From andreas at eversberg.eu  Tue Nov  5 06:55:29 2013
From: andreas at eversberg.eu (Andreas Eversberg)
Date: Tue, 05 Nov 2013 07:55:29 +0100
Subject: Dropping LLC frames due the TBF destruction
In-Reply-To: <20131027111242.GN6295@xiaoyu.lan>
References: <20131027111242.GN6295@xiaoyu.lan>
Message-ID: <52789661.1040502@eversberg.eu>

Holger Hans Peter Freyther wrote:
> * there is no indication to the SGSN for dropped frames/octets?
> * can't the DL-TBF be re-associated/updated after the assignment
>   is done again?
>   

hi holger,

there is one issue about failed downlink TBF: the PCU does not do any
retry. when the assignment is sent to the phone and it fails, there is
no retry according to the specs. i assume that this is a task for the
SGSN. osmo-sgsn/openggsn does not resend any LLC frame in case of
LLC-DISCARDED message, if i look at the source code of libosmogb.
(LLC-DISCARDED is forwarded at bssgp_rx_llc_disc() of gprs_bssgp.c, but
i don't see where it is handled.)

what actually happens is that TCP layer will do the resend, but this is
a problem, especially at the border of coverage where many downlink
assignments get lost.

i have once enabled the debugging at ipaccess BTS and saw that the PCU
itself does the retry of lost downlink assignments (4 or 5 times).

regards,

andreas


From andreas at eversberg.eu  Tue Nov  5 07:14:05 2013
From: andreas at eversberg.eu (Andreas Eversberg)
Date: Tue, 05 Nov 2013 08:14:05 +0100
Subject: Losing ACKs due TLLI changes?
In-Reply-To: <20131027123313.GO6295@xiaoyu.lan>
References: <20131027123313.GO6295@xiaoyu.lan>
Message-ID: <52789ABD.5030509@eversberg.eu>

Holger Hans Peter Freyther wrote:
> <0007> gprs_rlcmac_meas.cpp:103 UL RSSI of TLLI=0x88661bc6: -67 dBm
> <0002> bts.cpp:945 Got ACK, but UL TBF is gone TLLI=0xe512eba3
> <0007> gprs_rlcmac_meas.cpp:158 DL packet loss of IMSI=274080000004765 / TLLI=0xe512eba3: 0%
> <0002> tbf.cpp:668 TBF TFI=0 TLLI=0x88661bc6 T3169 timeout during transsmission
> <0002> tbf.cpp:690 - Assignment was on PACCH
> <0002> tbf.cpp:694 - No uplink data received yet
>
> So there is an ACK for TLLI=0xe512eba3 but at the same time the tlli
> 0x88661bc6 is timing out.
>
> PCU->SGSN TLLI=0x88661bc6 Attach Request
> SGSN->PCU TLLI=0x88661bc6 Attach Accept (new P-TMSI)
> PCU->SGSN TLLI=0xe512eba3 Attach Complete (new TLLI) (6s later)
>   
hi holger,

can you provide the complete log, starting with the attach request? i
looked at the gprs_rlcmac_rcv_control_block() function which already
handles some TLLI change, but i am not sure if there is a bug or if
there must be some other way to handle TLLI change.

regards,

andreas


From andreas at eversberg.eu  Tue Nov  5 07:39:49 2013
From: andreas at eversberg.eu (Andreas Eversberg)
Date: Tue, 05 Nov 2013 08:39:49 +0100
Subject: Losing ACKs due TLLI changes?
In-Reply-To: <20131030111731.GA23718@xiaoyu.lan>
References: <20131027123313.GO6295@xiaoyu.lan>
	<20131030111731.GA23718@xiaoyu.lan>
Message-ID: <5278A0C5.2000000@eversberg.eu>

Holger Hans Peter Freyther wrote:
> +       printf("%s TLLLI changed...... 0x%08x->0x%08x\n",
> +               tbf_name(this), m_tlli, tlli);
> +
> +       if (direction == GPRS_RLCMAC_DL_TBF) {
> +               gprs_rlcmac_tbf *ul_tbf;
> +               ul_tbf = bts->tbf_by_tlli(m_tlli, GPRS_RLCMAC_UL_TBF);
> +
> +               if (ul_tbf)
> +                       ul_tbf->m_tlli = tlli;
> +       }
> +
>   
oops, i am sorry. i did not read all the follow-ups before answering
your message...

the update of both DL and UL TBFs seems to be a good solution.


in the function above you wrote this comment:

|* TODO: There could be multiple DL and UL TBFs and we should
* have a proper way to link all the related TBFs so we can do
* a group update.

how can we have multiple TBFs for a single direction? i don't see anything like this in the specs.

|

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osmocom.org/pipermail/osmocom-net-gprs/attachments/20131105/f1c4fdad/attachment.htm>

From hfreyther at sysmocom.de  Tue Nov  5 08:01:27 2013
From: hfreyther at sysmocom.de (Holger Hans Peter Freyther)
Date: Tue, 5 Nov 2013 09:01:27 +0100
Subject: Dropping LLC frames due the TBF destruction
In-Reply-To: <52789661.1040502@eversberg.eu>
References: <20131027111242.GN6295@xiaoyu.lan> <52789661.1040502@eversberg.eu>
Message-ID: <20131105080127.GB9642@xiaoyu.lan>

On Tue, Nov 05, 2013 at 07:55:29AM +0100, Andreas Eversberg wrote:

dear andreas,

as usual there is more than one dimension to this problem.

1.) Dropping LLC data with no indication.
When trying to understand a problem (e.g. Samsung3 traffic stalls) one
needs to analyze what is going wrong. The "contract" for the SGSN/PCU
is that a LLC frame will be either forwarded to the MS or a discarded
message will be sent (frames, octets, etc..).

BUT with the current code this contract is broken. One simply can not
know if the LLC frame has actually reached the phone... 


2.) Retry or not to retry.

> there is one issue about failed downlink TBF: the PCU does not do any
> retry. when the assignment is sent to the phone and it fails, there is
> no retry according to the specs. i assume that this is a task for the
> SGSN. osmo-sgsn/openggsn does not resend any LLC frame in case of
> LLC-DISCARDED message, if i look at the source code of libosmogb.
> (LLC-DISCARDED is forwarded at bssgp_rx_llc_disc() of gprs_bssgp.c, but
> i don't see where it is handled.)

Well, retry or not to retry is an architectural decision. The question
is what to do with the frames we are discarding?

-- 
- Holger Freyther <hfreyther at sysmocom.de>       http://www.sysmocom.de/
=======================================================================
* sysmocom - systems for mobile communications GmbH
* Schivelbeiner Str. 5
* 10439 Berlin, Germany
* Sitz / Registered office: Berlin, HRB 134158 B
* Geschaeftsfuehrer / Managing Directors: Holger Freyther, Harald Welte


From hfreyther at sysmocom.de  Tue Nov  5 08:04:28 2013
From: hfreyther at sysmocom.de (Holger Hans Peter Freyther)
Date: Tue, 5 Nov 2013 09:04:28 +0100
Subject: Losing ACKs due TLLI changes?
In-Reply-To: <52789ABD.5030509@eversberg.eu>
References: <20131027123313.GO6295@xiaoyu.lan> <52789ABD.5030509@eversberg.eu>
Message-ID: <20131105080428.GC9642@xiaoyu.lan>

On Tue, Nov 05, 2013 at 08:14:05AM +0100, Andreas Eversberg wrote:

Dear Andreas,

> can you provide the complete log, starting with the attach request? i
> looked at the gprs_rlcmac_rcv_control_block() function which already
> handles some TLLI change, but i am not sure if there is a bug or if
> there must be some other way to handle TLLI change.

everything that is needed to understand/reproduce the issue has been
posted. There is now even a unit test for this behavior.

holger


-- 
- Holger Freyther <hfreyther at sysmocom.de>       http://www.sysmocom.de/
=======================================================================
* sysmocom - systems for mobile communications GmbH
* Schivelbeiner Str. 5
* 10439 Berlin, Germany
* Sitz / Registered office: Berlin, HRB 134158 B
* Geschaeftsfuehrer / Managing Directors: Holger Freyther, Harald Welte


From laforge at gnumonks.org  Tue Nov  5 16:41:58 2013
From: laforge at gnumonks.org (Harald Welte)
Date: Tue, 5 Nov 2013 17:41:58 +0100
Subject: Dropping LLC frames due the TBF destruction
In-Reply-To: <52789661.1040502@eversberg.eu>
References: <20131027111242.GN6295@xiaoyu.lan> <52789661.1040502@eversberg.eu>
Message-ID: <20131105164158.GM12353@nataraja.gnumonks.org>

Hi Andreas,

On Tue, Nov 05, 2013 at 07:55:29AM +0100, Andreas Eversberg wrote:

> there is one issue about failed downlink TBF: the PCU does not do any
> retry. when the assignment is sent to the phone and it fails, there is
> no retry according to the specs. i assume that this is a task for the
> SGSN. osmo-sgsn/openggsn does not resend any LLC frame in case of
> LLC-DISCARDED message, if i look at the source code of libosmogb.
> (LLC-DISCARDED is forwarded at bssgp_rx_llc_disc() of gprs_bssgp.c, but
> i don't see where it is handled.)
> 
> what actually happens is that TCP layer will do the resend, but this is
> a problem, especially at the border of coverage where many downlink
> assignments get lost.

I'm not sure if I'm mixing up things, but we have three potential
protocol layers that can take of reliable delivery:

1) the RLC/MAC layer between PCU and MS
2) the LLC protocol
3) the TCP layer inside user IP


From andreas at eversberg.eu  Thu Nov  7 06:44:40 2013
From: andreas at eversberg.eu (Andreas Eversberg)
Date: Thu, 07 Nov 2013 07:44:40 +0100
Subject: Dropping LLC frames due the TBF destruction
In-Reply-To: <52789661.1040502@eversberg.eu>
References: <20131027111242.GN6295@xiaoyu.lan> <52789661.1040502@eversberg.eu>
Message-ID: <527B36D8.7010001@eversberg.eu>

On Tue, Nov 05, 2013 at 07:55:29AM +0100, Andreas Eversberg wrote:

dear andreas,

as usual there is more than one dimension to this problem.

1.) Dropping LLC data with no indication.
When trying to understand a problem (e.g. Samsung3 traffic stalls) one
needs to analyze what is going wrong. The "contract" for the SGSN/PCU
is that a LLC frame will be either forwarded to the MS or a discarded
message will be sent (frames, octets, etc..).

BUT with the current code this contract is broken. One simply can not
know if the LLC frame has actually reached the phone...

2.) Retry or not to retry.

> there is one issue about failed downlink TBF: the PCU does not do any
> retry. when the assignment is sent to the phone and it fails, there is
> no retry according to the specs. i assume that this is a task for the
> SGSN. osmo-sgsn/openggsn does not resend any LLC frame in case of
> LLC-DISCARDED message, if i look at the source code of libosmogb.
> (LLC-DISCARDED is forwarded at bssgp_rx_llc_disc() of gprs_bssgp.c, but
> i don't see where it is handled.)

Well, retry or not to retry is an architectural decision. The question
is what to do with the frames we are discarding?

-- 
- Holger Freyther <hfreyther at sysmocom.de>       http://www.sysmocom.de/
=======================================================================
* sysmocom - systems for mobile communications GmbH
* Schivelbeiner Str. 5
* 10439 Berlin, Germany
* Sitz / Registered office: Berlin, HRB 134158 B
* Geschaeftsfuehrer / Managing Directors: Holger Freyther, Harald Welte


From andreas at eversberg.eu  Thu Nov  7 07:35:49 2013
From: andreas at eversberg.eu (Andreas Eversberg)
Date: Thu, 07 Nov 2013 08:35:49 +0100
Subject: Dropping LLC frames due the TBF destruction
In-Reply-To: <20131105080127.GB9642@xiaoyu.lan>
References: <20131027111242.GN6295@xiaoyu.lan> <52789661.1040502@eversberg.eu>
	<20131105080127.GB9642@xiaoyu.lan>
Message-ID: <527B42D5.7080500@eversberg.eu>

(just ignore my last mail. somehow i hit the wrong key...)


hi holger, hi harald,

i looked at the TS 08.18. the LLC-DISCARDED.ind message does not contain
any reference to the frame(s) that has been discarded, so SGSN cannot
resend specific frames that are dropped. also a resend does not seem to
be specified. it is task of the PCU to be sure to deliver LLC frame.

during a downlink TBF, lost downlink blocks are resent by the RLC/MAC
protocol, but assignment is not. (at least not correctly)

here is my suggestion: if assignment fails, the PCU should resend
downlink TBF assignment several times. the number of retries should be a
config option. it may happen that the assignment on PACCH fails, while
the MS is in packet transmode. if the MS switches back to packet idle
mode in the meantime, the retry must be performed on AGCH. it may also
happen that the assignment on AGCH fails because the MS is establishing
an uplink TBF. if this happens, the retry must be performed on PACCH. if
the number of maximum retries is reached, the LLC frame (or list of)
should be discarded and the an LLC-DISCARDED.ind message should be sent
to SGSN.

regards,

andreas


From laforge at gnumonks.org  Thu Nov  7 07:50:58 2013
From: laforge at gnumonks.org (Harald Welte)
Date: Thu, 7 Nov 2013 08:50:58 +0100
Subject: Dropping LLC frames due the TBF destruction
In-Reply-To: <527B42D5.7080500@eversberg.eu>
References: <20131027111242.GN6295@xiaoyu.lan> <52789661.1040502@eversberg.eu>
	<20131105080127.GB9642@xiaoyu.lan> <527B42D5.7080500@eversberg.eu>
Message-ID: <20131107075058.GF639@nataraja.gnumonks.org>

On Thu, Nov 07, 2013 at 08:35:49AM +0100, Andreas Eversberg wrote:
> i looked at the TS 08.18. the LLC-DISCARDED.ind message does not contain
> any reference to the frame(s) that has been discarded, so SGSN cannot
> resend specific frames that are dropped. also a resend does not seem to
> be specified. it is task of the PCU to be sure to deliver LLC frame.

Thanks for your research and analysis.

-- 
- Harald Welte <laforge at gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)


From hfreyther at sysmocom.de  Mon Nov 11 19:24:08 2013
From: hfreyther at sysmocom.de (Holger Hans Peter Freyther)
Date: Mon, 11 Nov 2013 20:24:08 +0100
Subject: Coverity issues in gsm_rlcmac.cpp
Message-ID: <20131111192408.GD30839@xiaoyu.lan>

Dear Ivan,

could you please have a look at the coverity issues in the gsm_rlcmac.cpp
routines? 

Uninitialized scalar variable:
gsm_rlcmac.cpp:5321 ar.direction not initialized
gsm_rlcmac.cpp:5039 ar.direction not initialized
gsm_rlcmac.cpp:5155 ar.direction not initialized
gsm_rlcmac.cpp:4872 ar.direction not initialized

Just initialize it in csnStreamInit?


Out-of-bounds read:
gsm_rlcmac.cpp:5502 " Overrunning array "data->RLC_DATA" of 20 bytes
at byte offset 22 using index "i" (which evaluates to 22)."

gsm_rlcmac.cpp:5440 "  Overrunning array "data->RLC_DATA" of 20 bytes
at byte offset 22 using index "i" (which evaluates to 22)."

Maybe just add an assert that dataNumOctets <= 20?


-- 
- Holger Freyther <hfreyther at sysmocom.de>       http://www.sysmocom.de/
=======================================================================
* sysmocom - systems for mobile communications GmbH
* Schivelbeiner Str. 5
* 10439 Berlin, Germany
* Sitz / Registered office: Berlin, HRB 134158 B
* Geschaeftsfuehrer / Managing Directors: Holger Freyther, Harald Welte


From Ivan.Kluchnikov at fairwaves.ru  Tue Nov 12 13:29:57 2013
From: Ivan.Kluchnikov at fairwaves.ru (Ivan Kluchnikov)
Date: Tue, 12 Nov 2013 17:29:57 +0400
Subject: Coverity issues in gsm_rlcmac.cpp
In-Reply-To: <20131111192408.GD30839@xiaoyu.lan>
References: <20131111192408.GD30839@xiaoyu.lan>
Message-ID: <CA+QHiD85J=ePa8z7kYPyChA31_Hw1hd-=zZgu4dUVz1Z+rPOPA@mail.gmail.com>

Hi Holger,

2013/11/11 Holger Hans Peter Freyther <hfreyther at sysmocom.de>:
>
> Uninitialized scalar variable:
> gsm_rlcmac.cpp:5321 ar.direction not initialized
> gsm_rlcmac.cpp:5039 ar.direction not initialized
> gsm_rlcmac.cpp:5155 ar.direction not initialized
> gsm_rlcmac.cpp:4872 ar.direction not initialized
>
> Just initialize it in csnStreamInit?

Yes.

>
> Out-of-bounds read:
> gsm_rlcmac.cpp:5502 " Overrunning array "data->RLC_DATA" of 20 bytes
> at byte offset 22 using index "i" (which evaluates to 22)."
>
> gsm_rlcmac.cpp:5440 "  Overrunning array "data->RLC_DATA" of 20 bytes
> at byte offset 22 using index "i" (which evaluates to 22)."
>
> Maybe just add an assert that dataNumOctets <= 20?

Yes, it makes sense.


-- 
Regards,
Ivan Kluchnikov.
http://fairwaves.ru


From Ivan.Kluchnikov at fairwaves.ru  Tue Nov 12 13:44:43 2013
From: Ivan.Kluchnikov at fairwaves.ru (Ivan Kluchnikov)
Date: Tue, 12 Nov 2013 17:44:43 +0400
Subject: Coverity issues in gsm_rlcmac.cpp
In-Reply-To: <CA+QHiD85J=ePa8z7kYPyChA31_Hw1hd-=zZgu4dUVz1Z+rPOPA@mail.gmail.com>
References: <20131111192408.GD30839@xiaoyu.lan>
	<CA+QHiD85J=ePa8z7kYPyChA31_Hw1hd-=zZgu4dUVz1Z+rPOPA@mail.gmail.com>
Message-ID: <CA+QHiD8+nywi51pVfMWhqp3R8EMOmRgfiv3bJF4H0f7zJrr8_A@mail.gmail.com>

I will prepare patch for this issues soon.

2013/11/12 Ivan Kluchnikov <Ivan.Kluchnikov at fairwaves.ru>:
> Hi Holger,
>
> 2013/11/11 Holger Hans Peter Freyther <hfreyther at sysmocom.de>:
>>
>> Uninitialized scalar variable:
>> gsm_rlcmac.cpp:5321 ar.direction not initialized
>> gsm_rlcmac.cpp:5039 ar.direction not initialized
>> gsm_rlcmac.cpp:5155 ar.direction not initialized
>> gsm_rlcmac.cpp:4872 ar.direction not initialized
>>
>> Just initialize it in csnStreamInit?
>
> Yes.
>
>>
>> Out-of-bounds read:
>> gsm_rlcmac.cpp:5502 " Overrunning array "data->RLC_DATA" of 20 bytes
>> at byte offset 22 using index "i" (which evaluates to 22)."
>>
>> gsm_rlcmac.cpp:5440 "  Overrunning array "data->RLC_DATA" of 20 bytes
>> at byte offset 22 using index "i" (which evaluates to 22)."
>>
>> Maybe just add an assert that dataNumOctets <= 20?
>
> Yes, it makes sense.
>
>
>
>
> --
> Regards,
> Ivan Kluchnikov.
> http://fairwaves.ru


-- 
Regards,
Ivan Kluchnikov.
http://fairwaves.ru


From dwillmann at sysmocom.de  Mon Nov 18 15:56:23 2013
From: dwillmann at sysmocom.de (Daniel Willmann)
Date: Mon, 18 Nov 2013 16:56:23 +0100
Subject: gprs_rlcmac_received_lost() in gprs_rlcmac_meas.cpp
Message-ID: <20131118155623.GA7003@adrastea.totalueberwachung.de>

Hello Andreas,

while looking through the osmo-pcu code to figure out why some
connections stall we had some problems making sense of the elapsed time
calculations in gprs_rlcmac_meas.cpp.

Could you confirm/deny my assumptions about it or explain the idea
behind it?

> gettimeofday(&now_tv, NULL);
> elapsed = ((now_tv.tv_sec - loss_tv->tv_sec) << 7)
> 	+ ((now_tv.tv_usec - loss_tv->tv_usec) << 7) / 1000000;

I assume here you're calculating the duration of the measurement period
so far and since you want to have sub-second accuracy you multiply
everything with 128. Why 128?
Is it becasue it simplifies the throughput calculation?
(tbf->meas.dl_bw_octets/elapsed in gprs_rlcmac_dl_bw())

> if (elapsed < 128)
> 	return 0;

Is the intention here that the duration of the measurements is supposed
to be one second? So every second these measurements are printed out and
reset?


Regards
Daniel Willmann

-- 
- Daniel Willmann <dwillmann at sysmocom.de>       http://www.sysmocom.de/
=======================================================================
* sysmocom - systems for mobile communications GmbH
* Schivelbeiner Str. 5
* 10439 Berlin, Germany
* Sitz / Registered office: Berlin, HRB 134158 B
* Geschaeftsfuehrer / Managing Directors: Holger Freyther, Harald Welte


From andreas at eversberg.eu  Tue Nov 19 05:27:07 2013
From: andreas at eversberg.eu (Andreas Eversberg)
Date: Tue, 19 Nov 2013 06:27:07 +0100
Subject: gprs_rlcmac_received_lost() in gprs_rlcmac_meas.cpp
In-Reply-To: <20131118155623.GA7003@adrastea.totalueberwachung.de>
References: <20131118155623.GA7003@adrastea.totalueberwachung.de>
Message-ID: <528AF6AB.9070102@eversberg.eu>

Daniel Willmann wrote:
>> > gettimeofday(&now_tv, NULL);
>> > elapsed = ((now_tv.tv_sec - loss_tv->tv_sec) << 7)
>> > 	+ ((now_tv.tv_usec - loss_tv->tv_usec) << 7) / 1000000;
>>     
> I assume here you're calculating the duration of the measurement period
> so far and since you want to have sub-second accuracy you multiply
> everything with 128. Why 128?
> Is it becasue it simplifies the throughput calculation?
> (tbf->meas.dl_bw_octets/elapsed in gprs_rlcmac_dl_bw())
>   
hi daniel,

yes. first i wanted something more accurate than one second, so i
multiplied the elapsed time with 128. 128 bytes are 1024 bits, so it
simplifies the calculation for kbits/s.
>   
>> > if (elapsed < 128)
>> > 	return 0;
>>     
> Is the intention here that the duration of the measurements is supposed
> to be one second? So every second these measurements are printed out and
> reset?
>   
after every transmitted downlink frame the gprs_rlcmac_dl_bw() function
is called. if at least one second is elapsed, the throughput of the time
that has been elapsed is printed out. it is just a simple solution
without the requirement for a timer for each TBF.

regards,

andreas


From hfreyther at sysmocom.de  Sun Nov 24 18:05:35 2013
From: hfreyther at sysmocom.de (Holger Hans Peter Freyther)
Date: Sun, 24 Nov 2013 19:05:35 +0100
Subject: Summary of osmo-pcu failures/defects
Message-ID: <20131124180535.GA8222@xiaoyu.lan>

Hi,

I have added rate counters to the PCU and noticed that a lot of frames
in the DL are re-sent without there actually being any NACK or
transmission errors. I have also re-produced a "stall" by just using
ping with a big enough data size and a low enough interval (the PCU
appears to transmit the ICMP PING but the MS never gets to send a PONG
packet).

In both cases it could be a scheduling issue (e.g. we don't schedule
the UL Packet ACK/NACK). Is any of you aware of it? I had already
highlighted that the scheduling is not fair. I wonder if we are seeing
starvation (and will re-factor to be able to unit test this).

In terms of scheduling. For the SBA we are already having a reservation
that we try to honor in multiple places. Maybe it is time to extend it
and have more reservations?

holger

-- 
- Holger Freyther <hfreyther at sysmocom.de>       http://www.sysmocom.de/
=======================================================================
* sysmocom - systems for mobile communications GmbH
* Schivelbeiner Str. 5
* 10439 Berlin, Germany
* Sitz / Registered office: Berlin, HRB 134158 B
* Geschaeftsfuehrer / Managing Directors: Holger Freyther, Harald Welte


From andreas at eversberg.eu  Mon Nov 25 07:24:27 2013
From: andreas at eversberg.eu (Andreas Eversberg)
Date: Mon, 25 Nov 2013 08:24:27 +0100
Subject: Summary of osmo-pcu failures/defects
In-Reply-To: <20131124180535.GA8222@xiaoyu.lan>
References: <20131124180535.GA8222@xiaoyu.lan>
Message-ID: <5292FB2B.6060308@eversberg.eu>

Holger Hans Peter Freyther wrote:
> Hi,
>
> I have added rate counters to the PCU and noticed that a lot of frames
> in the DL are re-sent without there actually being any NACK or
> transmission errors. 
dear holger,

i can only guess what causes the issues you describe. so i suggest that
we should do a debug session at the congress, if you attend to it and
find the time.

generally there is always a re-send due to delay between pcu and the
radio interface. if all rlc/mac data blocks of an llc frame have been
sent to the phone, a packet downlink ack control block is requested and
scheduled. the pcu repeats all unacknowledged data blocks in a loop
until it receives that requested control block. this way the pcu starts
re-sending data blocks that might got lost, before it actually knows if
and which blocks still need to be resend. (this procedure is described
in TS 04.60.)
> I have also re-produced a "stall" by just using
> ping with a big enough data size and a low enough interval (the PCU
> appears to transmit the ICMP PING but the MS never gets to send a PONG
> packet).
>   
could this be an mtu issue at the phone? does the phone requests an
uplink tbf to send a pong?
> In both cases it could be a scheduling issue (e.g. we don't schedule
> the UL Packet ACK/NACK).
in case of an ongoing uplink tbf, we ack/nack received or timed out
uplink rlc/mac blocks by using a control block. even if there is an
ongoing downlink tbf, all control blocks have priority over data blocks.
> Is any of you aware of it? I had already
> highlighted that the scheduling is not fair. I wonder if we are seeing
> starvation (and will re-factor to be able to unit test this).
>
> In terms of scheduling. For the SBA we are already having a reservation
> that we try to honor in multiple places. Maybe it is time to extend it
> and have more reservations?
>   
uplink control messages are reserved in the tbf object (poll_fn). the
sba control messages are reserved in a separate structure, since sba are
not (or not yet) related to a tbf.

regards,

andreas


From hfreyther at sysmocom.de  Mon Nov 25 08:21:08 2013
From: hfreyther at sysmocom.de (Holger Hans Peter Freyther)
Date: Mon, 25 Nov 2013 09:21:08 +0100
Subject: Summary of osmo-pcu failures/defects
In-Reply-To: <5292FB2B.6060308@eversberg.eu>
References: <20131124180535.GA8222@xiaoyu.lan> <5292FB2B.6060308@eversberg.eu>
Message-ID: <20131125082108.GK8222@xiaoyu.lan>

On Mon, Nov 25, 2013 at 08:24:27AM +0100, Andreas Eversberg wrote:

> could this be an mtu issue at the phone? does the phone requests an
> uplink tbf to send a pong?

the first 10 pings going through and then nothing coming back? I don't
think it is a MTU issue. I don't know what the root cause is but the
structure of the code doesn't make it easy to find it.


> > In both cases it could be a scheduling issue (e.g. we don't schedule
> > the UL Packet ACK/NACK).
> in case of an ongoing uplink tbf, we ack/nack received or timed out
> uplink rlc/mac blocks by using a control block. even if there is an
> ongoing downlink tbf, all control blocks have priority over data blocks.


As I have already pointed out. Some control blocks can starve. I don't
know if it does but it is where I will continue to have a look.


> > Is any of you aware of it? I had already
> > highlighted that the scheduling is not fair. I wonder if we are seeing
> > starvation (and will re-factor to be able to unit test this).
> >
> > In terms of scheduling. For the SBA we are already having a reservation
> > that we try to honor in multiple places. Maybe it is time to extend it
> > and have more reservations?
> >   
> uplink control messages are reserved in the tbf object (poll_fn). the
> sba control messages are reserved in a separate structure, since sba are
> not (or not yet) related to a tbf.

well. In the tbf you have:

  else if (bts->sba()->find(trx->trx_no, ts, (fn + 13) % 2715648))
             LOGP(DRLCMAC, LOGL_DEBUG, "Polling cannot be "
                         "sheduled, because single block alllocation "
                         "already exists\n");

But there is nothing that checks if another TBF is setting poll_fn to
a frame that another tbf is already polling. 

-- 
- Holger Freyther <hfreyther at sysmocom.de>       http://www.sysmocom.de/
=======================================================================
* sysmocom - systems for mobile communications GmbH
* Schivelbeiner Str. 5
* 10439 Berlin, Germany
* Sitz / Registered office: Berlin, HRB 134158 B
* Geschaeftsfuehrer / Managing Directors: Holger Freyther, Harald Welte


From hfreyther at sysmocom.de  Mon Nov 25 23:18:03 2013
From: hfreyther at sysmocom.de (Holger Hans Peter Freyther)
Date: Tue, 26 Nov 2013 00:18:03 +0100
Subject: Summary of osmo-pcu failures/defects
In-Reply-To: <20131125082108.GK8222@xiaoyu.lan>
References: <20131124180535.GA8222@xiaoyu.lan> <5292FB2B.6060308@eversberg.eu>
	<20131125082108.GK8222@xiaoyu.lan>
Message-ID: <20131125231803.GB18797@xiaoyu.lan>

On Mon, Nov 25, 2013 at 09:21:08AM +0100, Holger Hans Peter Freyther wrote:

> As I have already pointed out. Some control blocks can starve. I don't
> know if it does but it is where I will continue to have a look.


Here is the starvation theory for the ping:

We have a LLC frame (or many queued up)... we schedule the polls and
at the final_ack indicate we decide to re-use the TBF. This means that
we will schedule another PACKET DOWNLINK ASSIGNMENT. But at the same
time we either want to honor the "rh->si" or want to schedule the ACK
due SEND_ACK_AFTER_FRAMES.

So we more or less want to send "PACKET DOWNLINK ASSIGNMENT" and the
"PACKET UPLINK ACK" at the same time (with more DL tbfs/traffic this
is getting more likely) but currently we will always prefer the PACKET
DOWNLINK ASSIGNMENT. This means that the uplink will starve (e.g. the
window stalled, rh->si means that the uplink will starve (the window
will stall, rh->si being set, etc).


Fairness improvements:

* put ul_ass_tbf, dl_ass_tbf, ul_ack_tbf in an array and store the
  last index in the PDCH and iterate over it.
* move the TBF to the front of the ul_tbfs and dl_tbfs list so that
  it is not selected again.


-- 
- Holger Freyther <hfreyther at sysmocom.de>       http://www.sysmocom.de/
=======================================================================
* sysmocom - systems for mobile communications GmbH
* Schivelbeiner Str. 5
* 10439 Berlin, Germany
* Sitz / Registered office: Berlin, HRB 134158 B
* Geschaeftsfuehrer / Managing Directors: Holger Freyther, Harald Welte


From hfreyther at sysmocom.de  Tue Nov 26 08:22:56 2013
From: hfreyther at sysmocom.de (Holger Hans Peter Freyther)
Date: Tue, 26 Nov 2013 09:22:56 +0100
Subject: Summary of osmo-pcu failures/defects
In-Reply-To: <5292FB2B.6060308@eversberg.eu>
References: <20131124180535.GA8222@xiaoyu.lan> <5292FB2B.6060308@eversberg.eu>
Message-ID: <20131126082256.GH32012@xiaoyu.lan>

On Mon, Nov 25, 2013 at 08:24:27AM +0100, Andreas Eversberg wrote:

> generally there is always a re-send due to delay between pcu and the
> radio interface. if all rlc/mac data blocks of an llc frame have been
> sent to the phone, a packet downlink ack control block is requested and
> scheduled. the pcu repeats all unacknowledged data blocks in a loop
> until it receives that requested control block. this way the pcu starts
> re-sending data blocks that might got lost, before it actually knows if
> and which blocks still need to be resend. (this procedure is described
> in TS 04.60.)


Are you referring to 9.1.3.1 "Acknowledge state array V(B) for GPRS
TBF Mode"?

"
 If there are no further RLC data blocks available for transmission
 (i.e. the RLC data block with BSN= V(S) does not exist), the sending
 side shall transmit the oldest RLC data block whose corresponding
 element in V(B) has the value PENDING_ACK, then the next oldest block
 whose corresponding element in V(B) has the value PENDING_ACK, etc. 
"


With statistics in place it appears that for CS4 close to 50% of the
sent RLC blocks are re-sends. Is that what you expected when implementing
it?

holger

-- 
- Holger Freyther <hfreyther at sysmocom.de>       http://www.sysmocom.de/
=======================================================================
* sysmocom - systems for mobile communications GmbH
* Schivelbeiner Str. 5
* 10439 Berlin, Germany
* Sitz / Registered office: Berlin, HRB 134158 B
* Geschaeftsfuehrer / Managing Directors: Holger Freyther, Harald Welte


From andreas at eversberg.eu  Wed Nov 27 07:49:17 2013
From: andreas at eversberg.eu (Andreas Eversberg)
Date: Wed, 27 Nov 2013 08:49:17 +0100
Subject: Summary of osmo-pcu failures/defects
In-Reply-To: <20131125231803.GB18797@xiaoyu.lan>
References: <20131124180535.GA8222@xiaoyu.lan> <5292FB2B.6060308@eversberg.eu>
	<20131125082108.GK8222@xiaoyu.lan>
	<20131125231803.GB18797@xiaoyu.lan>
Message-ID: <5295A3FD.10708@eversberg.eu>

Holger Hans Peter Freyther wrote:
> Here is the starvation theory for the ping:
>
> We have a LLC frame (or many queued up)... we schedule the polls and
> at the final_ack indicate we decide to re-use the TBF. This means that
> we will schedule another PACKET DOWNLINK ASSIGNMENT. But at the same
> time we either want to honor the "rh->si" or want to schedule the ACK
> due SEND_ACK_AFTER_FRAMES.
>   
the idea behind priority of PACKET DOWNLINK ASSIGNMENT is that i do not
want the phone to switch back to idle mode. if i would ACK all uplink
blocks, the MS might switch back to idle mode immediately and will never
receive the PACKET DOWNLINK ASSIGNMENT.

when there was an ongoing downlink tbf, the phone keeps in transfer mode
until T3193 fires, so in case of a tbf re-use we can safely schedule a
PACKET UPLINK ACK/NACK prior PACKET DOWNLINK ASSIGNMENT.
> So we more or less want to send "PACKET DOWNLINK ASSIGNMENT" and the
> "PACKET UPLINK ACK" at the same time (with more DL tbfs/traffic this
> is getting more likely) but currently we will always prefer the PACKET
> DOWNLINK ASSIGNMENT. This means that the uplink will starve (e.g. the
> window stalled, rh->si means that the uplink will starve (the window
> will stall, rh->si being set, etc).
>   
once we sent PACKET DOWNLINK ASSIGNMENT we can ACK the uplink blocks
right afterwards. this should resolve the stalled window just a bit
later. this is my theory, but your tests showed that this does not work
as i would expect.


From andreas at eversberg.eu  Wed Nov 27 08:06:00 2013
From: andreas at eversberg.eu (Andreas Eversberg)
Date: Wed, 27 Nov 2013 09:06:00 +0100
Subject: Summary of osmo-pcu failures/defects
In-Reply-To: <20131126082256.GH32012@xiaoyu.lan>
References: <20131124180535.GA8222@xiaoyu.lan> <5292FB2B.6060308@eversberg.eu>
	<20131126082256.GH32012@xiaoyu.lan>
Message-ID: <5295A7E8.4010406@eversberg.eu>

Holger Hans Peter Freyther wrote:
> Are you referring to 9.1.3.1 "Acknowledge state array V(B) for GPRS
> TBF Mode"?
>   
yes, the protocol sends all unacknowledged blocks in a loop until all
blocks are acknowledged and the tbf is complete.
> With statistics in place it appears that for CS4 close to 50% of the
> sent RLC blocks are re-sends. Is that what you expected when implementing
> it?
>   
if there is a continuous download (opening a web page), there should be
almost no re-sends unless the window stalls or the download is complete.
in case of a ping with several small tbfs, there are re-sends due to the
delay between PCU and the radio interface: the PCU sends the final block
of a tbf (polling bit set) and then waits for PACKET DOWNLINK ACK/NACK.
higher delay will cause more re-sends for each tbf at the end, so
smaller tbf will cause more percentage of re-sends.