milestone: 3G CS Location Update Accept

historical

It's not really like we properly reached the LU milestone yet.

I observe two MM failure modes, one fails after the first reply the MSC
sends to the UE, the other fails one or two messages after that (depending
on how I count them). Let's call them MM.1 and MM.2

When I power cycle the hNodeB, I randomly get one of these two failures.
I tend to have to power cycle it, because it seems that SCTP stops working
(see below).

MM.1)

  UE  hNodeB  osmo-hnbgw    osmo-cscn
  |                             |
  | --- LU REQ ---------------> |
  | <--------- ID REQ (IMSI) -- |
  | [seconds pass]              |
  | <------- LU REJ (timeout)-- |

  missing/expecting:
  | --- ID RESP --------------> |

  This what I called "probably a timing issue of the reply towards the UE".

MM.2)

  UE  hNodeB  osmo-hnbgw    osmo-cscn
  |                             |
  | --- LU REQ ---------------> |
  | <--------- ID REQ (IMSI) -- |
  | --- ID RESP --------------> |
  | <------------- LU ACCEPT -- |
  | <--------------- MM INFO -- |
  | [seconds pass]              |
  | <------- LU REJ (timeout)-- |

  missing/expecting:
  | --- TMSI REALLOC COMPL----> |

SCTP)

  When one of above failures has occured, I no longer get the
  HEARTBEAT/HEARTBEAT_ACK messages that go through SCTP roughly every 6
  seconds. Instead, wireshark shows a bunch of errors and retransmissions
  "Destination unreachable (Protocol unreachable)"
  or even
  "ABORT [Malformed package]"
  or
  "ABORT" / "Protocol violation" with Cause Intormation
  "Association exceeded its max retans count" [sic: "retans"]

  It seems SCTP itself has stopped working in that case.

GW-sctp_recvmsg)

  In addition to that, I get an omso-hnbgw failure mode if after testing
  the above cases (doesn't matter which one) I let a few minutes pass.

  After a little while, I get
    <0000> hnbgw.c:171 Error during sctp_recvmsg()
  (-1 returned by sctp_recvmsg() impossible to further qualify short of
  heading into kernel debugging)

  During local testing of the same situation with hnb-test via loopback
  (127.0.0.1 as well as the same machine's "public" IP), this SCTP error
  doesn't occur, and consequently osmo-hnbgw doesn't segfault.

  When I run hnb-test from a different box and connect it to the osmo-hnbgw
  running on my machine, it also works without problems. Only when the
  hNodeB does the same, the sctp_recvmsg() error occurs.

GW.segf)

  Shortly after the SCTP error, osmo-hnbgw segfaults. This is probably due
  to wrong/missing osmo-fd/timer cleanup after the sctp_recvmsg() error
  code.

MSC)

  And I also get crashes of the MSC in form of the CSCN in conjunction
  with a LU reject due to timeout and invalidation of a subscriber conn.
  One time I got a cpu eater where two rb tree nodes pointed at each other
  via rb_left, and rb_erase kept looping through those two.  Mostly I get
  a plain segfault.  This is not as reproducable as the others though.

Solutions?

- connect the hNodeB with a proper timing source? So far no GPS is
  connected.

- SCTP debugging?

- Rather concentrate on further development using hNodeB mocking test
  programs?
  (Obviously catch the segfaults in the osmo code, but they are not the
  real problem. Once they are solved, the basic messaging problems will
  still exist.)

~Neels

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://lists.osmocom.org/pipermail/openbsc/attachments/20160229/8ed2f90a/attachment.bin>