It's not really like we properly reached the LU milestone yet.
I observe two MM failure modes, one fails after the first reply the MSC
sends to the UE, the other fails one or two messages after that (depending
on how I count them). Let's call them MM.1 and MM.2
When I power cycle the hNodeB, I randomly get one of these two failures.
I tend to have to power cycle it, because it seems that SCTP stops working
(see below).
MM.1)
UE hNodeB osmo-hnbgw osmo-cscn
| |
| --- LU REQ ---------------> |
| <--------- ID REQ (IMSI) -- |
| [seconds pass] |
| <------- LU REJ (timeout)-- |
missing/expecting:
| --- ID RESP --------------> |
This what I called "probably a timing issue of the reply towards the UE".
MM.2)
UE hNodeB osmo-hnbgw osmo-cscn
| |
| --- LU REQ ---------------> |
| <--------- ID REQ (IMSI) -- |
| --- ID RESP --------------> |
| <------------- LU ACCEPT -- |
| <--------------- MM INFO -- |
| [seconds pass] |
| <------- LU REJ (timeout)-- |
missing/expecting:
| --- TMSI REALLOC COMPL----> |
SCTP)
When one of above failures has occured, I no longer get the
HEARTBEAT/HEARTBEAT_ACK messages that go through SCTP roughly every 6
seconds. Instead, wireshark shows a bunch of errors and retransmissions
"Destination unreachable (Protocol unreachable)"
or even
"ABORT [Malformed package]"
or
"ABORT" / "Protocol violation" with Cause Intormation
"Association exceeded its max retans count" [sic: "retans"]
It seems SCTP itself has stopped working in that case.
GW-sctp_recvmsg)
In addition to that, I get an omso-hnbgw failure mode if after testing
the above cases (doesn't matter which one) I let a few minutes pass.
After a little while, I get
<0000> hnbgw.c:171 Error during sctp_recvmsg()
(-1 returned by sctp_recvmsg() impossible to further qualify short of
heading into kernel debugging)
During local testing of the same situation with hnb-test via loopback
(127.0.0.1 as well as the same machine's "public" IP), this SCTP error
doesn't occur, and consequently osmo-hnbgw doesn't segfault.
When I run hnb-test from a different box and connect it to the osmo-hnbgw
running on my machine, it also works without problems. Only when the
hNodeB does the same, the sctp_recvmsg() error occurs.
GW.segf)
Shortly after the SCTP error, osmo-hnbgw segfaults. This is probably due
to wrong/missing osmo-fd/timer cleanup after the sctp_recvmsg() error
code.
MSC)
And I also get crashes of the MSC in form of the CSCN in conjunction
with a LU reject due to timeout and invalidation of a subscriber conn.
One time I got a cpu eater where two rb tree nodes pointed at each other
via rb_left, and rb_erase kept looping through those two. Mostly I get
a plain segfault. This is not as reproducable as the others though.
Solutions?
- connect the hNodeB with a proper timing source? So far no GPS is
connected.
- SCTP debugging?
- Rather concentrate on further development using hNodeB mocking test
programs?
(Obviously catch the segfaults in the osmo code, but they are not the
real problem. Once they are solved, the basic messaging problems will
still exist.)
~Neels