Hi Holger,
I was able to narrow it down, and it seems this is the patch that causing the problem:
http://cgit.osmocom.org/libosmocore/commit/?id=f5a079f739c57d8be7c59149fd454...
Up to that patch it works (more or less), but after I applied it, I got this seg faults.
I don't know how to do back traces but I try to get some info about this topic and will send the results.
Csaba
On Sat, Jun 29, 2013 at 03:58:26PM +0200, Sipos Csaba wrote:
Hi Holger,
Hi,
I was able to narrow it down, and it seems this is the patch that causing the problem:
http://cgit.osmocom.org/libosmocore/commit/?id=f5a079f739c57d8be7c59149fd454...
Up to that patch it works (more or less), but after I applied it, I got this seg faults.
good. this patch was meant to detect use after free cases. This means the code claims that the memory is not used anymore, but it was.
I don't know how to do back traces but I try to get some info about this topic and will send the results.
$ ulimit -c unlimited $ nitb.... ... Segmentation fault (core dumped) $ gdb nitb core gdb > bt <BACKTRACE IS PRINTED>
include the output from here.
Hi Holger,
This was the result of the backtrace:
root@openbsc:~# ulimit -c unlimited root@openbsc:~# osmo-nitb -c 1800_nokia.cfg <0018> input/lapd.c:212 LAPD Allocating SAP for SAPI=62 / TEI=1 <0018> input/lapd.c:223 k=1 N200=3 N201=260 T200=1.0 T203=10.0 <0018> input/lapd.c:485 LAPD DL-ESTABLISH request TEI=1 SAPI=62 DB: Database initialized. DB: Database prepared. <001d> sms_queue.c:220 Attempting to send 20 SMS <0018> input/lapd.c:624 LAPD DL-ESTABLISH confirm TEI=1 SAPI=62 <0005> bts_nokia_site.c:58 bootstrapping OML for BTS 0 <0005> bts_nokia_site.c:1679 ABIS_OM_MDISC_FOM <0005> bts_nokia_site.c:1507 (0x81) NOKIA_BTS_ACK <0005> bts_nokia_site.c:1539 ACK = 1 <0018> input/lapd.c:513 LAPD DL-RELEASE request TEI=1 SAPI=62 <0018> input/lapd.c:628 LAPD DL-RELEASE confirm TEI=1 SAPI=62 Segmentation fault (core dumped) root@openbsc:~# gdb nitb core GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04 Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i686-linux-gnu". For bug reporting instructions, please see: http://bugs.launchpad.net/gdb-linaro/... nitb: No such file or directory. [New LWP 6873] Core was generated by `osmo-nitb -c 1800_nokia.cfg'. Program terminated with signal 11, Segmentation fault. #0 0xb774c11d in ?? () (gdb) bt #0 0xb774c11d in ?? () #1 0xfffffffe in ?? () #2 0x00000003 in ?? () #3 0xb775acdf in ?? () #4 0x000006ec in ?? () #5 0x00000000 in ?? () (gdb)
I hope this is going to help.
BR, Csaba
On Sat, Jun 29, 2013 at 11:52:06PM +0200, Sipos Csaba wrote:
root@openbsc:~# gdb nitb core
'nitb' was meant as a place holder because I don't know if you invoke it with ./osmo-nitb, osmo-nitb, /usr/local/bin/osmo-nitb. When invoking gdb you need to pass the right one. E.g. you didn't use 'nitb' to start the application either. ;)
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04 #0 0xb774c11d in ?? () (gdb) bt #0 0xb774c11d in ?? () #1 0xfffffffe in ?? () #2 0x00000003 in ?? () #3 0xb775acdf in ?? () #4 0x000006ec in ?? () #5 0x00000000 in ?? () (gdb)
No, the above is not useful at all. The information about methods and line numbers is missing. Re-run the gdb command and point it to the nitb that was used.
holger
Thanks Holger,
I think this is the rigth one:
root@openbsc:~# ulimit -c unlimited root@openbsc:~# osmo-nitb <0018> input/lapd.c:212 LAPD Allocating SAP for SAPI=62 / TEI=1 <0018> input/lapd.c:223 k=1 N200=3 N201=260 T200=1.0 T203=10.0 <0018> input/lapd.c:485 LAPD DL-ESTABLISH request TEI=1 SAPI=62 DB: Database initialized. DB: Database prepared. <001d> sms_queue.c:220 Attempting to send 20 SMS <0018> input/lapd.c:624 LAPD DL-ESTABLISH confirm TEI=1 SAPI=62 <0005> bts_nokia_site.c:58 bootstrapping OML for BTS 0 <0005> bts_nokia_site.c:1679 ABIS_OM_MDISC_FOM <0005> bts_nokia_site.c:1507 (0x81) NOKIA_BTS_ACK <0005> bts_nokia_site.c:1539 ACK = 1 <0018> input/lapd.c:513 LAPD DL-RELEASE request TEI=1 SAPI=62 <0018> input/lapd.c:628 LAPD DL-RELEASE confirm TEI=1 SAPI=62 Segmentation fault (core dumped) root@openbsc:~# gdb osmo-nitb core GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04 Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i686-linux-gnu". For bug reporting instructions, please see: http://bugs.launchpad.net/gdb-linaro/... Reading symbols from /usr/local/bin/osmo-nitb...done. [New LWP 6873]
warning: Can't read pathname for load map: Input/output error. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1". Core was generated by `osmo-nitb -c 1800_nokia.cfg'. Program terminated with signal 11, Segmentation fault. #0 0xb774c11d in lapd_send_i (line=1606, lctx=<optimized out>) at lapd_core.c:1802 1802 if (!dl->tx_hist[h].msg) { (gdb) bt #0 0xb774c11d in lapd_send_i (line=1606, lctx=<optimized out>) at lapd_core.c:1802 #1 0xb774ed9d in lapd_rx_i (lctx=0xbfcf0028, msg=0x81a31a8) at lapd_core.c:1606 #2 lapd_ph_data_ind (msg=0x81a31a8, lctx=0xbfcf0028) at lapd_core.c:1641 #3 0xb76fd5dd in lapd_receive (li=0x8177aa0, msg=0x81a31a8, error=0xbfcf00ac) at input/lapd.c:461 #4 0xb76f4046 in e1inp_rx_ts_lapd (e1i_ts=0x81796cc, msg=0x81a31a8) at e1_input.c:557 #5 0xb76f7c64 in handle_ts1_read (bfd=0x8179c38) at input/dahdi.c:190 #6 dahdi_fd_cb (bfd=0x8179c38, what=1) at input/dahdi.c:376 #7 0xb770cfc1 in osmo_select_main (polling=0) at select.c:158 #8 0x0804cc09 in main (argc=3, argv=0xbfcf03f4) at bsc_hack.c:346 (gdb)
BR, Csaba
On Sun, Jun 30, 2013 at 12:00:06PM +0200, Sipos Csaba wrote:
Thanks Holger,
Good Afternoon,
<0018> input/lapd.c:628 LAPD DL-RELEASE confirm TEI=1 SAPI=62
this releases the LAPD structures.. (and now sets dl->tx_hist to NULL)
#0 0xb774c11d in lapd_send_i (line=1606, lctx=<optimized out>) at lapd_core.c:1802 1802 if (!dl->tx_hist[h].msg) {
and it is accessed even after it has been released. Now I don't know if this access would be legetimate but it is from 'dead' memory.
Could you please create a PCAP file of the communication on the line? NITB has a config param for that and it should work for LAPD.
Hi Holger
I've made some progress with the segfault problem regarding Nokia and patch: f5a079f739c57d8be7c59149fd45475c402a45fc
It seems if I use "nokia_site skip-reset 1" in the config file, the problem is gone. I am pretty sure after these months, that the problem about the LAPD errors, this segfault problem and the multi-BTS problem is somewhere around the code that is responsible for this reset sequence, because if I turn it off, all problems are gone.
Maybe the cause is that the code is not multi-BTS aware (the reset part). For example it sends the RESET code for one BTS (despite the first one got acked properly), but it wont send the RESET code to the other BTS. Another thing is that after the RESET code sent to a particular BTS (and recieved an ack), all LAPD connections should be disconnected and keep it that way, till the RESET_TIMER expires (because NOkia BTS sends garbage on the E1 during reset), then try the bootstrapping again. And this is what actually happens:
1. OpenBSC sends BTS_RESET 2. OpenBSC receives ACK. 3. OpenBSC closes the LAPD connections as it should. 4. But somehow the LAPD connection got restarted during the reset period (some data received from the BTS, there is a lot of unimplemented TRAU warnings because the data is garbage).
The problem is that if I have to change some parameters, I have no choice but to use the NOkia reset code, otherwise the parameters are not going to be change in the BTS (despite the successful bootstrapping of OML and RSL).
Otherwise the code seems to handle multi Nokia BTSes well. It seems that the solution is somewhere in the code responsible for the reset part.
I hope I can get some help from someone who understands the code better. Once again, I am happy to do whatever it takes to sort this out.
BR, Csaba
On Sun, Jun 30, 2013 at 12:00:06PM +0200, Sipos Csaba wrote:
Thanks Holger,
Good Afternoon,
<0018> input/lapd.c:628 LAPD DL-RELEASE confirm TEI=1 SAPI=62
this releases the LAPD structures.. (and now sets dl->tx_hist to NULL)
#0 0xb774c11d in lapd_send_i (line=1606, lctx=<optimized out>) at lapd_core.c:1802 1802 if (!dl->tx_hist[h].msg) {
and it is accessed even after it has been released. Now I don't know if this access would be legetimate but it is from 'dead' memory.
Could you please create a PCAP file of the communication on the line? NITB has a config param for that and it should work for LAPD.
On Sun, Jul 28, 2013 at 05:34:57PM +0200, Sipos Csaba wrote:
Hi Holger
I've made some progress with the segfault problem regarding Nokia and patch: f5a079f739c57d8be7c59149fd45475c402a45fc
Hi,
I can only provide you with debug aid.
It seems if I use "nokia_site skip-reset 1" in the config file, the problem is gone. I am pretty sure after these months, that the problem about the LAPD errors, this segfault problem and the multi-BTS problem is somewhere around the code that is responsible for this reset sequence, because if I turn it off, all problems are gone.
* the reset will lead to LDAP release.. * some one is using the released link..
debug:
* Set a breakpoint in the code that is relreasing the hist buff * Once it is hit set a watchpoint on the memory regions... * Figure out who is accessing it first.. * Then try to figure out if that is correct or not ** E.g. do we need to flush a queue>
...
Hey Holger,
Thanks for the help. I am a little new in debugging so it might take a while, till I find out why the LAPD connection got reopened after OpenBSC receives the ack for the Nokia reset_req.
Until we can figure it out what is the problem, there is a temporary solution.
With "nokia_site skip-reset 1" in the config file and with the following modification in bts_nokia_site.c:
static int shutdown_om(struct gsm_bts *bts) { /* TODO !? */ LOGP(DNM, LOGL_NOTICE, "Sending Nokia RESET_REQ before shutdown\n"); abis_nm_reset(bts, 1); return 0; }
now the reset takes place not during the initialization, but during the shutdown. The result is the same: on the next OpenBSC run all Nokia BTS will be in a "clean restarted" state already, so despite the "skip-reset 1" both the OML and the RSL can be bootstrapped properly.
This is not a final solution for sure, there is no exception handling, it doesn't wait for ACK or anything, but at least it seems it is working for now, and there is no LAPD errors and segfault failures neither, even with the "problematic" patch applied.
Will test this in 1-2 days, and report back with the results.
BR, Csaba
On Sun, Jul 28, 2013 at 05:34:57PM +0200, Sipos Csaba wrote:
Hi Holger
I've made some progress with the segfault problem regarding Nokia and patch: f5a079f739c57d8be7c59149fd45475c402a45fc
Hi,
I can only provide you with debug aid.
It seems if I use "nokia_site skip-reset 1" in the config file, the problem is gone. I am pretty sure after these months, that the problem about the LAPD errors, this segfault problem and the multi-BTS problem is somewhere around the code that is responsible for this reset sequence, because if I turn it off, all problems are gone.
- the reset will lead to LDAP release..
- some one is using the released link..
debug:
- Set a breakpoint in the code that is relreasing the hist buff
- Once it is hit set a watchpoint on the memory regions...
- Figure out who is accessing it first..
- Then try to figure out if that is correct or not
** E.g. do we need to flush a queue>
On Sun, Jul 28, 2013 at 11:39:05PM +0200, Sipos Csaba wrote:
Hey Holger,
Thanks for the help. I am a little new in debugging so it might take a while, till I find out why the LAPD connection got reopened after OpenBSC receives the ack for the Nokia reset_req.
Just try, don't be afraid. We are happy to answer any self motivated technical question. We all had to learn how to use gdb and various other tools.
Will test this in 1-2 days, and report back with the results.
great
Hey Holger,
Thanks for the support.
If you or anybody can answer to one question, it would help the debugging very much:
In your opinion without explicitly closing the OML and RSL lapd connection, it is possible, that even after a RELEASE CONFIRMed lapd connection got restarted improperly, if it receives data from the BTS?
Because that is exactly what is happening:
1. OpenBSC send RESET_REQ to Nokia BTS 2. OpenBSC receives ACK from the BTS 3. OpenBSC releases the LAPD connection (but not closes it, like with Ctrl+C) 4. Some data received by OpenBSC via the E1 link, and the lapd connection for OML got restarted (despite we are in the RESET state and RESET_TIMER not expired yet)
According to Dieter (who helped a lot), during the RESET NOkia BTS sends garbage on the E1 link, therefore during the reset it is wise not to listen on any timeslots, until the RESET_TIMER expires (15-20 seconds). According to him, it was working like this with mISDN, but there was no DAHDI when he wrote the Nokia related codes, and now he don't have much time to debug this issue. SO the problem is that the LAPD connection somehow got restarted though it shouldn't durin the RESET state.
So this is where I am at the moment.
BR, Csaba
Just try, don't be afraid. We are happy to answer any self motivated technical question. We all had to learn how to use gdb and various other tools.
Will test this in 1-2 days, and report back with the results.
great
Hey Holger,
After placing some markers into the code, I have found the part which is responsible for proper close of the OML connection:
lapd_sap_stop(ts->lapd, link->tei, link->sapi);
which is called properly by the Nokia related code, after the ACK got received for the RESET_REQ. The lapd connection closes as it should:
Fri Aug 2 01:10:16 2013 <0018> input/lapd.c:513 LAPD DL-RELEASE request TEI=1 SAPI=62 Fri Aug 2 01:10:16 2013 <0018> lapd_core.c:2183 Message DL-RELEASE-REQUEST received in state LAPD_STATE_MF_EST Fri Aug 2 01:10:16 2013 <0018> lapd_core.c:2024 perform local release Fri Aug 2 01:10:16 2013 <0018> lapd_core.c:229 new state LAPD_STATE_MF_EST -> LAPD_STATE_IDLE Fri Aug 2 01:10:16 2013 <0018> lapd_core.c:222 stop T203 Fri Aug 2 01:10:16 2013 <0018> input/lapd.c:628 LAPD DL-RELEASE confirm TEI=1 SAPI=62
Now OML lapd connection is closed, but for some unkown reason lapd_core.c still try to process the received lapd messages for an already closed connection, which causes the segfault:
Fri Aug 2 01:10:16 2013 <0018> lapd_core.c:1781 lapd_send_i() called from line 1615 Segmentation fault (core dumped)
And this is not because the Nokia related code tries to send something when the lapd connection is closed. From the above log, it is clear that the LAPD connection got terminated as it should, so it seems this is a problem with lapd_core.c in general, that for an already closed connection, if something got received on the OML timeslot, lapd_core.c tries to process it. And this behavior leads to the segfault.
This is normal, that lapd_core.c tries to process data related to an already closed connection?
BR, Csaba
Sipos Csaba wrote:
Fri Aug 2 01:10:16 2013 <0018> input/lapd.c:628 LAPD DL-RELEASE confirm TEI=1 SAPI=62
Ok. Do you have a later message from the following line?
LOGP(DLLAPD, LOGL_INFO, "I received in state %s\n", lapd_state_names[dl->state]);
This is normal, that lapd_core.c tries to process data related to an already closed connection?
If it is the root cause of an error then of course it isn't correct.
"Normal" isn't a very good word when describing software algorithms.
//Peter