NITB: high cpu usage and "crossed" messages when SMS table grows

List overview All Threads
Download

newer

older

Re: Build failure of...

osmo-bts-trx channel activation

Keith

13 Mar 2017 13 Mar '17

11:01 p.m.

Hi all, I temporarily disabled a cron job we run at rhizomatica that purges the hlr SMS table of sent messages every day. After a few days I noticed slightly sluggish behaviour in the VTY, and sure enough, the nitb was consuming 100% cpu, not always, but presumably whenever it does a queue run. I also just heard that in the last few days, we got a number of reports from users, some confirmed by photos of the phones, about messages being delivered to the wrong destination. Apologies if this has been touched on before. I am not finding where I can directly search this list archives. Anyway, it might not be so bad to bring it up, as the cron job purging via cron job calling sqlite3 is not ideal, and anyway.. this crossed messages shouldn't happen, right? I imagine we'd like to track this one down. Keith.

Show replies by date

Neels Hofmeyr

14 Mar 14 Mar

1:04 a.m.

On Mon, Mar 13, 2017 at 11:01:57PM +0100, Keith wrote:

...

I temporarily disabled a cron job we run at rhizomatica that purges the hlr SMS table of sent messages every day. After a few days I noticed slightly sluggish behaviour in the VTY, and sure enough, the nitb was consuming 100% cpu, not always, but presumably whenever it does a queue run.

Hmm, that's a very vague indicator. How performant is the hardware? For how long does this load endure? Does the process hang otherwise, is service disrupted? From how I got to know the SMS code, it appears to have sound safeguards in place, e.g. limits the number of SMS to be delivered per queue run, and only attempts deliveries of SMS for actually attached subscribers ... But in fact we don't have load testing in place. It would be good to find out where unproportional CPU load is coming from -- SQlite? The NITB sms code? From a theoretical standpoint I'd also expect the SMS database to discard messages that are past a certain age, not sure though, as I'm not that deeply familiar with that (yet). Would be good to know: how many SMS are pending, for how many subscribers, of which how many are currently attached? How often are SMS deliveries being retried and end in failure? ... and anything else you can think of.

...

I also just heard that in the last few days, we got a number of reports from users, some confirmed by photos of the phones, about messages being delivered to the wrong destination.

Whoa! That should absolutely not happen. I can't see how this is even possible.

...

I imagine we'd like to track this one down.

Optimally, we would want to be able to reproduce the failure. Do you have any edge data on the scenario in which this situation comes up? ~N

Alexander Chemeris

9:49 a.m.

Hi Neels, On Mar 14, 2017 03:05, "Neels Hofmeyr" <nhofmeyr(a)sysmocom.de> wrote: On Mon, Mar 13, 2017 at 11:01:57PM +0100, Keith wrote:

...

Hmm, that's a very vague indicator. How performant is the hardware? For how long does this load endure? Does the process hang otherwise, is service disrupted?

...

From how I got to know the SMS code, it appears to have sound safeguards in

place, e.g. limits the number of SMS to be delivered per queue run, and only attempts deliveries of SMS for actually attached subscribers ... But in fact we don't have load testing in place. It would be good to find out where unproportional CPU load is coming from -- SQlite? The NITB sms code? From a theoretical standpoint I'd also expect the SMS database to discard messages that are past a certain age, not sure though, as I'm not that deeply familiar with that (yet). Would be good to know: how many SMS are pending, for how many subscribers, of which how many are currently attached? How often are SMS deliveries being retried and end in failure? ... and anything else you can think of. We looked into this a couple years ago, but didn't come up with a good solution context switched to something else. Just add a few thousand SMS to the DB (DB should 10Mb or more roughly) and start OsmoNITB. You'll notice it's agony immediately. It's been a while, but IIRC the issue is that the DB didn't have proper indexes for the kinds of queries we're running on it, so it's getting super inefficient. Regarding removing SMS - it should be fine based on validity time of the SMS and it was completely broken. I had a patch set which fixed validity time handling, but IIRC it wasn't merged. We can probably dug it up, but I don't have much time to rebase / adapt it to the new codebase right now. If there are any volunteers, that would be great. Please excuse typos. Written with a touchscreen keyboard. -- Regards, Alexander Chemeris CTO/Founder Fairwaves, Inc. https://fairwaves.co

Neels Hofmeyr

12:21 p.m.

On Tue, Mar 14, 2017 at 11:49:52AM +0300, Alexander Chemeris wrote:

...

Just add a few thousand SMS to the DB (DB should 10Mb or more roughly) and start OsmoNITB. You'll notice it's agony immediately. It's been a while, but IIRC the issue is that the DB didn't have proper indexes for the kinds of queries we're running on it, so it's getting super inefficient.

Ah, so that part is SQlite DB related. I've been very active in Osmocom for the past 15 or so months thanks to sysmocom, but still learning new aspects of the code base regularly :)

...

Regarding removing SMS - it should be fine based on validity time of the SMS and it was completely broken. I had a patch set which fixed validity time handling, but IIRC it wasn't merged. We can probably dug it up, but I don't have much time to rebase / adapt it to the new codebase right now. If there are any volunteers, that would be great.

You could push it as a private branch without bothering to pull it up to recent code and tell this list about it, maybe someone will be interested. For Osmocom's future plans in general, we are moving away from having an SQLite database in the OsmoNITB (and the new OsmoMSC). With the new VLR, the database is only used for SMS, no longer for subscribers, and there has been talk about implementing a proper SMSC, i.e. a separate process to handle SMS. So we would probably not want to spend effort on optimizing the old SMS storage "just before" we go on to get rid of it altogether. Sending SMS to the wrong recipient though is probably worth fixing. That should really not happen. ~N

Keith

2:37 p.m.

On 14/03/2017 12:21, Neels Hofmeyr wrote:

...

Ah, so that part is SQlite DB related. I've been very active in Osmocom for the past 15 or so months thanks to sysmocom, but still learning new aspects of the code base regularly :)

Yep, there are some big FIXME message in openbsc/src/libmsc/db.c: db_sms_store()

...

> Regarding removing SMS - it should be fine based on validity time of the

I am mildly concerned about concurrent write access to the sqlite hlr, although I saw the SMS table corrupted once, that prevented the nitb strting, I'm not even sure it was caused by concurrent writes. I read up about locking and how sqlite handles this, it should be OK.

...

SMS and it was completely broken. I had a patch set which fixed validity time handling, but IIRC it wasn't merged. We can probably dug it up, but I don't have much time to rebase / adapt it to the new codebase right now. If there are any volunteers, that would be great.

For Osmocom's future plans in general, we are moving away from having an SQLite database in the OsmoNITB (and the new OsmoMSC). So we would probably not want to spend effort on optimizing the old SMS storage "just before" we go on to get rid of it altogether.

Yep.. how long is just before? (in ms please) :-) There isn't really an uncomplicated (out-of-the-box) SMSC solution at this time. We should probably fix this up and create some internal purging of sent SMS? I wanted to write routines to view more info about the SMS queue from the vty. Maybe this part is not worth it, especially as I've done it in python.

...

Sending SMS to the wrong recipient though is probably worth fixing. That should really not happen.

No, it makes people rather irate. :-(

Neels Hofmeyr

15 Mar 15 Mar

2:22 p.m.

On Tue, Mar 14, 2017 at 02:37:00PM +0100, Keith wrote:

...

We should probably fix this up and create some internal purging of sent SMS?

...

Sending SMS to the wrong recipient though is probably worth fixing. That should really not happen.

No, it makes people rather irate. :-(

The social dimension of it is potentially quite destructive :P "Wait what!? *Who* is *Mandy*!?!?" Plus the IMSI == Inf thing. I fully agree and would love to jump in and help out, but unfortunately I can't divert my attention from the current commitment, at least not now... Switching to an external SMSC is probably also going to take Inf ms. ~N

Harald Welte

2:43 p.m.

New subject: proper e-mail quoting is important (was Re: NITB: high cpu usage and "crossed" messages when SMS table grows)

Hi Alexander, I would really appreciate if you could fix your quoting in your e-mails. My time is (sorry) too limited to have to read very carefully through every line and then try to figure out if it is something you quoted from somebody else's mail, or it is something you actually added to the discussion. This means reading such a mail takes a multitude of reading properly formatted mails. Thanks for your attention. In fact, I had already started to not read through your most recent mails anymore due to the lack of proper quoting, and I thought it would be better to actually let you know rather than silently disregarding the mails. I apologize for my impulsive response to this, and use this message as a general reminder to everyone on this list on how important proper quoting is. Regards, Harald -- - Harald Welte <laforge(a)gnumonks.org> http://laforge.gnumonks.org/ ============================================================================ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6)

Keith

14 Mar 14 Mar

2:15 p.m.

On 14/03/2017 01:04, Neels Hofmeyr wrote:

...

On Mon, Mar 13, 2017 at 11:01:57PM +0100, Keith wrote:

the nitb was consuming 100% cpu, not always, but presumably whenever it does a queue run.

Hmm, that's a very vague indicator.

I know, I know.. :-/

...

Would be good to know: how many SMS are pending, for how many subscribers, of which how many are currently attached? How often are SMS deliveries being retried and end in failure? ... and anything else you can think of.

Yesterday I did DELETE FROM SMS WHERE sent IS NOT NULL and the cpu usage problem goes away. Previous to that, On the site where I most noticed it: Total entries in table: 114231 Pending (not sent): 1662 Distinct number of subscribers with pending SMS: 501 So as Alexander says, It seems to have to do with simply the amount of entries in the table. Alex, if you dig out those patches, I'm happy to submit them for Code Review.

...

messages being delivered to the wrong destination.

Whoa! That should absolutely not happen. I can't see how this is even possible.

That's exactly the reaction that happened when I mentioned it at OsmoDevCon last year. :-) I have also been somewhat incredulous of these reports, putting it down to user error, or possibly something in our SMPP->kannel->python stuff->kannel->SMPP->Osmo chain, and after working a little on that code, the problem ceased to be reported, but it's quite telling that at the same time that I stop purging the SMS table and we grow above 100,000 entries in SMS table, we get reports from at least 4 communities of these "crossed" messages. k/

3094

days inactive

3096

days old

openbsc@lists.osmocom.org

Manage subscription

7 comments

4 participants

tags (0)

participants (4)

Alexander Chemeris
Harald Welte
Keith
Neels Hofmeyr