On Mon, Mar 13, 2017 at 11:01:57PM +0100, Keith wrote:
I temporarily disabled a cron job we run at
rhizomatica that purges the
hlr SMS table of sent messages every day.
After a few days I noticed slightly sluggish behaviour in the VTY, and
sure enough, the nitb was consuming 100% cpu, not always, but presumably
whenever it does a queue run.
Hmm, that's a very vague indicator. How performant is the hardware? For how
long does this load endure? Does the process hang otherwise, is service
disrupted?
From how I got to know the SMS code, it appears to have sound safeguards in
place, e.g. limits the number of SMS to be delivered per queue run, and only
attempts deliveries of SMS for actually attached subscribers ... But in fact we
don't have load testing in place. It would be good to find out where
unproportional CPU load is coming from -- SQlite? The NITB sms code? From a
theoretical standpoint I'd also expect the SMS database to discard messages
that are past a certain age, not sure though, as I'm not that deeply familiar
with that (yet).
Would be good to know: how many SMS are pending, for how many subscribers, of
which how many are currently attached? How often are SMS deliveries being
retried and end in failure? ... and anything else you can think of.
I also just heard that in the last few days, we got a
number of reports from
users, some confirmed by photos of the phones, about messages being delivered
to the wrong destination.
Whoa! That should absolutely not happen. I can't see how this is even possible.
I imagine we'd like to track this one down.
Optimally, we would want to be able to reproduce the failure. Do you have any
edge data on the scenario in which this situation comes up?
~N