On 11/09/17 14:24, Neels Hofmeyr wrote:
I had the watchdog script in place that power cycles the quad modem board and
restarts ofono as soon as the modem names mismatch what we expect. But lynxis
disabled it. The reasoning is that we won't catch ofono errors. That may be
true; my idea was to be able automatically recover ofono without manual
intervention. I guess it all depends on how closely you (lynxis) watch the gsm
testers for failures? Because my focus is not particularly on ofono, and when I
hit a broken situation and need to test things, I will restart ofono and rather
not in-depth investigate the failure; in a dismissive way "come on ofono, do
what I want now." What do you guys think about this?
I think we should auto-restart ofono in prod automatically, as anyway we
will catch the failure of ofono in one test run in the histogram, and we
can anaylse it, but then at least we can keep running all next test
without human intervention (ie. systemctl restart ofono). This way we
avoid having test fail for a lot of hours and then when restarting ofono
see that there was a regression introduced in osmo-* and having to
bisect potentially lots of commits.
For RnD, I think it's fine to have manual intervention, I don't want
automatic stuff running at the same time I'm running tests because I
want a controlled environment there.
Something I thought about before: we could implement a kind of random or round
robin to not always pick the first matching resources in the list. Advantage is
that we would cycle through the hardware and force us to precisely formulate
e.g. modem requirements. The disadvantage is that not every test is run exactly
the same, adding complexity that may obscure analysis. i.e. to reproduce a run
on a particular modem, we would have to somehow clamp that randomness, e.g.
log a random seed at the start and allow passing in a random seed on the
cmdline.
I also thought about this some time ago, but I think we already have too
many things being unstable right now, let's not add more instability to
it. And I think we can invest time in other stuff which may be more
interesting. Once we run suites in parallel we will start using
resources more intensively and we we'll have this for free.
--
- Pau Espin Pedrol <pespin(a)sysmocom.de>
http://www.sysmocom.de/
=======================================================================
* sysmocom - systems for mobile communications GmbH
* Alt-Moabit 93
* 10559 Berlin, Germany
* Sitz / Registered office: Berlin, HRB 134158 B
* Geschaeftsfuehrer / Managing Director: Harald Welte