A wild guess but can any of you check if the GSM tester state is corrupt again? My test with virtual equipment failed the last two times I tried with no IP addresses available and the production run seems broken as well.
thank you!
holger
On Mon, Jan 28, 2019 at 08:08:36AM +0000, Holger Freyther wrote:
A wild guess but can any of you check if the GSM tester state is corrupt again? My test with virtual equipment failed the last two times I tried with no IP addresses available and the production run seems broken as well.
thank you!
I notice the osmo-gsmtester-prod moved to IP address .107 since I was last on it. But I can log in.
The /var/tmp/osmo-gsm-tester/state has reserved resources since 2019-01-23 22:21:01.
The 'ps' output looks troubling, a large number of tcpdumps running: 843!! wtf See http://people.osmocom.org/neels/eengeeK2/gsm_tester_ps.txt No 'osmo-gsm-tester.py' in ps, so nothing is running.
I erased the reserved_resources and rebooted.
~N
Hi,
that happens from time to time because while running a job in there, the jenkins connection between jenkins master and slave fails. As a result, jenkins decides it's a good idea to kill -9 osmo-gsm-tester process, leaving as a result its child processes alive without anybody controlling them.
I usually take care myself of removing those processes (ps -ef | grep osmo, kill) and cleaning state (rm /var/tmp/osmo-gsm-tester/state/*) when I see lots of tests failing with UNKNOWN during the more-or-less daily jenkins job. Unfortunately, due to being on sick leave these days it may take longer than usual to have this work done after a conn failure, my apologies.
Thanks Neels for taking care of the issue.
Kind regards, Pau
Hi Pau,
On Mon, Jan 28, 2019 at 09:30:36PM +0100, Pau Espin Pedrol wrote:
that happens from time to time because while running a job in there, the jenkins connection between jenkins master and slave fails. As a result, jenkins decides it's a good idea to kill -9 osmo-gsm-tester process, leaving as a result its child processes alive without anybody controlling them.
FYI, this leaking ofprocesses is a known bug in Jenkins since 2013 (!) which many people raised at https://issues.jenkins-ci.org/browse/JENKINS-17116
It's a real pity that such an importan bug doesn't seem to get fixed by Jenkins upstream. If there are any Java developers with some spare cycles reading here, it would really be great if you could contribue a related fix upstream. Thanks!
On 28. Jan 2019, at 20:30, Pau Espin Pedrol pespin@sysmocom.de wrote:
Hi,
Hi!
I don't know how the state works but have you considered:
* Adding stale detection, either in code or jenkins? * far fetched.. look if subreaper can be of any help?
cheers holger
that happens from time to time because while running a job in there, the jenkins connection between jenkins master and slave fails. As a result, jenkins decides it's a good idea to kill -9 osmo-gsm-tester process, leaving as a result its child processes alive without anybody controlling them.
I usually take care myself of removing those processes (ps -ef | grep osmo, kill) and cleaning state (rm /var/tmp/osmo-gsm-tester/state/*) when I see lots of tests failing with UNKNOWN during the more-or-less daily jenkins job. Unfortunately, due to being on sick leave these days it may take longer than usual to have this work done after a conn failure, my apologies.
Thanks Neels for taking care of the issue.
Kind regards, Pau
--
- Pau Espin Pedrol pespin@sysmocom.de http://www.sysmocom.de/
=======================================================================
- sysmocom - systems for mobile communications GmbH
- Alt-Moabit 93
- 10559 Berlin, Germany
- Sitz / Registered office: Berlin, HRB 134158 B
- Geschaeftsfuehrer / Managing Director: Harald Welte
On Tue, Jan 29, 2019 at 04:25:08PM +0000, Holger Freyther wrote:
On 28. Jan 2019, at 20:30, Pau Espin Pedrol pespin@sysmocom.de wrote:
Hi,
Hi!
I don't know how the state works but have you considered:
- Adding stale detection, either in code or jenkins?
- far fetched.. look if subreaper can be of any help?
The state works pretty nicely if you don't kill -9 the process :P
The thought to build another safeguard around it has come up a few times before.
yet we probably wouldn't be talking about it if jenkins didn't maintain that habit of just kicking open the airlock to void space on the osmo-gsm-tester.
(OTOH of course it would be even nicer if we could deal with kill -9 safely as well, but where do you draw the line; there's always *some* place where a kill -9 breaks the plan.)
~N