Dear all,
for probably about a year (or longer) we have been putting up with VTY tests which cause builds to break under unclear circumstances. I personally believe the probability of a VTY test failing has recently increased again, and this is barely tolerable anymore. Often, rebasing/cherry-picking the given patch one or two times also doesn't work. Yet, the given patch-under-test is not even touching anything related to VTY, like In https://gerrit.osmocom.org/3899 which has failed in https://jenkins.osmocom.org/jenkins/job/OpenBSC-gerrit/2451/ and https://jenkins.osmocom.org/jenkins/job/OpenBSC-gerrit/2454/
I know Neels and others have spend already significant time in the past trying to resolve this - unsuccessfully.
So I think the situation has reached a point where we should disable the vty tests, or at least the specific part of the vty tests that is known to break most frequently.
I definitely want us to have *more* testing, not less. However, when the test itself is not stable yet - particularly after that much time - we cannot have that buggy test delay our development.
I would vote for running those tests regularly (daily, every few hours, you name it), but not as part of the mandatory build verification for gerrit V+1.
What do others think?
On Mon, Sep 11, 2017 at 09:09:06PM +0200, Harald Welte wrote:
Dear all,
for probably about a year (or longer) we have been putting up with VTY tests which cause builds to break under unclear circumstances. I personally believe the probability of a VTY test failing has recently increased again, and this is barely tolerable anymore. Often, rebasing/cherry-picking the given patch one or two times also doesn't work. Yet, the given patch-under-test is not even touching anything related to VTY, like In https://gerrit.osmocom.org/3899 which has failed in https://jenkins.osmocom.org/jenkins/job/OpenBSC-gerrit/2451/ and https://jenkins.osmocom.org/jenkins/job/OpenBSC-gerrit/2454/
I know Neels and others have spend already significant time in the past trying to resolve this - unsuccessfully.
So I think the situation has reached a point where we should disable the vty tests, or at least the specific part of the vty tests that is known to break most frequently.
I definitely want us to have *more* testing, not less. However, when the test itself is not stable yet - particularly after that much time - we cannot have that buggy test delay our development.
If there are no resources / noone with an assignment to actively maintain this, then it's reasonable to disable it, or at least disable the tests that are breaking things now.
I would vote for running those tests regularly (daily, every few hours, you name it), but not as part of the mandatory build verification for gerrit V+1.
I think this is fine, so we get fallout later on that we can address via robots, and make things a bit more agile.
In the Linux kernel, we usually get all these reports from robots afterwards, so I would say it's reasonable to follow the same approach.
On Mon, Sep 11, 2017 at 09:09:06PM +0200, Harald Welte wrote:
In https://gerrit.osmocom.org/3899 which has failed in https://jenkins.osmocom.org/jenkins/job/OpenBSC-gerrit/2451/ and https://jenkins.osmocom.org/jenkins/job/OpenBSC-gerrit/2454/
This particular failure is due to a VTY change in libosmocore. I have fixed it in osmo-bsc.git, and this needs to be applied to openbsc.git as well. Change-Id: I77931d6a09c42c443c6936000592f22a7fd06cab
However, let's decide when we stop developing on openbsc.git. Every patch that is merged to openbsc.git now needs additional work to be applied to osmo-*.git.
Vice versa, we only need to backport serious fixes to openbsc.git, this is one of them.
Here is the backport to openbsc.git: https://gerrit.osmocom.org/3921
I know Neels and others have spend already significant time in the past trying to resolve this - unsuccessfully.
That was the testBSCreload running into "Broken Pipe" errors.
If you see more of those, we may want to disable the testBSCreload:
https://gerrit.osmocom.org/3922
~N
On 13. Sep 2017, at 07:41, Neels Hofmeyr nhofmeyr@sysmocom.de wrote:
Hi!
On Mon, Sep 11, 2017 at 09:09:06PM +0200, Harald Welte wrote:
In https://gerrit.osmocom.org/3899 which has failed in https://jenkins.osmocom.org/jenkins/job/OpenBSC-gerrit/2451/ and https://jenkins.osmocom.org/jenkins/job/OpenBSC-gerrit/2454/
This particular failure is due to a VTY change in libosmocore. I have fixed it in osmo-bsc.git, and this needs to be applied to openbsc.git as well. Change-Id: I77931d6a09c42c443c6936000592f22a7fd06cab
Great. So the VTY tests found a behavior change and did its job. I think disabling tests is a slippery slope. Let's assume we would run it daily and send emails. How likely would it be that n-failures in a row trigger a question to disable the mail notifications?
We do run into some form of resource limitation and mitigated by reducing the number of executors (but that is up again). In the past the VTY test runner forgot to close sockets but we were still running into something.
So either a form of kernel limit (and I couldn't find a MIB counting it) or something caused by "slow" (as recently pointed out) disk leading to a slow start of the software under test?
I know Neels and others have spend already significant time in the past trying to resolve this - unsuccessfully.
That was the testBSCreload running into "Broken Pipe" errors.
If you see more of those, we may want to disable the testBSCreload:
broken pipe still sounds like either we kill the TCP connection before we want to or the remote process terminated. Could we dump core and check if these exist at the end of the test run (and check that dumping core in a container works).
holger
On Wed, Sep 13, 2017 at 04:08:51PM +0800, Holger Freyther wrote:
This particular failure is due to a VTY change in libosmocore. I have fixed it in osmo-bsc.git, and this needs to be applied to openbsc.git as well. Change-Id: I77931d6a09c42c443c6936000592f22a7fd06cab
Great. So the VTY tests found a behavior change and did its job.
Also it has just uncovered that the osmo-nitb VTY's 'nitb' vty node lacked the default node commands (including 'exit' and 'end'), so it was never really necessary to leave the 'nitb' node. It always took the parent node's exit/end command. Fix applied in https://gerrit.osmocom.org/3921, hope it gets +V now, then we can rebase the other patches onto it.
I think disabling tests is a slippery slope.
About the particular testBSCreload test that apparently still fails sometimes, and despite extensive digging we never came up with a way to solidify it, I think that we might be ok with disabling that one. Only that one.
We do run into some form of resource limitation and mitigated by reducing the number of executors (but that is up again). In the past the VTY test runner forgot to close sockets but we were still running into something.
So either a form of kernel limit (and I couldn't find a MIB counting it) or something caused by "slow" (as recently pointed out) disk leading to a slow start of the software under test?
[...]
broken pipe still sounds like either we kill the TCP connection before we want to or the remote process terminated. Could we dump core and check if these exist at the end of the test run (and check that dumping core in a container works).
The debug output we've added so far didn't suggest an obvious error. It would be great to find a solution ... but I feel that other todos are more important ATM :/
~N
On Wed, Sep 13, 2017 at 01:02:57PM +0200, Neels Hofmeyr wrote:
Also it has just uncovered that the osmo-nitb VTY's 'nitb' vty node lacked the default node commands (including 'exit' and 'end'), so it was never really necessary to leave the 'nitb' node. It always took the parent node's exit/end
s/necessary/possible
command. Fix applied in https://gerrit.osmocom.org/3921, hope it gets +V now, then we can rebase the other patches onto it.
~N