We recently hit a bunch of jenkins failures, due to a full disk.
Just now I removed 172G worth of docker images from build2-deb9build-ansible; I thought we had the docker cleanup automated by now?
Even after that, build-2 still uses 244G of its root file system, which doesn't seem right. most of it is also in the deb9build-ansible lxc:
root@build-2 /var/lib/lxc/deb9build-ansible/rootfs # du -hs * | sort -h [...] 2.2G opt 5.8G usr 8.1G tmp (what! 33G home 153G var
The tmp/ has many folders like 196M tmp.u3y02wgBNI which are all from March to May this year. I will delete them now.
home: root@build-2 /var/lib/lxc/deb9build-ansible/rootfs/home/osmocom-build # du -hs * 0 bin 19G jenkins 14G jenkins_build_artifact_store 1.2G osmo-ci
Interesting, I wasn't aware of us using the artifact store. Seems to come from some manual builds between April-October. Removing.
jenkins workspaces of 19G seems ok.
But osmo-ci of 1.2G!? That seems to be a manual build of the coverity job -- though the date is pretty recent, so is our coverity job actually building in ~osmocom-build/osmo-ci instead of in a workspace?
Even after the docker cleanup commands I used from the osmocom.org servers wiki page: docker rm $(docker ps -a -q) docker rmi $(docker images -q -f dangling=true) There are still 321 docker images around, most of which are months old. Not sure why above cleanups don't catch those. I'm just going to indiscriminately blow all of them away now.
Maybe a good cleanup strategy would be to every week or so automatically wipe out the entire build slave lxc and re-create it from scratch?
After this, we have on build-2: Filesystem Size Used Avail Use% Mounted on /dev/md2 438G 83G 333G 20% /
------ host-2
Similar story on host-2 deb9build-ansible lxc: tons of docker images, just removed all of them.
But after that we still have root@host2 ~ # df -h Filesystem Size Used Avail Use% Mounted on /dev/md2 438G 311G 105G 75% /
On host-2 though there are a lot of services running.
root@host2 / # du -hs * | sort -h [...] 1.2G usr 59G var 75G home 176G external
[...] 2.7G gerrit 3.1G redmine-20170530-before-upgrade-to-3.4.tar 4.3G mailman 5.7G openmoko-wiki 7.8G gitolite 9.9G openmoko-people 29G redmine 112G jenkins
root@host2 /external/jenkins/home/jobs # du -hs * | sort -h 171M nplab-m3ua-test 198M master-osmo-pcu 241M ttcn3-sip-test 251M osmo-gsm-tester_build-osmo-bsc 262M ttcn3-ggsn-test 287M gerrit-osmo-ttcn3-hacks 297M master-osmo-bsc 322M master-libosmo-sccp 328M osmo-gsm-tester_build-osmo-sgsn 355M master-osmo-mgw 359M master-libosmo-netif 365M osmo-gsm-tester_build-osmo-iuh 390M gerrit-asn1c 392M gerrit-osmo-bsc 419M ttcn3-nitb-sysinfo 445M osmo-gsm-tester_build-osmo-msc 456M osmo-gsm-tester_manual-build-all 461M master-libosmocore 461M TEST_osmocomBB_with_libosmocore_dep 482M master-osmo-iuh 611M master-osmo-sgsn 704M gerrit-osmo-bts 748M master-osmo-msc 756M gerrit-osmo-msc 929M master-openbsc 1.1G master-osmo-bts 1.1G ttcn3-hlr-test 1.2G gerrit-libosmocore 1.2G ttcn3-mgw-test 1.9G osmo-gsm-tester-rnd_run 2.0G ttcn3-sgsn-test 3.0G ttcn3-msc-test 3.2G osmo-gsm-tester_run 3.5G master-asn1c 4.2G ttcn3-bsc-test-sccplite 4.7G osmo-gsm-tester_run-rnd 6.2G osmo-gsm-tester_gerrit 6.3G osmo-gsm-tester_run-prod 7.5G osmo-gsm-tester_ttcn3 8.5G ttcn3-bsc-test 43G ttcn3-bts-test
It seems we are caching 211 ttcn3-bts-test builds. That seems a tad much. Indeed https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bts-test/configure has "[ ] Discard old builds" (unchecked). Looking in osmo-ci, the jobs/ttcn3-testsuites.yml has no 'build-discarder' set. I guess we should add one? Any discard option preferences? A month? A year? (compare master-builds.yml)
----- admin-2
It seems I can not login there, or at least I don't know the IP address... ssh: Could not resolve hostname admin2.osmocom.org: Name or service not known So I guess I can't check there.
~N
On Wed, Dec 05, 2018 at 04:22:07PM +0100, Neels Hofmeyr wrote:
----- admin-2
It seems I can not login there, or at least I don't know the IP address... ssh: Could not resolve hostname admin2.osmocom.org: Name or service not known So I guess I can't check there.
The mad jenkins web ui script console gives me
Filesystem Size Used Avail Use% Mounted on /dev/md2 438G 143G 273G 35% /
So it's not near critical yet, but also might have a bit of build-up?
~N
On 12/5/18 4:22 PM, Neels Hofmeyr wrote:
It seems we are caching 211 ttcn3-bts-test builds. That seems a tad much. Indeed https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bts-test/configure has "[ ] Discard old builds" (unchecked). Looking in osmo-ci, the jobs/ttcn3-testsuites.yml has no 'build-discarder' set. I guess we should add one? Any discard option preferences? A month? A year? (compare master-builds.yml)
The values that master-builds.yml has sound sane to me, so I created a patch that applies them to ttcn3-testsuites.yml:
https://gerrit.osmocom.org/#/c/osmo-ci/+/12141/
Oliver
On Wed, Dec 05, 2018 at 04:22:07PM +0100, Neels Hofmeyr wrote:
We recently hit a bunch of jenkins failures, due to a full disk.
thanks for investigating.
Just now I removed 172G worth of docker images from build2-deb9build-ansible; I thought we had the docker cleanup automated by now?
I was also under that impression.
The tmp/ has many folders like 196M tmp.u3y02wgBNI which are all from March to May this year. I will delete them now.
I think that's the ttcn3 test runs, which (from our shell script?) generate such a tmp directory for pcap and log files?
Even after the docker cleanup commands I used from the osmocom.org servers wiki page: docker rm $(docker ps -a -q) docker rmi $(docker images -q -f dangling=true) There are still 321 docker images around, most of which are months old. Not sure why above cleanups don't catch those.
because something still references them.
I'm just going to indiscriminately blow all of them away now.
would have been interesting to investigate where those references come from. Maybe something is starting containers that keep persistent state ('docker run' without '--rm'), which will introduce reference counts to the underlying images?
Maybe a good cleanup strategy would be to every week or so automatically wipe out the entire build slave lxc and re-create it from scratch?
an option, but not sure what kind of fall-out that might create due to build slave inavailability during re-creation time, ...?
On host-2 though there are a lot of services running.
sure, it's our main *.osmocom.org server. NEver intended as a build host, just created build slaves there as long as we still have capacity.
----- admin-2
It seems I can not login there, or at least I don't know the IP address...
Not even an osmocom.org machine. Again just using some spare capacity there.
General note:
"large" root severs are unfortunately rather expensive to rent (even at Hetzner), so I've been hesitant to spend even more than the several hundred EUR that we spend for hosting per month already.
If somebody has spare capacity (mainly CPU + RAM) and would want to make that available to Osmocom, it would be much appreciated.
Another strategy might be to simply buy machines and run them e.g. from the sysmocom office. The bandwidth requirements are low, so running behind a cable modem is feasible.
Regards, Harald
On Thu, Dec 06, 2018 at 07:29:37AM +0100, Harald Welte wrote:
On host-2 though there are a lot of services running.
sure, it's our main *.osmocom.org server. NEver intended as a build host, just created build slaves there as long as we still have capacity.
----- admin-2
It seems I can not login there, or at least I don't know the IP address...
Not even an osmocom.org machine. Again just using some spare capacity there.
Would be good then to take a look and make sure that the spare capacity isn't hogging the disk space away from the real services.
I used to be able to login via admin2.osmocom.org. jenkins reaches the build slave via an Ipv6 address, haven't figured out yet what the admin2 root server login would be.
~N