New subject: looking for more uild hosts / Re: jenkins build slaves: disks full

6 Dec 2018


      We recently hit a bunch of jenkins failures, due to a full disk.
Just now I removed 172G worth of docker images from build2-deb9build-ansible;
I thought we had the docker cleanup automated by now?
Even after that, build-2 still uses 244G of its root file system, which doesn't
seem right.  most of it is also in the deb9build-ansible lxc:
root@build-2 /var/lib/lxc/deb9build-ansible/rootfs # du -hs * | sort -h
[...]
2.2G	opt
5.8G	usr
8.1G	tmp   (what!
33G	home
153G	var
The tmp/ has many folders like
196M	tmp.u3y02wgBNI
which are all from March to May this year. I will delete them now.
home:
root@build-2 /var/lib/lxc/deb9build-ansible/rootfs/home/osmocom-build # du -hs *
0	bin
19G	jenkins
14G	jenkins_build_artifact_store
1.2G	osmo-ci
Interesting, I wasn't aware of us using the artifact store.
Seems to come from some manual builds between April-October.
Removing.
jenkins workspaces of 19G seems ok.
But osmo-ci of 1.2G!?
That seems to be a manual build of the coverity job -- though the date is
pretty recent, so is our coverity job actually building in
~osmocom-build/osmo-ci instead of in a workspace?
Even after the docker cleanup commands I used from the osmocom.org servers wiki page:
docker rm $(docker ps -a -q)
docker rmi $(docker images -q -f dangling=true)
There are still 321 docker images around, most of which are months old.
Not sure why above cleanups don't catch those.
I'm just going to indiscriminately blow all of them away now.
Maybe a good cleanup strategy would be to every week or so automatically wipe
out the entire build slave lxc and re-create it from scratch?
After this, we have on build-2:
Filesystem      Size  Used Avail Use% Mounted on
/dev/md2        438G   83G  333G  20% /
------ host-2
Similar story on host-2 deb9build-ansible lxc: tons of docker images, just removed all of them.
But after that we still have
root@host2 ~ # df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/md2        438G  311G  105G  75% /
On host-2 though there are a lot of services running.
root@host2 / # du -hs * | sort -h
[...]
1.2G	usr
59G	var
75G	home
176G	external
[...]
2.7G	gerrit
3.1G	redmine-20170530-before-upgrade-to-3.4.tar
4.3G	mailman
5.7G	openmoko-wiki
7.8G	gitolite
9.9G	openmoko-people
29G	redmine
112G	jenkins
root@host2 /external/jenkins/home/jobs # du -hs * | sort -h
171M	nplab-m3ua-test
198M	master-osmo-pcu
241M	ttcn3-sip-test
251M	osmo-gsm-tester_build-osmo-bsc
262M	ttcn3-ggsn-test
287M	gerrit-osmo-ttcn3-hacks
297M	master-osmo-bsc
322M	master-libosmo-sccp
328M	osmo-gsm-tester_build-osmo-sgsn
355M	master-osmo-mgw
359M	master-libosmo-netif
365M	osmo-gsm-tester_build-osmo-iuh
390M	gerrit-asn1c
392M	gerrit-osmo-bsc
419M	ttcn3-nitb-sysinfo
445M	osmo-gsm-tester_build-osmo-msc
456M	osmo-gsm-tester_manual-build-all
461M	master-libosmocore
461M	TEST_osmocomBB_with_libosmocore_dep
482M	master-osmo-iuh
611M	master-osmo-sgsn
704M	gerrit-osmo-bts
748M	master-osmo-msc
756M	gerrit-osmo-msc
929M	master-openbsc
1.1G	master-osmo-bts
1.1G	ttcn3-hlr-test
1.2G	gerrit-libosmocore
1.2G	ttcn3-mgw-test
1.9G	osmo-gsm-tester-rnd_run
2.0G	ttcn3-sgsn-test
3.0G	ttcn3-msc-test
3.2G	osmo-gsm-tester_run
3.5G	master-asn1c
4.2G	ttcn3-bsc-test-sccplite
4.7G	osmo-gsm-tester_run-rnd
6.2G	osmo-gsm-tester_gerrit
6.3G	osmo-gsm-tester_run-prod
7.5G	osmo-gsm-tester_ttcn3
8.5G	ttcn3-bsc-test
43G	ttcn3-bts-test
It seems we are caching 211 ttcn3-bts-test builds. That seems a tad much.
Indeed https://jenkins.osmocom.org/jenkins/view/TTCN3/job/ttcn3-bts-test/configure
has "[ ] Discard old builds" (unchecked).
Looking in osmo-ci, the jobs/ttcn3-testsuites.yml has no 'build-discarder' set.
I guess we should add one? Any discard option preferences? A month? A year?
(compare master-builds.yml)
----- admin-2
It seems I can not login there, or at least I don't know the IP address...
  ssh: Could not resolve hostname admin2.osmocom.org: Name or service not known
So I guess I can't check there.
~N