Hi Neels,
Thank you for pointing me in the direction of LU Reject cause values - it appears that the problem is there indeed! Prior to the most recent change in osmo-hlr (https://gerrit.osmocom.org/c/osmo-hlr/+/16808), the LU Reject cause returned by OsmoCNI to MS was horribly bad, and even with this patch applied, the default cause value is *still* horribly bad, unless the operator figures out the magic vty setting which needs to be changed away from the default.
3GPP TS 24.008 4.4.1 describes LU Reject.
Thank you once again for pointing me in the right direction - but as I studied that portion of the spec, I quickly realized that the section of most relevance here is not 4.4.1, but 4.4.4.7. Right now the default LU reject cause returned by OsmoCNI is #2 - but the spec tells MS implementations to treat #2 the same as #3 and #6, which deal with stolen phones and the like - thus by the spec, if an MS receives LU reject cause #2, it is supposed to assume that the SIM is intrinsically bad (meaning bad for any PLMN, not limited to the one it just tried), and basically abstain from ANY further registration attempts on any PLMN until the phone is rebooted, which is exactly the behaviour I see with my wife's Nokia C3-00.
Instead we need to return cause #11: this cause tells phones to stop trying to connect to *our* PLMN specifically, and advises them to go look for their home PLMN or some other PLMN that allows roaming, while ours does not.
I see that this topic has already been addressed (somewhat) in the code review of the patch that just got merged into osmo-hlr mere days ago, and there is this open issue too:
https://osmocom.org/issues/5865
I would argue that having this wrong default cause value in the code is a *critical* bug that actively endangers innocent operators of test GSM networks, causing them to inadvertently disrupt operation of live commercial networks in their neighbourhoods and exposing them to risk of being prosecuted for doing so. For those who do NOT have the luxury of living in the middle of an ocean or a desert or a Rhizomatica-style village, for those of us who have to operate our test GSM networks in areas that are covered by regular commercial cell services, the highest imperative needs to be ensuring that these test networks of ours do NOT disrupt regular commercial networks or their phones in any way.
Consider the super-common scenario of someone in our community operating a test GSM network in some "inhabited" location where other people live nearby, using their phones on their respective commercial networks. If some passer-by, completely unrelated to your test network, happens to be physically right next to your test BTS, such that the signal from the test network is the strongest despite putting out very little dBm, the passer-by's phone WILL try to "jump ship" from the commercial network to the test network - my practical experience absolutely confirms this unfortunate fact - and what happens next depends on the LU Reject cause returned by the test network. If the test network returns cause #2, the phone will interpret it as "oops, my SIM is bad, I need to withdraw from all further activity", if it behaves per the spec - or if the phone behaves like Keith noted in the CR and keeps retrying, it will keep retrying *to the test network*, and the end result will be the same: the phone will disconnect from the legitimate operator's network, and the subscriber will become unreachable at his or her phone number. Folks, that's interference with or disruption of public telecom networks, any one of us can get into serious trouble for such behaviour - thus I say that we need to fix our default LU Reject cause ASAP, stat!
Maybe some new configuration for reject causes could help, but am not sure how. Seems to be the HLR's job, so, such a feature in MSC could be a layer violation.
The cause value is set in osmo-hlr, and osmo-msc merely passes it through. To get the critical fix implemented as quickly as possible, I simply patched the hard-coded value in osmo-hlr on my system; for people who run the latest code from git, if your osmo-hlr includes commit 268a33e58b9d (just days ago), you need to add this vty setting under the hlr node:
reject-cause not-found plmn-not-allowed
This setting changes the LU Reject cause from #2 to #11, (hopefully) causing the test network to no longer interfere with commercial networks by "luring away" their subscribers.
If no one beats me to it, I will see if I can submit a Gerrit patch to osmo-hlr to change this default.
M~