Hi Oliver, Neels, community,
I had some comments on the D-GSM work that didn't really fit directly to the gerrit code review, and I thought I'd post it here.
== DNS zone / .msisdn suffix ===
One question I had was regarding the use of the .{msisdn,imsi} TLD. I would argue it is probably besser to use something that fits within the existing DNS hierarchy without contesting IANA's authority on gTLDs.
Historically, ETSI/3GPP made the mistake of using ".gprs" for resolving APNs on the GRX. This was later changed to something with 3gppnetwork.com or the like, hwere that domain would actually be registered by 3GPP with normal domain registrars, but without any publicly accessible zone records. This way the name is reserved in the public hierarchy and no risk of clashes.
I'm not sure how much of a concern this is to us, given that our use case is much more niche than the GRX. However, the "cost" is probably rather small to change this to something like .{msisdn,imsi}.dgsm.osmocom.org ? Sure, the packets will get larger by a few bytes, but given all the other overhead I think it's not really going to have any impact?
What are your thoughts on that?
== MSISDN format ==
Another thought is whether or not there are any concerns regarding the MSISDN format. Historically, this is one of the weaknesses of OsmoMSC, inherited from the OpenBSC days where we just thought in terms of PBX extension numbers. In reality, a MSISDN consist of a TON (type of number), NPI (numbering plan indicator) and the related digits. IIRC, in the TON one can also specify if it's supposed to be national or international, i.e. if it's prefixed with the country code or not.
It would be great to make sure that the format used in the mDNS queries is somewhat standardized, if not at least only by the documentation requiring that all queries should be done in fully qualfiied form with country code present. NPI is sort-of bogus as IMOH E.164 is the only one applicable for MSISDN.
Any thoughts?
== The use of 'age' vs. absolute timestamp ==
In my original D-GSM idas I always thought we'd send an absolute UTC timestamp when a given HLR/MSC has ever seen that subscriber. The idea here being that any rural GSM network will have some kind of GNSS recevier for clock stability in the BTS anyway, and one can hence assume that timestamps are synchronized.
The advantage of a relative 'age' is obvious: You don't care about the absolute clock value being correct anymore. The potential downside is that propagation delay might matter. If you have a rather slow / loaded geostationary sattelite link from one village, but a faster terrestrial link from another village, the 'age' will be ambiguous while an absolute timesetamp wouldn't have that.
Given that the delays we're talking about are probably all sub-second or maybe possible about 1s, it's probably not a problem.
== GSUP keepalives / connection loss detection ==
In the presence of unreliable back-haul mesh between villages, the GSUP connection can also not be seen as reliable. We would expect to see TCP stalls due to packet loss, etc.
Have you considered this in your implementation and/or done any testing based on simulated lossy networks to ensure we properly use either TCP keepalives or IPA application-level PING/PONG to detect lost connections and recover from such situations (by closing the old and re-establishing)?
Unreliable networks can be easily simulated by Linux built-in 'tc netem' for providing configurable packet loss / latency / jitter.
I also saw some comments / code related to "if a second connection using the same IPA ID arrives, we're screwed" (paraphrasing here). I would expect this not to be uncommon even if every MSC/HLR out there is configred correctly exactly because e.g .the remote MSC/HLR has already decided that the TCP/GSUP is dead and starts to reconnect by performing a local-end release, while the "local" MSC/HLR still thinks the old connection is alive. If the old connection "wins" (i.e. is preferred) I see potential trouble here.
Situations like that probably warrant some carefully designed tests to create exactly those situations.
Regards, Harald
On 03/12/2019 17:20, Harald Welte wrote:
Hi Oliver, Neels, community,
I had some comments on the D-GSM work that didn't really fit directly to the gerrit code review, and I thought I'd post it here.
Hi as well.
I add some short comments inline (disclaimer, I'm not fully up to speed with the code)
== DNS zone / .msisdn suffix ===
One question I had was regarding the use of the .{msisdn,imsi} TLD. I would argue it is probably besser to use something that fits within the existing DNS hierarchy without contesting IANA's authority on gTLDs.
I'd vote for, if at all possible, not making any link to/dependency on DNS hierarchy.
== The use of 'age' vs. absolute timestamp ==
In my original D-GSM idas I always thought we'd send an absolute UTC timestamp when a given HLR/MSC has ever seen that subscriber. The idea here being that any rural GSM network will have some kind of GNSS recevier for clock stability in the BTS anyway, and one can hence assume that timestamps are synchronized.
I'd vote for a configurable option or an (abs/rel) flag in the mslookup request. I would not want this implementation to stall a 1st release though.
The advantage of a relative 'age' is obvious: You don't care about the absolute clock value being correct anymore. The potential downside is that propagation delay might matter. If you have a rather slow / loaded geostationary sattelite link from one village, but a faster terrestrial link from another village, the 'age' will be ambiguous while an absolute timesetamp wouldn't have that.
It's true, and it's also possible that we might have geographically close villages, (making cases of fast LAC switching probable) at the same time as having a satellite link in one of both of these villages.
My personal preference is that the community GSM operator should also manage (in as much as possible) the terrestrial links and where possible, ensure existence of those between geographically close communities, but reality is.. confounding on many levels.
Given that the delays we're talking about are probably all sub-second or maybe possible about 1s, it's probably not a problem.
Agreed. I would add, it's my intention that whenever there is this kind of doubt about the actual location of a MS when an incoming call needs to be delivered, at the (SIP side) we would simply bridge the call to both locations anyway, resulting in paging on both villages, and the first to pick up wins.
== GSUP keepalives / connection loss detection ==
In the presence of unreliable back-haul mesh between villages, the GSUP connection can also not be seen as reliable. We would expect to see TCP stalls due to packet loss, etc.
We don't envisage a separation between MSC and HLR over unreliable back-haul, but I think I'm missing something here. (I still need to actually implement a local setup and observe)
Hi Keith,
On Wed, Dec 04, 2019 at 02:36:43PM +0100, Keith wrote:
== DNS zone / .msisdn suffix ===
One question I had was regarding the use of the .{msisdn,imsi} TLD. I would argue it is probably besser to use something that fits within the existing DNS hierarchy without contesting IANA's authority on gTLDs.
I'd vote for, if at all possible, not making any link to/dependency on DNS hierarchy.
I'm not suggesting a dependency. You can always operate whatever DNS or mDNS on whatever domain names in a network under your control. I just think it might be smart to try to avoid using a global namespace that more "authoritive" users might use for something else in the future, who knows.
The D-GSM mDNS will work irrespective of the public DNS system as we know it.
== GSUP keepalives / connection loss detection ==
In the presence of unreliable back-haul mesh between villages, the GSUP connection can also not be seen as reliable. We would expect to see TCP stalls due to packet loss, etc.
We don't envisage a separation between MSC and HLR over unreliable back-haul, but I think I'm missing something here. (I still need to actually implement a local setup and observe)
In an inbound roaming situation, you have the MSC (VPLMN) in one village and the "authoritative" HLR for that subscriber (HPLMN) in another village.
On 04/12/2019 20:46, Harald Welte wrote:
We don't envisage a separation between MSC and HLR over unreliable back-haul, but I think I'm missing something here. (I still need to actually implement a local setup and observe)
In an inbound roaming situation, you have the MSC (VPLMN) in one village and the "authoritative" HLR for that subscriber (HPLMN) in another village.
Ah, indeed. I do need to setup and play with this. In my head, I was imagining that an IMSI attach request triggers an mslookup (broadcast) and that the response message contains what the VLR (msc) needs. I wasn't imagining TCP connections relating to core GSM over the unreliable IP network.
I'm wary of TCP. I'm no expert in all the things that can be tuned, so maybe there's a solution I'm unaware of?
If not, I don't like tcp over tinc-vpn at all where the underlying IP connection is unreliable. Things tend to stall for a long time. This is precisely the main problem I was seeing with multi-master distributed databases. They work fine in theory in a data centre, but add severe packet loss and it all falls down.
I'm quite possibly still missing something?
Thanks!
k.
On Tue, Dec 03, 2019 at 05:20:33PM +0100, Harald Welte wrote:
== DNS zone / .msisdn suffix ===
.{msisdn,imsi}.dgsm.osmocom.org ? Sure, the packets will get larger by
sure, we can do that. An idea is to do that merely on the DNS encoding, and strip it off to have only *.imsi in the mslookup client? (If we add other methods, we might not use the domain kind of representation at all there)
== MSISDN format ==
Another thought is whether or not there are any concerns regarding the MSISDN format. Historically, this is one of the weaknesses of OsmoMSC, inherited from the OpenBSC days where we just thought in terms of PBX extension numbers. In reality, a MSISDN consist of a TON (type of number), NPI (numbering plan indicator) and the related digits. IIRC, in the TON one can also specify if it's supposed to be national or international, i.e. if it's prefixed with the country code or not.
It would be great to make sure that the format used in the mDNS queries is somewhat standardized, if not at least only by the documentation requiring that all queries should be done in fully qualfiied form with country code present. NPI is sort-of bogus as IMOH E.164 is the only one applicable for MSISDN.
Any thoughts?
So far it just works (TM) ... we reflect the MSISDNs saved in the HLR DB 1:1, not sure what a TON representation might change about that.
It always passes through the MSC and SIP first. If it's any consolation, the PBX gets the TON in the SIP INVITE (if it does), and it could choose to treat numbers in any fashion. For mslookup to work though, the MSISDN must reflect whatever we find in the HLR database. If that is unable to reflect a TON (like it currently is unable to) then handling non-naive MSISDNs would have to happen in the SIP dialplan anyway. As soon as a given string is becomes parseable by an mslookup server, i.e. say we implement some +123 notation in osmo-hlr, then it would suddenly start to work out. mslookup itself doesn't care much about the msisdn, but I think we do call osmo_msisdn_str_valid() on it. That could change easily, point being it really is an arbitrary string (without dots) that gets sent as MSISDN.
== The use of 'age' vs. absolute timestamp ==
Given that the delays we're talking about are probably all sub-second or maybe possible about 1s, it's probably not a problem.
I went through the same thoughts. When I do a first attach to a site, I find it expected that a caller might not reach me for five more seconds. If it is even that much, ever.
So I favored the elegance of 'age' vs absolute timestamp, because an entire timezone/clockdrift/faulty GPS receiver family of problems simply vanishes completely.
== GSUP keepalives / connection loss detection ==
In the presence of unreliable back-haul mesh between villages, the GSUP connection can also not be seen as reliable. We would expect to see TCP stalls due to packet loss, etc.
Have you considered this in your implementation and/or done any testing based on simulated lossy networks to ensure we properly use either TCP keepalives or IPA application-level PING/PONG to detect lost connections and recover from such situations (by closing the old and re-establishing)?
Unreliable networks can be easily simulated by Linux built-in 'tc netem' for providing configurable packet loss / latency / jitter.
I also saw some comments / code related to "if a second connection using the same IPA ID arrives, we're screwed" (paraphrasing here). I would expect this not to be uncommon even if every MSC/HLR out there is configred correctly exactly because e.g .the remote MSC/HLR has already decided that the TCP/GSUP is dead and starts to reconnect by performing a local-end release, while the "local" MSC/HLR still thinks the old connection is alive. If the old connection "wins" (i.e. is preferred) I see potential trouble here.
Situations like that probably warrant some carefully designed tests to create exactly those situations.
We haven't tested this at all. Should become an issue on redmine.
~N
Hi Neels,
On Wed, Dec 04, 2019 at 02:52:06PM +0100, Neels Hofmeyr wrote:
On Tue, Dec 03, 2019 at 05:20:33PM +0100, Harald Welte wrote:
== DNS zone / .msisdn suffix ===
.{msisdn,imsi}.dgsm.osmocom.org ? Sure, the packets will get larger by
sure, we can do that. An idea is to do that merely on the DNS encoding, and strip it off to have only *.imsi in the mslookup client? (If we add other methods, we might not use the domain kind of representation at all there)
good idea. It could also be a global config setting somewhere of whatever suffix is added after the .imsi/.msisdn, defaulting to the osmocom example.
We haven't tested this at all. Should become an issue on redmine.
Will you create them accordingly? As indicated, I think it's [at least] two differen levels; a) ensuring that keepalive on either TCP or IPA is enabled and works, and b) creating situations where the same peer establishes a second new connection while the old one is still not torn down (timeout not expired yet, FIN packets lost, ...)
Hey all,
On 12/4/19 8:50 PM, Harald Welte wrote:
On Wed, Dec 04, 2019 at 02:52:06PM +0100, Neels Hofmeyr wrote:
On Tue, Dec 03, 2019 at 05:20:33PM +0100, Harald Welte wrote:
== DNS zone / .msisdn suffix ===
.{msisdn,imsi}.dgsm.osmocom.org ? Sure, the packets will get larger by
sure, we can do that. An idea is to do that merely on the DNS encoding, and strip it off to have only *.imsi in the mslookup client? (If we add other methods, we might not use the domain kind of representation at all there)
good idea. It could also be a global config setting somewhere of whatever suffix is added after the .imsi/.msisdn, defaulting to the osmocom example.
this is now implemented, see: https://osmocom.org/issues/4309
Regards, Oliver