nextepc based LTE network @ CCCamp2019

historical

Hi all,

During the Chaos Communication Camp 2019 (an international hacker camp
with about 5500 participants) last week, there is a tradition to operate
Osmocom based 2G and more recently also 3G networks.

This time I operated a nextepc based 4G/LTE network next to the camp 2G/3G
networks.  In order to share one subscriber database, I have implemented
osmo_dia2gsup, which can translate  the S6a/S6d diameter into Osmocom GSUP
protocol, so nextepc can be used without nextepc-hssd but with osmo-hlr instead.

The network was operating six Ericsson RBS6402 in Band 7 (2600 MHz).

Some more details can be found at

Regarding the nextepc side:

* 2439 uniqua IMSIs were seen
** 147 unique IMSIs of CCC SIM cards (26242)
** 2292 non-CCC IMSIs
** 75 unique MCC-MNC tuples
** 34 unique MCCs
** The usual suspects (Europe + North America), but also...
*** Malaysia, Indonesia, Australia, New Zealand, South Africa
* 560 Attach accept (CCC SIM cards)
* 46590 Attach reject (commercial operator SIM cards)
* 629 PDN context (APN) activations
* 235 handovers between cells (X2)
* 64 crashes + restarts of nextepc-mme
* 9 crashes + restarts of nextepc-pgw
* 0 crashes + restarts of nextepc-sgw
* 10 crashes + restarts of nextepc-pcrf

In general, it worked quite nicely, and I have to congratulate Sukchan on
his work at nextepc.

I investigated some of the crashes, reported them to the issue tracker and
attempted to fix some of them on-site.  The actual codebase that was running
can be found at https://github.com/laf0rge/nextepc/commits/laforge/cccamp19

>From my experience with operating such a "large" nextepc network for the first
time, I have the following overall feedback, which basically boils down
to three major areas:

== the use of assert() ==

ASSERT should never be triggered by anything that is received from another
network entity.  So if a eNB sends an unknown S1AP-ID, or if a SGW sends
an unknown TEID, or if the NAS MAC validation fails, or a EMM message
cannot be decoded - all of those must be handled gracefully without
terminating the program.  This 'fail fast' way of programming can be
done when writing code in C++ (exceptions that are caught) or in erlang
(one process per message, crashing that one doesn't bring the entire MME
down).

I've tried my best to review all ogs_assert() in the MME and came up
with the following patch:
https://github.com/laf0rge/nextepc/commit/3b528af8fd51c85769123338eb57a4635c9d699e
which requires
https://github.com/laf0rge/ogslib/commit/dc36ccbb080038306666931bdc97f6204fd5c011
which introduces ogs_expect() and ogs_expect_or_return() macros that can
be used in many places instead of ogs_assert().

It would also be possible to use this kind of 'fail fast' approach in C
programs, but then one would have to use longjmp() from the 'assert',
and you would have to use some kind of hierarchical memory allocator so
that in the 'exception handler' you can release any dynamic allocations
that were made before.

== the lack of introspection ==

When you operate a network, it is vital to have some visibility.  For the
MME you want to inspect how many subscribers are currently attached, where
they are attached (TAC), whether they currently have an UE Context (and at
which eNB), which TMSI/GUTI was allocated, etc.

Likewise, for both SGW and PGW you want to see which PDN contexts exist, from
which peer IP adresses, which APN was used, what IP addresses have been allocated, etc.

In the Osmocom world, we implement this introspection in two ways:
* by means of the VTY interface (for the human user)
* by means of the CTRL interface (for other programs)

If I hadn't been busy with debugging various other issues, I would have actually
attempted to add a basic VTY interface to nextepc-mmed.

For sure there may be better ways to expose this state (ideally with the same
piece of code providing access to both human users as well as external programs),
but I'm not aware of any nice C language implementation in FOSS that one could
use right away.

== logging without context ==

When looking at log file output, it is very important that this log file output
always carry sufficient context.  IF there are many subscribers acting in parallel,
you need to know which subscriber / pdn context / ... a given log message relates
to, otherwise the log message is rather useless.

For example, if you get
	[mme] DEBUG: [MME] Authentication-Information-Answer (mme-fd-path.c:211)
then even at DEBUG level you have no indication what so ever for which
particular subscriber this AIA was received.  I would normally expect
that the UE is resolved from the DIAMETER session-id, and then the UEs
identity (IMSI) can be printed.

I also find it suboptimal that log lines often span multiple lines, which means
you cannot simply 'grep' for something, as you always need to check some lines
before and/or after.  But I guess conrary to the lack of context, this
is a matter of teste and one can have different opinions about it.

I'll try to contribute as much as I can regarding bug fixes and
enhancements.  Thanks again for all the great work so far!

Regards,
	Harald
-- 
- Harald Welte <laforge at gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)