osmo-msc[master]: mgcp: use osmo-mgw to switch rtp streams

historical

Patch Set 11:

(70 comments)

Actually we should not merge this patch until we did not solve the problems with 3G (rab assignment response not parsed).

For the other topic, please see my email.

https://gerrit.osmocom.org/#/c/4980/2//COMMIT_MSG
Commit Message:

Line 15: Depends: osmo-mgw Iab6a6038e7610c62f34e642cd49c93d11151252c
> please use the tag as "Depends: foo" to match general tag syntax like below
Done

https://gerrit.osmocom.org/#/c/4980/2/include/osmocom/msc/msc_ifaces.h
File include/osmocom/msc/msc_ifaces.h:

Line 44
> -1: msc_ifaces.h/c didn't really turn out as clearly as I had intended... B
Done

https://gerrit.osmocom.org/#/c/4980/2/include/osmocom/msc/msc_mgcp.h
File include/osmocom/msc/msc_mgcp.h:

Line 33: 
> ?: I wonder whether rtp_endpoint and the conn_ids should rather be part of 
The structs are similar, however I would not include this into the mgcp_client API. It is not guaranteed that every client will use exactly one connection. Integrating this into the client would require some more elaborated approach. We would have to manage the connection IDs some how, which means we would have to track open connections, add CRCX and remove connection IDs to lists on DLCX. This would be rather complex. I think letting the user tracking the connection IDs is pretty ok.

https://gerrit.osmocom.org/#/c/4980/11/include/osmocom/msc/msc_mgcp.h
File include/osmocom/msc/msc_mgcp.h:

Line 26: /* MGCP state handler context (fsm etc..) */
> explain the scope? mgcp context for one call leg? one subscriber?
Done

Line 31: 	/* RTP endpoint number */
> (omit comments that just mirror the variable name; explain what the endpoin
Done

Line 35: 	 * needed */
> use line width instead of wrapping short lines
Done

https://gerrit.osmocom.org/#/c/4980/11/src/libmsc/a_iface.c
File src/libmsc/a_iface.c:

Line 412: 	rtp_addr_in.sin_addr.s_addr = inet_addr(conn->rtp.local_addr_ran);
> needs some error checking. man inet_addr:
Done

https://gerrit.osmocom.org/#/c/4980/11/src/libmsc/gsm_04_08.c
File src/libmsc/gsm_04_08.c:

Line 1333: 	/* Initiate the teadown of the related connections on the MGW */
> "teadown"
Done

Line 1390: /* helper function for tch_bridge() to bridge the RTP Voice streams also */
> (I find "helper function for" is something you could write everywhere; I al
The "helper function for" serves basically the purpose to mark the function as a nested function. Unfortunately we do not have nested functions in C (see also pascal).

Actually we do not need to have this in a separate function anyway so I removed it.

Line 2691: 	uint32_t addr = inet_addr(trans->conn->rtp.local_addr_cn);
> inet_addr, error checking
Done

https://gerrit.osmocom.org/#/c/4980/11/src/libmsc/iucs.c
File src/libmsc/iucs.c:

Line 226: 	     " rtp=%x:%u, use_x213_nsap=%d\n", conn_id, rab_id, rtp_ip,
> I would have said, use line width, but this is just moving code, right?
I have corrected the formatting of the whole function now. This is code that had been moved from msc_ifaces.c to here.

(splitting the log strings is a coding style violation anyway, so I concatenated the strings again and corrected the the formatting)

Line 235: 		     " conn_id=%d rab_id=%d rtp=%x:%u\n",
> same
Done

https://gerrit.osmocom.org/#/c/4980/2/src/libmsc/msc_ifaces.c
File src/libmsc/msc_ifaces.c:

Line 194
> looks like callers should just call msc_mgcp_call_assignment() directly.
Done

Line 249
> looks like callers should just call msc_mgcp_call_release() directly.
Done

https://gerrit.osmocom.org/#/c/4980/2/src/libmsc/msc_mgcp.c
File src/libmsc/msc_mgcp.c:

Line 47: #define MGCP_RAN_TIMEOUT 10	/* in seconds */
> -1: prefer msc_mgcp_ instead of int_ because if you follow this scheme, we 
Done

Line 59: 	MGCP_ERR_MGW_FAIL,
> -1: typical naming would be
Done

Line 93: 	ST_CALL,
> -1: again you are using the same general struct name, just 'fsm_', like in 
Done

Line 126: 	 * the CN side */
> fsm_
Done

Line 217: {
> -1: again, you do not need to log FSM states or events, the FSM code does t
Done

Line 222: 	int rc;
> -1: best use "0x%x" so we always know it is logged in hex. Are you sure it 
osmo-mgw parses the numerical digit as hexadecimal number, so yes we should display it in hexadecimal here too.

Line 240: 	if (snprintf(mgcp_msg.endpoint, MGCP_ENDPOINT_MAXLEN, MGCP_ENDPOINT_FORMAT, mgcp_ctx->rtp_endpoint) >=
> (wondering whether this log line adds any info to the log, since the FSM al
Done

Line 248: 	/* Transmit MGCP message to MGW */
> lol, return at the end of a void function? (some more below)
Done

Line 319: 		handle_error(mgcp_ctx, MGCP_ERR_UNEXP_TEARDOWN);
> nope
Done

Line 422: 		goto error;
> nope
Done

Line 442: 	 * of the side pointing towards the BSS should be already communicated
> (code dup: could have one handle_error(...) below and goto assignment_fail 
Done

Line 460: 	struct gsm_subscriber_connection *conn;
> (then)
Done

Line 465: 	OSMO_ASSERT(mgcp_ctx);
> ?: no timeout?
I have added a 10sec timeout now. I think after 20sec we can be pretty sure that there are problems on the BSC side and the call can not be completed anyway.

Line 492: 		.conn_id = mgcp_ctx->conn_id_cn,
> ...
Done

Line 503: 	OSMO_ASSERT(msg);
> (line break or drop the " " in the middle)
Done

Line 507: 	rc = mgcp_client_tx(mgcp, msg, mgw_mdcx_cn_resp_cb, mgcp_ctx);
> (rather combine with preceding LOGPFSML for less log lines and a bit less l
Done

Line 552: 		return;
> you're copying these lines and local variables around everywhere, even in f
Done

Line 609: 	};
> (combine logs)
Done

Line 691: 		.presence = (MGCP_MSG_PRESENCE_ENDPOINT | MGCP_MSG_PRESENCE_CALL_ID),
> really an error?
Done

Line 803: 			 .in_event_mask = S(EV_TEARDOWN) | S(EV_ASSIGN),
> (we often use an S() macro, e.g. see top of vlr_lu_fsm.c and struct sub_pre
Done

Line 805: 			 .name = OSMO_STRINGIFY(ST_MDCX_RAN),
> (I like to use OSMO_STRINGIFY(ST_CRCX_RAN) because then there won't be typo
Done

Line 877: 	if (conn->via_ran == RAN_UTRAN_IU)
> "the a"
Done

Line 895: 	OSMO_ASSERT(mgcp_ctx->fsm);
> can't we log some context? Remember Holger's scenario ... it is 37c3, you h
We encode trans->transaction_id in the fsm name, so where we use LOGPFSML() we have some context, which should be sufficient. In the LOGP() case here it is not so easy because we would have to add the context info externally. From what I can see we won't get until here anyway if trans or trans->conn is missing.

Line 927: 	if (!addr || strlen(addr) <= 0) {
> I think it should be a child of the conn->conn_fsm.
I think we have a design problem here anyway. mgcp_client_tx() takes the
function pointer, including context references and indefinitely waits for an
answer of the mgw. The answer may arrive late when the FSM and the context is
long freed. There is no real way to cope with this on the MSC side. We have
to free the context info at some point. What we would need is a functionality
in mgcp_client through which we can tell that the context is freed and that
we are not interested in any late answers anymore. Did I overlook something
in the client? Do we already have some functionality like this?

Line 954: 		osmo_fsm_inst_dispatch(mgcp_ctx->fsm, EV_ASSIGN, mgcp_ctx);
> context please in all of these logs.
Done

Line 961:  * trans: transaction context
> In general, you could have a look whether any code path leading to these ch
Done

https://gerrit.osmocom.org/#/c/4980/3/src/libmsc/msc_mgcp.c
File src/libmsc/msc_mgcp.c:

Line 65: 	MGCP_ERR_ASS_TIMEOUT,
> (prefix 'msc_mgcp_': nice. Why not the common value_string[] suffix '_names
Done

Line 283: 
> -1: oops: %0x%x
Done

Line 389: 	/* Notify the FSM that we got the response. */
> well, or we added a new RAN type and forgot to implement it here. I usually
Done

Line 445: 	osmo_fsm_inst_state_chg(fi, ST_MDCX_CN, MGCP_RAN_TIMEOUT, MGCP_RAN_TIMEOUT_TIMER_NR);
> (i.e. you don't want to combine these two logs?)
Done

Line 529: 
> (i.e. you don't want to combine these two logs?)
Done

Line 803: 			 .in_event_mask = S(EV_TEARDOWN) | S(EV_ASSIGN),
> i.e. you don't want the FSM to be a child of the conn_fsm?
I think this will cause the FSM to freed to early. I already tried to free when conn is freed. This leads to a crash because the MGW response arrives after the freeing is done. We should discuss this tomorrow, I think this is a difficult problem.

Line 806: 			 .action = fsm_mdcx_ran_cb,
> (this is also a duplicate of FSM internal logging, right?)
Done

Line 858: 	if (!trans->conn) {
> -1: context
Done

https://gerrit.osmocom.org/#/c/4980/11/src/libmsc/msc_mgcp.c
File src/libmsc/msc_mgcp.c:

Line 56: /* Some internal cause codes to indicate fault
> use line width
Done

Line 70: /* Human readable respresentation of the faul codes,
> line width, also multiple more times below
Comment lines that exceed the 80 line boundary are becoming difficult to read. Also readability is decreased when only the last word wraps.

Line 102: 	 * may now forward IP/Port of the remote call leg to the MGW*/
> (space before '*/')
Done

Line 167: 	if (fi->T == MGCP_MGW_TIMEOUT_TIMER_NR) {
> (a lot of "Note:" below)
Done

Line 180: 		osmo_fsm_inst_dispatch(fi, EV_TEARDOWN, mgcp_ctx);
> could make sense to place teardown in a ST_HALT -> on_enter function? Other
I have seen the possibility of supplying an onenter function recently but too late. I do not think that it is an all too good idea to patch this at such a late development state.

Line 231: 		 "CRCX/RAN: creating connection for the RAN side on " "MGW endpoint:0x%x...\n", mgcp_ctx->rtp_endpoint);
> join "..." "..."
Done

Line 710: 	osmo_fsm_inst_state_chg(fi, ST_HALT, MGCP_MGW_TIMEOUT, MGCP_MGW_TIMEOUT_TIMER_NR);
> above, a TEARDOWN event follows after the HALT state_chg. What triggers the
The EV_TEARDOWN event is fired on error via:
handle_error()
fsm_timeout_cb()

And on a regular call end by:
msc_mgcp_call_release()

Line 747: 	if (mgcp_ctx->free_ctx) {
> the free_ctx flag essentially creates two kinds of HALT states. One where w
The flag is needed to know if some outside entity still needs to access the
context information or not. At the point of the call release we can be sure that
nobody will access the context anymore and that we may free it.

An internal error can happen anytime. We could then change to a state
ST_HALT_ERROR but then we still do not know if anyone still needs the context.
So this does not help us to get rid of the flag because this state also must
not free the context too early.

The root of the probelm (I do not see it as a problem) is the way the
surrounding logic is designed. When the call is released by gsm_04_08.c then
the call is over from the gsm_04_08.c point of view. But we have all the RTP
connections still running. We must safely tear them down and then free the
context ourseleves. For this we must know that the call war really released
from ouside (then we may free) or we just decided because of an error condition
to tear down the RTP connections. If so we may free, but only when the ouside
also acknowledged the end of the call not early because then we run into a
use-after free problem.

To make it more clear I have added an event EV_TEARDOWN_ERROR, so then we are
at least clear who decided to teardown.

Line 749: 		talloc_free(mgcp_ctx);
> mabe clear the free_ctx flag to be safe? otherwise set mgcp_ctx->fsm to NUL
The double free is prevented inherently because the FSM instance is freed including all context data. There is no way to enter fsm_halt_cb again because the FSM is gone. Also clearing the flag does not make sense when the context is freed.

Line 761: 			 .out_state_mask = S(ST_HALT) | S(ST_CALL) | S(ST_CRCX_CN),
> IIUC the next state is CRCX_CN, not ST_CALL? Howcome do all states allow tr
The reason for this is because ST_CALL supplys all the tools to teardown the connection on the MGW. Bascially what we do here is to skip to the call state and then immediately tear down the call.

Line 765: 	/* When the response to the CRCX is received, then proceed with sending
> "the RAN CRCX"
Done

Line 774: 	 * the RAT, this will either trigger an Assignment Request on the
> RAT?
RAT = Radio Access Technology

Line 783: 	 * we will enter the MDCX phaseby sending an MDCX for the CN side
> "phaseby"
Done

Line 793: 	 * the assignment is completed */
> above it sounds like CRCX_COMPL sends the Assignment Requests and MDCX_CN r
The assignment is completed in parallel to the FSM. There may be cases where the IP-Address of the RAN side is not yet valid because the assignment is still ongoing. In this case we just wait until the assignment completes. If we already have a valid IP/Port, then we implicitly know that the Assignment completed in the meantime. Then we just fall through and continue.

Line 800: 	/* When the response for the MDCX is received, send the MDCX for the
> "for the CN MDCX"
Done

Line 803: 			 .in_event_mask = S(EV_TEARDOWN) | S(EV_ASSIGN),
> I find this hard to follow: ST_MDCX_RAN receives EV_ASSIGN, but ST_ASS_COMP
Yes, ST_MDCX_RAN must wait for EV_ASSIGN because it must not execute before. Otherwise it would run on invalid IP/Port data. ST_ASS_COMPL is missnamed, I will rename it. This is entered when the MDCX for the CN side completes. If the assignment is detected non complete, than it stays there, if not it falls through. But I see, the naming is irretating. I will rename the states.

Line 808: 	/* The MDCX phase is complete when the response is received from the
> "the RAN side MDCX response"
Done

Line 809: 	 * MGW. The call is now active */
> when the call is now active, why is this not the ST_CALL state? Oh, you mea
Yes, thats correct. The call is active when the MGW responds. Then we go to ST_CALL and wait for the EV_TEARDOWN event which then terminates the call.

Line 825: 	 * the state machine. */
> IIUC there should be DLCX for both the RAN and CN sides, each returning a D
No, doing it one by one is more complicated and error prone. Ending both connections at a time with a wildcarded DLCX is RFC conform and the safest and fastest way to do it.

Line 843: /* Notify that that a new call begins. This will create a connection for the
> "that that"
Done

Line 846:  * trans: transaction context
> use punctuation in doc, don't rely on line endings
Done

Line 912: /* Inform the FSM that the assignment (RAN connection) is now complete
> punctuation on all of these lines
Done

-- 
To view, visit https://gerrit.osmocom.org/4980
To unsubscribe, visit https://gerrit.osmocom.org/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieea9630358b3963261fa1993cf1f3b563ff23538
Gerrit-PatchSet: 11
Gerrit-Project: osmo-msc
Gerrit-Branch: master
Gerrit-Owner: dexter <pmaier at sysmocom.de>
Gerrit-Reviewer: Harald Welte <laforge at gnumonks.org>
Gerrit-Reviewer: Jenkins Builder
Gerrit-Reviewer: Neels Hofmeyr <nhofmeyr at sysmocom.de>
Gerrit-Reviewer: dexter <pmaier at sysmocom.de>
Gerrit-HasComments: Yes