[PATCH] Decoding UCS2 USSD

historical

Thanks Harald for checking the license thing out, although I think
the code is so trivial that it's difficult to claim copyright. Let's
see.

Thanks Pavel for the patch! I have some comments:

Pavel Baturko wrote:
> From 4fcf1f7afc29053833e524cc35c15dd57ebbbf64 Mon Sep 17 00:00:00 2001
> From: pab <pab at pab>

Please set the git config user.name and user.email options correctly:

git config --global user.name 'Pavel Baturko'
git config --global user.email your at email.here

> Date: Thu, 20 Sep 2012 01:36:28 +0000
> Subject: [PATCH] Add USSD UCS2 support
> 
> ---

There is no commit message. I don't know why this would be OK.
Please write a few relevant sentences about the change *in* the
commit message, rather than in the email you send where you attach
the commit message. The commit message is the correct place for the
information, since that is the canonical storage facility for the
code. The mailing list is not.

> +++ b/src/host/layer23/src/mobile/gsm480_ss.c
> @@ -24,6 +24,9 @@
>  #include <stdio.h>
>  #include <string.h>
>  #include <stdlib.h>
> +#include <wchar.h>
> +#include <wctype.h>
> +#include <locale.h>

Are these files always guaranteed to be available, or are some
configure.ac checks required in this change?

> @@ -439,7 +442,7 @@ static int gsm480_tx_invoke(struct gsm_trans *trans, struct msgb *msg,
>  	/* Pre-pend the invoke ID */
>          msgb_push_TLV1(msg, GSM0480_COMPIDTAG_INVOKE_ID, trans->ss.invoke_id);
>  
> -	/* Wrap this up as invoke vomponent */
> +	/* Wrap this up as invoke component */
>  	if (msg_type == GSM0480_MTYPE_FACILITY)
>  		msgb_wrap_with_TL_asn(msg, GSM0480_CTYPE_RETURN_RESULT);
>  	else
> @@ -592,7 +595,7 @@ int ss_send(struct osmocom_ms *ms, const char *code, int new_trans)
>  		LOGP(DSS, LOGL_INFO, "Existing transaction.\n");
>  		return gsm480_tx_ussd(trans, GSM0480_MTYPE_FACILITY, code);
>  	}
> -
> +	

Please remove all unrelated changes from the commit. I am strongly in
favor of cleaning up code, but it needs to be done in separate
commits, not mixed in with other changes. Mixing changes makes life
very difficult for anyone who might need to bisect the code. It is
completely unneccessary. It's also really ugly.

> @@ -751,14 +755,27 @@ static int gsm480_rx_ussd(struct gsm_trans *trans, const uint8_t *data,
>  		LOGP(DSS, LOGL_NOTICE, "DSC tag too short\n");
>  		return -EINVAL;
>  	}
> +
>  	if (data[0] != ASN1_OCTET_STRING_TAG || tag_len != 1) {
>  		LOGP(DSS, LOGL_NOTICE, "Expecting DSC tag\n");
>  		return -EINVAL;
>  	}
> -	if (tag_data[0] != 0x0f) {
> -		LOGP(DSS, LOGL_NOTICE, "DSC not 0x0f\n");
> -		return -EINVAL;
> +	
> +	switch (tag_data[0])
> +	{
> +		case DCS_GSM7DEAFULT:
> +		case DCS_GSM7DEAFULT2:
> +		case DCS_0f:
> +			dcs_charset = CGB_GEN_CHARSET7;
> +			break;

I really like the kernel style for switch:

switch (foo) {
case BAR:
case BAZ:
	quux();
	break;
}

Ie. moving the opening brace, and using one less tab inside.

> +		default:
> +			LOGP(DSS, LOGL_NOTICE, "DSC <%d> unknown\n", tag_data[0]);
> +			return -EINVAL;
>  	}
> +
>  	len -= tag_data - data + tag_len;
>  	data = tag_data + tag_len;

The added newline seems an unrelated change.

> @@ -767,25 +784,81 @@ static int gsm480_rx_ussd(struct gsm_trans *trans, const uint8_t *data,
>  		LOGP(DSS, LOGL_NOTICE, "Text tag too short\n");
>  		return -EINVAL;
>  	}
> +	
>  	if (data[0] != ASN1_OCTET_STRING_TAG) {
>  		LOGP(DSS, LOGL_NOTICE, "Expecting text tag\n");
>  		return -EINVAL;
>  	}

The one above too.

> -	num_chars = tag_len * 8 / 7;
> -	/* Prevent a mobile-originated buffer-overrun! */
> -	if (num_chars > sizeof(text) - 1)
> -		num_chars = sizeof(text) - 1;
> -	text[sizeof(text) - 1] = '\0';
> -	gsm_7bit_decode(text, tag_data, num_chars);
> -
> -	for (i = 0; text[i]; i++) {
> -		if (text[i] == '\r')
> -			text[i] = '\n';
> +	
> +	if (dcs_charset == CGB_GEN_CHARSET7)
> +	{

Please pay attention to the existing style in code that you work on,
and make sure that you follow it exactly. You put the opening brace
on the following line, while the rest of the code does not do this.

In any case, please change this if (x == foo) else if (x == bar) code
to use switch ().

> +	else if (dcs_charset == CGB_GEN_CHARSET16)
> +	{
> +		// USC2 charset, see ISO/IEC 10646
> +		const char *curr_locale;
> +		uint8_t hex_or_str; // 1 - print string in utf-8, 0 -print raw hex

The hex_or_str variable name is absolutely horrible. Please change it
to something that also conveys the meaning of the value; something
self-descriptive. One suggestion would be have_utf8_locale.

> +		int j = 0;
> +	
> +		num_chars = tag_len / 2; // since 1 char = 2 bytes

This discards the last surplus byte if there is one. Good!

> +		// determine if console locale is UTF-8 and we can print message
> +		// if not - print raw hex UTF-8 data
> +		curr_locale = setlocale(LC_ALL, "");

Passing an empty string as locale is an invasive operation, which is
why I suggested to pass NULL. Why do you use the empty string instead
of NULL?

> +		hex_or_str = (strstr(curr_locale, "UTF-8")) ? 1 : 0;

I'd suggest variable = strstr() != NULL;

> +		if (!hex_or_str) printf("Received USSD (UFT-8, hex):\n");

Three comments for a single line of code:

First the style, I don't think that conditions and their code is on
the same line anywhere else in the code. Following existing style.

Second, I don't believe other parts of the code write directly to any
file descriptors, but rather they all use logging macros. Follow
existing style.

Third, in the patch there are several occurences of the "UFT-8" typo.

> +		for (i = 0; i < num_chars; i++) {
> +			unsigned char utf8[4];
> +			int ucs2, res;
> +
> +			ucs2 = ((tag_data[2*i]<<8)|tag_data[2*i+1]);

Look at how other code uses whitespace and follow existing style.

> +			res = ucs2_to_utf8(ucs2, utf8);

I would either just include the code here, or write a new function
which suits our purposes which converts the full string.

> +			if (res > 0) {
> +				if (hex_or_str) {
> +					memcpy(text+j, utf8, res);
> +					j += res;
> +				}
> +				else {
> +					int j;
> +					for (j=0; j<res; j++) printf("%02x", utf8[j]);
> +					printf(" ");
> +				}
> +			}

I still think printf() is not used anywhere else in the code.

And the code makes absolutely no sense. In one case there is a
memcpy(), in the other case there is printf().

Also, why is there a memcpy at all - please just write the data to
the correct location in the first place. Always strive for zero copy.
Thank you.

> +		if (hex_or_str) {
> +			text[j] = '\0';
> +		}
> +		else {
> +			printf("\n");
> +			sprintf(text, "USSD not printed(locale not UFT-8)");
> +		}
> +		
> +		gsm480_ss_result(trans->ms, text, 0);

Did you investigate if this function will handle UTF-8 data
correctly? I sure don't know that.

> @@ -935,6 +1008,7 @@ static int gsm480_rx_result(struct gsm_trans *trans, const uint8_t *data,
>  		LOGP(DSS, LOGL_NOTICE, "Invoke ID too short\n");
>  		return -EINVAL;
>  	}
> +
>  	if (data[0] != GSM0480_COMPIDTAG_INVOKE_ID || tag_len != 1) {
>  		LOGP(DSS, LOGL_NOTICE, "Expecting invoke ID\n");
>  		return -EINVAL;

This seems to be an unrelated change.

> @@ -960,6 +1034,7 @@ static int gsm480_rx_result(struct gsm_trans *trans, const uint8_t *data,
>  		LOGP(DSS, LOGL_NOTICE, "Sequence tag too short\n");
>  		return -EINVAL;
>  	}
> +
>  	if (data[0] != GSM_0480_SEQUENCE_TAG) {
>  		LOGP(DSS, LOGL_NOTICE, "Expecting Sequence Tag, trying "
>  			"Operation Tag\n");

And this.

> @@ -974,6 +1049,7 @@ operation:
>  		LOGP(DSS, LOGL_NOTICE, "Operation too short\n");
>  		return -EINVAL;
>  	}
> +
>  	if (data[0] != GSM0480_OPERATION_CODE || tag_len != 1) {
>  		LOGP(DSS, LOGL_NOTICE, "Expecting Operation Code\n");
>  		return -EINVAL;

And this.

> @@ -1223,8 +1299,8 @@ static int gsm480_mmss_ind(int mmss_msg, struct gsm_trans *trans,
>  	/* pull the MMSS header */
>  	msgb_pull(msg, sizeof(struct gsm48_mmxx_hdr));
>  
> -	LOGP(DSS, LOGL_INFO, "(ms %s) Received est/data '%u'\n", ms->name,
> -		msg_type);
> +	LOGP(DSS, LOGL_INFO, "(ms %s) Received est/data '%u'='%X'\n", ms->name,
> +		msg_type, msg_type);
>  
>  	switch (msg_type) {
>  	case GSM0480_MTYPE_RELEASE_COMPLETE:

And this.

> +++ b/src/shared/libosmocore/include/osmocom/gsm/gsm_utils.h
..
> @@ -143,9 +145,51 @@ enum gsm_chan_t {
>  	GSM_LCHAN_TCH_H,
>  	GSM_LCHAN_UNKNOWN,
>  	GSM_LCHAN_CCCH,
> -	GSM_LCHAN_PDTCH,
>  	_GSM_LCHAN_MAX
>  };

Seems unrelated.

> +/* Data Coding Schemes (DCS), 23.038, 4 */
> +#define CGB_GEN            0x00    /* 00xxxxxxb, General Data Coding indication */
> +#define CGB_GEN_COM        0x20    /* xx1xxxxxb, text is uncompressed */
> +#define CGB_GEN_UNCOM      0x00    /* xx0xxxxxb, text is compressed (23.042) */
> +#define CGB_GEN_CLASS      0x10    /* xxx1xxxxb, bits 1 to 0 have a message class meaning */
> +#define CGB_GEN_NOCLASS    0x00    /* xxx0xxxxb, bits 1 to 0 are reserved and have no message class */
> +#define CGB_GEN_CLASS0     0x00    /* xxxxxx00b, Class 0 */
> +#define CGB_GEN_CLASS1     0x01    /* xxxxxx01b, Class 1, Default meaning: ME-specific */
> +#define CGB_GEN_CLASS2     0x02    /* xxxxxx10b, Class 2, (U)SIM specific message */
> +#define CGB_GEN_CLASS3     0x03    /* xxxxxx11b, Class 3, Default meaning: TE specific (27.005) */
> +#define CGB_GEN_CHARSET7   0x00    /* xxxx00xxb, GSM 7 bit default alphabet */
> +#define CGB_GEN_CHARSET8   0x04    /* xxxx01xxb, 8 bit data */
> +#define CGB_GEN_CHARSET16  0x08    /* xxxx10xxb, UCS2 (16bit) */
> +#define CGB_GEN_CHARSETR   0x0C    /* xxxx11xxb, Reserved */
> +
> +#define CGB_AUTODEL        0x40    /* 01xxxxxxb, Message Marked for Automatic Deletion Group */
> +#define CGB_RESERVED       0x8B    /* 10001011b, Reserved */
> +
> +#define CGB_MWI_DISC       0xC0    /* 1100xxxxb, Message Waiting Indication Group: Discard Message */
> +#define CGB_MWI_STOR       0xD0    /* 1101xxxxb, Message Waiting Indication Group: Store Message */
> +#define CGB_MWI_INACT      0x00    /* xxxx0xxxb, Set Indication Inactive */
> +#define CGB_MWI_ACT        0x08    /* xxxx1xxxb, Set Indication Active */
> +#define CGB_MWI_RESERVED   0x00    /* xxxxx0xxb, Reserved */
> +#define CGB_MWI_VMAIL      0x00    /* xxxxxx00b, Voicemail Message Waiting */
> +#define CGB_MWI_FAX        0x01    /* xxxxxx01b, Fax Message Waiting */
> +#define CGB_MWI_EMAIL      0x02    /* xxxxxx10b, Electronic Mail Message Waiting */
> +#define CGB_MWI_OTHER      0x03    /* xxxxxx11b, Other Message Waiting */
> +#define CGB_MWI_STOR2      0xE0    /* 1110xxxxb, as CGB_MWI_STOR, but text in UCS2 */
> +
> +#define CGB_DCMC           0xF0    /* 1111xxxxb,  Data coding/message class */
> +#define CGB_DCMC_RESERVED  0x08    /* xxxx0xxxb,  Reserved */
> +#define CGB_DCMC_ALPH7     0x00    /* xxxxx0xxb, GSM 7 bit default alphabet  */
> +#define CGB_DCMC_ALPH8     0x04    /* xxxxx0xxb, 8 bit alphabet  */
> +
> +/* Default 7 bit GSM */
> +#define DCS_GSM7DEAFULT   (CGB_GEN    | CGB_GEN_UNCOM | CGB_GEN_NOCLASS | CGB_GEN_CHARSET7)
> +/* As default, but last bit is 1.
> + * It makes no sense but some providers sends this type of DCS  */
> +#define DCS_GSM7DEAFULT2  (CGB_GEN    | CGB_GEN_UNCOM | CGB_GEN_NOCLASS | CGB_GEN_CHARSET7  | CGB_GEN_CLASS1)
> +/* This constant was initialy in code */
> +#define DCS_0f            (CGB_GEN    | CGB_GEN_UNCOM | CGB_GEN_NOCLASS | CGB_GEN_CHARSETR  | CGB_GEN_CLASS3)

I hope a better name than DSC_0f can be found for it! Please talk to
the developer who added the constant, to understand what it is about.
Use git blame to find out who it is.

> +/* Autodel 16 bit UCS2 */
> +#define DCS_ADEL_USC2     (CGB_AUTODEL| CGB_GEN_UNCOM | CGB_GEN_NOCLASS | CGB_GEN_CHARSET16)
>  
>  #endif

Lots of new defines. They're not all required, but I do like that
they are all there now.

> +++ b/src/shared/libosmocore/src/gsm/gsm_utils.c
> @@ -102,6 +102,30 @@ static unsigned char gsm_7bit_alphabet[] = {
>  	0xff, 0x7d, 0x08, 0xff, 0xff, 0xff, 0x7c, 0xff, 0x0c, 0x06, 0xff, 0xff, 0x7e, 0xff, 0xff
>  };
>  
> +/* Convert UCS2 char to UTF8 char */
> +int ucs2_to_utf8 (int ucs2, unsigned char * utf8)
> +{

If you do end up keeping this function then even if it is too trivial
to be copyrightable and/or the author doesn't care I think it is
polite to at the very least reference it's origin.

Thanks for your effort - I hope you'll spin a new improved patch!

//Peter