[solved] Incorrect interpretation when specifying unicode

Hi, everybody!
The following situation:
This is the API https://x.com/index.php?app=ws&u=xxxx&h=c3f2454g543c3b8bfsdfa2311b1&op=pv&to=4444444444&unicode=1&msg=tex+consisting+of+82+Latin+characters
When using API to send SMS, where the number of characters consists of 82 characters (less than 160), also in the string of API specified unicode = 1. In this scenario, the number of SMS sent will be 2, instead of 1. Since we specified unicode = 1. If unicode = 1 is removed from the API string, the number of sent SMS will be 1, which is correct. But there is one point, if you remove the unicode = 1 line from the API, the Russian characters will not be interpreted correctly and will be displayed as “???” in SMS.
Conclusion:
When specifying unicode, all conclusions are correct, but the number of Latin letters will be equal to 2, instead of 1. If you do not specify unicode, Latin characters are processed correctly (each character is 1), but Russian characters are not displayed correctly.

Additional information:
PlaySMS version: 1.4.3

Installed PHP modules:
php7.3-cli
php7.3-common
php7.3-curl
php7.3-fpm
php7.3-gd
php7.3-json
php7.3-mbstring
php7.3-mysql
php7.3-opcache
php7.3-readline
php7.3-xml

Best regards,
Jamshid Tursunov

Anton, is it possible to add logic that excludes the definition of Latin letters as Unicode when specifying unicode = 1 in the API?
Thanks in advance.

Best regards,
Jamshid Tursunov

Please, help in solving this problem.
Thank you in advance.

Best regards,
Jamshid Tursunov

Unicode SMS is up to 70 chars max. See this: https://www.twilio.com/docs/glossary/what-sms-character-limit

anton

Good day!
Thanks for answering and thanks for the tip.
I will outline the situation a little differently.
P.S. How the gateway is used by Kannel.
I noticed that when compiling SMS, the number of characters was 82 characters. In fact, this is 1 SMS, as the number of characters does not exceed 160 characters. But the API unicode = 1 option was specified. And billing considered that this equals 2 SMS, instead of 1. I removed unicode = 1 from the API and billing counted it as 1 SMS.
P.S.S. Only letters from the Latin alphabet and numbers were used in the text.

Best regards,
Jamshid Tursunov

non-unicode SMS is 128 ASCII chars, and they r not including many other characters such as unicodes (Russian, Arabic etc).

For those unicode SMS, we need to submit to Kannel and set the type unicode, thus limit the SMS into 70 chars per SMS.

If you’re submitting to Kannel (playSMS via Kannel) a unicode text (SMS containing at least 1 unicode char) and you dont tell Kannel that it is unicode then Kannel will submit as non-unicode to provider and the recipient can’t read properly. Therefor you need to submit as unicode.

So, unicode text sent as SMS is limited to 70 chars per SMS, if more than that will be counted as more than 1 SMS. playSMS will follow that situation and adjust accordingly.

anton

Anton, thanks for answering.
On the Kannel side, unicode processing is enabled (smsbox - mo-recode = true)
Imagine this situation: the API does not specify unicode (unicode = 1) and everything works fine if you use the Latin font in SMS and each character counts as 1. If we want to use the Russian font, the output will not be correct. If we want to combine Latin and Russian fonts, the output will also not be correct. Chastino will be displayed correctly (where the Latin letters), partially not (where the Russian letters). That’s because unicode (unicode = 1) is not specified in the URL API. If we pass the unicode option with argument 1, all characters, even Latin ones, will be treated as unicode. And this proof, 82 Latin characters count as 2 SMS. I want to say that it is possible to add such logic, where, with the unicode API specified in the URL, the Latin characters were not taken into account as Unicode, and those that were not Latin were treated as Unicode? In the PlaySMS settings there is the option “Enable credit unicode SMS as normal SMS” and if you enable this option, the text of 82 Latin characters, with Unicode = 1 specified in the URL API, will be interpreted as 1 SMS. But also other Unicode characters will be interpreted as 1 character.

Best regards,
Jamshid Tursunov

It seems to me that before transmitting data to Kannel, we must determine which character is Unicode and which is not, and then when transmitting mixed data, where some are Unicode and some are not, the characters will be interpreted accordingly

Best regards,
Jamshid Tursunov

Hi,

Understood that you want to do that, but as far as I know you need to decide whether the SMS contains any unicode char or not, then you use unicode=1. So the unicode or not is not depend on per character but per SMS, if SMS contains unicode chars, to make it work on the recipient you need to submit unicode=1.

When playSMS use unicode=1 then its up to the gateway plugin to implement that. In Kannel the unicode option is used to select which encoding that will be used by Kannel to process a whole SMS (not per character).

Here is the relevant code:

anton

Anton, thank you very much for answering so extensively.
Then one logical question is how to be in this situation:
We want to send several SMS and some SMS contain Unicode, some do not.
It turns out the following picture:
Assume that the Unicode is not set in the URL API, in this case some SMS where there are no Unicode characters will be displayed correctly and also read, and those SMS that contain Unicode characters will not be displayed correctly, but read as Unicode characters (70 characters )
Another situation: Unicode is specified in the URL API (Unicode = 1), in this case, no matter what characters will be transmitted (Unicode, not Unicode), everything will be displayed correctly by the end user, but all characters will be considered as unicode (70 characters - 1 SMS), while Latin characters are not among the Unicode characters.

Transferring all characters as Unicode is also not correct, because if Unicode = 1 is specified and for example we want to send 10000 SMS and where 9999 SMS will be sent using Latin letters, and all of them will be processed as Unicode characters and 1 will use Russian characters, which will be correctly interpreted as expected. And for the sake of 1 SMS, 9999 SMS will be considered as 2x9999.

On the application side, using 2 URL APIs is also not correct, where is 1 with the given Unicode, the other without Unicode. Agree, logically this is not correct.

It seems to me that the API should be multifunctional and have some kind of verification mechanism, maybe before transferring data to a plug-in that excludes the Unicode = 1 option if all characters are not Unicode. I’m not a programmer, I can judge objectively based on the situation and logically approach it. If somewhere my ideas do not intersect with the logic of the code, I apologize.

Thank you for the time allotted earlier.

Best regards,
Jamshid Tursunov

Anton, such an idea and what do you think about the following:
We add logic that checks and determines the contents of SMS before passing it to the KAnnel plugin, i.e. process SMS and define it as Unicode or non-Unicode and, depending on the result, passes Unicode = 1 to the plugin or not. And in such cases, Unicode = 1 we do not need in the URL API.

Best regards,
Jamshid Tursunov

If you need webservices API it means that you are using a script to do custom processing before submitting SMS to Kannel via playSMS. Can you do your own detection which SMS containing unicodes which one is not, and then submit different URL (one with unicode=1, one without) ?

Its just for proof of concept. If you can then the detection part can be integrated in playSMS, I’ll help add it to playSMS.

anton

Good day, Anton!
Thanks for answering.
Yes, I would like to say that if there were no verification of SMS content for the presence of Unicode characters before transferring it to the Kannel plugin, you would have to use this approach and, as you noticed, this would be a bad decision.

I think that there should be such a verification logic:
By default, before sending data to the Kannel plugin, there should be a check of SMS content for Unicode, if at least 1 Unicode character is present in the content, then treat all content characters as Unicode.
If more than one character is not related to Unicode, but the Unicode = 1 is specified in the URL API, do not take Unicode = 1 into account and treat all characters as non-Unicode characters.

And we will not need to invent the method that was listed above (with two APIs, 1 with Unicode, the other without).

Or you can completely remove the ability to specify unicode = 1 and implement all the logic in the code. Since, all the characters and Unicode and non-Unicode characters are static, they can be specified once in the code and the verification logic. In such cases, in the URL API, we do not need to specify Unicode = 1. instead, we will have some kind of automation of verification in the code, and based on the content of the characters, the system will determine whether to consider the contents as Unicode or not.

Respectfully,
Jamshid Tursunov

There was logic to detect unicode: https://github.com/antonraharja/playSMS/blob/master/web/plugin/gateway/kannel/fn.php#L97

But I decided to remove it: https://github.com/antonraharja/playSMS/commit/e0ef38eb2e95dfad2a3bf748d0724d81912d4627

You can try to uncomment it and test.

Here is the function used to detect unicode: https://github.com/antonraharja/playSMS/blob/master/web/lib/fn_core.php#L767-L792

anton

Good day, Anton!
Thanks for answering.
Yes, of course I can.
It turns out, after I uncomment the lines and when I test, I will not specify Unicode = 1, right?

Respectfully,
Jamshid Tursunov

Good day, Anton!
Uncommented the line:


The situation is as follow:
All Russian characters began to be interpreted correctly.
But at the same time, all characters (Latin, Russian are interpreted as non-Unicode characters. The same text is 82 characters, one consists of Latin, the other is Cyrillic (Russian), both are treated as 1 SMS. Although in the second case, instead of 1, there should have been 2 SMSs. And another such case: if after 82 Latin characters Cyril (Russian) is also indicated, that is, mixed Latin and Cyrillic characters are also treated as the same.
P.S. Unicode = 1 not specified in URL API

Respectfully,
Jamshid Tursunov

uncommenting that line will only get playSMS to detect the SMS before sending it to Kannel whether that SMS is unicode or not. If its containing unicode characters then automatically will pass it to Kannel as unicode SMS by adding option charset (just below that uncommented line)

so was the detection wrong ? or the processing by Kannel or playSMS was wrong for you example SMS text.

anton

Good Morning!
PlaySMS log chunk:

2020-03-03 10:33:28 PID5e5dec28b3b61 jftp L2 sendsms # start uid:5 sender_id:[SMS] smsc:[]
2020-03-03 10:33:28 PID5e5dec28b3b61 jftp L2 sendsms_queue_create # saving queue_code:fc3b6d4ba9556e4dee7907593739e2d1 src:SMS scheduled:2020-03-03 10:33:28
2020-03-03 10:33:28 PID5e5dec28b3b61 jftp L2 sendsms_queue_create # saved queue_code:fc3b6d4ba9556e4dee7907593739e2d1 id:3031
2020-03-03 10:33:28 PID5e5dec28b3b61 jftp L2 sendsms # dst_count:1 sms_count:1 total_charges:0.01
2020-03-03 10:33:28 PID5e5dec28b3b61 jftp L2 sendsms_queue_push # saving queue_code:fc3b6d4ba9556e4dee7907593739e2d1 dst:+998974332423
2020-03-03 10:33:28 PID5e5dec28b3b61 jftp L2 sendsms_queue_push # saved queue_code:fc3b6d4ba9556e4dee7907593739e2d1 smslog_id:123300
2020-03-03 10:33:28 PID5e5dec28b3b61 jftp L2 sendsms # end queue_code:fc3b6d4ba9556e4dee7907593739e2d1 queue_count:1 **sms_count:1** failed_queue:0 failed_sms:0
- - 2020-03-03 10:33:29 PID5e5dec29845a0 - L2 sendsmsd # start processing queue_code:fc3b6d4ba9556e4dee7907593739e2d1 chunk:0 queue_count:1 **sms_count:1** scheduled:2020-03-03 10:33:28 uid:5 gpid:0 sender_id:SMS
- - 2020-03-03 10:33:29 PID5e5dec29845a0 - L2 sendsmsd # sending queue_code:fc3b6d4ba9556e4dee7907593739e2d1 smslog_id:123300 to:+998974332423 sms_count:1 counter:1
- - 2020-03-03 10:33:29 PID5e5dec29845a0 - L2 recvsms_process # using default SMSC smsc:[kannel]
- - 2020-03-03 10:33:29 PID5e5dec29845a0 - L2 sendsms_process # start
- - 2020-03-03 10:33:29 PID5e5dec29845a0 - L2 simplerate_hook_rate_cansend # allowed user uid:5 sms_to:+998974332423 adhoc_credit:0.27 count:1 **rate:0.01 charge:0.01** adhoc_balance:0.26
- - 2020-03-03 10:33:29 PID5e5dec29845a0 - L2 sendsms # saving smslog_id:123300 u:5 parent_uid:0 g:0 gw:kannel smsc:kannel s:SMS d:+998974332423 type:text unicode:0 status:0
- - 2020-03-03 10:33:29 PID5e5dec29845a0 - L2 sendsms_process # saved smslog_id:123300 id:123300
- - 2020-03-03 10:33:29 PID5e5dec29845a0 - L2 simplerate_hook_rate_deduct # enter smslog_id:123300
- - 2020-03-03 10:33:29 PID5e5dec29845a0 - L2 simplebilling_hook_billing_post # saving smslog_id:123300 rate:0.01 count:1 charge:0.01
- - 2020-03-03 10:33:29 PID5e5dec29845a0 - L2 simplebilling_hook_billing_post # saved smslog_id:123300 id:123270
- - 2020-03-03 10:33:29 PID5e5dec29845a0 - L2 sendsms_process # end
- - 2020-03-03 10:33:29 PID5e5dec29845a0 - L2 sendsmsd # result queue_code:fc3b6d4ba9556e4dee7907593739e2d1 to:+998974332423 flag:1 smslog_id:123300
- - 2020-03-03 10:33:29 PID5e5dec29845a0 - L2 sendsmsd # finish processing queue_code:fc3b6d4ba9556e4dee7907593739e2d1 uid:5 sender_id:SMS queue_count:1 sms_count:1
192.168.100.218 192.168.100.224 2020-03-03 10:33:43 PID5e5dec373945a - L2 kannel__call # start load:/var/www/playsms/plugin/gateway/kannel/dlr.php
192.168.100.218 192.168.100.224 2020-03-03 10:33:43 PID5e5dec373945a - L2 kannel__call # end load dlr
- - 2020-03-03 10:33:44 PID5e5d30d9513de - L2 simplebilling__finalize # saving smslog_id:123300
- - 2020-03-03 10:33:44 PID5e5d30d9513de - L2 simplebilling__finalize # saved smslog_id:123300

Before the transfer, in Kannel, PlaySMS will identify 82 characters in Cyrillic (Russian letters), as 1 SMS.

Respectfully,
Jamshid Tursunov

it was not detected as unicode.

can you paste here your sample SMS text.

anton