MySQL to peak load CPU

try to kill playsmsd one by one, one at a time, and observe

the kill starts from dlrssmsd, then restart playsmsd then kill recvsmsd manually, restart again and so on

see which service affecting mysql

anton

Ok, thank you, will try it.

Regards,
Jamshid Tursunov

Anton, I tried to collect information on each process and together as a whole, and this is what happened:
Sequential shutdown and inclusion:


Before shutting down the processes, the CPU load varied in the range of 105-140%
Disconnected in the following sequence:
The first to disconnect ratesmsd, the CPU load dropped to around 30-50%, and this range, while watching, kept steady
The second turned off recvsmsd, the CPU load was in the range of 30-43%
The third disabled schedule, the CPU load was in the range of 30-42%
The fourth disconnected sendsmsd, the CPU load was in the range of 30-42%
Fifth disabled dlrssmsd, CPU load dropped to 0.7%, sometimes jumped to 1%


When all 5 services disconnected, the CPU load dropped to 0.7%, sometimes jumped to 1%


When I turned on dlrssmsd, the CPU load jumped right up to 30-42%
Next turned on sendsmsd, the CPU load remained in the range of 30-42%
After turning on schedule, the CPU load was in the range of 30-42%
Then turned on recvsmsd, the CPU load was in the range of 30-42%
When I turned on ratesmsd, the CPU load was kept in the range of 40-45%, but very often there are skips and the load reaches as much as 145%


Individual enable and disable: (I will enable and disable the service to calculate how much load a single process gives)
When I turned on dlrssmsd, the CPU load jumped right up to 42%
Next turned on sendsmsd, CPU load kept in the range 1.3-1.7%
After I enabled schedule, the CPU load was in the range 1.3-1.7%
Then I turned on recvsmsd, the CPU load kept in the range 1.3-1.7%
When I turned on ratesmsd, the CPU load started to jump right up to the mark above 100%

I want to emphasize that even after I turned off all the PlaySMS processes, the memory occupied by MySQL remained at 14%. I want to say that after disabling the PlaySMS process, the memory is not freed.

Respectfully,
Jamshid Tursunov

Anton, if possible, pay attention to the above MySQL behavior. Is it possible to optimize MySQL in order to reduce the load on the processor. Such bumps occur only with a simple system.

Respectfully,
Jamshid Tursunov

Anton, good day!
Did you have time to look at the unstable behavior of PlaySMS when stacking with a database?

Respectfully,
Jamshid Tursunov

Hi, sorry no, and I think its better to ask MySQL professionals/communities for that.

Looks to me its all on how to correctly index and/or optimize MySQL configs. Of course there are parts where changes can be done in playSMS to help it, but I haven’t got time to focus on that.

anton

Hello dear Anton!
I have to bother you again, your help is needed in solving this issue :slight_smile:
If you pay attention, I tried to describe the problem in detail within each service separately.
I really need your help in solving this issue, apparently, except for you, no one will figure this out

Respectfully,
Jamshid Tursunov

whats your server’s specification (cpu, ram, is disk ssd or not) ?

have you try to tweak MySQL config ?

have you try to make index, test it, perhaps ask MySQL pro to help you how to fix it (or follow some index pointed by @Edilson_Spessoto on previous posts

how big your data ? how many estimated rows on your tables ? (just estimation, no need exact number, like above a million, above 1 hundred thousand rows… something like this)

anton

Try to shutdown your playSMS then try to observe your MySQL if it behaving the same. Is this a physical machine? or virtual? what is the storage config and physical specs… Is there any application uses share same volume with your MYSQL. Would be great if you provide you partition tables, directory where your components directory installed. And you migth check if someone mocking your Mysql server try netstat.

Anton got you there, but need to understand how his hardware specs. Indexing is good but I have Helpdesk system for almost 10 years of data with a 1K users haven’t performed indexing but still don’t have an issues with this databases server.

Just helping here.

From you post, memory physical uses 60% of it, your swap does not even used. meaning you have a pretty good memory utilizations.

But were are not aware on how your storage specs, how you manage the partitions, where/what volumes your application working directory been stored.

You might also check you physical storage performance.
your hardware specs.

@anton , if you have the opportunity and resources, please fix this problem as well.
You can preview a similar situation on the working playSMS

Note:
I tried to collect information on each process and together as a whole, and this is what happened:
Sequential shutdown and inclusion:

Before shutting down the processes, the CPU load varied in the range of 105-140%
Disconnected in the following sequence:
The first to disconnect ratesmsd, the CPU load dropped to around 30-50%, and this range, while watching, kept steady
The second turned off recvsmsd, the CPU load was in the range of 30-43%
The third disabled schedule, the CPU load was in the range of 30-42%
The fourth disconnected sendsmsd, the CPU load was in the range of 30-42%
Fifth disabled dlrssmsd, CPU load dropped to 0.7%, sometimes jumped to 1%

When all 5 services disconnected, the CPU load dropped to 0.7%, sometimes jumped to 1%

When I turned on dlrssmsd, the CPU load jumped right up to 30-42%
Next turned on sendsmsd, the CPU load remained in the range of 30-42%
After turning on schedule, the CPU load was in the range of 30-42%
Then turned on recvsmsd, the CPU load was in the range of 30-42%
When I turned on ratesmsd, the CPU load was kept in the range of 40-45%, but very often there are skips and the load reaches as much as 145%

Individual enable and disable: (I will enable and disable the service to calculate how much load a single process gives)
When I turned on dlrssmsd, the CPU load jumped right up to 42%
Next turned on sendsmsd, CPU load kept in the range 1.3-1.7%
After I enabled schedule, the CPU load was in the range 1.3-1.7%
Then I turned on recvsmsd, the CPU load kept in the range 1.3-1.7%
When I turned on ratesmsd, the CPU load started to jump right up to the mark above 100%

I want to emphasize that even after I turned off all the PlaySMS processes, the memory occupied by MySQL remained at 14%. I want to say that after disabling the PlaySMS process, the memory is not freed.

Regards,
Jamshid Tursunov

Most likely, indexing is not enabled for some tables.
I’m certainly not a deep DB specialist, but all sources on the Internet point to this.

Regards,
Jamshid Tursunov

This may paritially be fixed here:

That changes, especially on credit/fn.php, removed INNER JOIN SQL to tblSMSOutgoing, when querying balance.

parent_uid and uid information also saved in tblBilling now so no need to JOIN 2 tables when querying balance.

For new installation it wont be a problem, but for an upgrade, you need alter tblBilling by adding parent_uid and uid column, and then copy all parent_uid and uid info from each row in tblSMSOutgoing to tblBilling.

anton

Check soon

Yours faithfully,
Jamshid Tursnuov

  1. Can you share how you created those indexes in MySQL?
  2. Just wondering where does the field ‘pid’ in playsms_tblSMSOutgoing_queue_dst come from? I can’t seem to have that field name in the table.

Thanks in advance

This problem has not gone away; I think it’s mainly caused by inefficient code.

I find the codebase a bit difficult to follow because it contains a fair bit of ‘magic’ like numbers instead of constants, and use of backtrace & function name suffixes instead of explicit attachment to hooks, iteration with for instead of foreach loops, etc., so feel free to correct me on the facts, @anton.

There’s $core_config['isdlrd'] which is true by default in your config.php. This is meant to defer some work to the daemon (that calls dlrd()), otherwise it’s done immediately when dlr() is called. However, even if you set this config value to false, dlrd() is still called and still does the same work. It looks through every record in playsms_tblDLR for flag_processed is 1, even if there are never any that will match. This search is on an un-indexed column, and that table can contain anywhere from thousands to millions of records, depending on how often you send, how many recipients you typically have, and how long you’ve been using PlaySMS. The same daemon then does something similar in getsmsstatus() on playsms_tblSMSOutgoing (another full table scan, no indices). The latter table has one record for every SMS sent, fewer than the other one, but a lot. It then waits 1 second before doing this again, all day every day. The resulting CPU thrashing occurs, regardless. It can have nothing queued, no unprocessed messages, etc., but it will keep performing the checks over and over.

You can increase that default wait by setting DAEMON_SLEEP in etc/playsmsd.conf but this value is also used for all the other daemons PlaySMS runs. A longer wait may unnecessarily throttle normal sending and receiving; I chose to do this any way because the reduction in thrashing is quite significant.

Adding a non-unique index to playsms_tblDLR.flag_processed may help, but it’s up to your database to decide at query time whether to use it. It has to see that most of the indexed records don’t match what it’s looking for, so using the index to find the ones that do match will be faster than scanning the whole table. Initially, this wasn’t the case for me as I had a huge amount of records which logged the message as ready to send but unprocessed, yet later records for the same message (matching smslog_id) show them as delivered and/or failed and processed. To me, this looks like the daemon constantly re-examines old delivery statuses, looking for messages where it already has an answer and will never get a new answer because the message files are long gone :man_shrugging:. If I mark the older records as processed (because the matching later ones make me think they have been), the index becomes very helpful, and looking for unprocessed records is super-fast. I’m not 100% sure this is the right approach, but the alternative is infinite scanning of a huge table, looking for responses that I doubt will arrive again.

For the other table, adding this index massively sped up the queries, because very few records had p_status set to 0, as needed by the daemon:

ALTER TABLE playsms_tblSMSOutgoing ADD INDEX (p_smsc, p_status, flag_deleted);

This part is gammu-specific, but when PlaySMS checks the status of messages to mark as delivered or not, it manually reads every file in the error and sent folders into some lists, then checks those lists until it finds a matching file, instead of using something like glob() to look for a direct match. It also has some odd logic regarding timestamps of old messages, which continued looking for results of old sends many months after they’d completed. I had to patch that out.

I think performance issues stem from tasks using constant, repetitive, and resource-intensive polling of both the database and filesystem, instead of being event-based. It’s probably difficult to avoid some of this because it would require using platform-specific filesystem watching tools like inotify, and considerable revision to the code to accommodate them. However, it needs addressing. One strategy could involve adding indexed(!) timestamp columns to some tables so that tasks where only new and recently-modified records are relevant could reduce the amount of data to examine.

In practice, I have to run lots of queries directly from cron jobs to periodically correct & de-duplicate phonebook & stoplist entries, clean the DLR table, and remove old sends. This means far less data for the daemons to churn through. I also disabled the credit, simplebilling, and simplerate plugins (because we don’t use them and they consume a lot of processing time). This can be done by prepending a . to their folder names under plugin/feature/. I then had to add a custom one, otherwise every send fails. It’s basically only this:

function mypluginname_hook_rate_cansend() {
    return true;
}

You could probably add this to any enabled plugin, changing its prefix to match, YMMV. I could have changed the main sending code instead to assume that a null return from that hook meant ‘no rate plugins, so no send limits’.

Something to make clean-up easier is to add foreign keys, so that deleting a single queue record automatically deletes its related records. Don’t blame me if you break your database, but I use these foreign keys which seem to work for me:

Table Relation
playsms_tblDLR FOREIGN KEY (`smslog_id`) REFERENCES `playsms_tblSMSOutgoing_queue_dst` (`id`) ON DELETE CASCADE
playsms_tblSMSOutgoing_queue_dst FOREIGN KEY (`queue_id`) REFERENCES `playsms_tblSMSOutgoing_queue` (`id`) ON DELETE CASCADE
playsms_tblSMSOutgoing FOREIGN KEY (`queue_code`) REFERENCES `playsms_tblSMSOutgoing_queue` (`queue_code`) ON DELETE CASCADE

Those foreign keys also add indices to your tables, which may speed up other queries that join on those columns. If you cannot add those foreign keys, it’s usually because there are orphaned child records that need to be deleted, e.g.:

DELETE FROM playsms_tblDLR WHERE NOT EXISTS (SELECT * FROM playsms_tblSMSOutgoing_queue_dst WHERE playsms_tblSMSOutgoing_queue_dst.id = playsms_tblDLR.smslog_id);

I had to make those changes from the mariadb (or mysql) command line because they took a while to complete.


I don’t mean to disparage Anton’s work, it’s a large project that’s clearly had a lot of time and effort put into it, which I doubt I’d have the patience for. It mostly behaves as expected but hasn’t been written in a way that enables it to scale well.

I’m sure this seems like a big rant, but I’m just sharing that improvements are possible with good indices, avoiding processing things that don’t need it, and not keeping old data forever.

If disabling the billing things, you’ll also need a plugin function with something like this:

function mypluginname_hook_rate_getcharges() {
    return [1, 0, 0];
}