CMS_WebFarmServer and scaling

Hey there,

I'm coming here because we just had an incident that almost cost us our production environment.

We're working on an Xperience Website with this architecture:

  • One CMS instance

  • One Frontend instance that can scale to up to 4

All that in Azure.

We're currently doing a lot of test on our scaling so a lot of server creation/destruction.

A few minutes ago, we had the quite unpleasant surprise to see our SQLServer Instance go up to 100% with nothing peculiar running. Let's check the problem.

Proc_CMS_WebFarmTask_DeleteOrphanedTasks => min 15s, moy 15s, max 15s

That's not good. 15s felt a lot like a SQL timeout and indeed it was.

Let's dive.

Body of DeleteOrphanedTasks with a NOT IN( get all the tasks id in WebFarmServerTasks)

And at that moment, I got it.

Let's check our servers...

18 servers with ServerEnabled seems a lot as we only had 1 instance at that moment (+1 cms + 1 staging slot I guess). Let's check what's going on...

 request returning last ping by ServerId

So, some servers haven't answered for more than 12h but are still considered (serverEnabled) ?

Once I deleted all tasks/server that were not there anymore, the db load is back to normal.

Ever had this problem ? What is the normal behavior for our usecase ? Do Xperience support autoscalling or do I have to manage the server list on our side ? I feel a process should do the cleaning hourly at least.

Furthermore, Delete Top is a really bad pattern. We already had a problem with k13 with the same kind of request. You were doing almost the same thing to delete logs and if you had a big log intake, you could easily break the system. Your request is listing all the lines and taking the x first. If the listing takes more time than your timeout, the whole cleaning breaks.
You have an identity field. Get the min/max of your request and delete on Id > X. It will be lighting fast in comparison.

[edit] seems we can't put SQL. Not quite practical :D

Environment

  • Xperience by Kentico version: [31.5.0]

  • .NET version: [10]

  • Execution environment: [Private cloud (Azure)]

Answers

In regards to the large list of servers, it is standard behavior for a server not responding for 12hrs to still be considered enabled and "active." There is a description of this behavior here (it is K13 documentation, but the functionality is the same). Servers will remain in the system and generate tasks until 24hrs, then they are deleted along with their tasks.

You can see on that page there is also a mention of the CMSWebFarmNotRespondingInterval key- this also still applies in XbK. It is noted that the default 24hr interval may cause unnecessary load on the system when using autoscaling, so I would certainly recommend that you add this key to your applications with a shorter interval. Ideally, when you check the CMS_WebFarmServer table you will have 2-5 servers (1 admin and 1-4 front end). Setting this key to something like 6 hours could really help clean up the tables and improve performance.

Thanks a lot. Would be nice to add that to the Xperience documentation.

I'm gonna try lowering the value, should fix our problem.

Thanks again :)

To response this discussion, you have to login first.