Graceful Scaling of Queue Workers

Background:

One of the most utilized products of Locus – Dispatcher, is used to solve the constrained vehicle routing problem. The constrained vehicle routing problem is a hard problem in computer science – an extension of the travelling salesman problem, where instead of a single salesman visiting multiple cities or customers, we now have a fleet of salesmen visiting customers, each with their own constraints such as weight and volumetric capacities, distance, time slots, skills, locality familiarity, operating hours and fairness among salesmen.

The objective could be reducing the total time or distance on the road while meeting all the constraints. Solving such a complex problem is non-trivial. For the case of a single salesman visiting 80 customers, the number of permutations is 80! (that is 80 factorial, which is 80 x 79 x 78 … x 1, which if calculated comes out to be ~7.15 x 10118 a number so large, its value is greater than the number of atoms in the known universe! (1078 ~ 1082)

Solving such a problem might appear like magic to our end users, but engineers truly understand the complexity of doing this on production systems at scale! We’ve developed our own algorithms to solve the vehicle routing problem at Locus, taking into account all the real world constraints that our clients might need. Our systems could take from 5 mins to 60 mins to get a good solution for most use cases, depending on the number of tasks, vehicles and constraints. As output, the system spits out a plan, which is a collection of tours – sequences of tasks that need to be performed by each salesman

Architecture Overview and Working:-

A simplified view of our architecture on AWS is as follows:

A request for plan generation (data to be optimized – consisting of tasks, vehicles and configurations) is received by a load balancer(ELB) and directed to one of the attached API servers(EC2 instances managed by Elastic Beanstalk in our case). The API server adds a record containing the input data to the database(we use Aurora DB) and adds an element to the queue(we use SQS) containing an identifier to the record. A machine from a group of high compute workers – 16 core machines with Intel Xeon Platinum 8000 series processors(C5.4XLarge machine) – picks up the plan to process it. Each worker machine runs only one plan at a time, utilizing all the machines cores for optimal speed, since the plan optimization is highly demanding on compute power. During processing, there is a regular heartbeat signal sent to the queue, to keep the message invisible to other workers. Once a plan is completed, its message is deleted from the queue.

Autoscaling:

The number of plan generation requests varies throughout the day. It would be highly inefficient to use a fixed number of worker nodes – using fewer workers would mean longer wait times for customers during peak hours, while using more workers would result in significantly higher costs, most of which would be wasted while workers remain idle. Therefore we’ve used autoscaling rules to horizontally scale the group in and out based on the load.

The Problem:

Scaling out is pretty straightforward – we use custom metrics based on the queue length that trigger autoscaling rules; more messages in the queue imply that more workers are needed to process them. The challenge is when scaling in; how do we let auto scaling reduce the number of instances, while making sure an instances running a plan is not terminated?

Our Solution:

To solve this problem, we’ve used a feature provided by AWS called Autoscaling Lifecycle Hooks. Lifecycle hooks allow you to perform custom actions during instance launch or termination. We’ve configured the lifecycle hooks of the autoscaling group managing the worker nodes to emit an event into an SQS queue prior to instance termination. This can be done by going to Autoscaling Groups in the EC2 console, opening the Lifecycle Hooks tab and creating a new Lifecycle Hook for instance termination. Then go to the Rules section in the Cloudwatch console and create a new rule – An Event Pattern with Service Name as Autoscaling, event type as Instance Launch and Terminate and choose EC2 instance terminate action in specific events, and choose the AutoScalingGroup(s) to apply this to. In Targets set SQS queue and provide a queue name to receive the lifecycle event messages.

The queue message contains details about the instance to be terminated(instance id), the auto scaling group which the instance is part of, and the lifecycle-hook event. These fields are required when sending heartbeat events or for completing the lifecycle action. We could have a Lambda poll the SQS queue and send commands to terminate specific instances based on their state, but a better approach is for each instance to check whether it needs to be terminated by checking some shared memory containing this information. We used a table in AuroraDB for this.  The queue message is picked up by any worker node which writes these details to the database and deletes the message from the queue.

Every instance of the queue worker group periodically queries the database, roughly once a minute. If it finds a termination entry against its autoscaling group and instance id, it prepares itself for shutdown – it stops polling the queue for new plans. If it is currently processing a plan(in Java, we use a synchronized hash set in a static variable to store the currently running plans), it sends a heartbeat1 event to AWS Autoscaling, which extends the timeout length before shutdown. This process repeats as long as the plan is being processed. If it is not processing a plan, it passes a ‘continue’ message to AWS Autoscaling, which continues2 with the instance termination, and then deletes the database entry. This ensures that the instance does not prematurely terminate while processing a plan, but also that it gets shut down soon after completing the plan generation.

*We call AmazonAutoScaling#recordLifecycleActionHeartbeat with a request parameter, setting autoScalingGroupName, instanceId and lifecycleHookName within it.

*We call AmazonAutoScaling#completeLifecycleAction with a request parameter, setting autoScalingGroupName, instanceId, lifecycleHookName and set lifecycleActionResult to “CONTINUE” within it.

To understand more about how we solve the Vehicle Routing Problem, visit Locus.

Leave a Reply