A typical latency-critical service is based on client-server interaction, in which the client will send a certain request and the server side application will have to respond within a given time frame. This time frame is typically referred to as Quality of Service (QoS) target. To cope with user demand, such services are scaled across numerous servers. The question is how to maintain QoS, generally for any application that runs in a cluster environment, while aiming to minimise energy consumption or maximse throughput (by running batch jobs) [2,3].
For this, we need to do the following:
1. Run latency-critical services to multi-node scenario. Examples of latency-critical services given by programs in benchmark suite such as TailBench .
2. Experiment with and understand the existing cluster manager (Uber’s Peleton )
3. Apply techniques using machine learning (such as Reinforcement learning) to optimise existing cluster scheduler (placement management strategies)
Research questions we would like to answer:
1. How workloads are currently scaled/distributed to multiple nodes to cope with user demand? i.e., exploring containers, Apache Mesos, etc.,
2. How do cluster schedulers work, and hypothetically, what is required to build one? (We will learn this from point (2))
3. Optimise existing strategies with machine learning techniques .
- Interest in reading [large volumes of] research papers.
- Development skills in Python
- Willing to learn a new language such as Go.
- Domain knowledge on machine learning.
If you think you are hit all the check-boxes, I’d love to discuss details with you!
Advisers: Rajiv Nishtala and Magnus Själander