Monitor Scheduler Utilization in Elixir With AppSignal

When it comes to monitoring your Elixir application, it's challenging to make sense of the many metrics and statistics that you can read from the internals of the Erlang virtual machine. In this post, we'll be looking at the scheduler utilization metric in order to understand what it is, why we should monitor it, and how to monitor it.

What Is Scheduler Utilization?

In concurrent systems, scheduling is the mechanism by which work that needs to be done is assigned to the resources needed to do it. In the Erlang VM, these resources are managed by the schedulers. By default, there is one scheduler for each CPU core in your system, so tasks can be performed concurrently.

The scheduler utilization rate is a percentage value representing the amount of time that each scheduler is in use. A low scheduler utilization rate means that the scheduler is mostly idle, waiting to receive work to do, while a high scheduler utilization rate means that the scheduler has spent most of its time working on one or more tasks.

In order to ensure responsiveness, the Erlang VM is "busy waiting" when the application is idle. Because of this, metrics such as CPU usage, as provided by the OS, may not accurately represent your application's actual workload. Using the Erlang standard library to measure the scheduler utilization rate provides you with a metric that more closely corresponds to the actual work performed by your application.

Why Does Scheduler Utilization Matter?

Under full scheduler utilization, the work that needs to be done may start piling up in the schedulers' run queue. On a network application, such as a web server, this could result in increased latency: the schedulers' resources are spread thinly across many ongoing tasks. The scheduler may take more time to attend to new tasks, such as handling incoming requests.

To see this in action, let's start a Phoenix server from the iex interactive Elixir shell, and start many concurrent, long-running tasks from that shell, in order to keep the scheduler busy. For this example, we will start ten thousand processes, each counting down to zero from a million:

1$ iex -S mix phx.server
2
3defmodule Countdown do
4  def from(0), do: :done
5  def from(count), do: from(count - 1)
6
7  def start(count), do: Task.start(fn -> from(count) end)
8
9  def start_many(amount, count) do
10    Enum.each(1..amount, fn _ -> start(count) end)
11  end
12end
13
14Countdown.start_many(10_000, 1_000_000)

If we attempt to perform requests against the Phoenix server shortly after starting these tasks, i.e. by reloading a page, we'll see that it can take up to several seconds for it to respond. The scheduling system in the Erlang VM attempts to ensure a fair distribution of execution time amongst tasks, but as the amount of tasks in waiting increases, it takes longer for the scheduler to get around to handle your request.

Measuring Scheduler Utilization

While these long-running tasks are ongoing, we can also see the effect they have on the scheduler utilization rate. To see the scheduler utilization over a given time span, we must first collect a sample from :scheduler.sample/0. Then, pass that sample to :scheduler.utilization/1 in order to obtain the utilization rate since its collection:

1$ iex
2
3sample = :scheduler.sample()
4Countdown.start_many(10_000, 1_000_000)
5# a few seconds later...
6:scheduler.utilization(sample)
7# => [{:total, 0.41156659296142006, '41.2%'}, ...]

We should note that this is an artificial example, in the sense that it's rare for tasks to be spawned in this manner. Most often, you would use Task.async_stream instead of calling Task.start on a loop, and Task.async_stream would limit the amount of concurrent tasks it runs to the number of schedulers in your system. This would prevent the scheduler run queues from filling up, and therefore prevent response times from spiking in this manner. In this example, the saturation of the schedulers' run queues is what causes the latency increase, and the high scheduler utilization is what keeps the queues from emptying.

How To Monitor Scheduler Utilization With AppSignal

Starting with version 2.2.8 of our Elixir integration, AppSignal automatically displays a graph for the scheduler utilization metric in its Erlang magic dashboard. All you need to do is add AppSignal to your app.

If you want to get a focused view on specific metrics like these, creating a specialised dashboard is a great way to do so. You can import this dashboard to have a dedicated view of the Erlang schedulers' utilization metric. Click "Add dashboard", then "Import dashboard" and copy-paste the following dashboard configuration:

1{
2  "title": "Erlang schedulers",
3  "description": "",
4  "visuals": [
5    {
6      "title": "Utilization",
7      "line_label": "%type% #%id% - %hostname%",
8      "display": "LINE",
9      "format": "percent",
10      "draw_null_as_zero": true,
11      "metrics": [
12        {
13          "name": "erlang_scheduler_utilization",
14          "fields": [
15            {
16              "field": "GAUGE"
17            }
18          ],
19          "tags": [
20            {
21              "key": "hostname",
22              "value": "*"
23            },
24            {
25              "key": "id",
26              "value": "*"
27            },
28            {
29              "key": "type",
30              "value": "*"
31            }
32          ]
33        }
34      ],
35      "type": "timeseries"
36    }
37  ]
38}

Let's see an example of how your dashboard might look. This is what it might look like in a healthy situation, with a locally running server processing somewhere around 300 requests per minute. The scheduler activity hovers around 10%, the run queue remains at zero, and the server is responsive.