Best Practices for Background Jobs in Elixir

Erlang & Elixir are ready for asynchronous work right off the bat. Generally speaking, background job systems aren't needed as much as in other ecosystems but they still have their place for particular use cases.

This post goes through a few best practices I often try to think of in advance when writing background jobs, so that I don't hit some of the pain points that have hurt me multiple times in the past.

If you've ever deployed a new task, only to find out that it has gone rogue with a bug that caused it to misbehave (e.g.: sending way too many emails, way too quickly), you may have gone through similar bugs as well.

Flavours

Elixir already gives you the ability to schedule asynchronous work pretty easily. Something as simple as this already covers a lot:

1Task.async(fn ->
2  # some heavy lifting
3end)

You might need something a bit more powerful, either just for convenience (having some tooling & monitoring around that task), or because you need something like periodic jobs. Again, all this can be achieved with something like a GenServer:

1defmodule PeriodicJob do
2  use GenServer
3
4  @period 60_000
5
6  def init do
7    Process.send_after(self(), :poll, @period)
8
9    {:ok, :state}
10  end
11
12  def handle_info(:poll, state) do
13    # some heavy lifting
14
15    Process.send_after(self(), :poll, @period)
16  end
17end

You can also use a job queuing library such as Quantum. If you come from Ruby land and are used to libraries such as Sidekiq, you might be more familiar with something like this:

1#
2# lib/my_app/scheduler.ex
3#
4defmodule MyApp.Scheduler do
5  use Quantum.Scheduler, opt_app: :my_app
6end
7
8#
9# config/config.exs
10#
11config :my_app, MyApp.Scheduler,
12  jobs: [
13    first: [
14      # every hour
15      schedule: "0 * * * *",
16      task: {MyApp.ExampleJob, :run, []}
17    ],
18    second: [
19      # every minute
20      schedule: "* * * * *",
21      task: {MyApp.AnotherExampleJob, :run, []}
22    ]
23  ]

Some may argue that since Erlang/OTP already provides the infrastructure for creating these processes, packages such as Quantum are not necessary. However, the structure created by them can end up being more intuitive, especially if you're not that familiar with OTP. This might be the case with someone coming from Ruby or other such communities.

How to Structure Background Jobs

Let's now get into a few tips that will help you keep your jobs ready to deal with potential future problems!

Most of them are preventive measures due to the fact that all of these are background processes. They're not responding to an HTTP request and they happen without any intervention, thus sometimes, debugging can be hard if you don't take some precautions.

Let's consider a small example that sends confirmation emails to users that haven't received it yet:

1defmodule MyApp.ExampleJob do
2  def run do
3    get_users()
4    |> Enum.each(fn user ->
5      # send single email to user
6    end)
7  end
8
9  defp get_users do
10    MyApp.User
11    |> where(confirmation_email_sent: false)
12    |> MyApp.Repo.all()
13  end
14end

1. Put in a Kill Switch

This is one of those mistakes I'll never make again since it has hurt me so many times.

Let's say you've created a background job, tested, deployed, and configured it to run periodically and send some emails.

It hits production, and you soon notice that something's wrong. The same 100 people are being spammed with emails every minute. You messed up the geth_next_batch/1 function, and it always goes over the same batch of users. It's a developer's horror story. You need to fix it (or kill it) quickly, but all that time waiting for a new release to get online is physically painful.

So, avoid that:

1defmodule MyApp.ExampleJob do
2  def run do
3    return if !enabled?
4
5    # ...
6  end
7
8  defp enabled? do
9    # check a Redis flag, or a database record, or anything really
10  end
11end

You can plug in some persistent system that allows you to quickly toggle the job on/off. A good suggestion would be to use a feature flag package, such as FunWithFlags.

2. Always Batch Your Jobs by Default

It's easy to miss this one on a first draft. You're just trying to quickly get something online. But in some cases, it may be important to not hurt your performance if you're working on a very resource-intensive job, or simply, if your list of records to process grows too quickly.

Doing User |> where(confirmation_email_sent: false) |> Repo.all() can be dangerous if there's potential for that to yield too many results. You may end up consuming too many resources for something that could be done in smaller batches, keeping your system a lot more stable:

1defmodule MyApp.ExampleJob do
2  @batch_size 100
3
4  defp get_users do
5    MyApp.User
6    |> where(confirmation_email_sent: false)
7    |> limit(^@batch_size)
8    |> MyApp.Repo.all()
9  end
10end

Whatever job queue mechanism you plug this worker into, it will end up being called frequently. So you shouldn't hurry in processing smaller batches one at a time.

3. Avoid Overlaps

This is kind of related to the previous point, but it's a concern that goes beyond performance.

If you program a job to run every minute, and a single execution has the potential to last longer than that, you end up risking cascading performance problems, or even worse, race conditions, where the first and second executions are both trying to process the same set of data, and conflict with each other in the process.

This is obviously dependent on what your exact business logic is, but as a general rule, it's best to be defensive here.

If you use a GenServer approach like the one showcased above, this is solved automatically, as instead of scheduling jobs every minute, you can instead use Process.send_after(self(), :poll, delay) to only schedule the next run after the current one has finished, avoiding overlap.

When using Quantum, you also have an overlap: true option that you can add to automatically prevent this:

1config :my_app, MyApp.Scheduler,
2  jobs: [
3    first: [
4      # every hour
5      schedule: "0 * * * *",
6      task: {MyApp.ExampleJob, :run, []}
7      overlap: false
8    ]

This, by the way, might already be reason enough to consider using a package rather than just plain Elixir.

4. Plug in a Manual Mode

If your job is processing a batch of records, it's useful to plug in some public functions that allow you to manually process specific records. This can serve two purposes:

Better ability to debug the job
Ability to do a few manual runs before enabling the global job (by toggling the feature flag discussed above)

A sample structure could look like this:

1defmodule MyApp.ExampleJob do
2  import MyApp.Lock
3
4  def run do
5    lock("example_job", fn ->
6      get_users()
7      |> Enum.each(&process_user/1)
8    end)
9  end
10
11  def run_manually(users) when is_list(users) do
12    lock("example_job", fn ->
13      users
14      |> Enum.each(&process_user/1)
15    end)
16  end
17
18  def run_manually(user), do: run_manually([user])
19
20  def process_user(%User{} = user) do
21    # process a single user
22  end
23end

In this case, we're creating a run_manually/1 public function that can receive either a single user or a batch of them and performs the same logic as the automatic job would.

One important detail here is to again avoid a race condition, which in this case, is being done with a custom Lock module that uses the redis_mutex package to prevent potential issues:

1defmodule MyApp.Lock do
2  use RedisMutex
3  require Logger
4
5  def lock(lock_name, fun) do
6    with_lock(lock_name, 60_000) do
7      fun.()
8    end
9  rescue
10    _e in RedisMutex.Error ->
11      Logger.debug("#{locker_name} another process already running")
12  end
13end

The lock, which is invoked both on manual runs as well as the regular background job, ensures that you won't cause any unintentional conflicts if you try to do a manual run at the same time the job is doing the same processing. It also happens to solve the overlap problem discussed previously in this post.

Conclusion

All these tips come from something that I bumped into in the past, usually related to production bugs or user complaints. So I hope some of them help you avoid the same mistakes. Let me know if you have any further thoughts! 👋

P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!