elixir

Supervisors: Building fault-tolerant Elixir applications

Supervisors:
Building fault-tolerant
Elixir applications

We briefly touched on supervision when we talked about processes in the first edition of Elixir Alchemy. In this edition, we'll take it a step further by explaining how supervision works in Elixir, and we'll give you an introduction into building fault tolerant applications.

A phrase you'll likely run into when reading up on fault tolerance in Elixir and Erlang is "Let it crash". Instead of preventing exceptions from happening, or catching them immediately when they occur, you're usually advised not to do any defensive programming. That might sound counter-intuitive — how do crashing processes help with building fault tolerant applications? Supervisors are the answer.

Fault tolerance

Instead of taking down the whole system when one of its components fail, fault tolerant applications can recover from exceptions by restarting the affected parts while the rest of the system keeps running.

In Elixir, supervisors are tasked with restarting processes when they fail. Instead of trying to handle all possible exceptions within a process, the "Let it crash"-philosophy shifts the burden of recovering from such failures to the process' supervisor.

The supervisor makes sure the process is restarted if needed, bringing it back to its initial state, ready to accept new messages.

Supervisors

To see how supervisors work, we'll use a GenServer with some state. We'll implement a cast to store a value, and a call to retrieve that value later.

When started, our GenServer sets its initial state to :empty and registers itself by the name :cache, so we can refer to it later.

1# lib/cache.ex
2defmodule Cache do
3  use GenServer
4
5  def start_link() do
6    GenServer.start_link(__MODULE__, :empty, [name: :cache])
7  end
8
9  def handle_call(:get, _from, state) do
10    {:reply, state, state}
11  end
12
13  def handle_cast({:save, new}, _state) do
14    {:noreply, new}
15  end
16end

Let's jump into IEx to supervise our GenServer. The Supervisor has to be started with a list of workers. In our case, we'll use a single worker with the module name (Cache), and an empty list of arguments (because Cache.start_link/0 doesn't take any).

1$ iex -S mix
2iex(1)> import Supervisor.Spec
3Supervisor.Spec
4iex(2)> {:ok, _pid} = Supervisor.start_link([worker(Cache, [])], strategy: :one_for_one)
5{:ok, #PID<0.120.0>}
6iex(3)> GenServer.call(:cache, :get)
7:empty
8iex(4)> GenServer.cast(:cache, {:save, :hola})
9:ok
10iex(5)> GenServer.call(:cache, :get)
11:hola

If the process crashes, our supervisor will automatically restart it. Let's try that by killing the process manually.

1...
2iex(6)> pid = Process.whereis(:cache)
3#PID<0.121.0>
4iex(7)> Process.exit(pid, :kill)
5true
6iex(8)> GenServer.call(:cache, :get)
7:empty
8iex(9)> Process.whereis(:cache)
9#PID<0.127.0>

As you can see, the :cache process was restarted by our supervisor immediately when it crashed, and getting its value revealed that it returned to its initial state (:empty).

Dynamic Supervisor

In our first example, the process we supervised was built to run indefinitely. In some cases, however, you'd want your application to spawn processes when needed, and shut them down when their work is done.

Imagine that we want to track football matches. When a match starts, we'll start a process. We'll send messages to that process to update the score, and this process will live until the match ends.

To try this out, we'll define another GenServer named FootballMatchTracker, which we can use to store and fetch the current score for both teams.

1# lib/football_match_tracker.ex
2defmodule FootballMatchTracker do
3  def start_link([match_id: match_id]) do
4    GenServer.start_link(__MODULE__, :ok, [name: match_id])
5  end
6
7  def new_event(match_id, event) do
8    GenServer.cast(match_id, {:event, event})
9  end
10
11  def get_score(match_id) do
12    GenServer.call(match_id, :get_score)
13  end
14
15  def init(:ok) do
16    {:ok, %{home_score: 0, away_score: 0}}
17  end
18
19  def handle_call(:get_score, _from, state) do
20    {:reply, state, state}
21  end
22
23  def handle_cast({:event, event}, state) do
24    new_state =
25      case event do
26        "home_goal" -> %{state | home_score: state[:home_score] + 1}
27        "away_goal" -> %{state | away_score: state[:away_score] + 1}
28        "end" -> Supervisor.terminate_child(:football_match_supervisor, self())
29      end
30    {:noreply, new_state}
31  end
32end

Next, we'll implement a supervisor for FootballMatchTracker.

1# lib/football_match_supervisor.ex
2defmodule FootballMatchSupervisor do
3  use Supervisor
4
5  def start_link do
6    Supervisor.start_link(__MODULE__, [], [name: :football_match_supervisor])
7  end
8
9  def init([]) do
10    children = [
11      worker(FootballMatchTracker, [], restart: :transient)
12    ]
13
14    supervise(children, strategy: :simple_one_for_one)
15  end
16end

Each FootballMatchTracker will be registered with a match identifier that will be given to it through its initialization. Since the Supervisor behaviour is a GenServer under the hood, we can use all its features like registering it using a name, like we did before. In this case, we'll use :football_match_supervisor.

Let's take our supervisor for a spin. We'll start a child with a :match_id, check the initial state, and add a home goal.

1$ iex -S mix
2iex(1)> FootballMatchSupervisor.start_link()
3{:ok, #PID<0.119.0>}
4iex(2)> Supervisor.start_child(:football_match_supervisor, [[match_id: :match_123]])
5{:ok, #PID<0.121.0>}
6iex(3)> FootballMatchTracker.get_score(:match_123)
7%{away_score: 0, home_score: 0}
8iex(4)> FootballMatchTracker.new_event(:match_123, "home_goal")
9:ok
10iex(5)> FootballMatchTracker.get_score(:match_123)
11%{away_score: 0, home_score: 1}

When we send an unknown message ("goal" is not implemented in our GenServer), we'll get an exception, and the process will crash.

1iex(6)> FootballMatchTracker.new_event(:match_123, "goal")
2:ok
313:13:44.658 [error] GenServer :match_123 terminating
4** (UndefinedFunctionError) function FootballMatchTracker.terminate/2 is undefined or private
5    (supervisors_example) FootballMatchTracker.terminate({{:case_clause, "goal"}, [{FootballMatchTracker, :handle_cast, 2, [file: 'lib/football_match_tracker.ex', line: 24]}, {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 601]}, {:gen_server, :handle_msg, 5, [file: 'gen_server.erl', line: 667]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}]}, %{away_score: 0, home_score: 1})
6    (stdlib) gen_server.erl:629: :gen_server.try_terminate/3
7    (stdlib) gen_server.erl:795: :gen_server.terminate/7
8    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
9Last message: {:"$gen_cast", {:event, "goal"}}
10State: %{away_score: 0, home_score: 1}
11iex(8)> Process.whereis(:match_123)
12#PID<0.127.0>

Because we used the :transient as the restart option, and :simple_one_for_one as the restart strategy for our supervisor, the supervisor's children will only be restarted on abnormal termination, like the exception above. Like before, the process is restarted, which brings it back to its initial state.

When we stop the process using the "end"-message, the supervisor won't restart it.

1iex(9)> FootballMatchTracker.new_event(:match_123, "end")
2:ok
3iex(10)> Process.whereis(:match_123)
4nil

Inside the Supervisor

Now that we've seen some examples of how to use supervisors, let's take it a step further, and try to figure out how they work internally.

A supervisor is basically a GenServer with the capability of starting, supervising and restarting processes. The child processes are linked to the supervisor, meaning the supervisor receives an :EXIT message whenever one of its children crash, which prompts it to restart it.

So, if we want to implement our own supervisor, we need to start a linked process for each of its children. If one crashes, we'll catch the :EXIT message, and we'll start it again.

1defmodule MySupervisor do
2  use GenServer
3
4  def start_link(args, opts) do
5    GenServer.start_link(__MODULE__, args, opts)
6  end
7
8  def init([children: children]) do
9    Process.flag(:trap_exit, true) # for handling EXIT messages
10    state =
11      Enum.map(children,
12        fn child ->
13          {:ok, pid} = child.start_link()
14          {pid, child}
15      end)
16      |> Enum.into(%{})
17    {:ok, state}
18  end
19
20  def handle_info({:EXIT, from, reason}, state) do
21    IO.puts "Exit pid: #{inspect from} reason: #{inspect reason}"
22    child = state[from]
23    {:ok, pid} = child.start_link()
24    {:noreply, Map.put(state, pid, child)}
25  end
26end

Let's try it with our Cache module:

1$ iex -S mix
2iex(1)> MySupervisor.start_link([children: [Cache]], [])
3{:ok, #PID<0.108.0>}
4iex(2)> GenServer.cast(:cache, {:save, :hola})
5:ok
6iex(3)> Process.whereis(:cache)
7#PID<0.109.0>

If we kill the process, like we did before, our custom supervisor will automatically restart it.

1iex(4)> :cache |> Process.whereis |> Process.exit(:kill)
2Exit pid: #PID<0.109.0> reason: :killed
3true
4iex(5)> Process.whereis(:cache)
5#PID<0.113.0>
6iex(6)> GenServer.call(:cache, :get)
7:empty
  1. Our supervisor receives a list of children modules through the start_link/2 function, which are started by the init/0 function.
  2. By calling Process.flag(:trap_exit, true), we'll make sure the supervisor doesn't crash when one of its children do.
  3. Instead, the supervisor will receive an :EXIT message. When that happens, our supervisor finds the child module from the state of the crashed process and starts it again in a new one.

Conclusion

By learning how to use the Supervisor behaviour module, we learned quite a bit about building fault-tolerant applications in Elixir. Of course, there's more to supervisors than we could cover in this article, and the different options (like restarting strategies) can be found in the Elixir documentation.

We’d love to know how you liked this article, if you have any questions about it, and what you’d like to read about next, so be sure to let us know at @AppSignal.

Share this article

RSS

AppSignal monitors your apps

AppSignal provides insights for Ruby, Rails, Elixir, Phoenix, Node.js, Express and many other frameworks and libraries. We are located in beautiful Amsterdam. We love stroopwafels. If you do too, let us know. We might send you some!

Discover AppSignal
AppSignal monitors your apps