A Guide to Hot Code Reloading in Elixir

When building software, Elixir (or Erlang) offers great benefits, including concurrency, scalability and reliability.

In this series, we will examine how to make the most of these benefits in your production code upgrades. This article will focus on hot code reloading and upgrades. But before we dive in, let's quickly define OTP.

What Is OTP in Elixir?

Formally, Erlang/OTP is a specific implementation of Erlang Runtime System, i.e. a set of libraries, compilers and a VM implementation.

Informally, OTP often denotes a set of principles to build robust apps in Erlang and the corresponding set of built-in libraries.

Hot Code Reload: Tackling the Uncertainties

There is a bit of uncertainty about this concept.

When we speak of hot code reload or hot code upgrade, we usually mean an ability to change a running process behavior without any negative impact on that process. For example, we may change the behavior of a process that holds a TCP connection without terminating this connection.

Uncertainty comes in with scaling — the question is if we can upgrade:

a module
a package (an application in terms of OTP)
a whole running instance (a release in terms of OTP)

OTP offers tools for upgrading at any scale. In this article, we will consider application level upgrades.

How Are OTP and Hot Code Upgrades Related?

As we will see, hot code upgrades on a larger scale (application and release levels) work only for systems built according to OTP principles.

Hot Code Upgrades: The Basics

A good starting point to understand hot code reload is Hot Code Reloading in Elixir.

It explains the following key points:

How to reload code for a single module
How two versions of code exist after loading a new version of the module: new code and old code
The importance of external calls, which make it possible to transition from old code to new code
How GenServer helps us make such a transition seamlessly

At this point, I'd like to highlight one important concept in-depth: code purge.

Should You Code Purge in Elixir?

What happens if we want to upgrade code two or more times?

Let's create a small mix project:

1mix new code_purge
2cd code_purge

Then update lib/code_purge.ex to the following:

1# lib/code_purge.ex
2defmodule CodePurge do
3  def pi do
4    3.14
5  end
6end

Now we launch iex shell with mix:

1iex -S mix
2iex(1)> CodePurge.pi
33.14

Then update lib/code_purge.ex to:

1# lib/code_purge.ex
2defmodule CodePurge do
3  def pi do
4    3.142
5  end
6end

And recompile the project in a separate shell:

1mix compile

In our iex shell, we reload the module code:

1iex(2)> :code.load_file(CodePurge)
2{:module, CodePurge}
3iex(3)> CodePurge.pi
43.142

All has worked as expected. :code.load_file/1 found the updated Elixir.CodePurge.beam in _build/dev/lib/code_purge/ebin folder (as mix sets up code paths for us) and reloaded it.

But what happens if we try to reload this module once more, without actually changing it?:

1iex(4)> :code.load_file(CodePurge)
2{:error, :not_purged}

What Went Wrong Here?

Wow, that doesn't work. This is because Erlang can't have two versions of old code.

To overcome this, there are two other methods of :code: :code.purge/1 and :code.soft_purge/1.

A purge evicts the old code:

1iex(5)> :code.purge(CodePurge)
2false
3iex(6)> :code.load_file(CodePurge)
4{:module, CodePurge}

We can upgrade the code of the module again after the purge. But why do we even need to control that? Why not purge code automatically?

Well, there may still be processes running old code, and we should decide what to do with them during the upgrade. This is also why there are two functions:

:code.purge/1 — kills processes running old code
:code.soft_purge/1 — fails if there are any processes running old code

This leads to important consequences: if we want to upgrade our code more than once, our processes will be killed by default during upgrades.

Let's illustrate this.

How Not to Do a Code Upgrade

First, add file lib/code_purge/pi.ex to your toy project with the following content:

1# lib/code_purge/pi.ex
2defmodule CodePurge.Pi do
3  def start_link do
4    spawn_link(&server/0)
5  end
6
7  def server do
8    receive do
9      {:get, from} ->
10        send(from, {:ok, 3.14})
11        CodePurge.Pi.server()
12    end
13  end
14
15  def get(pid) do
16    send(pid, {:get, self()})
17
18    receive do
19      {:ok, value} ->
20        {:ok, value}
21    after
22      1000 ->
23        :error
24    end
25  end
26end

Then, run iex shell, spawn a server and check everything is fine:

1iex(1)> pid = CodePurge.Pi.start_link()
2#PID<0.140.0>
3iex(2)> CodePurge.Pi.get(pid)
4{:ok, 3.14}

Now, reload the module once (without any actual changes to functions) and try to purge it so that you can do the next 'upgrade':

1iex(3)> :code.load_file(CodePurge.Pi)
2{:module, CodePurge.Pi}
3iex(4)> :code.purge(CodePurge.Pi)
4** (EXIT from #PID<0.152.0>) shell process exited with reason: killed

What Happened Here?

As expected, your server just died, and even an external call to CodePurge.Pi.server/0 couldn't save you. The server didn't receive messages and so didn't transition to the new code after the first upgrade.

This isn't robust. One of the obvious reasons for the failure is that we didn't use OTP libraries (GenServer and related libraries) dedicated to creating this kind of server.

Avoid Spawn in Real-World Software Development

In many books and articles, we see code examples demonstrating the power of Elixir or Erlang: tons of processes easily spawned directly with spawn or spawn_link.

However, in real-world software development, we generally should avoid creating home-brewed servers or other long-running processes, and should instead use OTP libraries.

Even for 'one-off' asynchronous tasks, we shouldn't directly use spawn or spawn_link.

Elixir has a great alternative for spawn, though: Task module (covered in depth in the AppSignal article Demystifying processes in Elixir).

How To Do a Code Upgrade Using GenServer

Let's create a better version of our server in lib/code_purge/pi_gs.ex:

1# lib/code_purge/pi_gs.ex
2defmodule CodePurge.PiGs do
3  use GenServer
4
5  def start_link(value \\ 3.14) do
6    GenServer.start_link(__MODULE__, value)
7  end
8
9  def init(value) do
10    {:ok, value}
11  end
12
13  def handle_call(:get, _from, value) do
14    {:reply, value, value}
15  end
16
17  def get(pid) do
18    GenServer.call(pid, :get)
19  end
20end

And now, try to upgrade/purge the code of a running process several times:

1iex(1)> {:ok, pid} = CodePurge.PiGs.start_link()
2{:ok, #PID<0.161.0>}
3iex(2)> CodePurge.PiGs.get(pid)
43.14
5iex(3)> :code.load_file(CodePurge.PiGs)
6{:module, CodePurge.PiGs}
7iex(4)> :code.purge(CodePurge.PiGs)
8false
9iex(5)> :code.load_file(CodePurge.PiGs)
10{:module, CodePurge.PiGs}
11iex(6)> :code.purge(CodePurge.PiGs)
12false
13iex(7)> CodePurge.PiGs.get(pid)
143.14

Nothing bad happens! The reason why is easy to understand.

Our pid process doesn't spin in CodePurge.PiGs code. It runs a GenServer loop, and we don't update the GenServer module code at all.

CodePurge.PiGs is a callback module, and the name is kept in a GenServer internal state. GenServer makes external calls to CodePurge.PiGs functions when serving GenServer requests.

The main challenge is to keep updating the states of GenServer processes, so that any new code can work.

For a single GenServer, this can be done through :sys module and code_change callback of GenServer. This is covered in depth in the previously mentioned hot code reloading article, here, we'll only briefly demonstrate it.

Without closing the previous iex session, let's update lib/code_purge/pi_gs.ex to the following and compile:

1# lib/code_purge/pi_gs.ex
2defmodule CodePurge.PiGs do
3  use GenServer
4
5  def start_link(value \\ 3.14) do
6    GenServer.start_link(__MODULE__, value)
7  end
8
9  def init(value) do
10    {:ok, [value]}
11  end
12
13  def handle_call(:get, _from, st) do
14    [value] = st
15    {:reply, value, st}
16  end
17
18  def get(pid) do
19    GenServer.call(pid, :get)
20  end
21
22  def code_change(_old_vsn, value, _extra) do
23    {:ok, [value]}
24  end
25end

In code_change we updated the state, just wrapping it with a list. We also updated handle_call and init callbacks. Now, in the existing iex session, run:

1iex(8)> :code.purge(CodePurge.PiGs)
2false
3iex(9)> :sys.suspend(pid)
4:ok
5iex(10)> :code.load_file(CodePurge.PiGs)
6{:module, CodePurge.PiGs}
7iex(11)> :sys.change_code(pid, CodePurge.PiGs, nil, [])
8:ok
9iex(12)> :sys.resume(pid)
10:ok
11iex(13)> CodePurge.PiGs.get(pid)
123.14
13iex(14)> :sys.get_state(pid)
14[3.14]

Everything works fine, and the last call to :sys.get_state demonstrates that the state has actually changed.

Wrap-up

In the first part of this series, we've seen that a GenServer implementation is needed for effective hot code upgrades. We've also demonstrated how to upgrade a single GenServer instance consistently.

Upgrading an individual process, together with its callback module, can be used as a 'tactical weapon' to fix localized bugs or add some logging.

But updating a system at a greater scale, on a regular basis, requires more powerful tools. In the next part of the series, I'll delve into the world of supervisors in Elixir.

I hope you found this run-through of hot code reloading useful. See you next time for supervisors!

P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!