Working with Supervisors in Elixir

Sep 07, 2016

Table of contents:

Why supervising processes is important
What’s the difference between monitoring and linking processes?
Understanding “let it crash”
Supervisors
Creating a Flakey Service
Using a Supervisor
Starting the Application
Conclusion

Last week we looked at organising Elixir projects using Mix. Mix is a build tool for creating, compiling, and testing your Elixir projects.

In Elixir, an “application” is a set of modules that can be started and stopped as a single unit. One of the most important characteristics of Elixir and Erlang is it’s fault tolerance. By building our applications using processes we can isolate problems to prevent it having a knock on effect to other users.

When something goes wrong in a process, we need to be able to recover. In Elixir, typically this will mean “let it crash” and start a new process in it’s place.

Normally this would seem like an organisational nightmare. But in Elixir, thanks to OTP, this is exactly how the language was designed to be used.

In Elixir and Erlang, we have what are known as Supervisors. Supervisors are processes that have the sole job of monitoring other processes. When a process that is being supervised crashes, the supervisor will automatically start a new process in it’s place.

In today’s tutorial we are going to be exploring how we can work with supervisors in Elixir to build highly fault tolerant applications.

Why supervising processes is important

A couple of weeks ago we looked at using processes in Elixir.

Processes are isolated from one another and allow us to write code that is both parallel and concurrent. This means if something goes wrong in a process, none of the rest of the processes will be disrupted.

By default processes are isolated but you can also monitor or link a process to another process. This allows you to take an action if that process dies.

In Elixir and Erlang, you build your applications from these supervised trees of processes. This means if something goes wrong in any of your processes, you can isolate that fault to just the processes that are concerned.

And by monitoring or linking to another process, we can automatically recover from the failure, thus providing exceptionally high fault tolerance.

What’s the difference between monitoring and linking processes?

In the last section I mentioned you can either monitor or link processes. These are the two different methods for supervising processes in Elixir.

We’ve already seen how to link processes in Working with Processes in Elixir where we use spawn_link rather than spawn to create a new process.

When you link a process, a crash in either process will kill the other process. For example, typing the following into iex will spawn a new process and then throw an exception, killing both the spawned process and the current process:

spawn_link(fn -> raise "smell ya later" end)

Monitoring a process on the other hand will only notify the monitor process if the monitored process crashes:

# Span a new process
pid = spawn(fn -> :timer.sleep(100) end)

# Monitor the process
Process.monitor(pid)

# Kill the process
Process.exit(pid, :kill)

# Flush the messages of the inbox
flush
# {:DOWN, #Reference<0.0.1.82>, :process, #PID<0.86.0>, :killed}

In this example I’m creating a process that I will monitor. Next I will kill the process with the :kill flag. Finally I will flush the current process inbox and I will see that it was sent a :DOWN message from the monitored process when it died.

When you first learn about supervising processes this seems like a subtle difference. We will be exploring the difference between the two options and when to use each with practical examples over the coming weeks.

Understanding “let it crash”

If you are coming to Elixir from another programming language, something you probably do a lot of is rescuing from exceptions using the try / catch construct. This is a defensive style of programming where you try to anticipate what could go wrong, and gracefully recover.

However, in Elixir and Erlang, things are a bit different. Instead of littering your code with defensive constructs, you only write the happy path. This means you allow anything that could go wrong to just go wrong.

Typically this will mean the process will crash, but because processes are supervised, it’s very easy and natural to recover and continue.

Supervisors

The concept of supervising processes, allowing them to crash, and automatically replacing them with new processes is a central theme of the Erlang philosophy.

Supervisors are processes that have the sole responsibility of supervising other processes. Because supervisors have a very simple job, they are very unlikely to crash themselves.

In a typical Elixir application, you will create a supervision tree, constructed of supervisors and worker processes. This will often require multiple supervisors to supervise the various workers of the tree.

The beautiful outcome of this is, when something inevitably goes wrong on in one of the leaves or the branches of the tree, the problem is isolated and easily recovered without disturbing the rest of the application.

Creating a Flakey Service

We’ve covered a lot of the theory of Elixir applications and their supervisors without looking at code. I often find that the best way to really understand a concept is to play with it in code, and so to further our understanding let’s create a new application to play with.

First up, let’s use Mix to create a new application:

mix new flakey

The main module of this application is called Flakey.Service and it should have a single function that returns the state of the service:

defmodule Flakey.Service do
  use GenServer

  def start_link do
    GenServer.start_link(__MODULE__, :ok, name: __MODULE__)
  end

  def init(:ok) do
    {:ok, :ok}
  end

  def check do
    GenServer.call(__MODULE__, {:check})
  end

  def handle_call({:check}, _from, :ok) do
    {:reply, :ok, :ok}
  end
end

For this service I’m using GenServer. The state of the service is the :ok atom that is the second argument to the GenServer.start_link/3 function.

One thing to note that we haven’t covered so far is providing the name key as an option of the arguments list given to start_link/3. By providing a name it means we don’t need to provide a pid in order to communicate with this process.

The service has a single check function that returns the status of the service.

We can fire up iex to see this service in action:

iex -S mix

First we start the service:

{:ok, pid} = Flakey.Service.start_link()

Next we can call the check/0 function to see if the service is ok:

Flakey.Service.check()
# :ok

However, we can kill the service:

Process.exit(pid, :kill)

Now if we try to check the state of the service again we can see that the process is no longer available:

Flakey.Service.check()
# ** (exit) exited in: GenServer.call(Flakey.Service, {:check}, 5000)
# ** (EXIT) no process

Using a Supervisor

Ideally, if the Flakey.Service process crashes, we want it to be automatically restarted. This is the perfect job for a supervisor:

defmodule Flakey.Supervisor do
  use Supervisor

  def start_link do
    Supervisor.start_link(__MODULE__, [])
  end

  def init(_) do
    children = [
      worker(Flakey.Service, [])
    ]

    supervise(children, strategy: :one_for_one)
  end
end

The supervisor is now responsible for starting the Flakey.Service so we don’t have to start it directly ourselves. Although it doesn’t make a difference in this example, we can no longer rely on using a pid because if the service restarts, the pid won’t be the same.

The supervisor is defined in the init/1 function. In this case we’re defining the Flakey.Service as a worker process and a strategy of :one_for_one.

The strategy you define for your workers determine how they are restarted. In this case, we’re simply saying when the process crashes, create a new one. There are a couple of different strategies for restarting processes, but we won’t worry about them for now.

Now if we fire up iex we can have a play with the supervisor.

First up we can start the supervisor:

Flakey.Supervisor.start_link()

Starting the supervisor will also start the Flakey.Service so we don’t have to do that manually.

Next we can verify that the Flakey.Service has also been started:

Flakey.Service.check()

We can grab the pid of the Flakey.Service:

pid = Process.whereis(Flakey.Service)

Now that we have the pid we can kill the process:

Process.exit(pid, :kill)

We can now check to see if the service is still ok:

Flakey.Service.check()

And finally we can find the pid of the process to see that it is actually a new pid for the new process:

pid = Process.whereis(Flakey.Service)

Starting the Application

Every time we boot up the iex and include -S mix we are starting the Flakey application. Instead of manually starting the supervisor each time, we can have the supervisor automatically be started so everything is set up when the application starts.

The first thing we need to do is to convert the empty Flakey module that was created when the Mix generated the application:

defmodule Flakey do
  use Application

  def start(_type, _args) do
    Flakey.Supervisor.start_link()
  end
end

As you can see, we need to include the Application module and then implement the start/2 function. In this case we simply need to start the Flakey.Supervisor.

Next we need to tweak the mix.exs file that was also automatically created when we generated the mix application.

The only thing to change is to add the Flakey module’s application/0 function.

def application do
  [applications: [:logger], mod: {Flakey, []}]
end

The mod key is where we provide the module that will be called on start up. The module can be any module that implements the Application behaviour. The first argument is the module, and the second argument is a keyword list of arguments that will be provided, in this case an empty list.

With all of this in place, we can restart iex again:

iex -S mix

Now we can call the check/0 function on the Flakey.Service without having to start the supervisor first:

Flakey.Service.check()

Finally one more thing we should look at in this introduction to supervisors. If you run the following line in iex you should be presented with a new window:

:observer.start()

Next if you click on the Applications tab, you should see a tree that represents the supervision tree of the application. Now if right click on Elixir.Flakey.Service and choose “Kill Process”, you can kill the Flakey.Service process and see it pop right back automatically thanks to our supervisor.

Conclusion

A big part of Elixir applications is the “let it crash” philosophy. Instead of writing defensive code, we only write the happy path,

If something goes wrong we just let the process crash because when a process crashes we can automatically create a new process using a supervisor. A supervisor is a process that has a single responsibility of starting new processes.

Elixir applications are built in the structure of supervision trees. If a process dies, the crash is isolated and dealt with as part of the supervision tree.

This means if a leaf or a branch dies, the rest of the application is not touched. This is the secret to how Elixir applications are so highly fault tolerant.