Video object detection in Elixir using Nx and Bumblebee

Jan 26, 2023

Table of contents:

What are we going to build?
Setting up the project
Implementing object detection from a video
Building the Phoenix LiveView application
Conclusion

I was recently contacted by a client to build a prototype of object detection in a video stream using Elixir. The client recognised Elixir’s strengths in data pipelines, multimedia processing, and native machine learning capabilities and wanted to understand what it would take to get a simple prototype working.

We decided to build the prototype using Bumblebee, Elixir’s new machine learning platform that provides pre-trained and transformer neural network models on top of Axon.

The client has graciously allowed me to make the prototype public and write up my process for building it to share the learning with the Elixir ecosystem.

Here is the repo on GitHub if you would like to follow the commits: philipbrown/video-object-detection.

So let’s get to it!

What are we going to build?

This is going to be a fairly simple end-to-end prototype of object detection in a video stream, so we can take a couple of shortcuts to make our lives easier.

Firstly, instead of using an actual video stream, we’re just going to use an .mp4 file that we can read from the filesystem. Secondly we aren’t going to train our own neural network, instead we can just grab an existing pre-trained model from Hugging Face via Bumblebee.

In order to make the prototype work, we’ll need to implement the following functionality.

Firstly, we need to grab the first frame of the video and convert it to a Nx tensor that can be given to the neural network. The response from the neural network will be a prediction of what object can be found in the frame.

Secondly, we need to take the first frame and the prediction and display them to the user in the browser. Once the frame and prediction have been displayed, the second frame will be passed into the neural network, and so on. Fortunately Phoenix LiveView will make this very easy to build.

So, now that we know what we’re going to be building, let’s jump into the code.

Setting up the project

The first thing we need to do is to set up a new Phoenix project:

mix phx.new video-object-detection --app app --no-ecto

This command will generate a new Phoenix project with the app name of app. We won’t need a database for this project so we can pass the --no-ecto flag.

Next we need to create a new LiveView module that we can work from. By default Phoenix will generate a dead route, controller, and template files. We can just delete them and create the LiveView module instead. Here are the changes I made during this step.

Now if you fire up the application with the following command:

iex -S mix phx server

And then go to http://localhost:4000 in the browser you should see “Hello, world”.

Implementing object detection from a video

Now that we have the foundation of the project in place we can get to the meat and potatoes of implementing video detection.

The first thing we need to do is to add a couple of dependencies to our mix.exs file:

{:bumblebee, "~> 0.1.2"}
{:exla, ">= 0.0.0"}
{:evision, "~> 0.1.27"}

First up we have Bumblebee and EXLA. As I mentioned earlier, Bumblebee is what we’re going to be using for our pre-trained neural network. Bumblebee is built on top of Nx and Axon. We are also going to be using Evision, which provides bindings for OpenCV. OpenCV is an amazing library of modules and functions that makes it easier to work with images and video. Evision makes it really easy to use those functions in Elixir and it plays really nicely with Nx.

Run the following command in terminal to add those dependencies to the project:

mix deps.get

You will also need to add the following line to your config/config.exs file so that Nx uses the EXLA.Backend:

config :nx, default_backend: EXLA.Backend

Next, we need an .mp4 file that we can use to perform object detection on. If you have an .mp4 on your computer you can use that, or you can “acquire” one from YouTube. Save your .mp4 file under the priv directory as video.mp4.

To get the object detection functionality working, we can pop open an iex session and walk through the steps. Run the following command in terminal to open a new iex session:

iex -S mix

First, we’ll create the path to the .mp4 file as a variable so we can grab it later:

path = Path.join(:code.priv_dir(:app), "video.mp4")

Next, we can create the video using Evision:

video = Evision.VideoCapture.videoCapture(path)

%Evision.VideoCapture{
  fps: 25.0,
  frame_count: 4766.0,
  frame_width: 1280.0,
  frame_height: 720.0,
  isOpened: true,
  ref: #Reference<0.1803582736.1051328520.196295>
}

The video is a struct that contains a reference to the video.

Next, we can read the first frame of the video:

frame = Evision.VideoCapture.read(video)

%Evision.Mat{
  channels: 3,
  dims: 2,
  type: {:u, 8},
  raw_type: 16,
  shape: {720, 1280, 3},
  ref: #Reference<0.1803582736.1051328522.197029>
}

Again, the return value is a struct that contains a reference. If you were to repeat that function you will see that the reference is different on the second returned struct. This is because you just returned the second frame of the video despite not passing in the updated video. This is kinda strange in Elixir-land, but it’s just something to be aware of.

Now that we have a frame of the video, we can perform the object detection. However, before we can do that, we first need to convert the frame into an Nx tensor:

tensor = frame |> Evision.Mat.to_nx() |> Nx.backend_transfer()

Evision provides the to_nx/1 function to convert the %Evision.Mat{} struct to a tensor. However, the backend of the tensor will be set to Evision.Backend, and so before we pass the tensor to Bumblebee, we first need to call the Nx.backend_transfer/1 function to transfer it.

The next thing we need to do is to set up Bumblebee:

{:ok, model_info} = Bumblebee.load_model({:hf, "microsoft/resnet-50"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "microsoft/resnet-50"})

serving = Bumblebee.Vision.image_classification(model_info, featurizer,
  top_k: 1,
  compile: [batch_size: 1],
  defn_options: [compiler: EXLA]
)

In this example I’m using ResNet-50, which is a pre-trained neural network for object detection. Bumblebee will automatically download everything we need from Hugging Face.

The return value of the image_classification/3 function is an %Nx.Serving{} struct. Nx provides the Nx.Serving module to encapsulate running inference on the model with new data.

So now that we have the %Nx.Serving{} struct, we can make our first object detection prediction using the first frame of the video:

result = Nx.Serving.run(serving, tensor)

%{predictions: [%{label: "valley, vale", score: 0.2353769838809967}]}

As you can see, the neural network is predicting the object in the first frame of my video is “valley, vale”.

You could now manually grab the second frame of the video and pass it to Nx.Serving.run/2, but let’s instead leverage LiveView and have it run automatically in the browser.

Building the Phoenix LiveView application

Now that we have object detection working, we can transfer the code from iex into the LiveView module so we can interact with it in the browser.

The first thing we need to do is to set up a couple of things in the mount/3 function. The mount/3 function is called when the LiveView is first mounted:

def mount(_params, _session, socket) do
  path = Path.join(:code.priv_dir(:app), "video.mp4")

  {:ok,
  socket
  |> assign(running?: false)
  |> assign(image: nil)
  |> assign(prediction: nil)
  |> assign(serving: serving())
  |> assign(video: Evision.VideoCapture.videoCapture(path))}
end

First we generate the path to the video file exactly like we did in iex. Next, we set up a couple of assigns in the socket as default values.

Next, we create the serving struct just like we did earlier with the following private function:

defp serving do
  {:ok, model_info} = Bumblebee.load_model({:hf, "microsoft/resnet-50"})
  {:ok, featurizer} = Bumblebee.load_featurizer({:hf, "microsoft/resnet-50"})

  Bumblebee.Vision.image_classification(model_info, featurizer,
    top_k: 1,
    compile: [batch_size: 1],
    defn_options: [compiler: EXLA]
  )
end

And finally, we create the video and set it into the socket assigns.

We also need to update the render/1 function of the LiveView:

def render(assigns) do
  ~H"""
  <div class="min-h-screen flex flex-col">
    <div class="flex-1 flex flex-col justify-center mx-auto max-w-7xl">
      <div class="flex flex-col items-center justify-center">
        <button :if={!@running?} phx-click="start">
          <Heroicons.play solid class="w-32 h-32 fill-gray-300" />
        </button>
      </div>

      <div :if={@running?} class="flex flex-col gap-4">
        <img :if={@image} src={["data:image/jpg;base64,", @image]} class="max-w-3xl" />

        <p :if={@prediction} class="text-2xl">
          <%= @prediction %>
        </p>
      </div>
    </div>
  </div>
  """
end

If running? is set to false (which it is by default) we display a “play” button that the user can click to kick off the object detection. The button has a phx-click attribute that we need to provide a handle_event/3 callback for in order for it to work.

If running? is set to true and a prediction has been made, we can display the frame as an image and the prediction as text to the screen. Every time there is a new frame and prediction, LiveView will take care of updating the screen for us.

You should be able to view this LiveView module in the browser with the “play” button displayed. However, if you click the button you will get an error. Let’s now implement the final part of this application.

First we need to handle the phx-click event from the “play” button:

def handle_event("start", _params, socket) do
  send(self(), :run)

  {:noreply, assign(socket, running?: true)}
end

Here we are listening for the "start" event and then sending a :run message to self(), which is the pid of the current LiveView process. This means we can perform some work asynchronously just as you would with a regular Elixir process. We also mark the running? assign as true.

As we’ve sent a message to the current process, we also need a handle_info/2 callback to handle the message:

def handle_info(:run, %{assigns: %{running?: true}} = socket) do
  frame = socket.assigns.video |> Evision.VideoCapture.read()
  prediction = predict(socket.assigns.serving, frame)

  send(self(), :run)

  {:noreply,
  socket
  |> assign(prediction: prediction)
  |> assign(image: Evision.imencode(".jpg", frame) |> Base.encode64())}
end

def handle_info(_msg, socket), do: {:noreply, socket}

This handle_info/2 function is where the main action happens. If the :run message is received and the socket assigns value of running? is true we can take the next frame of the video and perform object detection.

First we grab the first frame and pass it and the serving struct to a predict/2 function. Here is the implementation of that function:

defp predict(serving, frame) do
  tensor = frame |> Evision.Mat.to_nx() |> Nx.backend_transfer()

  %{predictions: [%{label: label}]} = Nx.Serving.run(serving, tensor)

  label
end

This should be very familiar as it’s exactly the same as what we were doing in iex. We basically just convert the frame to a tensor and then pass it to Nx.Serving.run/2. We then return the label string.

Back in the handle_info/2 function, now that we have made the prediction, we can send ourselves the same :run message to automatically repeat the process:

send(self(), :run)

Finally we can update the socket with the prediction and the frame converted to an image. To render the image we can use Evision.imencode/2, which will convert the frame to a binary, and then base64 encode the binary so that it can be displayed in an <img> tag in HTML:

{:noreply,
  socket
  |> assign(prediction: prediction)
  |> assign(image: Evision.imencode(".jpg", frame) |> Base.encode64())}

Now every time the handle_info/2 function is called and the prediction and image assigns are updated, Phoenix will automatically send the diff to the browser and make the update, rendering the new frame and the given prediction.

Conclusion

And there you have it, video object detection with a live updating browser interface in a very small amount of code. Evision, Bumblebee, Nx, and Axon provide everything you need out of the box to start working with machine learning in Elixir, and of course Phoenix LiveView makes it incredibly easy to build real-time web applications.

It’s a very exciting time to be working on these types of projects. Elixir is dramatically reducing the cost to build end-to-end real-time, machine learning applications.