cult3

Building a full-text search engine in Elixir

Mar 22, 2023

Table of contents:

  1. Motivations
  2. What are the advantages of Haystack?
  3. Prior art
  4. An overview of Haystack
  5. Tokenizing incoming data
  6. Transforming tokens
  7. Implementing the storage layer
  8. Building the Index
  9. Defining queries
  10. Conclusion

Full-text search is a common requirement in a number of different types of website and web applications. I recently needed the ability to index and search content on a project I’ve been working on. However, using an off-the-shelf search solution didn’t make sense for this project.

Instead of building a closed-source solution as part of that project, I decided to build an open source library that I could use as a building block for my other Elixir projects that fit the same requirements. That open source library is Haystack, a simple, extendable full-text search engine written in Elixir.

I strongly believe projects built on the Beam have a unique set of competitive advantages in comparison to projects built in other technologies, and so I wanted to explore what that would look like for a native search index.

This blog post is an overview of how Haystack is built, the motivations for the project, and the design decisions I have made to allow Haystack to be as simple and extendable as possible.

Motivations

Before we get too deep into looking at how Haystack is built, let’s first take a look at the motivations for building it in the first place.

As I mentioned in the introduction, using an off-the-shelf search solution didn’t make sense for the project I was building. The project is a simple data driven Elixir Phoenix application. Currently the project is encapsulated as a single web server. In order to keep the complexity of the project to a minimum, I don’t want to add any additional infrastructure that I would need to set up, deploy and maintain. This means deploying an Elastic search cluster is out of the question.

PostgreSQL and SQlite both offer excellent full-text search solutions. This is perfect if you are already using one of these two databases in your project. However, as this project is not currently using a database, it doesn’t make sense to add one just so I can use the full-text search functionality.

For this project, and a handful of other projects I want to build, I want to keep the running cost as low as possible. This means paying for a hosted search solution is also out of the question.

And so I was left with the obvious choice of building my own solution. I felt like this was the right choice to make because the “cost” of developing and maintaining Haystack can be spread over a number of my projects that will benefit from it.

Finally, let’s be honest, building Haystack means I have something interesting to write about on Culttt. Building a search engine is one of those technical blog posts that, like building your own compiler, is just incredibly intoxicating to those of us with a predisposition for reading about the technical intricacies of how stuff works.

What are the advantages of Haystack?

Whilst Haystack will definitely make some compromises when compared with existing search solutions, I do genuinely believe it has a number of advantages that make it better than the competition in certain ways.

The first, and most obvious thing is, as I mentioned above, if you have an Elixir project that is a simple web server deployment and you don’t want the headache or additional cost of deploying and managing additional infrastructure or paying for third-party services, Haystack can offer you full-text search that is very easy to get started with.

If you are currently building Elixir projects, I think you will find modifying and extending Haystack to meet your requirements very easy. Doing the same for one of the existing off-the-shelf solutions is likely going to be a lot more difficult or impossible, depending on what you want to do.

And finally, if you need extremely low latency, having a solution that is embedded inside of your application is going to be faster than a solution that is forced to make a network roundtrip to return results.

Prior art

Before we get into looking at how Haystack is built, we should take a brief look at the prior art and sources of inspiration.

The first source of inspiration was Elasticlunr, an existing full-text search library built entirely in Elixir.

When I was first planning the project that required full-text search I initially planned to use Elasticlunr. However when I came to add it to the project and take it for a test drive, I found a couple of problems that I felt made this a no-go.

After assessing how I would go about making the changes I required to make Elasticlunr work for my requirements, it felt like there were a number of fundamental changes that I would need to make that would completely change the original spirit of the project. I’m usually a proponent for contributing to an existing project rather than starting your own. However, if the changes you are proposing are completely rewriting large portions of the original work, I think it’s better to start a new project.

Elasticlunr also hasn’t had any updates in the last year despite having a PR that looks 90% finished. As full-text search is going to be a core component of a number of my projects, I didn’t want to outsource it to an abandoned project. Finally I reached out to the author of Elasticlunr but didn’t receive a response. The combination of all of these factors meant that I couldn’t see Elasticlunr as a viable choice.

The second source of inspiration was Building a full-text search engine in 150 lines of Python code. Before building Haystack, I only had a basic understanding of how this type of project would be built. I really love these kind of “from scratch” blog posts as they provide a guide to building something of value from first principles. I find this type of blog post to be incredibly valuable when you are trying to learn something without any prior knowledge. I hope the blog post I’m writing now will be added to that collective literature and be used by the next person as inspiration for building what they have in mind.

An overview of Haystack

The rest of this blog post will be an overview of how Haystack has been built as well as the design considerations I’ve made that I hope will make the project as simple and extensible for other people to use in their projects.

Whenever I stumble upon a new interesting project, the first thing I do is to look at the README.md to get an understanding of how I would use the project. So think of this section as the high level overview of how to use Haystack.

The entry point to Haystack is the Haystack module. This module is responsible for managing the various indexes you have as part of the project as well as any future configuration or high level concerns.

These concerns are encapsulated in the %Haystack{} struct:

haystack = Haystack.new()

Currently, the Haystack module has a single index/3 function that accepts the name of the index as an atom and a callback function to apply changes to the index. If the index does not already exist, Haystack will create one and pass it into the callback so you can configure or use it:

Haystack.index(haystack, :animals, fn index ->
  index
end)

Once you have an index, the first thing you need to do is specify the fields of the index:

alias Haystack.Index

Haystack.index(haystack, :animals, fn index ->
  index
  |> Index.ref(Index.Field.term("id"))
  |> Index.field(Index.Field.new("name"))
  |> Index.field(Index.Field.new("description"))
end)

The ref is the identifier of the document. You can create as many fields on the document as you want. When you add documents to the index, Haystack will automatically extract the fields you have configured.

You can also provide nested fields using dot syntax:

alias Haystack.Index

Haystack.index(haystack, :people, fn index ->
  index
  |> Index.ref(Index.Field.term("id"))
  |> Index.field(Index.Field.new("name"))
  |> Index.field(Index.Field.new("address.postcode"))
end)

Each Haystack index has a configurable storage implementation. By default the storage implementation is essentially a map, which is great for testing and prototyping, but probably not a good solution if you want to use Haystack as part of a real application. You can configure a custom storage implementation on the index:

alias Haystack.{Index, Storage}

Haystack.index(haystack, :animals, fn index ->
  index
  |> Index.ref(Index.Field.term("id"))
  |> Index.field(Index.Field.new("name"))
  |> Index.field(Index.Field.new("description"))
  |> Index.storage(Storage.ETS.new())
end)

The ETS storage implementation is provided as one of the default implementations that ships with Haystack out of the box. If you would like to extend Haystack with your own storage implementation, it’s very easy to do so. We’ll look at doing that later in this blog post.

Finally, the most important part of a library that provides full-text search is the ability to query the indexes. Here is the most basic version of applying queries:

Haystack.index(haystack, :animals, &Index.search(&1, "Red Panda"))

This would search the :animals index for the phrase "Red Panda". Haystack also offers a powerful way to build and run search queries, which we’ll look at later in this blog post.

So that’s the high level overview of using Haystack. Let’s now jump into the details of how it is built.

Tokenizing incoming data

The first step that is required to build a search engine is that you need to take the incoming data and tokenize it. Tokenizing is basically the process of splitting the input into individual words so that they can be indexed.

Part of the tokenization process is extracting the offset and length of each of the words. This makes it possible to highlight matched words as part of the extract that is provided as part of the search results.

To extract the words and positions from the incoming data we can use a regex with different return values.

To capture alphanumeric characters, we can use the following regex:

~r/([[:alnum:]]+)/

If you wanted to capture all characters, you could use the following regex:

~r/(.+)/

You might want to use this regex for identifier type fields where it’s important to match the entire value.

Wee can then use the Regex.scan/3 function to apply the regex to the incoming value:

# Extract the tokenized words from the value.
defp words(v, r),
  do: Regex.scan(r, v, capture: :first) |> List.flatten()

# Extract the positions from the value.
defp positions(v, r),
  do: Regex.scan(r, v, capture: :first, return: :index) |> List.flatten()

You can use the return: index option to return the index position of each word. The position is given as the index of the first character and then length of the match as a tuple.

We can use the following tokenize/2 function to apply the regex and extract the tokens as a list. I’m wrapping each token in a struct to formalise this data structure:

@doc """
Tokenize a value with a given separator.
## Examples
    iex> tokens = Tokenizer.tokenize("Needle in a Haystack", Tokenizer.separator(:default))
    iex> Enum.map(tokens, & &1.v)
    ~w{needle in a haystack}
"""
@spec tokenize(term, Regex.t()) :: list(Token.t())
def tokenize(v, r) when is_integer(v),
  do: v |> to_string() |> tokenize(r)

def tokenize(v, r) when is_binary(v) do
  v = String.downcase(v)

  [words(v, r), positions(v, r)]
  |> Enum.zip()
  |> Enum.map(fn {v, {o, l}} -> Token.new(v, offset: o, length: l) end)
end

I’m also providing some default regexes as part of this module’s API. Here is the full Tokenizer module:

defmodule Haystack.Tokenizer do
  @moduledoc """
  A module for tokenizing values.

  This module provides utilities for tokenizing values. The default tokenizer
  removes anything but alphanumeric characters and extracts the positions of
  words using a start offset and length.

  There's also a `:full` tokenizer that can be used when the full value should
  be treated as a single token. For example, a serial code.
  """

  alias Haystack.Tokenizer.Token

  @default ~r/([[:alnum:]]+)/
  @full ~r/(.+)/

  # Public

  @doc """
  Return the separator

  ## Examples

      iex> Tokenizer.separator(:default)
      ~r/([[:alnum:]]+)/

      iex> Tokenizer.separator(:full)
      ~r/(.+)/

  """
  @spec separator(atom) :: Regex.t()
  def separator(:default), do: @default
  def separator(:full), do: @full

  @doc """
  Tokenize a value.

  ## Examples

      iex> tokens = Tokenizer.tokenize("Needle in a Haystack")
      iex> Enum.map(tokens, & &1.v)
      ~w{needle in a haystack}

  """
  @spec tokenize(term) :: list(Token.t())
  def tokenize(v), do: tokenize(v, separator(:default))

  @doc """
  Tokenize a value with a given separator.

  ## Examples

      iex> tokens = Tokenizer.tokenize("Needle in a Haystack", Tokenizer.separator(:default))
      iex> Enum.map(tokens, & &1.v)
      ~w{needle in a haystack}

  """
  @spec tokenize(term, Regex.t()) :: list(Token.t())
  def tokenize(v, r) when is_integer(v),
    do: v |> to_string() |> tokenize(r)

  def tokenize(v, r) when is_binary(v) do
    v = String.downcase(v)

    [words(v, r), positions(v, r)]
    |> Enum.zip()
    |> Enum.map(fn {v, {o, l}} -> Token.new(v, offset: o, length: l) end)
  end

  # Private

  # Extract the tokenized words from the value.
  defp words(v, r),
    do: Regex.scan(r, v, capture: :first) |> List.flatten()

  # Extract the positions from the value.
  defp positions(v, r),
    do: Regex.scan(r, v, capture: :first, return: :index) |> List.flatten()
end

Transforming tokens

The next step after tokenization is applying transformations on the tokens. This normalizes and optimises the tokens ready for indexing.

This is one of the points of integration that I wanted to offer so that consumers of haystack can apply their own series of transformations based on their unique requirements. However, whilst I want to make transformations configurable, I also want to have sensible defaults that “just work” out of the box.

To ensure each stage of the pipeline of transformations are interchangeable, the Elixir way of doing that is to define a behaviour that all of the pipeline stages should implement. Here’s what that behaviour looks like:

@doc """
Apply a transformation on a list of tokens.
"""
@callback transform(list(Token.t())) :: list(Token.t())

The transform/1 behaviour callback accepts a list of tokens and returns a list of tokens. Each stage of the transformation pipeline builds upon the previous stages to return a list of tokens that have been transformed ready to be indexed.

Out of the box I want to offer two important transformation stages, a way to stem words and a way to remove stop words.

Here’s the stemmer transformer:

defmodule Haystack.Transformer.Stemmer do
  @moduledoc """
  A transformer for stemming words.
  """

  @behaviour Haystack.Transformer

  # Behaviour: Haystack.Transformer

  @doc """
  Apply a stemming transformation on a list of tokens.

  ## Examples

      iex> tokens = Tokenizer.tokenize("Needle in a Haystack")
      iex> tokens = Transformer.Stemmer.transform(tokens)
      iex> Enum.map(tokens, & &1.v)
      ~w{needl in a haystack}

  """
  @impl Haystack.Transformer
  def transform(tokens) do
    Enum.map(tokens, fn token ->
      %{token | v: Stemmer.stem(token.v)}
    end)
  end
end

Stemming words is the process of reducing each word to the stem so that different forms of the same word are grouped together. This module simply delegates to the excellent Stemmer library.

Here’s the stop word transformer:

defmodule Haystack.Transformer.StopWords do
  @moduledoc """
  A transformer for removing stop words.
  """

  @behaviour Haystack.Transformer

  @words Application.app_dir(:haystack, "priv/haystack/stop_words.txt")
         |> File.read!()
         |> String.trim()
         |> String.split("\n")

  # Behaviour: Haystack.Transformer

  @doc """
  Remove stop words from a list of tokens.

  ## Examples

      iex> tokens = Tokenizer.tokenize("Needle in a Haystack")
      iex> tokens = Transformer.StopWords.transform(tokens)
      iex> Enum.map(tokens, & &1.v)
      ~w{needle haystack}

  """
  @impl Haystack.Transformer
  def transform(tokens),
    do: Enum.reject(tokens, &Enum.member?(@words, &1.v))
end

This module removes common words that add very little value. The list of words are stored as a text file in the priv directory of the Haystack project.

Both of these transformation modules are replaceable by the engineer who is implementing Haystack.

Here’s the full Transformer module:

defmodule Haystack.Transformer do
  @moduledoc """
  A module for transforming tokens.

  This module includes a transform/1 callback behaviour for implementing stages
  of a transformation pipeline. The default pipeline transformers includes
  stages for stemming and removing stop words.
  """

  alias Haystack.Tokenizer.Token
  alias Haystack.Transformer

  @default [
    Transformer.Stemmer,
    Transformer.StopWords
  ]

  # Behaviour

  @doc """
  Apply a transformation on a list of tokens.
  """
  @callback transform(list(Token.t())) :: list(Token.t())

  # Public

  @doc """
  Return the list of default transformers.

  ## Examples

      iex> Transformer.default()
      [Transformer.Stemmer, Transformer.StopWords]

  """
  @spec default :: list(module)
  def default, do: @default

  @doc """
  Apply a pipeline of transformers to a list of tokens.

  ## Examples

      iex> tokens = Tokenizer.tokenize("Needle in a Haystack")
      iex> tokens = Transformer.pipeline(tokens, Transformer.default())
      iex> Enum.map(tokens, & &1.v)
      ~w{needl haystack}

  """
  @spec pipeline(list(Token.t()), list(module)) :: list(Token.t())
  def pipeline(tokens, transformers) do
    Enum.reduce(transformers, tokens, fn transformer, tokens ->
      transformer.transform(tokens)
    end)
  end
end

You will see I’ve included a list of default transformations and a pipeline/2 function that applies a list of transformations to a list of tokens. We’ll use these functions later in this blog post.

Implementing the storage layer

The next big area to think about is how data is stored. This is another area of the library that I want to be configurable. There’s a couple of reasons why I’d want to do this.

Firstly, I want to make it as easy as possible to test all of the functionality of Haystack that relies on a storage implementation and so I want a simple mechanism for storage in tests. However, I also want to offer an out-of-the-box solution that can be used in a production application.

Fortunately I’m building Haystack on top of the Beam, and in the world of the Beam, we get the wonderful ETS as a key, value storage engine for free without any third-party dependencies.

I also want the ability for anyone who uses Haystack to be able to use their own storage implementation without having to send a PR to the main Haystack repository. For example, they might want to store their data in Redis.

To achieve these design goals it’s time to define another behaviour, here’s what that looks like:

defmodule Haystack.Storage do
  @moduledoc """
  A module and behaviour for storing data.

  This module acts as a convenient way to delegate to the storage implementation
  within Haystack.

  This has been created as a behaviour to make it easy to provide your own
  storage implementation. By default Haystack provides a memory storage
  implementation, which uses a map to store data.
  """

  # Types

  @type opts :: Keyword.t()
  @type t :: struct
  @type k :: term
  @type v :: term

  # Errors

  defmodule NotFoundError do
    @moduledoc """
    Not found error.
    """

    @type t :: %__MODULE__{message: String.t()}

    defexception ~w{message}a
  end

  # Behaviour

  @doc """
  Create a new storage.
  """
  @callback new(opts) :: t

  @doc """
  Fetch an item from storage.
  """
  @callback fetch(t, k) :: {:ok, v} | {:error, NotFoundError.t()}

  @doc """
  Fetch an item from storage.
  """
  @callback fetch!(t, k) :: v

  @doc """
  Insert an item into storage
  """
  @callback insert(t, k, v) :: t

  @doc """
  Update an item in storage.
  """
  @callback update(t, k, function) :: {:ok, t}

  @doc """
  Update an item in storage.
  """
  @callback update!(t, k, function) :: t

  @doc """
  Upsert an item in storage.
  """
  @callback upsert(t, k, v, function) :: t

  @doc """
  Delete an item from storage.
  """
  @callback delete(t, k) :: t

  @doc """
  Return the count of items in storage.
  """
  @callback count(t) :: integer

  @doc """
  Serialize the storage.
  """
  @callback serialize(t) :: binary
end

Each storage is expected to implement the functions of the behaviour as well as provide a struct that can be used internally to the implementation to store configuration details or data. I’ve also added some convenience functions to this module to delegate to the implementation.

Here’s the full Storage module:

defmodule Haystack.Storage do
  @moduledoc """
  A module and behaviour for storing data.

  This module acts as a convenient way to delegate to the storage implementation
  within Haystack.

  This has been created as a behaviour to make it easy to provide your own
  storage implementation. By default Haystack provides a memory storage
  implementation, which uses a map to store data.
  """

  # Types

  @type opts :: Keyword.t()
  @type t :: struct
  @type k :: term
  @type v :: term

  # Errors

  defmodule NotFoundError do
    @moduledoc """
    Not found error.
    """

    @type t :: %__MODULE__{message: String.t()}

    defexception ~w{message}a
  end

  # Behaviour

  @doc """
  Create a new storage.
  """
  @callback new(opts) :: t

  @doc """
  Fetch an item from storage.
  """
  @callback fetch(t, k) :: {:ok, v} | {:error, NotFoundError.t()}

  @doc """
  Fetch an item from storage.
  """
  @callback fetch!(t, k) :: v

  @doc """
  Insert an item into storage
  """
  @callback insert(t, k, v) :: t

  @doc """
  Update an item in storage.
  """
  @callback update(t, k, function) :: {:ok, t}

  @doc """
  Update an item in storage.
  """
  @callback update!(t, k, function) :: t

  @doc """
  Upsert an item in storage.
  """
  @callback upsert(t, k, v, function) :: t

  @doc """
  Delete an item from storage.
  """
  @callback delete(t, k) :: t

  @doc """
  Return the count of items in storage.
  """
  @callback count(t) :: integer

  @doc """
  Serialize the storage.
  """
  @callback serialize(t) :: binary

  # Public

  @doc """
  Fetch an item from storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> Storage.fetch(storage, :name)
      {:error, %Storage.NotFoundError{message: "Not found"}}

      iex> storage = Storage.Map.new()
      iex> storage = Storage.insert(storage, :name, "Haystack")
      iex> Storage.fetch(storage, :name)
      {:ok, "Haystack"}

  """
  @spec fetch(t, k) :: {:ok, v} | {:error, NotFoundError.t()}
  def fetch(storage, k),
    do: delegate(storage, :fetch, [k])

  @doc """
  Fetch an item from storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> storage = Storage.insert(storage, :name, "Haystack")
      iex> Storage.fetch!(storage, :name)
      "Haystack"

  """
  @spec fetch!(t, k) :: v
  def fetch!(storage, k),
    do: delegate(storage, :fetch!, [k])

  @doc """
  Insert an item into storage

  ## Examples

      iex> storage = Storage.Map.new()
      iex> storage = Storage.insert(storage, :name, "Haystack")
      iex> Storage.fetch!(storage, :name)
      "Haystack"

  """
  @spec insert(t, k, v) :: t
  def insert(storage, k, v),
    do: delegate(storage, :insert, [k, v])

  @doc """
  Update an item in storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> Storage.update(storage, :name, &String.upcase/1)
      {:error, %Storage.NotFoundError{message: "Not found"}}

      iex> storage = Storage.Map.new()
      iex> storage = Storage.insert(storage, :name, "Haystack")
      iex> {:ok, storage} = Storage.update(storage, :name, &String.upcase/1)
      iex> Storage.fetch!(storage, :name)
      "HAYSTACK"

  """
  @spec update(t, k, function) :: {:ok, t}
  def update(storage, k, f),
    do: delegate(storage, :update, [k, f])

  @doc """
  Update an item in storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> storage = Storage.insert(storage, :name, "Haystack")
      iex> storage = Storage.update!(storage, :name, &String.upcase/1)
      iex> Storage.fetch!(storage, :name)
      "HAYSTACK"

  """
  @spec update!(t, k, function) :: t
  def update!(storage, k, f),
    do: delegate(storage, :update!, [k, f])

  @doc """
  Upsert an item in storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> storage = Storage.upsert(storage, :name, "HAYSTACK", &String.upcase/1)
      iex> Storage.fetch!(storage, :name)
      "HAYSTACK"

      iex> storage = Storage.Map.new()
      iex> storage = Storage.insert(storage, :name, "Haystack")
      iex> storage = Storage.upsert(storage, :name, "HAYSTACK", &String.upcase/1)
      iex> Storage.fetch!(storage, :name)
      "HAYSTACK"

  """
  @spec upsert(t, k, v, function) :: t
  def upsert(storage, k, v, f),
    do: delegate(storage, :upsert, [k, v, f])

  @doc """
  Delete an item from storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> storage = Storage.delete(storage, :name)
      iex> Storage.fetch(storage, :name)
      {:error, %Storage.NotFoundError{message: "Not found"}}

      iex> storage = Storage.Map.new()
      iex> storage = Storage.insert(storage, :name, "Haystack")
      iex> storage = Storage.delete(storage, :name)
      iex> Storage.fetch(storage, :name)
      {:error, %Storage.NotFoundError{message: "Not found"}}

  """
  @spec delete(t, k) :: t
  def delete(storage, k),
    do: delegate(storage, :delete, [k])

  @doc """
  Return the count of items in storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> storage = Storage.insert(storage, :name, "Haystack")
      iex> Storage.count(storage)
      1

  """
  @spec count(t) :: integer
  def count(storage),
    do: delegate(storage, :count, [])

  @doc """
  Serialize the storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> Storage.serialize(storage)

  """
  @spec serialize(t) :: binary
  def serialize(storage),
    do: delegate(storage, :serialize, [])

  @doc """
  Deserialize the storage.

  ## Examples

      iex> binary = Storage.serialize(Storage.Map.new())
      iex> Storage.deserialize(binary)

  """
  @spec deserialize(binary) :: t
  def deserialize(binary),
    do: :erlang.binary_to_term(binary)

  # Private

  defp delegate(%module{} = storage, func, args),
    do: apply(module, func, [storage] ++ args)
end

The default map storage implementation is perfect for testing and prototyping. Using a map makes it very easy to make confident assertions without the overhead of complicated storage concerns. Here is the Map storage implementation module:

defmodule Haystack.Storage.Map do
  @moduledoc """
  A map implementation of the storage behaviour.
  """

  alias Haystack.Storage

  @behaviour Haystack.Storage

  # Types

  @type t :: %__MODULE__{data: %{term => term}}

  @enforce_keys ~w{data}a

  defstruct @enforce_keys

  # Behaviour: Haystack.Storage

  @doc """
  Create a new storage.

  ## Examples

      iex> Storage.Map.new()

  """
  @impl Haystack.Storage
  def new(_opts \\ []),
    do: struct(__MODULE__, data: %{})

  @doc """
  Fetch an item from storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> Storage.Map.fetch(storage, :name)
      {:error, %Storage.NotFoundError{message: "Not found"}}

      iex> storage = Storage.Map.new()
      iex> storage = Storage.Map.insert(storage, :name, "Haystack")
      iex> Storage.Map.fetch(storage, :name)
      {:ok, "Haystack"}

  """
  @impl Haystack.Storage
  def fetch(storage, k) do
    case Map.fetch(storage.data, k) do
      {:ok, v} -> {:ok, v}
      :error -> {:error, %Storage.NotFoundError{message: "Not found"}}
    end
  end

  @doc """
  Fetch an item from storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> storage = Storage.Map.insert(storage, :name, "Haystack")
      iex> Storage.Map.fetch!(storage, :name)
      "Haystack"

  """
  @impl Haystack.Storage
  def fetch!(storage, k) do
    case fetch(storage, k) do
      {:ok, v} -> v
      {:error, error} -> raise error
    end
  end

  @doc """
  Insert an item into storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> storage = Storage.Map.insert(storage, :name, "Haystack")
      iex> Storage.Map.fetch!(storage, :name)
      "Haystack"

  """
  @impl Haystack.Storage
  def insert(storage, k, v),
    do: Map.update!(storage, :data, &Map.put(&1, k, v))

  @doc """
  Update an item in storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> Storage.Map.update(storage, :name, &String.upcase/1)
      {:error, %Storage.NotFoundError{message: "Not found"}}

      iex> storage = Storage.Map.new()
      iex> storage = Storage.Map.insert(storage, :name, "Haystack")
      iex> {:ok, storage} = Storage.Map.update(storage, :name, &String.upcase/1)
      iex> Storage.Map.fetch!(storage, :name)
      "HAYSTACK"

  """
  @impl Haystack.Storage
  def update(storage, k, f) do
    with {:ok, v} <- fetch(storage, k) do
      {:ok, Map.update!(storage, :data, &Map.put(&1, k, f.(v)))}
    end
  end

  @doc """
  Update an item in storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> storage = Storage.Map.insert(storage, :name, "Haystack")
      iex> storage = Storage.Map.update!(storage, :name, &String.upcase/1)
      iex> Storage.Map.fetch!(storage, :name)
      "HAYSTACK"

  """
  @impl Haystack.Storage
  def update!(storage, k, f) do
    case update(storage, k, f) do
      {:ok, storage} -> storage
      {:error, error} -> raise error
    end
  end

  @doc """
  Upsert an item in storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> storage = Storage.Map.upsert(storage, :name, "HAYSTACK", &String.upcase/1)
      iex> Storage.Map.fetch!(storage, :name)
      "HAYSTACK"

      iex> storage = Storage.Map.new()
      iex> storage = Storage.Map.insert(storage, :name, "Haystack")
      iex> storage = Storage.Map.upsert(storage, :name, "HAYSTACK", &String.upcase/1)
      iex> Storage.Map.fetch!(storage, :name)
      "HAYSTACK"

  """
  @impl Haystack.Storage
  def upsert(storage, k, v, f) do
    case update(storage, k, f) do
      {:ok, storage} -> storage
      {:error, _} -> insert(storage, k, v)
    end
  end

  @doc """
  Delete an item from storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> storage = Storage.Map.delete(storage, :name)
      iex> Storage.Map.fetch(storage, :name)
      {:error, %Storage.NotFoundError{message: "Not found"}}

      iex> storage = Storage.Map.new()
      iex> storage = Storage.Map.insert(storage, :name, "Haystack")
      iex> storage = Storage.Map.delete(storage, :name)
      iex> Storage.Map.fetch(storage, :name)
      {:error, %Storage.NotFoundError{message: "Not found"}}

  """
  @impl Haystack.Storage
  def delete(storage, k),
    do: Map.update!(storage, :data, &Map.delete(&1, k))

  @doc """
  Return the count of items in storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> storage = Storage.insert(storage, :name, "Haystack")
      iex> Storage.count(storage)
      1

  """
  @impl Haystack.Storage
  def count(storage),
    do: Enum.count(storage.data)

  @doc """
  Serialize the storage.

  ## Examples

      iex> storage = Storage.Map.new()
      iex> Storage.serialize(storage)

  """
  @impl Haystack.Storage
  def serialize(storage),
    do: :erlang.term_to_binary(storage)
end

The second out-of-the-box storage implementation I want to provide is one backed by ETS. This storage implementation can be used in production applications and means anyone can use Haystack without the friction of having to provide a way to store data. Here is the ETS storage implementation:

defmodule Haystack.Storage.ETS do
  @moduledoc """
  An ETS implementation of the storage behaviour.
  """

  use GenServer

  alias Haystack.Storage

  @behaviour Haystack.Storage

  @type t :: %__MODULE__{
          name: GenServer.name(),
          table: atom,
          data: list({term, term}),
          load: nil | fun()
        }

  @enforce_keys ~w{name table data load}a

  defstruct @enforce_keys

  # Behaviour: Haystack.Storage

  @doc """
  Create a new storage.
  """
  @impl Haystack.Storage
  def new(opts \\ []) do
    opts =
      opts
      |> Keyword.put(:data, [])
      |> Keyword.put_new(:load, nil)
      |> Keyword.put_new(:name, :haystack)
      |> Keyword.put_new(:table, :haystack)

    struct(__MODULE__, opts)
  end

  @doc """
  Fetch an item from storage.
  """
  @impl Haystack.Storage
  def fetch(storage, k) do
    case :ets.lookup(storage.table, k) do
      [] -> {:error, %Storage.NotFoundError{message: "Not found"}}
      [{^k, v}] -> {:ok, v}
    end
  end

  @doc """
  Fetch an item from storage.
  """
  @impl Haystack.Storage
  def fetch!(storage, k) do
    case fetch(storage, k) do
      {:ok, v} -> v
      {:error, error} -> raise error
    end
  end

  @doc """
  Insert an item into storage.
  """
  @impl Haystack.Storage
  def insert(storage, k, v),
    do: GenServer.call(storage.name, {:insert, {k, v}})

  @doc """
  Update an item in storage.
  """
  @impl Haystack.Storage
  def update(storage, k, f) do
    with {:ok, v} <- fetch(storage, k) do
      {:ok, GenServer.call(storage.name, {:insert, {k, f.(v)}})}
    end
  end

  @doc """
  Update an item in storage.
  """
  @impl Haystack.Storage
  def update!(storage, k, f) do
    case update(storage, k, f) do
      {:ok, storage} -> storage
      {:error, error} -> raise error
    end
  end

  @doc """
  Upsert an item in storage.
  """
  @impl Haystack.Storage
  def upsert(storage, k, v, f) do
    case update(storage, k, f) do
      {:ok, storage} -> storage
      {:error, _} -> insert(storage, k, v)
    end
  end

  @doc """
  Delete an item from storage.
  """
  @impl Haystack.Storage
  def delete(storage, k),
    do: GenServer.call(storage.name, {:delete, k})

  @doc """
  Return the count of items in storage.
  """
  @impl Haystack.Storage
  def count(storage) do
    info = :ets.info(storage.table)

    Keyword.fetch!(info, :size)
  end

  @doc """
  Serialize the storage.
  """
  @impl Haystack.Storage
  def serialize(storage),
    do: GenServer.call(storage.name, :serialize)

  # Behaviour: GenServer

  def child_spec(opts) do
    storage = Keyword.fetch!(opts, :storage)

    %{id: {__MODULE__, storage.table}, start: {__MODULE__, :start_link, [opts]}}
  end

  def start_link(opts) do
    {storage, opts} = Keyword.pop!(opts, :storage)

    GenServer.start_link(__MODULE__, storage, [name: storage.name] ++ opts)
  end

  @impl true
  def init(storage) do
    :ets.new(storage.table, [:set, :protected, :named_table, :compressed])

    {:ok, storage, {:continue, :ok}}
  end

  @impl true
  def handle_call({:insert, {k, v}}, _from, storage) do
    true = :ets.insert(storage.table, {k, v})

    {:reply, storage, storage}
  end

  def handle_call({:delete, k}, _from, storage) do
    true = :ets.delete(storage.table, k)

    {:reply, storage, storage}
  end

  def handle_call(:serialize, _from, storage) do
    data = :ets.tab2list(storage.table)

    {:reply, :erlang.term_to_binary(%{storage | data: data}), storage}
  end

  @impl true
  def handle_continue(:ok, %{load: nil} = storage), do: {:noreply, storage}

  def handle_continue(:ok, storage) do
    data = storage.load.()

    Enum.each(data, &:ets.insert(storage.table, &1))

    :erlang.garbage_collect(self())

    {:noreply, %{storage | load: nil}}
  end
end

Of course, if you aren’t happy with either of these implementations, you could provide your own by implementing the behaviour callbacks, or providing an implementation for a totally different method of persisting data, the choice is yours.

Building the Index

The main control centre of Haystack is the Index module. This module is responsible for the configuration of the index, how data is stored in the index, and providing a public interface for the underlying modules that implement the functionality.

Here’s the first part of the Index module:

defmodule Haystack.Index do
  @moduledoc """
  A module for managing indexes.
  The index is used to manage configuration and acts as a central point to add,
  update, delete, or query the storage.
  """

  alias Haystack.{Index, Query, Storage, Tokenizer, Transformer}

  # Types

  @type attrs :: %{insert: list({module, Keyword.t()}), delete: list({module, Keyword.t()})}
  @type fields :: %{Index.Field.k() => Index.Field.t()}
  @type opts :: Keyword.t()
  @type t :: %__MODULE__{
          attrs: attrs,
          fields: fields,
          name: atom,
          ref: Index.Field.t(),
          storage: struct()
        }

  @enforce_keys ~w{attrs fields name ref storage}a

  defstruct @enforce_keys

  # Public

  @doc """
  Create a new index.
  ## Examples
      iex> Index.new(:animals)
  """
  @spec new(atom, opts) :: t
  def new(name, opts \\ []) do
    opts =
      opts
      |> Keyword.put(:name, name)
      |> Keyword.put(:fields, %{})
      |> Keyword.put_new(:storage, Storage.Map.new([]))
      |> Keyword.put_new(:attrs, Index.Attr.default())

    struct(__MODULE__, opts)
  end

  @doc """
  Set the attrs of the index.
  ## Examples
      iex> index = Index.new(:animals)
      iex> Index.attrs(index, %{insert: [Global], delete: [Global]})
  """
  @spec attrs(t, attrs) :: t
  def attrs(index, attrs),
    do: %{index | attrs: attrs}

  @doc """
  Add a ref to the index.
  ## Examples
      iex> index = Index.new(:animals)
      iex> Index.ref(index, Index.Field.new("id"))
  """
  @spec ref(t, Index.Field.t()) :: t
  def ref(index, ref),
    do: %{index | ref: ref}

  @doc """
  Add a field to the index.
  ## Examples
      iex> index = Index.new(:animals)
      iex> Index.field(index, Index.Field.new("name"))
  """
  @spec field(t, Index.Field.t()) :: t
  def field(index, field),
    do: %{index | fields: Map.put(index.fields, field.k, field)}

  @doc """
  Set the storage on the index.
  ## Examples
      iex> index = Index.new(:animals)
      iex> Index.storage(index, Storage.Map.new())
  """
  @spec storage(t, struct) :: t
  def storage(index, storage),
    do: %{index | storage: storage}
end

The index has a struct that captures the name of the index, a map of fields and a ref that we saw how to configure at the start as well as the chosen storage implementation. The index also has the concept of attrs, which we will look at later

The first index related sub module to look at is Index.Document. This module is responsible for normalizing, extracting, tokenizing, and transforming the data. We also convert the transformed data into a %Index.Document{} struct so we have a formalised data structure to pass around. Here is the Index.Document module in full:

defmodule Haystack.Index.Document do
  @moduledoc """
  A module for preparing documents to be stored.
  """

  alias Haystack.{Tokenizer, Transformer}

  # Types

  @type t :: %__MODULE__{
          ref: String.t(),
          fields: %{String.t() => list(Tokenizer.Token.t())}
        }

  @enforce_keys ~w{ref fields}a

  defstruct @enforce_keys

  # Public

  @doc """
  Create a new document.
  """
  def new(index, params) do
    fields = [index.ref | index.fields |> Map.values()]

    {[%{v: ref}], fields} =
      params
      |> normalize()
      |> extract(fields)
      |> tokenize()
      |> transform()
      |> Enum.into(%{})
      |> Map.pop!(index.ref.k)

    struct(__MODULE__, ref: ref, fields: fields)
  end

  # Private

  # Normalize the keys of the map to strings
  defp normalize(map) when is_map(map),
    do: Enum.map(map, &normalize/1) |> Enum.into(%{})

  defp normalize({k, v}) when is_map(v),
    do: {to_string(k), v |> Enum.map(&normalize/1) |> Enum.into(%{})}

  defp normalize({k, v}),
    do: {to_string(k), v}

  #  Extract the fields from the map
  defp extract(map, fields) do
    fields
    |> Enum.map(fn field -> {get_in(map, field.path), field} end)
    |> Enum.reject(fn {v, _field} -> is_nil(v) end)
    |> Enum.map(fn {v, field} -> {field.k, {v, field}} end)
  end

  # Tokenize the values
  def tokenize(list) do
    Enum.map(list, fn {k, {v, field}} ->
      {k, {Tokenizer.tokenize(v, field.separator), field}}
    end)
  end

  # Transform the values
  def transform(list) do
    Enum.map(list, fn {k, {tokens, field}} ->
      tokens = Transformer.pipeline(tokens, field.transformers)
      counts = Enum.frequencies_by(tokens, & &1.v)
      groups = Enum.group_by(tokens, & &1.v)

      values =
        Enum.map(groups, fn {v, group} ->
          %{}
          |> Map.put(:v, v)
          |> Map.put(:positions, Enum.map(group, &{&1.offset, &1.length}))
          |> Map.put(:tf, Map.get(counts, v) / Enum.count(tokens))
        end)

      {k, values}
    end)
  end
end

The :tf value is the term frequency. This is a statistic of how often the term appears in the given field, divided by the total amount of tokens for that field. It’s a method to understand how relevant a token is as part of the document.

The next interesting thing to look at is the index attrs. An important part of building a search index is the data that you store. The data is stored as an “inverted index”, which basically means that the words are used as the keys.

This is simple enough to build out in isolation but once you start layering on additional data to be stored, things can become more complicated. In order to make the responsibility of storing data as simple and as configurable as possible, I wanted to define those attrs as interchangeable modules that could be added, removed, or replaced.

This has a number of benefits.

Firstly it makes implementing the core set of attributes to be stored very simple because we’ve formalised those stages into discrete steps.

Secondly, this makes it possible for any consumer of Haystack to add, remove, or replace any of the attrs and completely control the data that is processed and stored as part of their search index. For example, if you wanted to store word embeddings so you could extend Haystack to offer semantic search. Instead of diving into the source code of an increasingly complicated module, you can simply add a new implementation to the list of existing attrs.

So to achieve this, we once again define a behaviour that each of the attr modules must implement:

@doc """
The attr key.
"""
@callback key(opts) :: tuple

@doc """
Insert an attr.
"""
@callback insert(Index.t(), Index.Document.t(), opts) :: Index.t()

@doc """
Delete an attr.
"""
@callback delete(Index.t(), Index.Document.t(), opts) :: Index.t()

This behaviour requires that implementations have functions for inserting and deleting data as well as a unique key that represents this attr type in the storage. An interesting thing to note is that updating has not been included as part of this behaviour because it could be seen as the composition of deleting and inserting. Another decision I made was in order to delete data from the index, the consumer must supply the original document to be deleted, rather than just the ref. This decreases the amount of data that must be stored and so it’s a trade-off worth having because Haystack can be used in a relatively low memory environment.

As part of this module I’ve also added a list of default attrs that will be used unless the developer supplies a custom list. Here is the Attr module in full:

defmodule Haystack.Index.Attr do
  @moduledoc """
  A behaviour for defining a store attr.
  """

  alias Haystack.Index

  @attrs [
    {Index.Attr.Global, []},
    {Index.Attr.Docs, []},
    {Index.Attr.Meta, include: [:tf]}
  ]

  # Types

  @type opts :: Keyword.t()

  # Behaviour

  @doc """
  The attr key.
  """
  @callback key(opts) :: tuple

  @doc """
  Insert an attr.
  """
  @callback insert(Index.t(), Index.Document.t(), opts) :: Index.t()

  @doc """
  Delete an attr.
  """
  @callback delete(Index.t(), Index.Document.t(), opts) :: Index.t()

  # Public

  @doc """
  Return the default attrs.
  """
  @spec default :: %{insert: list({module, Keyword.t()}), delete: list({module, Keyword.t()})}
  def default, do: %{insert: @attrs, delete: Enum.reverse(@attrs)}
end

Next we can look at the default attr modules.

The first attr module simply stores a global list of all refs. We need to maintain a global list of all document refs to calculate the inverse document relevancy score later in this blog post. Here’s what that module looks like:

defmodule Haystack.Index.Attr.Global do
  @moduledoc """
  A module for storing the global list of refs.
  """

  import Record

  alias Haystack.{Index, Storage}

  @behaviour Index.Attr

  defrecord :global, []

  # Behaviour: Index.Attr

  @impl Index.Attr
  def key(_opts \\ []), do: global()

  @impl Index.Attr
  def insert(index, %{ref: ref}, _opts \\ []) do
    storage = Storage.upsert(index.storage, key(), [ref], &(&1 ++ [ref]))

    Index.storage(index, storage)
  end

  @impl Index.Attr
  def delete(index, %{ref: ref}, _opts \\ []) do
    storage = Storage.upsert(index.storage, key(), ref, &(&1 -- [ref]))

    storage =
      case Storage.fetch!(storage, key()) do
        [] -> Storage.delete(storage, key())
        _ -> storage
      end

    Index.storage(index, storage)
  end
end

The next attr module stores a list of documents that have a particular field, term pair:

defmodule Haystack.Index.Attr.Docs do
  @moduledoc """
  A module for returning a list of docs for a given field, term.
  """

  import Record

  alias Haystack.{Index, Storage}

  @behaviour Index.Attr

  defrecord :docs, field: nil, term: nil

  # Behaviour: Index.Attr

  @impl Index.Attr
  def key(field: field, term: term),
    do: docs(field: field, term: term)

  @impl Index.Attr
  def insert(index, %{ref: ref, fields: fields}, _opts \\ []) do
    storage =
      Enum.reduce(fields, index.storage, fn {field, terms}, storage ->
        Enum.reduce(terms, storage, fn term, storage ->
          k = key(field: field, term: term.v)
          Storage.upsert(storage, k, [ref], &[ref | &1])
        end)
      end)

    Index.storage(index, storage)
  end

  @impl Index.Attr
  def delete(index, %{ref: ref, fields: fields}, _opts \\ []) do
    storage =
      Enum.reduce(fields, index.storage, fn {field, terms}, storage ->
        Enum.reduce(terms, storage, fn term, storage ->
          k = key(field: field, term: term.v)

          case Storage.fetch!(storage, k) do
            [^ref] -> Storage.delete(storage, k)
            refs -> Storage.insert(storage, k, refs -- [ref])
          end
        end)
      end)

    Index.storage(index, storage)
  end
end

This makes it really easy to find a list of documents that have a specific term as part of that field because it’s stored as an inverted index.

Finally we have a meta attr that stores additional metadata about the field, term pair for a given document:

defmodule Haystack.Index.Attr.Meta do
  @moduledoc """
  A module for storing the meta for a field, term.
  """

  import Record

  alias Haystack.{Index, Storage}

  @behaviour Index.Attr

  defrecord :meta, ref: nil, field: nil, term: nil

  # Behaviour: Index.Attr

  @impl Index.Attr
  def key(ref: ref, field: field, term: term),
    do: meta(ref: ref, field: field, term: term)

  @impl Index.Attr
  def insert(index, %{ref: ref, fields: fields}, opts \\ []) do
    include = Keyword.get(opts, :include, [:tf])

    storage =
      Enum.reduce(fields, index.storage, fn {field, terms}, storage ->
        Enum.reduce(terms, storage, fn term, storage ->
          k = key(ref: ref, field: field, term: term.v)
          Storage.insert(storage, k, Map.take(term, include))
        end)
      end)

    Index.storage(index, storage)
  end

  @impl Index.Attr
  def delete(index, %{ref: ref, fields: fields}, _opts \\ []) do
    storage =
      Enum.reduce(fields, index.storage, fn {field, terms}, storage ->
        Enum.reduce(terms, storage, fn term, storage ->
          k = key(ref: ref, field: field, term: term.v)
          Storage.delete(storage, k)
        end)
      end)

    Index.storage(index, storage)
  end
end

The meta data that is stored is the positions we calculated earlier and the term frequency. By default we don’t store the positions as that is extra data that would need to be added to the memory usage that might not be needed.

Back in the Index module, we can add a couple functions for adding and deleting data from the index:

@doc """
Add documents to the index.
## Examples
    iex> index = Index.new(:animals)
    iex> Index.add(index, [])
"""
@spec add(t, list(map)) :: t
def add(index, data) do
  docs = Enum.map(data, &Index.Document.new(index, &1))

  Enum.reduce(docs, index, fn doc, index ->
    Enum.reduce(index.attrs.insert, index, fn {module, opts}, index ->
      module.insert(index, doc, opts)
    end)
  end)
end

@doc """
Delete documents in the index.
## Examples
    iex> index = Index.new(:animals)
    iex> Index.delete(index, [])
"""
@spec delete(t, list(map)) :: t
def delete(index, data) do
  docs = Enum.map(data, &Index.Document.new(index, &1))

  Enum.reduce(docs, index, fn doc, index ->
    Enum.reduce(index.attrs.delete, index, fn {module, opts}, index ->
      module.delete(index, doc, opts)
    end)
  end)
end

These two functions use the Index.Document module to create the index document and then reduce through the list of attr modules for the given action and apply the attr to the index for the given document.

Defining queries

The final part of Haystack that needs to be built is a method for defining queries and then applying them to the index to get a list of results.

Once again this was an area of Haystack that I wanted to be easy to implement whilst also being easy to extend or modify by an engineer without having to make a PR to the main repository.

A query is a composable data structure of clauses and expressions that can be evaluated against the index. Similar to a lot of the modules we’ve looked at previously, each clause and expression is implemented as an implementation of a behaviour so that they can be swapped, replaced or added to very easily.

Queries should also be infinitely nestable to allow the engineer to define complex queries that meet their requirements. So with that being said, let’s take a look at how queries work in Haystack.

Here’s the Query module:

defmodule Haystack.Query do
  @moduledoc """
  A module for building queries.
  """

  alias Haystack.{Index, Query}
  alias Haystack.Tokenizer.Token

  # Types

  @type statement :: (Query.t(), Index.t() -> list(map()))
  @type t :: %__MODULE__{
          clause: Query.Clause.t(),
          config: Keyword.t()
        }

  @enforce_keys ~w{index clause config}a

  defstruct @enforce_keys

  @config clause: Query.Clause.default(),
          expression: Query.Expression.default()

  # Public

  @doc """
  Create a new Query.
  """
  @spec new(Keyword.t()) :: t
  def new(opts \\ []),
    do: struct(__MODULE__, Keyword.put_new(opts, :config, @config))

  @doc """
  Add a clause to the query.
  """
  @spec clause(t, Query.Clause.t()) :: t
  def clause(query, clause),
    do: %{query | clause: clause}

  @doc """
  Evaluate a clause.
  """
  @spec evaluate(t(), Index.t(), Query.Clause.t(), list(statement)) :: list(map())
  def evaluate(query, index, clause, statements) do
    module = get_in(query.config, [:clause, clause.k])
    module.evaluate(query, index, statements)
  end

  @doc """
  Evaluate an expression.
  """
  @spec evaluate(t(), Index.t(), Query.Expression.t()) :: list(map())
  def evaluate(query, index, expression) do
    module = get_in(query.config, [:expression, expression.k])
    module.evaluate(index, expression)
  end

  @doc """
  Run the given query.
  """
  @spec run(t, Index.t()) :: list(map)
  def run(query, index) do
    statement = Query.Clause.build(query.clause)
    responses = statement.(query, index)

    responses
    |> Query.IDF.calculate(index.storage)
    |> Enum.group_by(& &1.ref)
    |> Enum.map(fn {ref, fields} ->
      score = Enum.reduce(fields, 0, &(&1.score + &2))

      fields = Enum.group_by(fields, & &1.field)

      fields =
        Enum.map(fields, fn {k, list} ->
          {k, Enum.map(list, &Map.take(&1, [:positions, :term]))}
        end)

      %{ref: ref, score: score, fields: Map.new(fields)}
    end)
    |> Enum.sort_by(& &1.score, :desc)
  end

  @doc """
  Build a clause for the given list of tokens.

  ## Examples

      iex> index = Index.new(:animals)
      iex> index = Index.field(index, Index.Field.new("name"))
      iex> tokens = Tokenizer.tokenize("Red Panda")
      iex> tokens = Transformer.pipeline(tokens, Transformer.default())
      iex> Query.build(:match_all, index, tokens)

  """
  @spec build(atom, Index.t(), list(Token.t())) :: Query.Clause.t()
  def build(:match_all, index, tokens) do
    Enum.reduce(tokens, Query.Clause.new(:all), fn token, clause ->
      Enum.reduce(Map.values(index.fields), clause, fn field, clause ->
        Query.Clause.expressions(clause, [
          Query.Expression.new(:match, field: field.k, term: token.v)
        ])
      end)
    end)
  end

  def build(:match_any, index, tokens) do
    Enum.reduce(tokens, Query.Clause.new(:any), fn token, clause ->
      Enum.reduce(Map.values(index.fields), clause, fn field, clause ->
        Query.Clause.expressions(clause, [
          Query.Expression.new(:match, field: field.k, term: token.v)
        ])
      end)
    end)
  end
end

The query module has a struct that contains a single clause that is used as the root of the query. The struct also has a config keyword list that allows the engineer to set their own mapping of clauses and expressions.

Here’s a run down of the functions in this module.

First up we provide helpers to create a new query and to add a clause to the query. If the config is not supplied by the engineer, we set a default value:

@doc """
Create a new Query.
"""
@spec new(Keyword.t()) :: t
def new(opts \\ []),
  do: struct(__MODULE__, Keyword.put_new(opts, :config, @config))

@doc """
Add a clause to the query.
"""
@spec clause(t, Query.Clause.t()) :: t
def clause(query, clause),
  do: %{query | clause: clause}

Next, we have evaluate functions for clauses and expressions. These functions basically just look up the matching clause or expression type in the config and then delegate to the configured module:

@doc """
Evaluate a clause.
"""
@spec evaluate(t(), Index.t(), Query.Clause.t(), list(statement)) :: list(map())
def evaluate(query, index, clause, statements) do
  module = get_in(query.config, [:clause, clause.k])
  module.evaluate(query, index, statements)
end

@doc """
Evaluate an expression.
"""
@spec evaluate(t(), Index.t(), Query.Expression.t()) :: list(map())
def evaluate(query, index, expression) do
  module = get_in(query.config, [:expression, expression.k])
  module.evaluate(index, expression)
end

Next, we have the run function that applies the query to the given index:

@doc """
Run the given query.
"""
@spec run(t, Index.t()) :: list(map)
def run(query, index) do
  statement = Query.Clause.build(query.clause)
  responses = statement.(query, index)

  responses
  |> Query.IDF.calculate(index.storage)
  |> Enum.group_by(& &1.ref)
  |> Enum.map(fn {ref, fields} ->
    score = Enum.reduce(fields, 0, &(&1.score + &2))

    fields = Enum.group_by(fields, & &1.field)

    fields =
      Enum.map(fields, fn {k, list} ->
        {k, Enum.map(list, &Map.take(&1, [:positions, :term]))}
      end)

    %{ref: ref, score: score, fields: Map.new(fields)}
  end)
  |> Enum.sort_by(& &1.score, :desc)
end

The run/2 function works by using the root clause to build a statement that can then be executed and results are turned. The IDF is calculated for each result. IDF stands for Inverse Document Frequency. This basically means that terms that don’t appear often in documents across the whole index are weighted higher. Finally we group results by their ref, calculate the total score for each document and then sort on that score

Next up we’ll look at the clause module. This module is responsible for defining the behaviour of clauses as well as offering some convenience functions. Here is that module:

defmodule Haystack.Query.Clause do
  @moduledoc """
  A module for defining query clauses.
  """

  alias Haystack.{Index, Query}
  alias Haystack.Query.Clause

  # Types

  @type k :: atom
  @type t :: %__MODULE__{
          k: k,
          clauses: list(t),
          expressions: list(Query.Expression.t())
        }

  @enforce_keys ~w{k clauses expressions}a

  defstruct @enforce_keys

  @clauses all: Clause.All,
           any: Clause.Any

  # Behaviour

  @doc """
  Evaluate the given clause.
  """
  @callback evaluate(Query.t(), Index.t(), list(Query.statement())) :: list(map())

  # Public

  @doc """
  Create a new clause.

  ## Examples

      iex> Query.Clause.new(:any)

  """
  @spec new(k) :: t
  def new(k),
    do: struct(__MODULE__, k: k, clauses: [], expressions: [])

  @doc """
  Return the default clauses.

  ## Examples

      iex> Query.Clause.default()

  """
  @spec default :: [{atom, module}]
  def default, do: @clauses

  @doc """
  Add a list of clauses.

  ## Examples

      iex> clause = Query.Clause.new(:any)
      iex> Query.Clause.clauses(clause, [Query.Clause.new(:any)])

  """
  def clauses(clause, clauses),
    do: %{clause | clauses: clause.clauses ++ clauses}

  @doc """
  Add a list of expressions.

  ## Examples

      iex> clause = Query.Clause.new(:any)
      iex> expression = Query.Expression.new(:match, field: "name", term: "Haystack")
      iex> Query.Clause.expressions(clause, [expression])

  """
  def expressions(clause, expressions),
    do: %{clause | expressions: clause.expressions ++ expressions}

  @doc """
  Build the clause into a statement.

  ## Examples

      iex> clause = Query.Clause.new(:any)
      iex> Query.Clause.build(clause)

  """
  @spec build(t) :: Query.statement()
  def build(clause) do
    fn query, index ->
      statements =
        clause.expressions
        |> Enum.map(&Query.Expression.build/1)
        |> Enum.concat(Enum.map(clause.clauses, &build/1))

      Query.evaluate(query, index, clause, statements)
    end
  end
end

The most interesting bit of this module is the build/2 function that accepts a clause and returns a function. The returned function is the statement we looked at earlier that we use to evaluate the query against the index.

The function that is returned from this function essentially provides a way to capture a nested method of evaluating sub clauses and expressions. For clauses, it means building a list of statements for each expression of the clause and then each nested clause and then evaluating that list of statements.

Out of the box, Haystack has two clause implementations for any or all queries, These modules are really simple because they essentially just evaluate the given list of statements (which could be expressions, clauses, or a mixture of both). We then find the union or the intersection of those results, depending on the type of clause.

Here’s how those two modules look:

defmodule Haystack.Query.Clause.All do
  @moduledoc """
  A module to define "all" clauses.
  """

  alias Haystack.Query

  @behaviour Query.Clause

  # Behaviour: Query.Clause

  @impl Query.Clause
  def evaluate(query, index, statements) do
    responses = Enum.map(statements, & &1.(query, index))

    results =
      Enum.map(responses, fn results ->
        results
        |> Enum.map(&Map.get(&1, :ref))
        |> MapSet.new()
      end)

    {acc, results} = List.pop_at(results, 0, MapSet.new())

    refs = Enum.reduce(results, acc, &MapSet.intersection(&2, &1))

    responses
    |> List.flatten()
    |> Enum.filter(&Enum.member?(refs, &1.ref))
  end
end
defmodule Haystack.Query.Clause.Any do
  @moduledoc """
  A module to define "any" clauses.
  """

  alias Haystack.Query

  @behaviour Query.Clause

  # Behaviour: Query.Clause

  @impl Query.Clause
  def evaluate(query, index, statements) do
    responses = Enum.map(statements, & &1.(query, index))

    results =
      Enum.map(responses, fn results ->
        results
        |> Enum.map(&Map.get(&1, :ref))
        |> MapSet.new()
      end)

    {acc, results} = List.pop_at(results, 0, MapSet.new())

    refs = Enum.reduce(results, acc, &MapSet.union(&2, &1))

    responses
    |> List.flatten()
    |> Enum.filter(&Enum.member?(refs, &1.ref))
  end
end

Next we have the expression module. You can think of this a simplified version of a clause:

defmodule Haystack.Query.Expression do
  @moduledoc """
  A module for defining query expressions.
  """

  alias Haystack.{Index, Query}

  # Types

  @type k :: atom
  @type t :: %__MODULE__{
          k: k,
          field: String.t(),
          term: String.t()
        }

  @enforce_keys ~w{k field term}a

  defstruct @enforce_keys

  @expressions match: Query.Expression.Match

  # Behaviour

  @doc """
  Evaluate the given clause.
  """
  @callback evaluate(Index.t(), t) :: list(map())

  # Public

  @doc """
  Create a new expression.

  ## Examples

      iex> Query.Expression.new(:match, field: "name", term: "Haystack")

  """
  @spec new(k, Keyword.t()) :: t
  def new(k, opts),
    do: struct(__MODULE__, [k: k] ++ opts)

  @doc """
  Return the default expressions.

  ## Examples

      iex> Query.Expression.default()

  """
  @spec default :: [{atom, module}]
  def default, do: @expressions

  @doc """
  Build the expression into a statement.

  ## Examples

      iex> expression = Query.Expression.new(:match, field: "name", term: "Haystack")
      iex> Query.Expression.build(expression)

  """
  @spec build(t) :: Query.statement()
  def build(expression) do
    fn query, index ->
      Query.evaluate(query, index, expression)
    end
  end
end

Expressions are simply a way to evaluate a result, they do not need any nested clauses or expressions. Out of the box there is a single match expression, which uses the inverted index to find matches:

defmodule Haystack.Query.Expression.Match do
  @moduledoc """
  A module to define "match" expressions.
  """

  alias Haystack.Query
  alias Haystack.Storage
  alias Haystack.Index.Attr.{Docs, Meta}

  @behaviour Query.Expression

  # Behaviour: Query.Expression

  @impl Query.Expression
  def evaluate(index, %Query.Expression{k: :match} = exp) do
    with {:ok, refs} <- refs(index, exp) do
      Enum.map(refs, fn ref ->
        key = Meta.key(ref: ref, field: exp.field, term: exp.term)

        meta = Storage.fetch!(index.storage, key)

        meta
        |> Map.put(:ref, ref)
        |> Map.put(:score, Map.get(meta, :tf, 0))
        |> Map.put(:field, exp.field)
        |> Map.put(:term, exp.term)
      end)
    end
  end

  # Private

  defp refs(%{storage: storage}, exp) do
    case Storage.fetch(storage, Docs.key(field: exp.field, term: exp.term)) do
      {:ok, results} -> {:ok, results}
      {:error, _} -> []
    end
  end
end

Back in the query module we can define a couple of convenience functions for building common queries:

@doc """
Build a clause for the given list of tokens.
## Examples
    iex> index = Index.new(:animals)
    iex> index = Index.field(index, Index.Field.new("name"))
    iex> tokens = Tokenizer.tokenize("Red Panda")
    iex> tokens = Transformer.pipeline(tokens, Transformer.default())
    iex> Query.build(:match_all, index, tokens)
"""
@spec build(atom, Index.t(), list(Token.t())) :: Query.Clause.t()
def build(:match_all, index, tokens) do
  Enum.reduce(tokens, Query.Clause.new(:all), fn token, clause ->
    Enum.reduce(Map.values(index.fields), clause, fn field, clause ->
      Query.Clause.expressions(clause, [
        Query.Expression.new(:match, field: field.k, term: token.v)
      ])
    end)
  end)
end

def build(:match_any, index, tokens) do
  Enum.reduce(tokens, Query.Clause.new(:any), fn token, clause ->
    Enum.reduce(Map.values(index.fields), clause, fn field, clause ->
      Query.Clause.expressions(clause, [
        Query.Expression.new(:match, field: field.k, term: token.v)
      ])
    end)
  end)
end

Both of these functions simply return a query data structure. However, there’s nothing stopping you from building you own custom, complex query using the building blocks of clauses and expressions. You are also free to define your own implementations of the existing clauses and expressions, or extend Haystack with your own implementations for different types of queries.

Finally we can take a look at the index module, which provides a search function to make it easy to build and run simple queries from a string.

@doc """
Search the index.
"""
@spec search(t, String.t(), Keyword.t()) :: list(map)
def search(index, v, opts \\ []) do
  type = Keyword.get(opts, :query, :match_any)
  tokens = Tokenizer.tokenize(v)
  tokens = Transformer.pipeline(tokens, Transformer.default())

  Query.new()
  |> Query.clause(Query.build(type, index, tokens))
  |> Query.run(index)
end

Here we are tokenizing and transforming the query and then passing the list of tokens to one of the convenience functions of the query module. By default this will run a “match any” query.

Conclusion

So that was a “brief” overview of Haystack. I’ve left out some details to try and keep this post at a reasonable length, but I’ve tried to cover the main important topics.

Haystack is currently on version 0.1.0 and so the code shown in this blog post could change. If you want to see Haystack in action, take a look at the search of Culttt.

The obvious question to ask yourself is, should you use Haystack? I think if you are already happy with your existing search solution it’s probably not worth the effort to switch. However, if you are unhappy with your current solution, or you are in a situation like I was where none of the existing off-the-shelf solutions make sense, I believe Haystack could be a good choice for you.

I’m going to continue maintaining Haystack because it’s now integrated into this website and I have another 3 projects I want to add it to. I’d love to explore different ways to extend the functionality of Haystack via storage, attrs, and queries, and for different languages. I’d love to explore how Haystack can be used across different Beam projects such as Nerves, Elixir Desktop, Livebook, and LiveView Native, as well as use Nx, and Axon to build machine learning capabilities for a search solution built entirely in Elixir.

If you’re looking to add Haystack to a Nimble Publisher powered blog, stay tuned for my next post.

Philip Brown

@philipbrown

© Yellow Flag Ltd 2024.