Mar 22, 2023
Table of contents:
Full-text search is a common requirement in a number of different types of website and web applications. I recently needed the ability to index and search content on a project I’ve been working on. However, using an off-the-shelf search solution didn’t make sense for this project.
Instead of building a closed-source solution as part of that project, I decided to build an open source library that I could use as a building block for my other Elixir projects that fit the same requirements. That open source library is Haystack, a simple, extendable full-text search engine written in Elixir.
I strongly believe projects built on the Beam have a unique set of competitive advantages in comparison to projects built in other technologies, and so I wanted to explore what that would look like for a native search index.
This blog post is an overview of how Haystack is built, the motivations for the project, and the design decisions I have made to allow Haystack to be as simple and extendable as possible.
Before we get too deep into looking at how Haystack is built, let’s first take a look at the motivations for building it in the first place.
As I mentioned in the introduction, using an off-the-shelf search solution didn’t make sense for the project I was building. The project is a simple data driven Elixir Phoenix application. Currently the project is encapsulated as a single web server. In order to keep the complexity of the project to a minimum, I don’t want to add any additional infrastructure that I would need to set up, deploy and maintain. This means deploying an Elastic search cluster is out of the question.
PostgreSQL and SQlite both offer excellent full-text search solutions. This is perfect if you are already using one of these two databases in your project. However, as this project is not currently using a database, it doesn’t make sense to add one just so I can use the full-text search functionality.
For this project, and a handful of other projects I want to build, I want to keep the running cost as low as possible. This means paying for a hosted search solution is also out of the question.
And so I was left with the obvious choice of building my own solution. I felt like this was the right choice to make because the “cost” of developing and maintaining Haystack can be spread over a number of my projects that will benefit from it.
Finally, let’s be honest, building Haystack means I have something interesting to write about on Culttt. Building a search engine is one of those technical blog posts that, like building your own compiler, is just incredibly intoxicating to those of us with a predisposition for reading about the technical intricacies of how stuff works.
Whilst Haystack will definitely make some compromises when compared with existing search solutions, I do genuinely believe it has a number of advantages that make it better than the competition in certain ways.
The first, and most obvious thing is, as I mentioned above, if you have an Elixir project that is a simple web server deployment and you don’t want the headache or additional cost of deploying and managing additional infrastructure or paying for third-party services, Haystack can offer you full-text search that is very easy to get started with.
If you are currently building Elixir projects, I think you will find modifying and extending Haystack to meet your requirements very easy. Doing the same for one of the existing off-the-shelf solutions is likely going to be a lot more difficult or impossible, depending on what you want to do.
And finally, if you need extremely low latency, having a solution that is embedded inside of your application is going to be faster than a solution that is forced to make a network roundtrip to return results.
Before we get into looking at how Haystack is built, we should take a brief look at the prior art and sources of inspiration.
The first source of inspiration was Elasticlunr, an existing full-text search library built entirely in Elixir.
When I was first planning the project that required full-text search I initially planned to use Elasticlunr. However when I came to add it to the project and take it for a test drive, I found a couple of problems that I felt made this a no-go.
After assessing how I would go about making the changes I required to make Elasticlunr work for my requirements, it felt like there were a number of fundamental changes that I would need to make that would completely change the original spirit of the project. I’m usually a proponent for contributing to an existing project rather than starting your own. However, if the changes you are proposing are completely rewriting large portions of the original work, I think it’s better to start a new project.
Elasticlunr also hasn’t had any updates in the last year despite having a PR that looks 90% finished. As full-text search is going to be a core component of a number of my projects, I didn’t want to outsource it to an abandoned project. Finally I reached out to the author of Elasticlunr but didn’t receive a response. The combination of all of these factors meant that I couldn’t see Elasticlunr as a viable choice.
The second source of inspiration was Building a full-text search engine in 150 lines of Python code. Before building Haystack, I only had a basic understanding of how this type of project would be built. I really love these kind of “from scratch” blog posts as they provide a guide to building something of value from first principles. I find this type of blog post to be incredibly valuable when you are trying to learn something without any prior knowledge. I hope the blog post I’m writing now will be added to that collective literature and be used by the next person as inspiration for building what they have in mind.
The rest of this blog post will be an overview of how Haystack has been built as well as the design considerations I’ve made that I hope will make the project as simple and extensible for other people to use in their projects.
Whenever I stumble upon a new interesting project, the first thing I do is to look at the README.md
to get an understanding of how I would use the project. So think of this section as the high level overview of how to use Haystack.
The entry point to Haystack is the Haystack
module. This module is responsible for managing the various indexes you have as part of the project as well as any future configuration or high level concerns.
These concerns are encapsulated in the %Haystack{}
struct:
haystack = Haystack.new()
Currently, the Haystack
module has a single index/3
function that accepts the name of the index as an atom and a callback function to apply changes to the index. If the index does not already exist, Haystack will create one and pass it into the callback so you can configure or use it:
Haystack.index(haystack, :animals, fn index ->
index
end)
Once you have an index
, the first thing you need to do is specify the fields of the index:
alias Haystack.Index
Haystack.index(haystack, :animals, fn index ->
index
|> Index.ref(Index.Field.term("id"))
|> Index.field(Index.Field.new("name"))
|> Index.field(Index.Field.new("description"))
end)
The ref
is the identifier of the document. You can create as many fields on the document as you want. When you add documents to the index, Haystack will automatically extract the fields you have configured.
You can also provide nested fields using dot syntax:
alias Haystack.Index
Haystack.index(haystack, :people, fn index ->
index
|> Index.ref(Index.Field.term("id"))
|> Index.field(Index.Field.new("name"))
|> Index.field(Index.Field.new("address.postcode"))
end)
Each Haystack index has a configurable storage implementation. By default the storage implementation is essentially a map, which is great for testing and prototyping, but probably not a good solution if you want to use Haystack as part of a real application. You can configure a custom storage implementation on the index:
alias Haystack.{Index, Storage}
Haystack.index(haystack, :animals, fn index ->
index
|> Index.ref(Index.Field.term("id"))
|> Index.field(Index.Field.new("name"))
|> Index.field(Index.Field.new("description"))
|> Index.storage(Storage.ETS.new())
end)
The ETS storage implementation is provided as one of the default implementations that ships with Haystack out of the box. If you would like to extend Haystack with your own storage implementation, it’s very easy to do so. We’ll look at doing that later in this blog post.
Finally, the most important part of a library that provides full-text search is the ability to query the indexes. Here is the most basic version of applying queries:
Haystack.index(haystack, :animals, &Index.search(&1, "Red Panda"))
This would search the :animals
index for the phrase "Red Panda"
. Haystack also offers a powerful way to build and run search queries, which we’ll look at later in this blog post.
So that’s the high level overview of using Haystack. Let’s now jump into the details of how it is built.
The first step that is required to build a search engine is that you need to take the incoming data and tokenize it. Tokenizing is basically the process of splitting the input into individual words so that they can be indexed.
Part of the tokenization process is extracting the offset and length of each of the words. This makes it possible to highlight matched words as part of the extract that is provided as part of the search results.
To extract the words and positions from the incoming data we can use a regex with different return values.
To capture alphanumeric characters, we can use the following regex:
~r/([[:alnum:]]+)/
If you wanted to capture all characters, you could use the following regex:
~r/(.+)/
You might want to use this regex for identifier type fields where it’s important to match the entire value.
Wee can then use the Regex.scan/3
function to apply the regex to the incoming value:
# Extract the tokenized words from the value.
defp words(v, r),
do: Regex.scan(r, v, capture: :first) |> List.flatten()
# Extract the positions from the value.
defp positions(v, r),
do: Regex.scan(r, v, capture: :first, return: :index) |> List.flatten()
You can use the return: index
option to return the index position of each word. The position is given as the index of the first character and then length of the match as a tuple.
We can use the following tokenize/2
function to apply the regex and extract the tokens as a list. I’m wrapping each token in a struct to formalise this data structure:
@doc """
Tokenize a value with a given separator.
## Examples
iex> tokens = Tokenizer.tokenize("Needle in a Haystack", Tokenizer.separator(:default))
iex> Enum.map(tokens, & &1.v)
~w{needle in a haystack}
"""
@spec tokenize(term, Regex.t()) :: list(Token.t())
def tokenize(v, r) when is_integer(v),
do: v |> to_string() |> tokenize(r)
def tokenize(v, r) when is_binary(v) do
v = String.downcase(v)
[words(v, r), positions(v, r)]
|> Enum.zip()
|> Enum.map(fn {v, {o, l}} -> Token.new(v, offset: o, length: l) end)
end
I’m also providing some default regexes as part of this module’s API. Here is the full Tokenizer
module:
defmodule Haystack.Tokenizer do
@moduledoc """
A module for tokenizing values.
This module provides utilities for tokenizing values. The default tokenizer
removes anything but alphanumeric characters and extracts the positions of
words using a start offset and length.
There's also a `:full` tokenizer that can be used when the full value should
be treated as a single token. For example, a serial code.
"""
alias Haystack.Tokenizer.Token
@default ~r/([[:alnum:]]+)/
@full ~r/(.+)/
# Public
@doc """
Return the separator
## Examples
iex> Tokenizer.separator(:default)
~r/([[:alnum:]]+)/
iex> Tokenizer.separator(:full)
~r/(.+)/
"""
@spec separator(atom) :: Regex.t()
def separator(:default), do: @default
def separator(:full), do: @full
@doc """
Tokenize a value.
## Examples
iex> tokens = Tokenizer.tokenize("Needle in a Haystack")
iex> Enum.map(tokens, & &1.v)
~w{needle in a haystack}
"""
@spec tokenize(term) :: list(Token.t())
def tokenize(v), do: tokenize(v, separator(:default))
@doc """
Tokenize a value with a given separator.
## Examples
iex> tokens = Tokenizer.tokenize("Needle in a Haystack", Tokenizer.separator(:default))
iex> Enum.map(tokens, & &1.v)
~w{needle in a haystack}
"""
@spec tokenize(term, Regex.t()) :: list(Token.t())
def tokenize(v, r) when is_integer(v),
do: v |> to_string() |> tokenize(r)
def tokenize(v, r) when is_binary(v) do
v = String.downcase(v)
[words(v, r), positions(v, r)]
|> Enum.zip()
|> Enum.map(fn {v, {o, l}} -> Token.new(v, offset: o, length: l) end)
end
# Private
# Extract the tokenized words from the value.
defp words(v, r),
do: Regex.scan(r, v, capture: :first) |> List.flatten()
# Extract the positions from the value.
defp positions(v, r),
do: Regex.scan(r, v, capture: :first, return: :index) |> List.flatten()
end
The next step after tokenization is applying transformations on the tokens. This normalizes and optimises the tokens ready for indexing.
This is one of the points of integration that I wanted to offer so that consumers of haystack can apply their own series of transformations based on their unique requirements. However, whilst I want to make transformations configurable, I also want to have sensible defaults that “just work” out of the box.
To ensure each stage of the pipeline of transformations are interchangeable, the Elixir way of doing that is to define a behaviour that all of the pipeline stages should implement. Here’s what that behaviour looks like:
@doc """
Apply a transformation on a list of tokens.
"""
@callback transform(list(Token.t())) :: list(Token.t())
The transform/1
behaviour callback accepts a list of tokens and returns a list of tokens. Each stage of the transformation pipeline builds upon the previous stages to return a list of tokens that have been transformed ready to be indexed.
Out of the box I want to offer two important transformation stages, a way to stem words and a way to remove stop words.
Here’s the stemmer transformer:
defmodule Haystack.Transformer.Stemmer do
@moduledoc """
A transformer for stemming words.
"""
@behaviour Haystack.Transformer
# Behaviour: Haystack.Transformer
@doc """
Apply a stemming transformation on a list of tokens.
## Examples
iex> tokens = Tokenizer.tokenize("Needle in a Haystack")
iex> tokens = Transformer.Stemmer.transform(tokens)
iex> Enum.map(tokens, & &1.v)
~w{needl in a haystack}
"""
@impl Haystack.Transformer
def transform(tokens) do
Enum.map(tokens, fn token ->
%{token | v: Stemmer.stem(token.v)}
end)
end
end
Stemming words is the process of reducing each word to the stem so that different forms of the same word are grouped together. This module simply delegates to the excellent Stemmer library.
Here’s the stop word transformer:
defmodule Haystack.Transformer.StopWords do
@moduledoc """
A transformer for removing stop words.
"""
@behaviour Haystack.Transformer
@words Application.app_dir(:haystack, "priv/haystack/stop_words.txt")
|> File.read!()
|> String.trim()
|> String.split("\n")
# Behaviour: Haystack.Transformer
@doc """
Remove stop words from a list of tokens.
## Examples
iex> tokens = Tokenizer.tokenize("Needle in a Haystack")
iex> tokens = Transformer.StopWords.transform(tokens)
iex> Enum.map(tokens, & &1.v)
~w{needle haystack}
"""
@impl Haystack.Transformer
def transform(tokens),
do: Enum.reject(tokens, &Enum.member?(@words, &1.v))
end
This module removes common words that add very little value. The list of words are stored as a text file in the priv
directory of the Haystack project.
Both of these transformation modules are replaceable by the engineer who is implementing Haystack.
Here’s the full Transformer
module:
defmodule Haystack.Transformer do
@moduledoc """
A module for transforming tokens.
This module includes a transform/1 callback behaviour for implementing stages
of a transformation pipeline. The default pipeline transformers includes
stages for stemming and removing stop words.
"""
alias Haystack.Tokenizer.Token
alias Haystack.Transformer
@default [
Transformer.Stemmer,
Transformer.StopWords
]
# Behaviour
@doc """
Apply a transformation on a list of tokens.
"""
@callback transform(list(Token.t())) :: list(Token.t())
# Public
@doc """
Return the list of default transformers.
## Examples
iex> Transformer.default()
[Transformer.Stemmer, Transformer.StopWords]
"""
@spec default :: list(module)
def default, do: @default
@doc """
Apply a pipeline of transformers to a list of tokens.
## Examples
iex> tokens = Tokenizer.tokenize("Needle in a Haystack")
iex> tokens = Transformer.pipeline(tokens, Transformer.default())
iex> Enum.map(tokens, & &1.v)
~w{needl haystack}
"""
@spec pipeline(list(Token.t()), list(module)) :: list(Token.t())
def pipeline(tokens, transformers) do
Enum.reduce(transformers, tokens, fn transformer, tokens ->
transformer.transform(tokens)
end)
end
end
You will see I’ve included a list of default transformations and a pipeline/2
function that applies a list of transformations to a list of tokens. We’ll use these functions later in this blog post.
The next big area to think about is how data is stored. This is another area of the library that I want to be configurable. There’s a couple of reasons why I’d want to do this.
Firstly, I want to make it as easy as possible to test all of the functionality of Haystack that relies on a storage implementation and so I want a simple mechanism for storage in tests. However, I also want to offer an out-of-the-box solution that can be used in a production application.
Fortunately I’m building Haystack on top of the Beam, and in the world of the Beam, we get the wonderful ETS as a key, value storage engine for free without any third-party dependencies.
I also want the ability for anyone who uses Haystack to be able to use their own storage implementation without having to send a PR to the main Haystack repository. For example, they might want to store their data in Redis.
To achieve these design goals it’s time to define another behaviour, here’s what that looks like:
defmodule Haystack.Storage do
@moduledoc """
A module and behaviour for storing data.
This module acts as a convenient way to delegate to the storage implementation
within Haystack.
This has been created as a behaviour to make it easy to provide your own
storage implementation. By default Haystack provides a memory storage
implementation, which uses a map to store data.
"""
# Types
@type opts :: Keyword.t()
@type t :: struct
@type k :: term
@type v :: term
# Errors
defmodule NotFoundError do
@moduledoc """
Not found error.
"""
@type t :: %__MODULE__{message: String.t()}
defexception ~w{message}a
end
# Behaviour
@doc """
Create a new storage.
"""
@callback new(opts) :: t
@doc """
Fetch an item from storage.
"""
@callback fetch(t, k) :: {:ok, v} | {:error, NotFoundError.t()}
@doc """
Fetch an item from storage.
"""
@callback fetch!(t, k) :: v
@doc """
Insert an item into storage
"""
@callback insert(t, k, v) :: t
@doc """
Update an item in storage.
"""
@callback update(t, k, function) :: {:ok, t}
@doc """
Update an item in storage.
"""
@callback update!(t, k, function) :: t
@doc """
Upsert an item in storage.
"""
@callback upsert(t, k, v, function) :: t
@doc """
Delete an item from storage.
"""
@callback delete(t, k) :: t
@doc """
Return the count of items in storage.
"""
@callback count(t) :: integer
@doc """
Serialize the storage.
"""
@callback serialize(t) :: binary
end
Each storage is expected to implement the functions of the behaviour as well as provide a struct that can be used internally to the implementation to store configuration details or data. I’ve also added some convenience functions to this module to delegate to the implementation.
Here’s the full Storage
module:
defmodule Haystack.Storage do
@moduledoc """
A module and behaviour for storing data.
This module acts as a convenient way to delegate to the storage implementation
within Haystack.
This has been created as a behaviour to make it easy to provide your own
storage implementation. By default Haystack provides a memory storage
implementation, which uses a map to store data.
"""
# Types
@type opts :: Keyword.t()
@type t :: struct
@type k :: term
@type v :: term
# Errors
defmodule NotFoundError do
@moduledoc """
Not found error.
"""
@type t :: %__MODULE__{message: String.t()}
defexception ~w{message}a
end
# Behaviour
@doc """
Create a new storage.
"""
@callback new(opts) :: t
@doc """
Fetch an item from storage.
"""
@callback fetch(t, k) :: {:ok, v} | {:error, NotFoundError.t()}
@doc """
Fetch an item from storage.
"""
@callback fetch!(t, k) :: v
@doc """
Insert an item into storage
"""
@callback insert(t, k, v) :: t
@doc """
Update an item in storage.
"""
@callback update(t, k, function) :: {:ok, t}
@doc """
Update an item in storage.
"""
@callback update!(t, k, function) :: t
@doc """
Upsert an item in storage.
"""
@callback upsert(t, k, v, function) :: t
@doc """
Delete an item from storage.
"""
@callback delete(t, k) :: t
@doc """
Return the count of items in storage.
"""
@callback count(t) :: integer
@doc """
Serialize the storage.
"""
@callback serialize(t) :: binary
# Public
@doc """
Fetch an item from storage.
## Examples
iex> storage = Storage.Map.new()
iex> Storage.fetch(storage, :name)
{:error, %Storage.NotFoundError{message: "Not found"}}
iex> storage = Storage.Map.new()
iex> storage = Storage.insert(storage, :name, "Haystack")
iex> Storage.fetch(storage, :name)
{:ok, "Haystack"}
"""
@spec fetch(t, k) :: {:ok, v} | {:error, NotFoundError.t()}
def fetch(storage, k),
do: delegate(storage, :fetch, [k])
@doc """
Fetch an item from storage.
## Examples
iex> storage = Storage.Map.new()
iex> storage = Storage.insert(storage, :name, "Haystack")
iex> Storage.fetch!(storage, :name)
"Haystack"
"""
@spec fetch!(t, k) :: v
def fetch!(storage, k),
do: delegate(storage, :fetch!, [k])
@doc """
Insert an item into storage
## Examples
iex> storage = Storage.Map.new()
iex> storage = Storage.insert(storage, :name, "Haystack")
iex> Storage.fetch!(storage, :name)
"Haystack"
"""
@spec insert(t, k, v) :: t
def insert(storage, k, v),
do: delegate(storage, :insert, [k, v])
@doc """
Update an item in storage.
## Examples
iex> storage = Storage.Map.new()
iex> Storage.update(storage, :name, &String.upcase/1)
{:error, %Storage.NotFoundError{message: "Not found"}}
iex> storage = Storage.Map.new()
iex> storage = Storage.insert(storage, :name, "Haystack")
iex> {:ok, storage} = Storage.update(storage, :name, &String.upcase/1)
iex> Storage.fetch!(storage, :name)
"HAYSTACK"
"""
@spec update(t, k, function) :: {:ok, t}
def update(storage, k, f),
do: delegate(storage, :update, [k, f])
@doc """
Update an item in storage.
## Examples
iex> storage = Storage.Map.new()
iex> storage = Storage.insert(storage, :name, "Haystack")
iex> storage = Storage.update!(storage, :name, &String.upcase/1)
iex> Storage.fetch!(storage, :name)
"HAYSTACK"
"""
@spec update!(t, k, function) :: t
def update!(storage, k, f),
do: delegate(storage, :update!, [k, f])
@doc """
Upsert an item in storage.
## Examples
iex> storage = Storage.Map.new()
iex> storage = Storage.upsert(storage, :name, "HAYSTACK", &String.upcase/1)
iex> Storage.fetch!(storage, :name)
"HAYSTACK"
iex> storage = Storage.Map.new()
iex> storage = Storage.insert(storage, :name, "Haystack")
iex> storage = Storage.upsert(storage, :name, "HAYSTACK", &String.upcase/1)
iex> Storage.fetch!(storage, :name)
"HAYSTACK"
"""
@spec upsert(t, k, v, function) :: t
def upsert(storage, k, v, f),
do: delegate(storage, :upsert, [k, v, f])
@doc """
Delete an item from storage.
## Examples
iex> storage = Storage.Map.new()
iex> storage = Storage.delete(storage, :name)
iex> Storage.fetch(storage, :name)
{:error, %Storage.NotFoundError{message: "Not found"}}
iex> storage = Storage.Map.new()
iex> storage = Storage.insert(storage, :name, "Haystack")
iex> storage = Storage.delete(storage, :name)
iex> Storage.fetch(storage, :name)
{:error, %Storage.NotFoundError{message: "Not found"}}
"""
@spec delete(t, k) :: t
def delete(storage, k),
do: delegate(storage, :delete, [k])
@doc """
Return the count of items in storage.
## Examples
iex> storage = Storage.Map.new()
iex> storage = Storage.insert(storage, :name, "Haystack")
iex> Storage.count(storage)
1
"""
@spec count(t) :: integer
def count(storage),
do: delegate(storage, :count, [])
@doc """
Serialize the storage.
## Examples
iex> storage = Storage.Map.new()
iex> Storage.serialize(storage)
"""
@spec serialize(t) :: binary
def serialize(storage),
do: delegate(storage, :serialize, [])
@doc """
Deserialize the storage.
## Examples
iex> binary = Storage.serialize(Storage.Map.new())
iex> Storage.deserialize(binary)
"""
@spec deserialize(binary) :: t
def deserialize(binary),
do: :erlang.binary_to_term(binary)
# Private
defp delegate(%module{} = storage, func, args),
do: apply(module, func, [storage] ++ args)
end
The default map storage implementation is perfect for testing and prototyping. Using a map makes it very easy to make confident assertions without the overhead of complicated storage concerns. Here is the Map
storage implementation module:
defmodule Haystack.Storage.Map do
@moduledoc """
A map implementation of the storage behaviour.
"""
alias Haystack.Storage
@behaviour Haystack.Storage
# Types
@type t :: %__MODULE__{data: %{term => term}}
@enforce_keys ~w{data}a
defstruct @enforce_keys
# Behaviour: Haystack.Storage
@doc """
Create a new storage.
## Examples
iex> Storage.Map.new()
"""
@impl Haystack.Storage
def new(_opts \\ []),
do: struct(__MODULE__, data: %{})
@doc """
Fetch an item from storage.
## Examples
iex> storage = Storage.Map.new()
iex> Storage.Map.fetch(storage, :name)
{:error, %Storage.NotFoundError{message: "Not found"}}
iex> storage = Storage.Map.new()
iex> storage = Storage.Map.insert(storage, :name, "Haystack")
iex> Storage.Map.fetch(storage, :name)
{:ok, "Haystack"}
"""
@impl Haystack.Storage
def fetch(storage, k) do
case Map.fetch(storage.data, k) do
{:ok, v} -> {:ok, v}
:error -> {:error, %Storage.NotFoundError{message: "Not found"}}
end
end
@doc """
Fetch an item from storage.
## Examples
iex> storage = Storage.Map.new()
iex> storage = Storage.Map.insert(storage, :name, "Haystack")
iex> Storage.Map.fetch!(storage, :name)
"Haystack"
"""
@impl Haystack.Storage
def fetch!(storage, k) do
case fetch(storage, k) do
{:ok, v} -> v
{:error, error} -> raise error
end
end
@doc """
Insert an item into storage.
## Examples
iex> storage = Storage.Map.new()
iex> storage = Storage.Map.insert(storage, :name, "Haystack")
iex> Storage.Map.fetch!(storage, :name)
"Haystack"
"""
@impl Haystack.Storage
def insert(storage, k, v),
do: Map.update!(storage, :data, &Map.put(&1, k, v))
@doc """
Update an item in storage.
## Examples
iex> storage = Storage.Map.new()
iex> Storage.Map.update(storage, :name, &String.upcase/1)
{:error, %Storage.NotFoundError{message: "Not found"}}
iex> storage = Storage.Map.new()
iex> storage = Storage.Map.insert(storage, :name, "Haystack")
iex> {:ok, storage} = Storage.Map.update(storage, :name, &String.upcase/1)
iex> Storage.Map.fetch!(storage, :name)
"HAYSTACK"
"""
@impl Haystack.Storage
def update(storage, k, f) do
with {:ok, v} <- fetch(storage, k) do
{:ok, Map.update!(storage, :data, &Map.put(&1, k, f.(v)))}
end
end
@doc """
Update an item in storage.
## Examples
iex> storage = Storage.Map.new()
iex> storage = Storage.Map.insert(storage, :name, "Haystack")
iex> storage = Storage.Map.update!(storage, :name, &String.upcase/1)
iex> Storage.Map.fetch!(storage, :name)
"HAYSTACK"
"""
@impl Haystack.Storage
def update!(storage, k, f) do
case update(storage, k, f) do
{:ok, storage} -> storage
{:error, error} -> raise error
end
end
@doc """
Upsert an item in storage.
## Examples
iex> storage = Storage.Map.new()
iex> storage = Storage.Map.upsert(storage, :name, "HAYSTACK", &String.upcase/1)
iex> Storage.Map.fetch!(storage, :name)
"HAYSTACK"
iex> storage = Storage.Map.new()
iex> storage = Storage.Map.insert(storage, :name, "Haystack")
iex> storage = Storage.Map.upsert(storage, :name, "HAYSTACK", &String.upcase/1)
iex> Storage.Map.fetch!(storage, :name)
"HAYSTACK"
"""
@impl Haystack.Storage
def upsert(storage, k, v, f) do
case update(storage, k, f) do
{:ok, storage} -> storage
{:error, _} -> insert(storage, k, v)
end
end
@doc """
Delete an item from storage.
## Examples
iex> storage = Storage.Map.new()
iex> storage = Storage.Map.delete(storage, :name)
iex> Storage.Map.fetch(storage, :name)
{:error, %Storage.NotFoundError{message: "Not found"}}
iex> storage = Storage.Map.new()
iex> storage = Storage.Map.insert(storage, :name, "Haystack")
iex> storage = Storage.Map.delete(storage, :name)
iex> Storage.Map.fetch(storage, :name)
{:error, %Storage.NotFoundError{message: "Not found"}}
"""
@impl Haystack.Storage
def delete(storage, k),
do: Map.update!(storage, :data, &Map.delete(&1, k))
@doc """
Return the count of items in storage.
## Examples
iex> storage = Storage.Map.new()
iex> storage = Storage.insert(storage, :name, "Haystack")
iex> Storage.count(storage)
1
"""
@impl Haystack.Storage
def count(storage),
do: Enum.count(storage.data)
@doc """
Serialize the storage.
## Examples
iex> storage = Storage.Map.new()
iex> Storage.serialize(storage)
"""
@impl Haystack.Storage
def serialize(storage),
do: :erlang.term_to_binary(storage)
end
The second out-of-the-box storage implementation I want to provide is one backed by ETS. This storage implementation can be used in production applications and means anyone can use Haystack without the friction of having to provide a way to store data. Here is the ETS
storage implementation:
defmodule Haystack.Storage.ETS do
@moduledoc """
An ETS implementation of the storage behaviour.
"""
use GenServer
alias Haystack.Storage
@behaviour Haystack.Storage
@type t :: %__MODULE__{
name: GenServer.name(),
table: atom,
data: list({term, term}),
load: nil | fun()
}
@enforce_keys ~w{name table data load}a
defstruct @enforce_keys
# Behaviour: Haystack.Storage
@doc """
Create a new storage.
"""
@impl Haystack.Storage
def new(opts \\ []) do
opts =
opts
|> Keyword.put(:data, [])
|> Keyword.put_new(:load, nil)
|> Keyword.put_new(:name, :haystack)
|> Keyword.put_new(:table, :haystack)
struct(__MODULE__, opts)
end
@doc """
Fetch an item from storage.
"""
@impl Haystack.Storage
def fetch(storage, k) do
case :ets.lookup(storage.table, k) do
[] -> {:error, %Storage.NotFoundError{message: "Not found"}}
[{^k, v}] -> {:ok, v}
end
end
@doc """
Fetch an item from storage.
"""
@impl Haystack.Storage
def fetch!(storage, k) do
case fetch(storage, k) do
{:ok, v} -> v
{:error, error} -> raise error
end
end
@doc """
Insert an item into storage.
"""
@impl Haystack.Storage
def insert(storage, k, v),
do: GenServer.call(storage.name, {:insert, {k, v}})
@doc """
Update an item in storage.
"""
@impl Haystack.Storage
def update(storage, k, f) do
with {:ok, v} <- fetch(storage, k) do
{:ok, GenServer.call(storage.name, {:insert, {k, f.(v)}})}
end
end
@doc """
Update an item in storage.
"""
@impl Haystack.Storage
def update!(storage, k, f) do
case update(storage, k, f) do
{:ok, storage} -> storage
{:error, error} -> raise error
end
end
@doc """
Upsert an item in storage.
"""
@impl Haystack.Storage
def upsert(storage, k, v, f) do
case update(storage, k, f) do
{:ok, storage} -> storage
{:error, _} -> insert(storage, k, v)
end
end
@doc """
Delete an item from storage.
"""
@impl Haystack.Storage
def delete(storage, k),
do: GenServer.call(storage.name, {:delete, k})
@doc """
Return the count of items in storage.
"""
@impl Haystack.Storage
def count(storage) do
info = :ets.info(storage.table)
Keyword.fetch!(info, :size)
end
@doc """
Serialize the storage.
"""
@impl Haystack.Storage
def serialize(storage),
do: GenServer.call(storage.name, :serialize)
# Behaviour: GenServer
def child_spec(opts) do
storage = Keyword.fetch!(opts, :storage)
%{id: {__MODULE__, storage.table}, start: {__MODULE__, :start_link, [opts]}}
end
def start_link(opts) do
{storage, opts} = Keyword.pop!(opts, :storage)
GenServer.start_link(__MODULE__, storage, [name: storage.name] ++ opts)
end
@impl true
def init(storage) do
:ets.new(storage.table, [:set, :protected, :named_table, :compressed])
{:ok, storage, {:continue, :ok}}
end
@impl true
def handle_call({:insert, {k, v}}, _from, storage) do
true = :ets.insert(storage.table, {k, v})
{:reply, storage, storage}
end
def handle_call({:delete, k}, _from, storage) do
true = :ets.delete(storage.table, k)
{:reply, storage, storage}
end
def handle_call(:serialize, _from, storage) do
data = :ets.tab2list(storage.table)
{:reply, :erlang.term_to_binary(%{storage | data: data}), storage}
end
@impl true
def handle_continue(:ok, %{load: nil} = storage), do: {:noreply, storage}
def handle_continue(:ok, storage) do
data = storage.load.()
Enum.each(data, &:ets.insert(storage.table, &1))
:erlang.garbage_collect(self())
{:noreply, %{storage | load: nil}}
end
end
Of course, if you aren’t happy with either of these implementations, you could provide your own by implementing the behaviour callbacks, or providing an implementation for a totally different method of persisting data, the choice is yours.
The main control centre of Haystack is the Index
module. This module is responsible for the configuration of the index, how data is stored in the index, and providing a public interface for the underlying modules that implement the functionality.
Here’s the first part of the Index
module:
defmodule Haystack.Index do
@moduledoc """
A module for managing indexes.
The index is used to manage configuration and acts as a central point to add,
update, delete, or query the storage.
"""
alias Haystack.{Index, Query, Storage, Tokenizer, Transformer}
# Types
@type attrs :: %{insert: list({module, Keyword.t()}), delete: list({module, Keyword.t()})}
@type fields :: %{Index.Field.k() => Index.Field.t()}
@type opts :: Keyword.t()
@type t :: %__MODULE__{
attrs: attrs,
fields: fields,
name: atom,
ref: Index.Field.t(),
storage: struct()
}
@enforce_keys ~w{attrs fields name ref storage}a
defstruct @enforce_keys
# Public
@doc """
Create a new index.
## Examples
iex> Index.new(:animals)
"""
@spec new(atom, opts) :: t
def new(name, opts \\ []) do
opts =
opts
|> Keyword.put(:name, name)
|> Keyword.put(:fields, %{})
|> Keyword.put_new(:storage, Storage.Map.new([]))
|> Keyword.put_new(:attrs, Index.Attr.default())
struct(__MODULE__, opts)
end
@doc """
Set the attrs of the index.
## Examples
iex> index = Index.new(:animals)
iex> Index.attrs(index, %{insert: [Global], delete: [Global]})
"""
@spec attrs(t, attrs) :: t
def attrs(index, attrs),
do: %{index | attrs: attrs}
@doc """
Add a ref to the index.
## Examples
iex> index = Index.new(:animals)
iex> Index.ref(index, Index.Field.new("id"))
"""
@spec ref(t, Index.Field.t()) :: t
def ref(index, ref),
do: %{index | ref: ref}
@doc """
Add a field to the index.
## Examples
iex> index = Index.new(:animals)
iex> Index.field(index, Index.Field.new("name"))
"""
@spec field(t, Index.Field.t()) :: t
def field(index, field),
do: %{index | fields: Map.put(index.fields, field.k, field)}
@doc """
Set the storage on the index.
## Examples
iex> index = Index.new(:animals)
iex> Index.storage(index, Storage.Map.new())
"""
@spec storage(t, struct) :: t
def storage(index, storage),
do: %{index | storage: storage}
end
The index has a struct that captures the name of the index, a map of fields and a ref that we saw how to configure at the start as well as the chosen storage implementation. The index also has the concept of attrs, which we will look at later
The first index related sub module to look at is Index.Document
. This module is responsible for normalizing, extracting, tokenizing, and transforming the data. We also convert the transformed data into a %Index.Document{}
struct so we have a formalised data structure to pass around. Here is the Index.Document
module in full:
defmodule Haystack.Index.Document do
@moduledoc """
A module for preparing documents to be stored.
"""
alias Haystack.{Tokenizer, Transformer}
# Types
@type t :: %__MODULE__{
ref: String.t(),
fields: %{String.t() => list(Tokenizer.Token.t())}
}
@enforce_keys ~w{ref fields}a
defstruct @enforce_keys
# Public
@doc """
Create a new document.
"""
def new(index, params) do
fields = [index.ref | index.fields |> Map.values()]
{[%{v: ref}], fields} =
params
|> normalize()
|> extract(fields)
|> tokenize()
|> transform()
|> Enum.into(%{})
|> Map.pop!(index.ref.k)
struct(__MODULE__, ref: ref, fields: fields)
end
# Private
# Normalize the keys of the map to strings
defp normalize(map) when is_map(map),
do: Enum.map(map, &normalize/1) |> Enum.into(%{})
defp normalize({k, v}) when is_map(v),
do: {to_string(k), v |> Enum.map(&normalize/1) |> Enum.into(%{})}
defp normalize({k, v}),
do: {to_string(k), v}
# Extract the fields from the map
defp extract(map, fields) do
fields
|> Enum.map(fn field -> {get_in(map, field.path), field} end)
|> Enum.reject(fn {v, _field} -> is_nil(v) end)
|> Enum.map(fn {v, field} -> {field.k, {v, field}} end)
end
# Tokenize the values
def tokenize(list) do
Enum.map(list, fn {k, {v, field}} ->
{k, {Tokenizer.tokenize(v, field.separator), field}}
end)
end
# Transform the values
def transform(list) do
Enum.map(list, fn {k, {tokens, field}} ->
tokens = Transformer.pipeline(tokens, field.transformers)
counts = Enum.frequencies_by(tokens, & &1.v)
groups = Enum.group_by(tokens, & &1.v)
values =
Enum.map(groups, fn {v, group} ->
%{}
|> Map.put(:v, v)
|> Map.put(:positions, Enum.map(group, &{&1.offset, &1.length}))
|> Map.put(:tf, Map.get(counts, v) / Enum.count(tokens))
end)
{k, values}
end)
end
end
The :tf
value is the term frequency. This is a statistic of how often the term appears in the given field, divided by the total amount of tokens for that field. It’s a method to understand how relevant a token is as part of the document.
The next interesting thing to look at is the index attrs. An important part of building a search index is the data that you store. The data is stored as an “inverted index”, which basically means that the words are used as the keys.
This is simple enough to build out in isolation but once you start layering on additional data to be stored, things can become more complicated. In order to make the responsibility of storing data as simple and as configurable as possible, I wanted to define those attrs as interchangeable modules that could be added, removed, or replaced.
This has a number of benefits.
Firstly it makes implementing the core set of attributes to be stored very simple because we’ve formalised those stages into discrete steps.
Secondly, this makes it possible for any consumer of Haystack to add, remove, or replace any of the attrs and completely control the data that is processed and stored as part of their search index. For example, if you wanted to store word embeddings so you could extend Haystack to offer semantic search. Instead of diving into the source code of an increasingly complicated module, you can simply add a new implementation to the list of existing attrs.
So to achieve this, we once again define a behaviour that each of the attr modules must implement:
@doc """
The attr key.
"""
@callback key(opts) :: tuple
@doc """
Insert an attr.
"""
@callback insert(Index.t(), Index.Document.t(), opts) :: Index.t()
@doc """
Delete an attr.
"""
@callback delete(Index.t(), Index.Document.t(), opts) :: Index.t()
This behaviour requires that implementations have functions for inserting and deleting data as well as a unique key that represents this attr type in the storage. An interesting thing to note is that updating has not been included as part of this behaviour because it could be seen as the composition of deleting and inserting. Another decision I made was in order to delete data from the index, the consumer must supply the original document to be deleted, rather than just the ref
. This decreases the amount of data that must be stored and so it’s a trade-off worth having because Haystack can be used in a relatively low memory environment.
As part of this module I’ve also added a list of default attrs that will be used unless the developer supplies a custom list. Here is the Attr
module in full:
defmodule Haystack.Index.Attr do
@moduledoc """
A behaviour for defining a store attr.
"""
alias Haystack.Index
@attrs [
{Index.Attr.Global, []},
{Index.Attr.Docs, []},
{Index.Attr.Meta, include: [:tf]}
]
# Types
@type opts :: Keyword.t()
# Behaviour
@doc """
The attr key.
"""
@callback key(opts) :: tuple
@doc """
Insert an attr.
"""
@callback insert(Index.t(), Index.Document.t(), opts) :: Index.t()
@doc """
Delete an attr.
"""
@callback delete(Index.t(), Index.Document.t(), opts) :: Index.t()
# Public
@doc """
Return the default attrs.
"""
@spec default :: %{insert: list({module, Keyword.t()}), delete: list({module, Keyword.t()})}
def default, do: %{insert: @attrs, delete: Enum.reverse(@attrs)}
end
Next we can look at the default attr modules.
The first attr module simply stores a global list of all refs. We need to maintain a global list of all document refs to calculate the inverse document relevancy score later in this blog post. Here’s what that module looks like:
defmodule Haystack.Index.Attr.Global do
@moduledoc """
A module for storing the global list of refs.
"""
import Record
alias Haystack.{Index, Storage}
@behaviour Index.Attr
defrecord :global, []
# Behaviour: Index.Attr
@impl Index.Attr
def key(_opts \\ []), do: global()
@impl Index.Attr
def insert(index, %{ref: ref}, _opts \\ []) do
storage = Storage.upsert(index.storage, key(), [ref], &(&1 ++ [ref]))
Index.storage(index, storage)
end
@impl Index.Attr
def delete(index, %{ref: ref}, _opts \\ []) do
storage = Storage.upsert(index.storage, key(), ref, &(&1 -- [ref]))
storage =
case Storage.fetch!(storage, key()) do
[] -> Storage.delete(storage, key())
_ -> storage
end
Index.storage(index, storage)
end
end
The next attr module stores a list of documents that have a particular field, term pair:
defmodule Haystack.Index.Attr.Docs do
@moduledoc """
A module for returning a list of docs for a given field, term.
"""
import Record
alias Haystack.{Index, Storage}
@behaviour Index.Attr
defrecord :docs, field: nil, term: nil
# Behaviour: Index.Attr
@impl Index.Attr
def key(field: field, term: term),
do: docs(field: field, term: term)
@impl Index.Attr
def insert(index, %{ref: ref, fields: fields}, _opts \\ []) do
storage =
Enum.reduce(fields, index.storage, fn {field, terms}, storage ->
Enum.reduce(terms, storage, fn term, storage ->
k = key(field: field, term: term.v)
Storage.upsert(storage, k, [ref], &[ref | &1])
end)
end)
Index.storage(index, storage)
end
@impl Index.Attr
def delete(index, %{ref: ref, fields: fields}, _opts \\ []) do
storage =
Enum.reduce(fields, index.storage, fn {field, terms}, storage ->
Enum.reduce(terms, storage, fn term, storage ->
k = key(field: field, term: term.v)
case Storage.fetch!(storage, k) do
[^ref] -> Storage.delete(storage, k)
refs -> Storage.insert(storage, k, refs -- [ref])
end
end)
end)
Index.storage(index, storage)
end
end
This makes it really easy to find a list of documents that have a specific term as part of that field because it’s stored as an inverted index.
Finally we have a meta attr that stores additional metadata about the field, term pair for a given document:
defmodule Haystack.Index.Attr.Meta do
@moduledoc """
A module for storing the meta for a field, term.
"""
import Record
alias Haystack.{Index, Storage}
@behaviour Index.Attr
defrecord :meta, ref: nil, field: nil, term: nil
# Behaviour: Index.Attr
@impl Index.Attr
def key(ref: ref, field: field, term: term),
do: meta(ref: ref, field: field, term: term)
@impl Index.Attr
def insert(index, %{ref: ref, fields: fields}, opts \\ []) do
include = Keyword.get(opts, :include, [:tf])
storage =
Enum.reduce(fields, index.storage, fn {field, terms}, storage ->
Enum.reduce(terms, storage, fn term, storage ->
k = key(ref: ref, field: field, term: term.v)
Storage.insert(storage, k, Map.take(term, include))
end)
end)
Index.storage(index, storage)
end
@impl Index.Attr
def delete(index, %{ref: ref, fields: fields}, _opts \\ []) do
storage =
Enum.reduce(fields, index.storage, fn {field, terms}, storage ->
Enum.reduce(terms, storage, fn term, storage ->
k = key(ref: ref, field: field, term: term.v)
Storage.delete(storage, k)
end)
end)
Index.storage(index, storage)
end
end
The meta data that is stored is the positions we calculated earlier and the term frequency. By default we don’t store the positions as that is extra data that would need to be added to the memory usage that might not be needed.
Back in the Index
module, we can add a couple functions for adding and deleting data from the index:
@doc """
Add documents to the index.
## Examples
iex> index = Index.new(:animals)
iex> Index.add(index, [])
"""
@spec add(t, list(map)) :: t
def add(index, data) do
docs = Enum.map(data, &Index.Document.new(index, &1))
Enum.reduce(docs, index, fn doc, index ->
Enum.reduce(index.attrs.insert, index, fn {module, opts}, index ->
module.insert(index, doc, opts)
end)
end)
end
@doc """
Delete documents in the index.
## Examples
iex> index = Index.new(:animals)
iex> Index.delete(index, [])
"""
@spec delete(t, list(map)) :: t
def delete(index, data) do
docs = Enum.map(data, &Index.Document.new(index, &1))
Enum.reduce(docs, index, fn doc, index ->
Enum.reduce(index.attrs.delete, index, fn {module, opts}, index ->
module.delete(index, doc, opts)
end)
end)
end
These two functions use the Index.Document
module to create the index document and then reduce through the list of attr modules for the given action and apply the attr to the index for the given document.
The final part of Haystack that needs to be built is a method for defining queries and then applying them to the index to get a list of results.
Once again this was an area of Haystack that I wanted to be easy to implement whilst also being easy to extend or modify by an engineer without having to make a PR to the main repository.
A query is a composable data structure of clauses and expressions that can be evaluated against the index. Similar to a lot of the modules we’ve looked at previously, each clause and expression is implemented as an implementation of a behaviour so that they can be swapped, replaced or added to very easily.
Queries should also be infinitely nestable to allow the engineer to define complex queries that meet their requirements. So with that being said, let’s take a look at how queries work in Haystack.
Here’s the Query
module:
defmodule Haystack.Query do
@moduledoc """
A module for building queries.
"""
alias Haystack.{Index, Query}
alias Haystack.Tokenizer.Token
# Types
@type statement :: (Query.t(), Index.t() -> list(map()))
@type t :: %__MODULE__{
clause: Query.Clause.t(),
config: Keyword.t()
}
@enforce_keys ~w{index clause config}a
defstruct @enforce_keys
@config clause: Query.Clause.default(),
expression: Query.Expression.default()
# Public
@doc """
Create a new Query.
"""
@spec new(Keyword.t()) :: t
def new(opts \\ []),
do: struct(__MODULE__, Keyword.put_new(opts, :config, @config))
@doc """
Add a clause to the query.
"""
@spec clause(t, Query.Clause.t()) :: t
def clause(query, clause),
do: %{query | clause: clause}
@doc """
Evaluate a clause.
"""
@spec evaluate(t(), Index.t(), Query.Clause.t(), list(statement)) :: list(map())
def evaluate(query, index, clause, statements) do
module = get_in(query.config, [:clause, clause.k])
module.evaluate(query, index, statements)
end
@doc """
Evaluate an expression.
"""
@spec evaluate(t(), Index.t(), Query.Expression.t()) :: list(map())
def evaluate(query, index, expression) do
module = get_in(query.config, [:expression, expression.k])
module.evaluate(index, expression)
end
@doc """
Run the given query.
"""
@spec run(t, Index.t()) :: list(map)
def run(query, index) do
statement = Query.Clause.build(query.clause)
responses = statement.(query, index)
responses
|> Query.IDF.calculate(index.storage)
|> Enum.group_by(& &1.ref)
|> Enum.map(fn {ref, fields} ->
score = Enum.reduce(fields, 0, &(&1.score + &2))
fields = Enum.group_by(fields, & &1.field)
fields =
Enum.map(fields, fn {k, list} ->
{k, Enum.map(list, &Map.take(&1, [:positions, :term]))}
end)
%{ref: ref, score: score, fields: Map.new(fields)}
end)
|> Enum.sort_by(& &1.score, :desc)
end
@doc """
Build a clause for the given list of tokens.
## Examples
iex> index = Index.new(:animals)
iex> index = Index.field(index, Index.Field.new("name"))
iex> tokens = Tokenizer.tokenize("Red Panda")
iex> tokens = Transformer.pipeline(tokens, Transformer.default())
iex> Query.build(:match_all, index, tokens)
"""
@spec build(atom, Index.t(), list(Token.t())) :: Query.Clause.t()
def build(:match_all, index, tokens) do
Enum.reduce(tokens, Query.Clause.new(:all), fn token, clause ->
Enum.reduce(Map.values(index.fields), clause, fn field, clause ->
Query.Clause.expressions(clause, [
Query.Expression.new(:match, field: field.k, term: token.v)
])
end)
end)
end
def build(:match_any, index, tokens) do
Enum.reduce(tokens, Query.Clause.new(:any), fn token, clause ->
Enum.reduce(Map.values(index.fields), clause, fn field, clause ->
Query.Clause.expressions(clause, [
Query.Expression.new(:match, field: field.k, term: token.v)
])
end)
end)
end
end
The query module has a struct that contains a single clause that is used as the root of the query. The struct also has a config keyword list that allows the engineer to set their own mapping of clauses and expressions.
Here’s a run down of the functions in this module.
First up we provide helpers to create a new query and to add a clause to the query. If the config
is not supplied by the engineer, we set a default value:
@doc """
Create a new Query.
"""
@spec new(Keyword.t()) :: t
def new(opts \\ []),
do: struct(__MODULE__, Keyword.put_new(opts, :config, @config))
@doc """
Add a clause to the query.
"""
@spec clause(t, Query.Clause.t()) :: t
def clause(query, clause),
do: %{query | clause: clause}
Next, we have evaluate
functions for clauses and expressions. These functions basically just look up the matching clause or expression type in the config
and then delegate to the configured module:
@doc """
Evaluate a clause.
"""
@spec evaluate(t(), Index.t(), Query.Clause.t(), list(statement)) :: list(map())
def evaluate(query, index, clause, statements) do
module = get_in(query.config, [:clause, clause.k])
module.evaluate(query, index, statements)
end
@doc """
Evaluate an expression.
"""
@spec evaluate(t(), Index.t(), Query.Expression.t()) :: list(map())
def evaluate(query, index, expression) do
module = get_in(query.config, [:expression, expression.k])
module.evaluate(index, expression)
end
Next, we have the run function that applies the query to the given index:
@doc """
Run the given query.
"""
@spec run(t, Index.t()) :: list(map)
def run(query, index) do
statement = Query.Clause.build(query.clause)
responses = statement.(query, index)
responses
|> Query.IDF.calculate(index.storage)
|> Enum.group_by(& &1.ref)
|> Enum.map(fn {ref, fields} ->
score = Enum.reduce(fields, 0, &(&1.score + &2))
fields = Enum.group_by(fields, & &1.field)
fields =
Enum.map(fields, fn {k, list} ->
{k, Enum.map(list, &Map.take(&1, [:positions, :term]))}
end)
%{ref: ref, score: score, fields: Map.new(fields)}
end)
|> Enum.sort_by(& &1.score, :desc)
end
The run/2
function works by using the root clause to build a statement that can then be executed and results are turned. The IDF is calculated for each result. IDF stands for Inverse Document Frequency. This basically means that terms that don’t appear often in documents across the whole index are weighted higher. Finally we group results by their ref
, calculate the total score
for each document and then sort on that score
Next up we’ll look at the clause module. This module is responsible for defining the behaviour of clauses as well as offering some convenience functions. Here is that module:
defmodule Haystack.Query.Clause do
@moduledoc """
A module for defining query clauses.
"""
alias Haystack.{Index, Query}
alias Haystack.Query.Clause
# Types
@type k :: atom
@type t :: %__MODULE__{
k: k,
clauses: list(t),
expressions: list(Query.Expression.t())
}
@enforce_keys ~w{k clauses expressions}a
defstruct @enforce_keys
@clauses all: Clause.All,
any: Clause.Any
# Behaviour
@doc """
Evaluate the given clause.
"""
@callback evaluate(Query.t(), Index.t(), list(Query.statement())) :: list(map())
# Public
@doc """
Create a new clause.
## Examples
iex> Query.Clause.new(:any)
"""
@spec new(k) :: t
def new(k),
do: struct(__MODULE__, k: k, clauses: [], expressions: [])
@doc """
Return the default clauses.
## Examples
iex> Query.Clause.default()
"""
@spec default :: [{atom, module}]
def default, do: @clauses
@doc """
Add a list of clauses.
## Examples
iex> clause = Query.Clause.new(:any)
iex> Query.Clause.clauses(clause, [Query.Clause.new(:any)])
"""
def clauses(clause, clauses),
do: %{clause | clauses: clause.clauses ++ clauses}
@doc """
Add a list of expressions.
## Examples
iex> clause = Query.Clause.new(:any)
iex> expression = Query.Expression.new(:match, field: "name", term: "Haystack")
iex> Query.Clause.expressions(clause, [expression])
"""
def expressions(clause, expressions),
do: %{clause | expressions: clause.expressions ++ expressions}
@doc """
Build the clause into a statement.
## Examples
iex> clause = Query.Clause.new(:any)
iex> Query.Clause.build(clause)
"""
@spec build(t) :: Query.statement()
def build(clause) do
fn query, index ->
statements =
clause.expressions
|> Enum.map(&Query.Expression.build/1)
|> Enum.concat(Enum.map(clause.clauses, &build/1))
Query.evaluate(query, index, clause, statements)
end
end
end
The most interesting bit of this module is the build/2
function that accepts a clause and returns a function. The returned function is the statement
we looked at earlier that we use to evaluate the query against the index.
The function that is returned from this function essentially provides a way to capture a nested method of evaluating sub clauses and expressions. For clauses, it means building a list of statements for each expression of the clause and then each nested clause and then evaluating that list of statements.
Out of the box, Haystack has two clause implementations for any
or all
queries, These modules are really simple because they essentially just evaluate the given list of statements (which could be expressions, clauses, or a mixture of both). We then find the union or the intersection of those results, depending on the type of clause.
Here’s how those two modules look:
defmodule Haystack.Query.Clause.All do
@moduledoc """
A module to define "all" clauses.
"""
alias Haystack.Query
@behaviour Query.Clause
# Behaviour: Query.Clause
@impl Query.Clause
def evaluate(query, index, statements) do
responses = Enum.map(statements, & &1.(query, index))
results =
Enum.map(responses, fn results ->
results
|> Enum.map(&Map.get(&1, :ref))
|> MapSet.new()
end)
{acc, results} = List.pop_at(results, 0, MapSet.new())
refs = Enum.reduce(results, acc, &MapSet.intersection(&2, &1))
responses
|> List.flatten()
|> Enum.filter(&Enum.member?(refs, &1.ref))
end
end
defmodule Haystack.Query.Clause.Any do
@moduledoc """
A module to define "any" clauses.
"""
alias Haystack.Query
@behaviour Query.Clause
# Behaviour: Query.Clause
@impl Query.Clause
def evaluate(query, index, statements) do
responses = Enum.map(statements, & &1.(query, index))
results =
Enum.map(responses, fn results ->
results
|> Enum.map(&Map.get(&1, :ref))
|> MapSet.new()
end)
{acc, results} = List.pop_at(results, 0, MapSet.new())
refs = Enum.reduce(results, acc, &MapSet.union(&2, &1))
responses
|> List.flatten()
|> Enum.filter(&Enum.member?(refs, &1.ref))
end
end
Next we have the expression module. You can think of this a simplified version of a clause:
defmodule Haystack.Query.Expression do
@moduledoc """
A module for defining query expressions.
"""
alias Haystack.{Index, Query}
# Types
@type k :: atom
@type t :: %__MODULE__{
k: k,
field: String.t(),
term: String.t()
}
@enforce_keys ~w{k field term}a
defstruct @enforce_keys
@expressions match: Query.Expression.Match
# Behaviour
@doc """
Evaluate the given clause.
"""
@callback evaluate(Index.t(), t) :: list(map())
# Public
@doc """
Create a new expression.
## Examples
iex> Query.Expression.new(:match, field: "name", term: "Haystack")
"""
@spec new(k, Keyword.t()) :: t
def new(k, opts),
do: struct(__MODULE__, [k: k] ++ opts)
@doc """
Return the default expressions.
## Examples
iex> Query.Expression.default()
"""
@spec default :: [{atom, module}]
def default, do: @expressions
@doc """
Build the expression into a statement.
## Examples
iex> expression = Query.Expression.new(:match, field: "name", term: "Haystack")
iex> Query.Expression.build(expression)
"""
@spec build(t) :: Query.statement()
def build(expression) do
fn query, index ->
Query.evaluate(query, index, expression)
end
end
end
Expressions are simply a way to evaluate a result, they do not need any nested clauses or expressions. Out of the box there is a single match
expression, which uses the inverted index to find matches:
defmodule Haystack.Query.Expression.Match do
@moduledoc """
A module to define "match" expressions.
"""
alias Haystack.Query
alias Haystack.Storage
alias Haystack.Index.Attr.{Docs, Meta}
@behaviour Query.Expression
# Behaviour: Query.Expression
@impl Query.Expression
def evaluate(index, %Query.Expression{k: :match} = exp) do
with {:ok, refs} <- refs(index, exp) do
Enum.map(refs, fn ref ->
key = Meta.key(ref: ref, field: exp.field, term: exp.term)
meta = Storage.fetch!(index.storage, key)
meta
|> Map.put(:ref, ref)
|> Map.put(:score, Map.get(meta, :tf, 0))
|> Map.put(:field, exp.field)
|> Map.put(:term, exp.term)
end)
end
end
# Private
defp refs(%{storage: storage}, exp) do
case Storage.fetch(storage, Docs.key(field: exp.field, term: exp.term)) do
{:ok, results} -> {:ok, results}
{:error, _} -> []
end
end
end
Back in the query module we can define a couple of convenience functions for building common queries:
@doc """
Build a clause for the given list of tokens.
## Examples
iex> index = Index.new(:animals)
iex> index = Index.field(index, Index.Field.new("name"))
iex> tokens = Tokenizer.tokenize("Red Panda")
iex> tokens = Transformer.pipeline(tokens, Transformer.default())
iex> Query.build(:match_all, index, tokens)
"""
@spec build(atom, Index.t(), list(Token.t())) :: Query.Clause.t()
def build(:match_all, index, tokens) do
Enum.reduce(tokens, Query.Clause.new(:all), fn token, clause ->
Enum.reduce(Map.values(index.fields), clause, fn field, clause ->
Query.Clause.expressions(clause, [
Query.Expression.new(:match, field: field.k, term: token.v)
])
end)
end)
end
def build(:match_any, index, tokens) do
Enum.reduce(tokens, Query.Clause.new(:any), fn token, clause ->
Enum.reduce(Map.values(index.fields), clause, fn field, clause ->
Query.Clause.expressions(clause, [
Query.Expression.new(:match, field: field.k, term: token.v)
])
end)
end)
end
Both of these functions simply return a query data structure. However, there’s nothing stopping you from building you own custom, complex query using the building blocks of clauses and expressions. You are also free to define your own implementations of the existing clauses and expressions, or extend Haystack with your own implementations for different types of queries.
Finally we can take a look at the index module, which provides a search function to make it easy to build and run simple queries from a string.
@doc """
Search the index.
"""
@spec search(t, String.t(), Keyword.t()) :: list(map)
def search(index, v, opts \\ []) do
type = Keyword.get(opts, :query, :match_any)
tokens = Tokenizer.tokenize(v)
tokens = Transformer.pipeline(tokens, Transformer.default())
Query.new()
|> Query.clause(Query.build(type, index, tokens))
|> Query.run(index)
end
Here we are tokenizing and transforming the query and then passing the list of tokens to one of the convenience functions of the query module. By default this will run a “match any” query.
So that was a “brief” overview of Haystack. I’ve left out some details to try and keep this post at a reasonable length, but I’ve tried to cover the main important topics.
Haystack is currently on version 0.1.0
and so the code shown in this blog post could change. If you want to see Haystack in action, take a look at the search of Culttt.
The obvious question to ask yourself is, should you use Haystack? I think if you are already happy with your existing search solution it’s probably not worth the effort to switch. However, if you are unhappy with your current solution, or you are in a situation like I was where none of the existing off-the-shelf solutions make sense, I believe Haystack could be a good choice for you.
I’m going to continue maintaining Haystack because it’s now integrated into this website and I have another 3 projects I want to add it to. I’d love to explore different ways to extend the functionality of Haystack via storage, attrs, and queries, and for different languages. I’d love to explore how Haystack can be used across different Beam projects such as Nerves, Elixir Desktop, Livebook, and LiveView Native, as well as use Nx, and Axon to build machine learning capabilities for a search solution built entirely in Elixir.
If you’re looking to add Haystack to a Nimble Publisher powered blog, stay tuned for my next post.