The No-BS guide to AutoComplete and FuzzySearch in Elasticsearch


Before we begin.. Here are a few basics.

Analyzer:

An analyzer does the analysis or splits the indexed phrase/word into tokens/terms upon which the search is performed with much ease.

An analyzer is made up of tokenizers and filters.

There are numerous analyzers in elasticsearch, by default;
here, we use some of the custom analyzers tweaked to meet our requirements.

Filter:

A filter removes/filters keywords from the query. Useful when we need to remove false positives from the search results based on the inputs.

We will be using a stop word filter to remove the specified keywords in the search configuration from the query text.

Tokenizer:

The input string needs to be split, to be searched against the indexed documents. We are about to use ngram here, which splits the query text into sizeable terms.

Mappings:

The created analyzer needs to be mapped to a field name, for it to be efficiently used while querying.

T'is time!!!

Now that we have covered the basics, it's time to create our index.

Fuzzy Search:

The first upon our index list is fuzzy search:

Index Creation:

curl -vX PUT http://localhost:9200/books -d @fuzzy_index.json \
--header "Content-Type: application/json"
 
{
"mappings": {
"books": {
"_all": {
"analyzer": "my_analyzer"
},
"properties": {
"name": {
"type": "string",
"analyzer": "my_analyzer",
"include_in_all": false
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter": "my_filter"
}
},
"filter": {
"my_filter": {
"type": "stop",
"stopwords": [
"&",
"AND",
"THE",
",",
"'"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
],
"filter": "lowercase"
}
}
}
}
}
}

And, the following books and their corresponding authors are loaded to the index.

name author
To Kill a Mockingbird Harper Lee
When You're Ready J.L. Berg
The Book Thief Markus Zusak
The Underground Railroad          Colson Whitehead
Pride and Prejudice Jane Austen
Ready Player One Ernest Cline

When a fuzzy query such as:

curl -X POST \
http://localhost:9200/books/_search \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{
"query":
{
"match":
{
"name": "ready"
}
}
}'
view raw fuzzy_query.sh hosted with ❤ by GitHub
This query with the match keyword as "ready" returns the matched books ready as a keyword in the phrase; as,
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.3434829,
"hits": [
{
"_index": "books",
"_type": "list",
"_id": "AWRcM7yplopA60Y3lBSr",
"_score": 1.3434829,
"_source": {
"name": "Ready Player One",
"author": "Ernest Cline"
}
},
{
"_index": "books",
"_type": "list",
"_id": "AWRcNRwGlopA60Y3lBSs",
"_score": 0.53484553,
"_source": {
"name": "When You're Ready",
"author": "J.L. Berg"
}
}
]
}
}

AutoComplete:

Next up, is the autocomplete. The only difference between a fuzzy search and an autocomplete is the min_gram and max_gram values.

In this case, depending on the number of characters to be auto-filled, the min_gram and max_gram values are set, as follows:


{
"mappings": {
"autocomplete": {
"_all": {
"analyzer": "my_analyzer"
},
"properties": {
"name": {
"type": "string",
"analyzer": "my_analyzer",
"include_in_all": false
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter": "my_filter"
}
},
"filter": {
"my_filter": {
"type": "stop",
"stopwords": [
"&",
"AND",
"THE",
",",
"'"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 30,
"token_chars": [
"letter",
"digit"
],
"filter": "lowercase"
}
}
}
}
}
}

Featured Posts

ETL & Enterprise Level Practices

  ETL Strategies & Pipelines have now become inevitable for cloud business needs. There are several ETL tools in the market ranging fro...