Before we begin.. Here are a few basics.
Analyzer:
An analyzer does the analysis or splits the indexed phrase/word into tokens/terms upon which the search is performed with much ease.An analyzer is made up of tokenizers and filters.
There are numerous analyzers in elasticsearch, by default;
here, we use some of the custom analyzers tweaked to meet our requirements.
Filter:
A filter removes/filters keywords from the query. Useful when we need to remove false positives from the search results based on the inputs.We will be using a stop word filter to remove the specified keywords in the search configuration from the query text.
Tokenizer:
The input string needs to be split, to be searched against the indexed documents. We are about to use ngram here, which splits the query text into sizeable terms.Mappings:
The created analyzer needs to be mapped to a field name, for it to be efficiently used while querying.T'is time!!!
Now that we have covered the basics, it's time to create our index.Fuzzy Search:
The first upon our index list is fuzzy search:Index Creation:
curl -vX PUT http://localhost:9200/books -d @fuzzy_index.json \
--header "Content-Type: application/json"
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"mappings": { | |
"books": { | |
"_all": { | |
"analyzer": "my_analyzer" | |
}, | |
"properties": { | |
"name": { | |
"type": "string", | |
"analyzer": "my_analyzer", | |
"include_in_all": false | |
} | |
} | |
} | |
}, | |
"settings": { | |
"index": { | |
"analysis": { | |
"analyzer": { | |
"my_analyzer": { | |
"tokenizer": "my_tokenizer", | |
"filter": "my_filter" | |
} | |
}, | |
"filter": { | |
"my_filter": { | |
"type": "stop", | |
"stopwords": [ | |
"&", | |
"AND", | |
"THE", | |
",", | |
"'" | |
] | |
} | |
}, | |
"tokenizer": { | |
"my_tokenizer": { | |
"type": "ngram", | |
"min_gram": 3, | |
"max_gram": 3, | |
"token_chars": [ | |
"letter", | |
"digit" | |
], | |
"filter": "lowercase" | |
} | |
} | |
} | |
} | |
} | |
} |
And, the following books and their corresponding authors are loaded to the index.
name | author |
To Kill a Mockingbird | Harper Lee |
When You're Ready | J.L. Berg |
The Book Thief | Markus Zusak |
The Underground Railroad | Colson Whitehead |
Pride and Prejudice | Jane Austen |
Ready Player One | Ernest Cline |
When a fuzzy query such as:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
curl -X POST \ | |
http://localhost:9200/books/_search \ | |
-H 'Cache-Control: no-cache' \ | |
-H 'Content-Type: application/json' \ | |
-d '{ | |
"query": | |
{ | |
"match": | |
{ | |
"name": "ready" | |
} | |
} | |
}' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"took": 3, | |
"timed_out": false, | |
"_shards": { | |
"total": 5, | |
"successful": 5, | |
"failed": 0 | |
}, | |
"hits": { | |
"total": 2, | |
"max_score": 1.3434829, | |
"hits": [ | |
{ | |
"_index": "books", | |
"_type": "list", | |
"_id": "AWRcM7yplopA60Y3lBSr", | |
"_score": 1.3434829, | |
"_source": { | |
"name": "Ready Player One", | |
"author": "Ernest Cline" | |
} | |
}, | |
{ | |
"_index": "books", | |
"_type": "list", | |
"_id": "AWRcNRwGlopA60Y3lBSs", | |
"_score": 0.53484553, | |
"_source": { | |
"name": "When You're Ready", | |
"author": "J.L. Berg" | |
} | |
} | |
] | |
} | |
} |
AutoComplete:
Next up, is the autocomplete. The only difference between a fuzzy search and an autocomplete is the min_gram and max_gram values.
In this case, depending on the number of characters to be auto-filled, the min_gram and max_gram values are set, as follows:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"mappings": { | |
"autocomplete": { | |
"_all": { | |
"analyzer": "my_analyzer" | |
}, | |
"properties": { | |
"name": { | |
"type": "string", | |
"analyzer": "my_analyzer", | |
"include_in_all": false | |
} | |
} | |
} | |
}, | |
"settings": { | |
"index": { | |
"analysis": { | |
"analyzer": { | |
"my_analyzer": { | |
"tokenizer": "my_tokenizer", | |
"filter": "my_filter" | |
} | |
}, | |
"filter": { | |
"my_filter": { | |
"type": "stop", | |
"stopwords": [ | |
"&", | |
"AND", | |
"THE", | |
",", | |
"'" | |
] | |
} | |
}, | |
"tokenizer": { | |
"my_tokenizer": { | |
"type": "ngram", | |
"min_gram": 1, | |
"max_gram": 30, | |
"token_chars": [ | |
"letter", | |
"digit" | |
], | |
"filter": "lowercase" | |
} | |
} | |
} | |
} | |
} | |
} |