Introduction to Analyzer in Elasticsearch
One component we can tune so Elasticsearch can return relevant documents is Analyzer. Analyzer is a component responsible for processing the text we want to index and is one component that control which documents are more relevant when querying.
A bit about Inverted Index
Since Analyzer correlates tightly to Inverted Index, we need to understand about what Inverted Index is first.
Inverted Index is a data structure for storing a mapping between token to the document identifiers that have the term. Other than document identifiers, the Inverted Index also stores the token position relative to the documents. Since Elasticsearch map the tokens with document identifiers, when you do a query to Elasticsearch, it can easily get the documents you want and returns the documents quick.
Indexing documents into Inverted Index
Let’s say that we want to index 2 documents:
Document 1: “Elasticsearch is fast”
Document 2: “I want to learn Elasticsearch”
Let’s take a peek into the Inverted Index and see the result of the Analysis and Indexing process:
As you can see, the terms are counted and mapped into document identifiers and its position in the document. The reason we don’t see the full document “Elasticsearch is fast” or “I want to learn Elasticsearch” is because they go through Analysis process, which is our main topic in this article.
Querying into Inverted Index
There is one thing to note regarding querying to Inverted Index. The Elasticsearch will only get the documents with the same term as the one queried.
We can easily test this by using two types of Elasticsearch’s query, Match Query
and Term Query
. Basically, the Match Query
will go through an Analysis process while Term Query
won’t. if you’re interested in the difference between them, you can read in my other articles “Elasticsearch: Text vs. Keyword”
If you try to do a Term Query
“Elasticsearch” to the index in the example above, you won’t get any result. This happens because the token in the Inverted Index is “elasticsearch” with lowercase “e”. While when you try the same using Match Query
, Elasticsearch will analyze the query into “elasticsearch” before searching in the Inverted Index. Hence, the query will return results.
What is Analyzer in Elasticsearch?
When we insert a text document into the Elasticsearch, the Elasticsearch won’t save the text as it is. The text will go through an Analysis process performed by an Analyzer. In the Analysis process, an Analyzer will first transform and split the text into tokens before saving it to the Inverted Index.
For example, inserting “Let’s build an Autocomplete!” to the Elasticsearch will transform the text into 4 terms, “let’s”, “build”, “an”, and “autocomplete”.
The analyzer will affect how we search the text, but it won’t affect the content of the text itself. With the previous example, if we search for “let”, the Elasticsearch will still return the full text “Let’s build an autocomplete!” instead of only “let”.
Elasticsearch’s Analyze API
Elasticsearch provide a very convenient API that we can use to test and visualize analyzer:
Request
curl --request GET \
--url http://localhost:9200/_analyze \
--header 'Content-Type: application/json' \
--data '{
"analyzer":"standard",
"text": "Let'\''s build an autocomplete!"
}'
Response
{
"tokens": [
{
"token": "let's",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "build",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "an",
"start_offset": 12,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "autocomplete",
"start_offset": 15,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 3
}
]
}
This API will ease our analyzer’s debugging process by much. We will use it a lot in this article.
Elasticsearch Analyzer Components
Elasticsearch’s Analyzer has three components you can modify depending on your use case:
- Character Filters
- Tokenizer
- Token Filter
Character Filters
The first process that happens in the Analysis process is Character Filtering, which removes, adds, and replaces the characters in the text.
There are three built-in Character Filters in Elasticsearch:
- HTML Strip Character Filters: Will strip out html tag and characters like
<b>
,<i>
,<div>
,<br />
, et cetera. - Mapping Character Filters: This filter will let you map a term into another term. For example, if you want to make the user can search an emoji, you can map “:)” to “smile”
- Pattern Replace Character Filter: Will replace a regular expression pattern into another term. Be careful though, using Pattern Replace Character Filter will slow down your documents indexing process.