Custom Analyzers¶
Overview¶
An Atlas Search analyzer prepares a set of documents to be indexed by performing a series of operations to transform, filter, and group sequences of characters. You can define a custom analyzer to suit your specific indexing needs.
Text analysis is a three-step process:
You can specify one or more character filters to use in your custom analyzer. Character filters examine text one character at a time and perform filtering operations.
An analyzer uses a tokenizer to split chunks of text into groups, or tokens, for indexing purposes. For example, the whitespace tokenizer splits text fields into individual words based on where whitespace occurs.
After the tokenization step, the resulting tokens can pass through one or more token filters. A token filter performs operations such as:
- Stemming, which reduces related words, such as "talking", "talked", and "talks" to their root word "talk".
- Redaction, the removal of sensitive information from public documents.
Usage¶
A custom analyzer has the following syntax:
"analyzers": [ { "name": "<name>", "charFilters": [ <list-of-character-filters> ], "tokenizer": { "type": "<tokenizer-type" }, "tokenFilters": [ <list-of-token-filters> ] } ]
A custom analyzer has the following attributes:
Attribute | Type | Description | Required? |
---|---|---|---|
name | string | Name of the custom analyzer. Names must be unique within an index, and may not start with any of the following strings:
| yes |
charFilters | list of objects | Array containing zero or more character filters. | no |
tokenizer | object | Tokenizer to use. | yes |
tokenFilters | list of objects | Array containing zero or more token filters. | no |
To use a custom analyzer when indexing a collection, include it in the
index definition. In the following example,
a custom analyzer named htmlStrippingAnalyzer
uses a character filter
to remove all HTML tags except a
from the text.
{ "analyzer": "htmlStrippingAnalyzer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "htmlStrippingAnalyzer", "charFilters": [ { "type": "htmlStrip", "ignoredTags": ["a"] } ], "tokenizer": { "type": "standard" }, "tokenFilters": [] } ] }
Character Filters¶
Character filters always require a type field, and some take additional options as well.
"charFilters": [ { "type": "<filter-type>", "<additional-option>": <value> } ]
Atlas Search supports four types of character filters:
Type | Description |
---|---|
Strips out HTML constructs. | |
Normalizes text with the ICU
Normalizer. Based on Lucene's ICUNormalizer2CharFilter. | |
Applies user-specified normalization mappings to characters. Based
on Lucene's MappingCharFilter. | |
Replaces instances of zero-width non-joiner with ordinary
space. Based on Lucene's PersianCharFilter. |
htmlStrip¶
The htmlStrip
character filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this character filter. Must be htmlStrip . | yes | |
ignoredTags | array of strings | A list of HTML tags to exclude from filtering. | no |
The following example index definition uses a custom analyzer named
htmlStrippingAnalyzer
. It uses the htmlStrip
character
filter to remove all HTML tags from the text except the a
tag. It
uses the standard tokenizer and no
token filters.
{ "analyzer": "htmlStrippingAnalyzer", "mappings": { "dynamic": true }, "analyzers": [{ "name": "htmlStrippingAnalyzer", "charFilters": [{ "type": "htmlStrip", "ignoredTags": ["a"] }], "tokenizer": { "type": "standard" }, "tokenFilters": [] }] }
icuNormalize¶
The icuNormalize
character filter has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this character filter. Must be icuNormalize . | yes |
The following example index definition uses a custom analyzer named
normalizingAnalyzer
. It uses the icuNormalize
character
filter, the whitespace tokenizer
and no token filters.
{ "analyzer": "normalizingAnalyzer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "normalizingAnalyzer", "charFilters": [ { "type": "icuNormalize" } ], "tokenizer": { "type": "whitespace" }, "tokenFilters": [] } ] }
mapping¶
The mapping
character filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this character filter. Must be mapping . | yes | |
mappings | object | An object containing a comma-separated list of mappings. A mapping
indicates that one character or group of characters should be
substituted for another, in the format <original> : <replacement> . | yes |
The following example index definition uses a custom analyzer named
mappingAnalyzer
. It uses the mapping
character filter to
replace instances of \\
with /
. It uses the keyword
tokenizer and no token filters.
{ "analyzer": "mappingAnalyzer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "mappingAnalyzer", "charFilters": [ { "type": "mapping", "mappings": { "\\": "/" } } ], "tokenizer": { "type": "keyword" }, "tokenFilters": [] } ] }
persian¶
The persian
character filter has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this character filter. Must be persian . | yes |
The following example index definition uses a custom analyzer named
persianCharacterIndex
. It uses the persian
character filter,
the whitespace
tokenizer and no token filters.
{ "analyzer": "persianCharacterIndex", "mappings": { "dynamic": true }, "analyzers": [ { "name": "persianCharacterIndex", "charFilters": [ { "type": "persian" } ], "tokenizer": { "type": "whitespace" }, "tokenFilters": [] } ] }
Tokenizers¶
A custom analyzer's tokenizer determines how Atlas Search splits up text into discrete chunks for indexing.
Tokenizers always require a type field, and some take additional options as well.
"tokenizer": { "type": "<tokenizer-type>", "<additional-option>": "<value>" }
Atlas Search supports the following tokenizer options:
Name | Description |
---|---|
Tokenize based on word break rules from the Unicode Text Segmentation
algorithm. | |
Tokenize the entire input as a single token. | |
Tokenize based on occurrences of whitespace between words. | |
Tokenize into text chunks, or "n-grams", of given sizes. | |
Tokenize input from the beginning, or "edge", of a text input into
n-grams of given sizes. | |
Match a regular expression pattern to extract tokens. | |
Split tokens with a regular-expression based delimiter. |
standard¶
The standard
tokenizer has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be standard . | yes | |
maxTokenLength | integer | Maximum length for a single token. Tokens greater than this length
are split at maxTokenLength into multiple tokens. | no | 255 |
The following example index definition uses a custom analyzer named
standardShingler
. It uses the standard
tokenizer and the
shingle token filter.
{ "analyzer": "standardShingler", "mappings": { "dynamic": true }, "analyzers": [ { "name": "standardShingler", "charFilters": [], "tokenizer": { "type": "standard", "maxTokenLength": 10, }, "tokenFilters": [ { "type": "shingle", "minSingleSize": 2, "maxShingleSize": 3 } ] } ] }
keyword¶
The keyword
tokenizer has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be keyword . | yes |
The following example index definition uses a custom analyzer named
keywordTokenizingIndex
. It uses the keyword
tokenizer and a
regular expression token filter that redacts email addresses.
{ "analyzer": "keywordTokenizingIndex", "mappings": { "dynamic": true }, "analyzers": [ { "name": "keywordTokenizingIndex", "charFilters": [], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "regex", "pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$", "replacement": "redacted", "matches": "all" } ] } ] }
whitespace¶
The whitespace
tokenizer has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be whitespace . | yes | |
maxTokenLength | integer | Maximum length for a single token. Tokens greater than this length
are split at maxTokenLength into multiple tokens. | no | 255 |
The following example index definition uses a custom analyzer named
whitespaceLowerer
. It uses the whitespace
tokenizer and a
token filter that lowercases all tokens.
{ "analyzer": "whitespaceLowerer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "whitespaceLowerer", "charFilters": [], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "lowercase" } ] } ] }
nGram¶
The nGram
tokenizer has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be nGram . | yes | |
minGram | integer | Number of characters to include in the shortest token created. | yes | |
maxGram | integer | Number of characters to include in the longest token created. | yes |
The following example index definition uses a custom analyzer named
ngramShingler
. It uses the nGram
tokenizer to create tokens
between 2 and 5 characters long and the shingle token filter.
{ "analyzer": "ngramShingler", "mappings": { "dynamic": true }, "analyzers": [ { "name": "ngramShingler", "charFilters": [], "tokenizer": { "type": "nGram", "minGram": 2, "maxGram": 5 }, "tokenFilters": [ { "type": "shingle", "minSingleSize": 2, "maxShingleSize": 3 } ] } ] }
edgeGram¶
The edgeGram
tokenizer has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be edgeGram . | yes | |
minGram | integer | Number of characters to include in the shortest token created. | yes | |
maxGram | integer | Number of characters to include in the longest token created. | yes |
The following example index definition uses a custom analyzer named
edgegramShingler
. It uses the edgeGram
tokenizer to create tokens
between 2 and 5 characters long starting from the first character of
text input and the shingle token filter.
{ "analyzer": "edegramShingler", "mappings": { "dynamic": true }, "analyzers": [ { "name": "edgegramShingler", "charFilters": [], "tokenizer": { "type": "edgeGram", "minGram": 2, "maxGram": 5 }, "tokenFilters": [ { "type": "shingle", "minSingleSize": 2, "maxShingleSize": 3 } ] } ] }
regexCaptureGroup¶
The regexCaptureGroup
tokenizer has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be regexCaptureGroup . | yes | |
pattern | string | A regular expression to match against. | yes | |
group | integer | Index of the character group within the matching expression to extract
into tokens. Use 0 to extract all character groups. | yes |
The following example index definition uses a custom analyzer named
phoneNumberExtractor
. It uses the regexCaptureGroup
tokenizer
to creates a single token from the first US-formatted phone number
present in the text input.
{ "analyzer": "phoneNumberExtractor", "mappings": { "dynamic": true }, "analyzers": [ { "name": "phoneNumberExtractor", "charFilters": [], "tokenizer": { "type": "regexCaptureGroup", "pattern": "^\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b$", "group": 0 }, "tokenFilters": [] } ] }
regexSplit¶
The regexSplit
tokenizer has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this tokenizer. Must be regexSplit . | yes | |
pattern | string | A regular expression to match against. | yes |
The following example index definition uses a custom analyzer named
dashSplitter
. It uses the regexSplit
tokenizer
to create tokens from hyphen-delimited input text.
{ "analyzer": "dashSplitter", "mappings": { "dynamic": true }, "analyzers": [ { "name": "dashSplitter", "charFilters": [], "tokenizer": { "type": "regexSplit", "pattern": "[-]+", }, "tokenFilters": [] } ] }
Token Filters¶
Token Filters always require a type field, and some take additional options as well.
"tokenFilters": [ { "type": "<token-filter-type>", "<additional-option>": <value> } ]
Atlas Search supports the following token filters:
Name | Description |
---|---|
Normalizes token text to lowercase. | |
Removes tokens that are too short or too long. | |
Applies character folding from Unicode Technical Report #30. | |
Normalizes tokens using a standard Unicode Normalization Mode. | |
Tokenizes input into n-grams of configured sizes. | |
Tokenizes input into edge n-grams of configured sizes. | |
Constructs shingles (token n-grams) from a series of tokens. | |
Applies a regular expression to each token, replacing matches with
a specified string. | |
Stems tokens using a Snowball-generated stemmer. |
lowercase¶
The lowercase
token filter has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be lowercase . | yes |
The following example index definition uses a custom analyzer named
lowercaser
. It uses the standard tokenizer with the lowercase
token filter to lowercase
all tokens.
{ "analyzer": "lowercaser", "mappings": { "dynamic": true }, "analyzers": [ { "name": "lowercaser", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "lowercase" } ] } ] }
length¶
The length
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be length . | yes | |
min | integer | The minimum length of a token. Must be less than or equal to max . | no | 0 |
max | integer | The maximum length of a token. Must be greater than or equal to min . | no | 255 |
The following example index definition uses a custom analyzer named
longOnly
. It uses the length
token filter to index only tokens
that are at least 20 UTF-16 code units long after tokenizing with the
standard tokenizer.
{ "analyzer": "longOnly", "mappings": { "dynamic": true }, "analyzers": [ { "name": "longOnly", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "length", "min": 20 } ] } ] }
icuFolding¶
The icuFolding
token filter has the following attribute:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be icuFolding . | yes |
The following example index definition uses a custom analyzer named
diacriticFolder
. It uses the keyword tokenizer with the icuFolding
token filter to apply
foldings from UTR#30 Character Foldings. Foldings include accent
removal, case folding, canonical duplicates folding, and many others
detailed in the report.
{ "analyzer": "diacriticFolder", "mappings": { "dynamic": true }, "analyzers": [ { "name": "diacriticFolder", "charFilters": [], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "icuFolding", } ] } ] }
icuNormalizer¶
The icuNormalizer
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be icuNormalizer . | yes | |
normalizationForm | string | Normalization form to apply. Accepted values are:
For more information about the supported normalization forms, see Section 1.2: Normalization Forms, UTR#15. | no | nfc |
The following example index definition uses a custom analyzer named
normalizer
. It uses the whitespace tokenizer, then normalizes
tokens by Canonical Decomposition, followed by Canonical Composition.
{ "analyzer": "normalizer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "normalizer", "charFilters": [], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "icuNormalizer", "normalizationForm": "nfc" } ] } ] }
nGram¶
The nGram
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be nGram . | yes | |
minGram | integer | The minimum length of generated n-grams. Must be less than or equal
to maxGram . | yes | |
maxGram | integer | The maximum length of generated n-grams. Must be greater than or equal
to minGram . | yes | |
termNotInBounds | string | Accepted values are:
If | no | omit |
The following example index definition uses a custom analyzer named
persianAutocomplete
. It functions as an autocomplete analyzer for
Persian and other languages that use the zero-width non-joiner
character. It performs the following operations:
- Normalizes zero-width non-joiner characters with the persian character filter.
- Tokenizes by whitespace with the whitespace tokenizer.
Applies a series of token filters:
icuNormalizer
shingle
nGram
{ "analyzer": "persianAutocomplete", "mappings": { "dynamic": true }, "analyzers": [ { "name": "persianAutocomplete", "charFilters": [ { "type": "persian" } ], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "icuNormalizer", "normalizationForm": "nfc" }, { "type": "shingle", "minShingleSize": 2, "maxShingleSize": 3 }, { "type": "nGram", "minGram": 1, "maxGram": 10 } ] } ] }
edgeGram¶
The edgeGram
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be edgeGram . | yes | |
minGram | integer | The minimum length of generated n-grams. Must be less than or equal
to maxGram . | yes | |
maxGram | integer | The maximum length of generated n-grams. Must be greater than or equal
to minGram . | yes | |
termNotInBounds | string | Accepted values are:
If | no | omit |
The following example index definition uses a custom analyzer named
englishAutocomplete
. It performs the following operations:
- Tokenizes with the standard tokenizer.
Token filtering with the following filters:
icuFolding
shingle
edgeGram
{ "analyzer": "englishAutocomplete", "mappings": { "dynamic": true }, "analyzers": [ { "name": "englishAutocomplete", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "icuFolding" }, { "type": "shingle", "minShingleSize": 2, "maxShingleSize": 3 }, { type: "edgeGram", "minGram": 1, "maxGram": 10 } ] } ] }
shingle¶
The shingle
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be shingle . | yes | |
minShingleSize | integer | Minimum number of tokens per shingle. Must be less than or equal
to maxShingleSize . | yes | |
maxShingleSize | integer | Maximum number of tokens per shingle. Must be greater than or equal
to minShingleSize . | yes |
The following example index definition uses a custom analyzer named
shingler
. It creates shingles of two and three token combinations
after tokenizing with the standard tokenizer. minShingleSize
is set to 2
, so it
does not index input when only one token is created from the tokenizer.
{ "analyzer": "shingler", "mappings": { "dynamic": true }, "analyzers": [ { "name": "shingler", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "shingle", "minShingleSize": 2, "maxShingleSize": 3 } ] } ] }
regex¶
The regex
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be regex . | yes | |
pattern | string | Regular expression pattern to apply to each token. | yes | |
replacement | string | Replacement string to substitute wherever a matching pattern
occurs. | yes | |
matches | string | Acceptable values are:
If | yes |
The following example index definition uses a custom analyzer named
emailRedact
. It uses the keyword tokenizer.
It finds strings that look like email addresses and replaces them with
the word redacted
.
{ "analyzer": "emailRedact", "mappings": { "dynamic": true }, "analyzers": [ { "name": "emailRedact", "charFilters": [], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "regex", "pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$", "replacement": "redacted", "matches": "all" } ] } ] }
snowballStemming¶
The snowballStemming
token filter has the following attributes:
Name | Type | Description | Required? | Default |
---|---|---|---|---|
type | string | The type of this token filter. Must be snowballStemming . | yes | |
stemmerName | string | The following values are valid:
| yes |
The following example index definition uses a custom analyzer named
frenchStemmer
. It uses the lowercase
token filter and the
standard tokenizer, followed
by the french
variant of the snowballStemming
token filter.
{ "analyzer": "frenchStemmer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "frenchStemmer", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "lowercase" }, { "type": "snowballStemming", "stemmerName": "french" } ] } ] }
Query Example¶
A collection named minutes
contains the following documents:
{ "_id" : 1, "text" : "<head> This page deals with department meetings. </head>" } { "_id" : 2, "text" : "The head of the sales department spoke first." } { "_id" : 3, "text" : "<body>We'll head out to the conference room by noon.</body>" }
The index definition for the minutes
collection uses a custom analyzer
with the htmlStrip character filter to strip out
HTML tags when searching for text specified in the query
field of a
search.
{ "analyzer": "htmlStrippingAnalyzer", "mappings": { "dynamic": true }, "analyzers": [{ "name": "htmlStrippingAnalyzer", "charFilters": [{ "type": "htmlStrip", "ignoredTags": ["a"] }], "tokenizer": { "type": "standard" }, "tokenFilters": [] }] }
The following search operation looks for occurrences of the string
head
in the text
field of the minutes
collection.
db.minutes.aggregate([ { $search: { text: { query: "head", path: "text" } } } ])
The query returns the following results:
{ "_id" : 2, "text" : "The head of the sales department spoke first." } { "_id" : 3, "text" : "<body>We'll head out to the conference room by noon.</body>" }
The document with _id: 1
is not returned, because the string head
is part of the HTML tag <head>
. The document with _id: 3
contains
HTML tags, but the string head
is elsewhere so the document is a match.