Navigation

Custom Analyzers

An Atlas Search analyzer prepares a set of documents to be indexed by performing a series of operations to transform, filter, and group sequences of characters. You can define a custom analyzer to suit your specific indexing needs.

Text analysis is a three-step process:

  1. Character Filters

    You can specify one or more character filters to use in your custom analyzer. Character filters examine text one character at a time and perform filtering operations.

  2. Tokenizer

    An analyzer uses a tokenizer to split chunks of text into groups, or tokens, for indexing purposes. For example, the whitespace tokenizer splits text fields into individual words based on where whitespace occurs.

  3. Token Filters

    After the tokenization step, the resulting tokens can pass through one or more token filters. A token filter performs operations such as:

    • Stemming, which reduces related words, such as "talking", "talked", and "talks" to their root word "talk".
    • Redaction, the removal of sensitive information from public documents.

A custom analyzer has the following syntax:

"analyzers": [
{
"name": "<name>",
"charFilters": [ <list-of-character-filters> ],
"tokenizer": {
"type": "<tokenizer-type"
},
"tokenFilters": [ <list-of-token-filters> ]
}
]

A custom analyzer has the following attributes:

Attribute
Type
Description
Required?
name
string

Name of the custom analyzer. Names must be unique within an index, and may not start with any of the following strings:

  • lucene.
  • builtin.
  • mongodb.
yes
charFilters
list of objects
Array containing zero or more character filters.
no
tokenizer
object
Tokenizer to use.
yes
tokenFilters
list of objects
Array containing zero or more token filters.
no

To use a custom analyzer when indexing a collection, include it in the index definition. In the following example, a custom analyzer named htmlStrippingAnalyzer uses a character filter to remove all HTML tags except a from the text.

{
"analyzer": "htmlStrippingAnalyzer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "htmlStrippingAnalyzer",
"charFilters": [
{
"type": "htmlStrip",
"ignoredTags": ["a"]
}
],
"tokenizer": {
"type": "standard"
},
"tokenFilters": []
}
]
}

Character filters always require a type field, and some take additional options as well.

"charFilters": [
{
"type": "<filter-type>",
"<additional-option>": <value>
}
]

Atlas Search supports four types of character filters:

Type
Description
Strips out HTML constructs.
Normalizes text with the ICU Normalizer. Based on Lucene's ICUNormalizer2CharFilter.
Applies user-specified normalization mappings to characters. Based on Lucene's MappingCharFilter.
Replaces instances of zero-width non-joiner with ordinary space. Based on Lucene's PersianCharFilter.

The htmlStrip character filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this character filter. Must be htmlStrip.
yes
ignoredTags
array of strings
A list of HTML tags to exclude from filtering.
no
Beaker IconExample

The following example index definition uses a custom analyzer named htmlStrippingAnalyzer. It uses the htmlStrip character filter to remove all HTML tags from the text except the a tag. It uses the standard tokenizer and no token filters.

{
"analyzer": "htmlStrippingAnalyzer",
"mappings": {
"dynamic": true
},
"analyzers": [{
"name": "htmlStrippingAnalyzer",
"charFilters": [{
"type": "htmlStrip",
"ignoredTags": ["a"]
}],
"tokenizer": {
"type": "standard"
},
"tokenFilters": []
}]
}

The icuNormalize character filter has the following attribute:

Name
Type
Description
Required?
Default
type
string
The type of this character filter. Must be icuNormalize.
yes
Beaker IconExample

The following example index definition uses a custom analyzer named normalizingAnalyzer. It uses the icuNormalize character filter, the whitespace tokenizer and no token filters.

{
"analyzer": "normalizingAnalyzer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "normalizingAnalyzer",
"charFilters": [
{
"type": "icuNormalize"
}
],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": []
}
]
}

The mapping character filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this character filter. Must be mapping.
yes
mappings
object
An object containing a comma-separated list of mappings. A mapping indicates that one character or group of characters should be substituted for another, in the format <original> : <replacement>.
yes
Beaker IconExample

The following example index definition uses a custom analyzer named mappingAnalyzer. It uses the mapping character filter to replace instances of \\ with /. It uses the keyword tokenizer and no token filters.

{
"analyzer": "mappingAnalyzer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "mappingAnalyzer",
"charFilters": [
{
"type": "mapping",
"mappings": {
"\\": "/"
}
}
],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": []
}
]
}

The persian character filter has the following attribute:

Name
Type
Description
Required?
Default
type
string
The type of this character filter. Must be persian.
yes
Beaker IconExample

The following example index definition uses a custom analyzer named persianCharacterIndex. It uses the persian character filter, the whitespace tokenizer and no token filters.

{
"analyzer": "persianCharacterIndex",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "persianCharacterIndex",
"charFilters": [
{
"type": "persian"
}
],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": []
}
]
}

A custom analyzer's tokenizer determines how Atlas Search splits up text into discrete chunks for indexing.

Tokenizers always require a type field, and some take additional options as well.

"tokenizer": {
"type": "<tokenizer-type>",
"<additional-option>": "<value>"
}

Atlas Search supports the following tokenizer options:

Name
Description
Tokenize based on word break rules from the Unicode Text Segmentation algorithm.
Tokenize the entire input as a single token.
Tokenize based on occurrences of whitespace between words.
Tokenize into text chunks, or "n-grams", of given sizes.
Tokenize input from the beginning, or "edge", of a text input into n-grams of given sizes.
Match a regular expression pattern to extract tokens.
Split tokens with a regular-expression based delimiter.

The standard tokenizer has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be standard.
yes
maxTokenLength
integer
Maximum length for a single token. Tokens greater than this length are split at maxTokenLength into multiple tokens.
no
255
Beaker IconExample

The following example index definition uses a custom analyzer named standardShingler. It uses the standard tokenizer and the shingle token filter.

{
"analyzer": "standardShingler",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "standardShingler",
"charFilters": [],
"tokenizer": {
"type": "standard",
"maxTokenLength": 10,
},
"tokenFilters": [
{
"type": "shingle",
"minSingleSize": 2,
"maxShingleSize": 3
}
]
}
]
}

The keyword tokenizer has the following attribute:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be keyword.
yes
Beaker IconExample

The following example index definition uses a custom analyzer named keywordTokenizingIndex. It uses the keyword tokenizer and a regular expression token filter that redacts email addresses.

{
"analyzer": "keywordTokenizingIndex",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "keywordTokenizingIndex",
"charFilters": [],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": [
{
"type": "regex",
"pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$",
"replacement": "redacted",
"matches": "all"
}
]
}
]
}

The whitespace tokenizer has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be whitespace.
yes
maxTokenLength
integer
Maximum length for a single token. Tokens greater than this length are split at maxTokenLength into multiple tokens.
no
255
Beaker IconExample

The following example index definition uses a custom analyzer named whitespaceLowerer. It uses the whitespace tokenizer and a token filter that lowercases all tokens.

{
"analyzer": "whitespaceLowerer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "whitespaceLowerer",
"charFilters": [],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": [
{
"type": "lowercase"
}
]
}
]
}

The nGram tokenizer has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be nGram.
yes
minGram
integer
Number of characters to include in the shortest token created.
yes
maxGram
integer
Number of characters to include in the longest token created.
yes
Beaker IconExample

The following example index definition uses a custom analyzer named ngramShingler. It uses the nGram tokenizer to create tokens between 2 and 5 characters long and the shingle token filter.

{
"analyzer": "ngramShingler",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "ngramShingler",
"charFilters": [],
"tokenizer": {
"type": "nGram",
"minGram": 2,
"maxGram": 5
},
"tokenFilters": [
{
"type": "shingle",
"minSingleSize": 2,
"maxShingleSize": 3
}
]
}
]
}

The edgeGram tokenizer has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be edgeGram.
yes
minGram
integer
Number of characters to include in the shortest token created.
yes
maxGram
integer
Number of characters to include in the longest token created.
yes
Beaker IconExample

The following example index definition uses a custom analyzer named edgegramShingler. It uses the edgeGram tokenizer to create tokens between 2 and 5 characters long starting from the first character of text input and the shingle token filter.

{
"analyzer": "edegramShingler",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "edgegramShingler",
"charFilters": [],
"tokenizer": {
"type": "edgeGram",
"minGram": 2,
"maxGram": 5
},
"tokenFilters": [
{
"type": "shingle",
"minSingleSize": 2,
"maxShingleSize": 3
}
]
}
]
}

The regexCaptureGroup tokenizer has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be regexCaptureGroup.
yes
pattern
string
A regular expression to match against.
yes
group
integer
Index of the character group within the matching expression to extract into tokens. Use 0 to extract all character groups.
yes
Beaker IconExample

The following example index definition uses a custom analyzer named phoneNumberExtractor. It uses the regexCaptureGroup tokenizer to creates a single token from the first US-formatted phone number present in the text input.

{
"analyzer": "phoneNumberExtractor",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "phoneNumberExtractor",
"charFilters": [],
"tokenizer": {
"type": "regexCaptureGroup",
"pattern": "^\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b$",
"group": 0
},
"tokenFilters": []
}
]
}

The regexSplit tokenizer has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be regexSplit.
yes
pattern
string
A regular expression to match against.
yes
Beaker IconExample

The following example index definition uses a custom analyzer named dashSplitter. It uses the regexSplit tokenizer to create tokens from hyphen-delimited input text.

{
"analyzer": "dashSplitter",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "dashSplitter",
"charFilters": [],
"tokenizer": {
"type": "regexSplit",
"pattern": "[-]+",
},
"tokenFilters": []
}
]
}

Token Filters always require a type field, and some take additional options as well.

"tokenFilters": [
{
"type": "<token-filter-type>",
"<additional-option>": <value>
}
]

Atlas Search supports the following token filters:

Name
Description
Normalizes token text to lowercase.
Removes tokens that are too short or too long.
Applies character folding from Unicode Technical Report #30.
Normalizes tokens using a standard Unicode Normalization Mode.
Tokenizes input into n-grams of configured sizes.
Tokenizes input into edge n-grams of configured sizes.
Constructs shingles (token n-grams) from a series of tokens.
Applies a regular expression to each token, replacing matches with a specified string.
Stems tokens using a Snowball-generated stemmer.

The lowercase token filter has the following attribute:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be lowercase.
yes
Beaker IconExample

The following example index definition uses a custom analyzer named lowercaser. It uses the standard tokenizer with the lowercase token filter to lowercase all tokens.

{
"analyzer": "lowercaser",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "lowercaser",
"charFilters": [],
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "lowercase"
}
]
}
]
}

The length token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be length.
yes
min
integer
The minimum length of a token. Must be less than or equal to max.
no
0
max
integer
The maximum length of a token. Must be greater than or equal to min.
no
255
Beaker IconExample

The following example index definition uses a custom analyzer named longOnly. It uses the length token filter to index only tokens that are at least 20 UTF-16 code units long after tokenizing with the standard tokenizer.

{
"analyzer": "longOnly",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "longOnly",
"charFilters": [],
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "length",
"min": 20
}
]
}
]
}

The icuFolding token filter has the following attribute:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be icuFolding.
yes
Beaker IconExample

The following example index definition uses a custom analyzer named diacriticFolder. It uses the keyword tokenizer with the icuFolding token filter to apply foldings from UTR#30 Character Foldings. Foldings include accent removal, case folding, canonical duplicates folding, and many others detailed in the report.

{
"analyzer": "diacriticFolder",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "diacriticFolder",
"charFilters": [],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": [
{
"type": "icuFolding",
}
]
}
]
}

The icuNormalizer token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be icuNormalizer.
yes
normalizationForm
string

Normalization form to apply. Accepted values are:

  • nfd (Canonical Decomposition)
  • nfc (Canonical Decomposition, followed by Canonical Composition)
  • nfkd (Compatibility Decomposition)
  • nfkc (Compatibility Decomposition, followed by Canonical Composition)

For more information about the supported normalization forms, see Section 1.2: Normalization Forms, UTR#15.

no
nfc
Beaker IconExample

The following example index definition uses a custom analyzer named normalizer. It uses the whitespace tokenizer, then normalizes tokens by Canonical Decomposition, followed by Canonical Composition.

{
"analyzer": "normalizer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "normalizer",
"charFilters": [],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": [
{
"type": "icuNormalizer",
"normalizationForm": "nfc"
}
]
}
]
}

The nGram token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be nGram.
yes
minGram
integer
The minimum length of generated n-grams. Must be less than or equal to maxGram.
yes
maxGram
integer
The maximum length of generated n-grams. Must be greater than or equal to minGram.
yes
termNotInBounds
string

Accepted values are:

  • include
  • omit

If include is specified, tokens shorter than minGram or longer than maxGram are indexed as-is. If omit is specified, those tokens are not indexed.

no
omit
Beaker IconExample

The following example index definition uses a custom analyzer named persianAutocomplete. It functions as an autocomplete analyzer for Persian and other languages that use the zero-width non-joiner character. It performs the following operations:

{
"analyzer": "persianAutocomplete",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "persianAutocomplete",
"charFilters": [
{
"type": "persian"
}
],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": [
{
"type": "icuNormalizer",
"normalizationForm": "nfc"
},
{
"type": "shingle",
"minShingleSize": 2,
"maxShingleSize": 3
},
{
"type": "nGram",
"minGram": 1,
"maxGram": 10
}
]
}
]
}

The edgeGram token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be edgeGram.
yes
minGram
integer
The minimum length of generated n-grams. Must be less than or equal to maxGram.
yes
maxGram
integer
The maximum length of generated n-grams. Must be greater than or equal to minGram.
yes
termNotInBounds
string

Accepted values are:

  • include
  • omit

If include is specified, tokens shorter than minGram or longer than maxGram are indexed as-is. If omit is specified, those tokens are not indexed.

no
omit
Beaker IconExample

The following example index definition uses a custom analyzer named englishAutocomplete. It performs the following operations:

  • Tokenizes with the standard tokenizer.
  • Token filtering with the following filters:

    • icuFolding
    • shingle
    • edgeGram
{
"analyzer": "englishAutocomplete",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "englishAutocomplete",
"charFilters": [],
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "icuFolding"
},
{
"type": "shingle",
"minShingleSize": 2,
"maxShingleSize": 3
},
{
type: "edgeGram",
"minGram": 1,
"maxGram": 10
}
]
}
]
}

The shingle token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be shingle.
yes
minShingleSize
integer
Minimum number of tokens per shingle. Must be less than or equal to maxShingleSize.
yes
maxShingleSize
integer
Maximum number of tokens per shingle. Must be greater than or equal to minShingleSize.
yes
Beaker IconExample

The following example index definition uses a custom analyzer named shingler. It creates shingles of two and three token combinations after tokenizing with the standard tokenizer. minShingleSize is set to 2, so it does not index input when only one token is created from the tokenizer.

{
"analyzer": "shingler",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "shingler",
"charFilters": [],
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "shingle",
"minShingleSize": 2,
"maxShingleSize": 3
}
]
}
]
}

The regex token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be regex.
yes
pattern
string
Regular expression pattern to apply to each token.
yes
replacement
string
Replacement string to substitute wherever a matching pattern occurs.
yes
matches
string

Acceptable values are:

  • all
  • first

If matches is set to all, replace all matching patterns. Otherwise, replace only the first matching pattern.

yes
Beaker IconExample

The following example index definition uses a custom analyzer named emailRedact. It uses the keyword tokenizer. It finds strings that look like email addresses and replaces them with the word redacted.

{
"analyzer": "emailRedact",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "emailRedact",
"charFilters": [],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": [
{
"type": "regex",
"pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$",
"replacement": "redacted",
"matches": "all"
}
]
}
]
}

The snowballStemming token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be snowballStemming.
yes
stemmerName
string

The following values are valid:

  • arabic
  • armenian
  • basque
  • catalan
  • danish
  • dutch
  • english
  • finnish
  • french
  • german
  • german2 (Alternative German language stemmer. Handles the umlaut by expanding ü to ue in most contexts.)
  • hungarian
  • irish
  • italian
  • kp (Kraaij-Pohlmann stemmer, an alternative stemmer for Dutch.)
  • lithuanian
  • lovins (The first-ever published "Lovins JB" stemming algorithm.)
  • norwegian
  • porter (The original Porter English stemming algorithm.)
  • portuguese
  • romanian
  • russian
  • spanish
  • swedish
  • turkish
yes
Beaker IconExample

The following example index definition uses a custom analyzer named frenchStemmer. It uses the lowercase token filter and the standard tokenizer, followed by the french variant of the snowballStemming token filter.

{
"analyzer": "frenchStemmer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "frenchStemmer",
"charFilters": [],
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "lowercase"
},
{
"type": "snowballStemming",
"stemmerName": "french"
}
]
}
]
}

A collection named minutes contains the following documents:

{ "_id" : 1, "text" : "<head> This page deals with department meetings. </head>" }
{ "_id" : 2, "text" : "The head of the sales department spoke first." }
{ "_id" : 3, "text" : "<body>We'll head out to the conference room by noon.</body>" }

The index definition for the minutes collection uses a custom analyzer with the htmlStrip character filter to strip out HTML tags when searching for text specified in the query field of a search.

{
"analyzer": "htmlStrippingAnalyzer",
"mappings": {
"dynamic": true
},
"analyzers": [{
"name": "htmlStrippingAnalyzer",
"charFilters": [{
"type": "htmlStrip",
"ignoredTags": ["a"]
}],
"tokenizer": {
"type": "standard"
},
"tokenFilters": []
}]
}

The following search operation looks for occurrences of the string head in the text field of the minutes collection.

db.minutes.aggregate([
{
$search: {
text: {
query: "head",
path: "text"
}
}
}
])

The query returns the following results:

{ "_id" : 2, "text" : "The head of the sales department spoke first." }
{ "_id" : 3, "text" : "<body>We'll head out to the conference room by noon.</body>" }

The document with _id: 1 is not returned, because the string head is part of the HTML tag <head>. The document with _id: 3 contains HTML tags, but the string head is elsewhere so the document is a match.

Give Feedback