Navigation

Custom Analyzers

On this page

  • Overview
  • Usage
  • Character Filters
  • Tokenizers
  • Token Filters

An Atlas Search analyzer prepares a set of documents to be indexed by performing a series of operations to transform, filter, and group sequences of characters. You can define a custom analyzer to suit your specific indexing needs.

Text analysis is a three-step process:

  1. Character Filters

    You can specify one or more character filters to use in your custom analyzer. Character filters examine text one character at a time and perform filtering operations.

  2. Tokenizer

    An analyzer uses a tokenizer to split chunks of text into groups, or tokens, for indexing purposes. For example, the whitespace tokenizer splits text fields into individual words based on where whitespace occurs.

  3. Token Filters

    After the tokenization step, the resulting tokens can pass through one or more token filters. A token filter performs operations such as:

    • Stemming, which reduces related words, such as "talking", "talked", and "talks" to their root word "talk".
    • Redaction, the removal of sensitive information from public documents.

A custom analyzer has the following syntax:

"analyzers": [
{
"name": "<name>",
"charFilters": [ <list-of-character-filters> ],
"tokenizer": {
"type": "<tokenizer-type"
},
"tokenFilters": [ <list-of-token-filters> ]
}
]

A custom analyzer has the following attributes:

Attribute
Type
Description
Required?
name
string

Name of the custom analyzer. Names must be unique within an index, and may not start with any of the following strings:

  • lucene.
  • builtin.
  • mongodb.
yes
charFilters
list of objects
Array containing zero or more character filters.
no
tokenizer
object
Tokenizer to use.
yes
tokenFilters
list of objects
Array containing zero or more token filters.
no

To use a custom analyzer when indexing a collection, include it in the index definition. This page contains sample index defintions and query examples for character filters, tokenizers, and token filters. The query examples use a sample minutes collection with the following documents:

{
"_id": 1,
"page_updated_by": {
"last_name": "AUERBACH",
"first_name": "Siân",
"email": "auerbach@example.com",
"phone": "123-456-7890"
},
"text" : "<head> This page deals with department meetings. </head>"
}
{
"_id": 2,
"page_updated_by": {
"last_name": "OHRBACH",
"first_name": "Noël",
"email": "ohrback@example.com",
"phone": "123-456-0987"
},
"text" : "The head of the sales department spoke first."
}
{
"_id": 3,
"page_updated_by": {
"last_name": "LEWINSKY",
"first_name": "Brièle",
"email": "lewinsky@example.com",
"phone": "123-456-9870"
},
"text" : "<body>We'll head out to the conference room by noon.</body>"
}
{
"_id": 4,
"page_updated_by": {
"last_name": "LEVINSKI",
"first_name": "François",
"email": "levinski@example.com",
"phone": "123-456-8907"
},
"text" : "<body>The page has been updated with the items on the agenda.</body>"
}

Character filters always require a type field, and some take additional options as well.

"charFilters": [
{
"type": "<filter-type>",
"<additional-option>": <value>
}
]

Atlas Search supports four types of character filters:

Type
Description
Strips out HTML constructs.
Normalizes text with the ICU Normalizer. Based on Lucene's ICUNormalizer2CharFilter.
Applies user-specified normalization mappings to characters. Based on Lucene's MappingCharFilter.
Replaces instances of zero-width non-joiner with ordinary space. Based on Lucene's PersianCharFilter.

The htmlStrip character filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this character filter. Must be htmlStrip.
yes
ignoredTags
array of strings
A list of HTML tags to exclude from filtering.
no
Example

The icuNormalize character filter has the following attribute:

Name
Type
Description
Required?
Default
type
string
The type of this character filter. Must be icuNormalize.
yes
Example

The following example index definition uses a custom analyzer named normalizingAnalyzer. It uses the icuNormalize character filter, the whitespace tokenizer and no token filters.

{
"analyzer": "normalizingAnalyzer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "normalizingAnalyzer",
"charFilters": [
{
"type": "icuNormalize"
}
],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": []
}
]
}

The mapping character filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this character filter. Must be mapping.
yes
mappings
object
An object containing a comma-separated list of mappings. A mapping indicates that one character or group of characters should be substituted for another, in the format <original> : <replacement>.
yes
Example

The following example index definition uses a custom analyzer named mappingAnalyzer. It uses the mapping character filter to replace instances of \\ with /. It uses the keyword tokenizer and no token filters.

{
"analyzer": "mappingAnalyzer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "mappingAnalyzer",
"charFilters": [
{
"type": "mapping",
"mappings": {
"\\": "/"
}
}
],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": []
}
]
}
Tip
See also:

The shingle token filter for a sample index definition and query.

The persian character filter has the following attribute:

Name
Type
Description
Required?
Default
type
string
The type of this character filter. Must be persian.
yes
Example

The following example index definition uses a custom analyzer named persianCharacterIndex. It uses the persian character filter, the whitespace tokenizer and no token filters.

{
"analyzer": "persianCharacterIndex",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "persianCharacterIndex",
"charFilters": [
{
"type": "persian"
}
],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": []
}
]
}

A custom analyzer's tokenizer determines how Atlas Search splits up text into discrete chunks for indexing.

Tokenizers always require a type field, and some take additional options as well.

"tokenizer": {
"type": "<tokenizer-type>",
"<additional-option>": "<value>"
}

Atlas Search supports the following tokenizer options:

Name
Description
Tokenize based on word break rules from the Unicode Text Segmentation algorithm.
Tokenize the entire input as a single token.
Tokenize based on occurrences of whitespace between words.
Tokenize into text chunks, or "n-grams", of given sizes. You can't use the nGram tokenizer in synonym mapping definitions.
Tokenize input from the beginning, or "edge", of a text input into n-grams of given sizes. You can't use the edgeGram tokenizer in synonym mapping definitions.
Match a regular expression pattern to extract tokens.
Split tokens with a regular-expression based delimiter.
Tokenize URLs and email addresses. Although uaxUrlEmail tokenizer tokenizes based on word break rules from the Unicode Text Segmentation algorithm, we recommend using uaxUrlEmail tokenizer only when the indexed field value includes URLs and email addresses. For fields that do not include URLs or email addresses, use the standard tokenizer to create tokens based on word break rules.

The standard tokenizer has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be standard.
yes
maxTokenLength
integer
Maximum length for a single token. Tokens greater than this length are split at maxTokenLength into multiple tokens.
no
255
Example

The following example index definition uses a custom analyzer named standardShingler. It uses the standard tokenizer and the shingle token filter.

{
"analyzer": "standardShingler",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "standardShingler",
"charFilters": [],
"tokenizer": {
"type": "standard",
"maxTokenLength": 10,
},
"tokenFilters": [
{
"type": "shingle",
"minShingleSize": 2,
"maxShingleSize": 3
}
]
}
]
}
Tip
See also:

The regex token filter for a sample index definition and query.

The keyword tokenizer has the following attribute:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be keyword.
yes
Example

The following example index definition uses a custom analyzer named keywordTokenizingIndex. It uses the keyword tokenizer and a regular expression token filter that redacts email addresses.

{
"analyzer": "keywordTokenizingIndex",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "keywordTokenizingIndex",
"charFilters": [],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": [
{
"type": "regex",
"pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$",
"replacement": "redacted",
"matches": "all"
}
]
}
]
}

The whitespace tokenizer has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be whitespace.
yes
maxTokenLength
integer
Maximum length for a single token. Tokens greater than this length are split at maxTokenLength into multiple tokens.
no
255
Example

The following example index definition uses a custom analyzer named whitespaceLowerer. It uses the whitespace tokenizer and a token filter that lowercases all tokens.

{
"analyzer": "whitespaceLowerer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "whitespaceLowerer",
"charFilters": [],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": [
{
"type": "lowercase"
}
]
}
]
}
Tip
See also:

The shingle token filter for a sample index definition and query.

The nGram tokenizer has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be nGram.
yes
minGram
integer
Number of characters to include in the shortest token created.
yes
maxGram
integer
Number of characters to include in the longest token created.
yes
Example

The following example index definition uses a custom analyzer named ngramShingler. It uses the nGram tokenizer to create tokens between 2 and 5 characters long and the shingle token filter.

{
"analyzer": "ngramShingler",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "ngramShingler",
"charFilters": [],
"tokenizer": {
"type": "nGram",
"minGram": 2,
"maxGram": 5
},
"tokenFilters": [
{
"type": "shingle",
"minShingleSize": 2,
"maxShingleSize": 3
}
]
}
]
}

The edgeGram tokenizer has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be edgeGram.
yes
minGram
integer
Number of characters to include in the shortest token created.
yes
maxGram
integer
Number of characters to include in the longest token created.
yes
Example

The following example index definition uses a custom analyzer named edgegramShingler. It uses the edgeGram tokenizer to create tokens between 2 and 5 characters long starting from the first character of text input and the shingle token filter.

{
"analyzer": "edegramShingler",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "edgegramShingler",
"charFilters": [],
"tokenizer": {
"type": "edgeGram",
"minGram": 2,
"maxGram": 5
},
"tokenFilters": [
{
"type": "shingle",
"minShingleSize": 2,
"maxShingleSize": 3
}
]
}
]
}

The regexCaptureGroup tokenizer has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be regexCaptureGroup.
yes
pattern
string
A regular expression to match against.
yes
group
integer
Index of the character group within the matching expression to extract into tokens. Use 0 to extract all character groups.
yes
Example

The following example index definition uses a custom analyzer named phoneNumberExtractor. It uses the regexCaptureGroup tokenizer to creates a single token from the first US-formatted phone number present in the text input.

{
"analyzer": "phoneNumberExtractor",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "phoneNumberExtractor",
"charFilters": [],
"tokenizer": {
"type": "regexCaptureGroup",
"pattern": "^\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b$",
"group": 0
},
"tokenFilters": []
}
]
}

The regexSplit tokenizer has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be regexSplit.
yes
pattern
string
A regular expression to match against.
yes
Example

The following example index definition uses a custom analyzer named dashSplitter. It uses the regexSplit tokenizer to create tokens from hyphen-delimited input text.

{
"analyzer": "dashSplitter",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "dashSplitter",
"charFilters": [],
"tokenizer": {
"type": "regexSplit",
"pattern": "[-]+"
},
"tokenFilters": []
}
]
}

The uaxUrlEmail tokenizer has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this tokenizer. Must be uaxUrlEmail.
yes
maxTokenLength
int
The maximum number of characters in one token.
no
255
Example

The following example index definition uses a custom analyzer named emailUrlExtractor. It uses the uaxUrlEmail tokenizer to create tokens up to 200 characters long each for all text, including email addresses and URLs, in the input. It converts all text to lowercase using the lowercase token filter.

{
"analyzer": "emailUrlExtractor",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "emailUrlExtractor",
"charFilters": [],
"tokenizer": {
"type": "uaxUrlEmail",
"maxTokenLength": "200",
},
"tokenFilters": [
"type": "lowercase"
]
}
]
}

Token Filters always require a type field, and some take additional options as well.

"tokenFilters": [
{
"type": "<token-filter-type>",
"<additional-option>": <value>
}
]

Atlas Search supports the following token filters:

Name
Description

Creates tokens for words that sound the same based on Daitch-Mokotoff Soundex phonetic algorithm. This filter can generate multiple encodings for each input, where each encoded token is a 6 digit number.

Note

Don't use daitchMokotoffSoundex token filter in:

Normalizes token text to lowercase.
Removes tokens that are too short or too long.
Applies character folding from Unicode Technical Report #30.
Normalizes tokens using a standard Unicode Normalization Mode.
Tokenizes input into n-grams of configured sizes. You can't use the nGram token filter in synonym mapping definitions.
Tokenizes input into edge n-grams of configured sizes. You can't use the edgeGram token filter in synonym mapping definitions.
Constructs shingles (token n-grams) from a series of tokens. You can't use the shingle token filter in synonym mapping definitions.
Applies a regular expression to each token, replacing matches with a specified string.
Stems tokens using a Snowball-generated stemmer.
Removes tokens that correspond to the specified stop words. This token filter doesn't analyze the specified stop words.
Trims leading and trailing whitespace from tokens.

The daitchMokotoffSoundex token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be daitchMokotoffSoundex.
yes
originalTokens
string

Specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following:

  • include - to include the original tokens with the encoded tokens in the output of the token filter. We recommend this value if you want queries on both the original tokens as well as the encoded forms.
  • omit - to omit the original tokens and include only the encoded tokens in the output of the token filter. Use this value if you want to only query on the encoded forms of the original tokens.
no
include
Example

The lowercase token filter has the following attribute:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be lowercase.
yes
Example

The following example index definition uses a custom analyzer named lowercaser. It uses the standard tokenizer with the lowercase token filter to lowercase all tokens.

{
"analyzer": "lowercaser",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "lowercaser",
"charFilters": [],
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "lowercase"
}
]
}
]
}
Tip
See also:

The regex token filter for a sample index definition and query.

The length token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be length.
yes
min
integer
The minimum length of a token. Must be less than or equal to max.
no
0
max
integer
The maximum length of a token. Must be greater than or equal to min.
no
255
Example

The following example index definition uses a custom analyzer named longOnly. It uses the length token filter to index only tokens that are at least 20 UTF-16 code units long after tokenizing with the standard tokenizer.

{
"analyzer": "longOnly",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "longOnly",
"charFilters": [],
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "length",
"min": 20
}
]
}
]
}

The icuFolding token filter has the following attribute:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be icuFolding.
yes
Example

The following example index definition uses a custom analyzer named diacriticFolder. It uses the keyword tokenizer with the icuFolding token filter to apply foldings from UTR#30 Character Foldings. Foldings include accent removal, case folding, canonical duplicates folding, and many others detailed in the report.

{
"analyzer": "diacriticFolder",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "diacriticFolder",
"charFilters": [],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": [
{
"type": "icuFolding"
}
]
}
]
}

The icuNormalizer token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be icuNormalizer.
yes
normalizationForm
string

Normalization form to apply. Accepted values are:

  • nfd (Canonical Decomposition)
  • nfc (Canonical Decomposition, followed by Canonical Composition)
  • nfkd (Compatibility Decomposition)
  • nfkc (Compatibility Decomposition, followed by Canonical Composition)

For more information about the supported normalization forms, see Section 1.2: Normalization Forms, UTR#15.

no
nfc
Example

The following example index definition uses a custom analyzer named normalizer. It uses the whitespace tokenizer, then normalizes tokens by Canonical Decomposition, followed by Canonical Composition.

{
"analyzer": "normalizer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "normalizer",
"charFilters": [],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": [
{
"type": "icuNormalizer",
"normalizationForm": "nfc"
}
]
}
]
}

The nGram token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be nGram.
yes
minGram
integer
The minimum length of generated n-grams. Must be less than or equal to maxGram.
yes
maxGram
integer
The maximum length of generated n-grams. Must be greater than or equal to minGram.
yes
termNotInBounds
string

Accepted values are:

  • include
  • omit

If include is specified, tokens shorter than minGram or longer than maxGram are indexed as-is. If omit is specified, those tokens are not indexed.

no
omit
Example

The following example index definition uses a custom analyzer named persianAutocomplete. It functions as an autocomplete analyzer for Persian and other languages that use the zero-width non-joiner character. It performs the following operations:

{
"analyzer": "persianAutocomplete",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "persianAutocomplete",
"charFilters": [
{
"type": "persian"
}
],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": [
{
"type": "icuNormalizer",
"normalizationForm": "nfc"
},
{
"type": "shingle",
"minShingleSize": 2,
"maxShingleSize": 3
},
{
"type": "nGram",
"minGram": 1,
"maxGram": 10
}
]
}
]
}

The edgeGram token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be edgeGram.
yes
minGram
integer
The minimum length of generated n-grams. Must be less than or equal to maxGram.
yes
maxGram
integer
The maximum length of generated n-grams. Must be greater than or equal to minGram.
yes
termNotInBounds
string

Accepted values are:

  • include
  • omit

If include is specified, tokens shorter than minGram or longer than maxGram are indexed as-is. If omit is specified, those tokens are not indexed.

no
omit
Example

The following example index definition uses a custom analyzer named englishAutocomplete. It performs the following operations:

  • Tokenizes with the standard tokenizer.
  • Token filtering with the following filters:

    • icuFolding
    • shingle
    • edgeGram
{
"analyzer": "englishAutocomplete",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "englishAutocomplete",
"charFilters": [],
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "icuFolding"
},
{
"type": "shingle",
"minShingleSize": 2,
"maxShingleSize": 3
},
{
"type": "edgeGram",
"minGram": 1,
"maxGram": 10
}
]
}
]
}
Tip
See also:

The shingle token filter for a sample index definition and query.

The shingle token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be shingle.
yes
minShingleSize
integer
Minimum number of tokens per shingle. Must be less than or equal to maxShingleSize.
yes
maxShingleSize
integer
Maximum number of tokens per shingle. Must be greater than or equal to minShingleSize.
yes
Example

The regex token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be regex.
yes
pattern
string
Regular expression pattern to apply to each token.
yes
replacement
string
Replacement string to substitute wherever a matching pattern occurs.
yes
matches
string

Acceptable values are:

  • all
  • first

If matches is set to all, replace all matching patterns. Otherwise, replace only the first matching pattern.

yes
Example

The snowballStemming token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be snowballStemming.
yes
stemmerName
string

The following values are valid:

  • arabic
  • armenian
  • basque
  • catalan
  • danish
  • dutch
  • english
  • finnish
  • french
  • german
  • german2 (Alternative German language stemmer. Handles the umlaut by expanding ü to ue in most contexts.)
  • hungarian
  • irish
  • italian
  • kp (Kraaij-Pohlmann stemmer, an alternative stemmer for Dutch.)
  • lithuanian
  • lovins (The first-ever published "Lovins JB" stemming algorithm.)
  • norwegian
  • porter (The original Porter English stemming algorithm.)
  • portuguese
  • romanian
  • russian
  • spanish
  • swedish
  • turkish
yes
Example

The following example index definition uses a custom analyzer named frenchStemmer. It uses the lowercase token filter and the standard tokenizer, followed by the french variant of the snowballStemming token filter.

{
"analyzer": "frenchStemmer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "frenchStemmer",
"charFilters": [],
"tokenizer": {
"type": "standard"
},
"tokenFilters": [
{
"type": "lowercase"
},
{
"type": "snowballStemming",
"stemmerName": "french"
}
]
}
]
}

The stopword token filter has the following attributes:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be stopword.
yes
tokens
array of strings
The list of stop words that correspond to the tokens to remove. Value must be one or more stop words.
yes
ignoreCase
boolean

The flag that indicates whether or not to ignore case of stop words when filtering the tokens to remove. The value can be one of the following:

  • true - to ignore case and remove all tokens that match the specified stop words
  • false - to be case-sensitive and remove only tokens that exactly match the specified case

If omitted, defaults to true.

no
true
Example

The following example index definition uses a custom analyzer named It uses the stopword token filter after the whitespace tokenizer to remove the tokens that match the defined stop words is, the, and at. The token filter is case-insensitive and will remove all tokens that match the specified stop words.

{
"analyzer": "tokenTrimmer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "stopwordRemover",
"charFilters": [],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": [
{
"type": "stopword",
"tokens": ["is", "the", "at"]
}
]
}
]
}

The trim token filter has the following attribute:

Name
Type
Description
Required?
Default
type
string
The type of this token filter. Must be trim.
yes
Example

The following example index definition uses a custom analyzer named tokenTrimmer. It uses the trim token filter after the keyword tokenizer to remove leading and trailing whitespace in the tokens created by the keyword tokenizer.

"analyzer": "tokenTrimmer",
"mappings": {
"dynamic": true
},
"analyzers": [
{
"name": "tokenTrimmer",
"charFilters": [],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": [
{
"type": "trim"
}
]
}
]
}
Give Feedback

On this page

  • Overview
  • Usage
  • Character Filters
  • Tokenizers
  • Token Filters