Navigation

Custom Analyzers

Overview

An Atlas Search analyzer prepares a set of documents to be indexed by performing a series of operations to transform, filter, and group sequences of characters. You can define a custom analyzer to suit your specific indexing needs.

Text analysis is a three-step process:

  1. Character Filters

    You can specify one or more character filters to use in your custom analyzer. Character filters examine text one character at a time and perform filtering operations.

  2. Tokenizer

    An analyzer uses a tokenizer to split chunks of text into groups, or tokens, for indexing purposes. For example, the whitespace tokenizer splits text fields into individual words based on where whitespace occurs.

  3. Token Filters

    After the tokenization step, the resulting tokens can pass through one or more token filters. A token filter performs operations such as:

    • Stemming, which reduces related words, such as “talking”, “talked”, and “talks” to their root word “talk”.
    • Redaction, the removal of sensitive information from public documents.

Usage

A custom analyzer has the following syntax:

"analyzers": [
  {
    "name": "<name>",
    "charFilters": [ <list-of-character-filters> ],
    "tokenizer": {
      "type": "<tokenizer-type"
    },
    "tokenFilters": [ <list-of-token-filters> ]
  }
]

A custom analyzer has the following attributes:

Attribute Type Description Required?
name string

Name of the custom analyzer. Names must be unique within an index, and may not start with any of the following strings:

  • lucene.
  • builtin.
  • mongodb.
yes
charFilters list of objects Array containing zero or more character filters. no
tokenizer object Tokenizer to use. yes
tokenFilters list of objects Array containing zero or more token filters. no

To use a custom analyzer when indexing a collection, include it in the index definition. In the following example, a custom analyzer named htmlStrippingAnalyzer uses a character filter to remove all HTML tags except a from the text.

{
  "analyzer": "htmlStrippingAnalyzer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "htmlStrippingAnalyzer",
      "charFilters": [
        {
          "type": "htmlStrip",
          "ignoredTags": ["a"]
        }
      ],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": []
    }
  ]
}

Character Filters

Character filters always require a type field, and some take additional options as well.

"charFilters": [
  {
    "type": "<filter-type>",
    "<additional-option>": <value>
  }
]

Atlas Search supports four types of character filters:

Type Description
htmlStrip Strips out HTML constructs.
icuNormalize Normalizes text with the ICU Normalizer. Based on Lucene’s ICUNormalizer2CharFilter.
mapping Applies user-specified normalization mappings to characters. Based on Lucene’s MappingCharFilter.
persian Replaces instances of zero-width non-joiner with ordinary space. Based on Lucene’s PersianCharFilter.

htmlStrip

The htmlStrip character filter has the following attributes:

Name Type Description Required? Default
type string The type of this character filter. Must be htmlStrip. yes  
ignoredTags array of strings A list of HTML tags to exclude from filtering. no  

Example

The following example index definition uses a custom analyzer named htmlStrippingAnalyzer. It uses the htmlStrip character filter to remove all HTML tags from the text except the a tag. It uses the standard tokenizer and no token filters.

{
  "analyzer": "htmlStrippingAnalyzer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [{
    "name": "htmlStrippingAnalyzer",
    "charFilters": [{
      "type": "htmlStrip",
      "ignoredTags": ["a"]
    }],
    "tokenizer": {
      "type": "standard"
    },
    "tokenFilters": []
  }]
}

icuNormalize

The icuNormalize character filter has the following attribute:

Name Type Description Required? Default
type string The type of this character filter. Must be icuNormalize. yes  

Example

The following example index definition uses a custom analyzer named normalizingAnalyzer. It uses the icuNormalize character filter, the whitespace tokenizer and no token filters.

{
  "analyzer": "normalizingAnalyzer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "normalizingAnalyzer",
      "charFilters": [
        {
          "type": "icuNormalize"
        }
      ],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": []
    }
  ]
}

mapping

The mapping character filter has the following attributes:

Name Type Description Required? Default
type string The type of this character filter. Must be mapping. yes  
mappings object An object containing a comma-separated list of mappings. A mapping indicates that one character or group of characters should be substituted for another, in the format <original> : <replacement>. yes  

Example

The following example index definition uses a custom analyzer named mappingAnalyzer. It uses the mapping character filter to replace instances of \\ with /. It uses the keyword tokenizer and no token filters.

{
  "analyzer": "mappingAnalyzer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "mappingAnalyzer",
      "charFilters": [
        {
          "type": "mapping",
          "mappings": {
            "\\": "/"
          }
        }
      ],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": []
    }
  ]
}

persian

The persian character filter has the following attribute:

Name Type Description Required? Default
type string The type of this character filter. Must be persian. yes  

Example

The following example index definition uses a custom analyzer named persianCharacterIndex. It uses the persian character filter, the whitespace tokenizer and no token filters.

{
  "analyzer": "persianCharacterIndex",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "persianCharacterIndex",
      "charFilters": [
        {
          "type": "persian"
        }
      ],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": []
    }
  ]
}

Tokenizers

A custom analyzer’s tokenizer determines how Atlas Search splits up text into discrete chunks for indexing.

Tokenizers always require a type field, and some take additional options as well.

"tokenizer": {
  "type": "<tokenizer-type>",
  "<additional-option>": "<value>"
}

Atlas Search supports the following tokenizer options:

Name Description
standard Tokenize based on word break rules from the Unicode Text Segmentation algorithm.
keyword Tokenize the entire input as a single token.
whitespace Tokenize based on occurrences of whitespace between words.
nGram Tokenize into text chunks, or “n-grams”, of given sizes.
edgeGram Tokenize input from the beginning, or “edge”, of a text input into n-grams of given sizes.
regexCaptureGroup Match a regular expression pattern to extract tokens.
regexSplit Split tokens with a regular-expression based delimiter.

standard

The standard tokenizer has the following attributes:

Name Type Description Required? Default
type string The type of this tokenizer. Must be standard. yes  
maxTokenLength integer Maximum length for a single token. Tokens greater than this length are split at maxTokenLength into multiple tokens. no 255

Example

The following example index definition uses a custom analyzer named standardShingler. It uses the standard tokenizer and the shingle token filter.

{
  "analyzer": "standardShingler",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "standardShingler",
      "charFilters": [],
      "tokenizer": {
        "type": "standard",
        "maxTokenLength": 10,
      },
      "tokenFilters": [
        {
          "type": "shingle",
          "minSingleSize": 2,
          "maxShingleSize": 3
        }
      ]
    }
  ]
}

keyword

The keyword tokenizer has the following attribute:

Name Type Description Required? Default
type string The type of this tokenizer. Must be keyword. yes  

Example

The following example index definition uses a custom analyzer named keywordTokenizingIndex. It uses the keyword tokenizer and a regular expression token filter that redacts email addresses.

{
  "analyzer": "keywordTokenizingIndex",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "keywordTokenizingIndex",
      "charFilters": [],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "regex",
          "pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$",
          "replacement": "redacted",
          "matches": "all"
        }
      ]
    }
  ]
}

whitespace

The whitespace tokenizer has the following attributes:

Name Type Description Required? Default
type string The type of this tokenizer. Must be whitespace. yes  
maxTokenLength integer Maximum length for a single token. Tokens greater than this length are split at maxTokenLength into multiple tokens. no 255

Example

The following example index definition uses a custom analyzer named whitespaceLowerer. It uses the whitespace tokenizer and a token filter that lowercases all tokens.

{
  "analyzer": "whitespaceLowerer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "whitespaceLowerer",
      "charFilters": [],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        }
      ]
    }
  ]
}

nGram

The nGram tokenizer has the following attributes:

Name Type Description Required? Default
type string The type of this tokenizer. Must be nGram. yes  
minGram integer Number of characters to include in the shortest token created. yes  
maxGram integer Number of characters to include in the longest token created. yes  

Example

The following example index definition uses a custom analyzer named ngramShingler. It uses the nGram tokenizer to create tokens between 2 and 5 characters long and the shingle token filter.

{
  "analyzer": "ngramShingler",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "ngramShingler",
      "charFilters": [],
      "tokenizer": {
        "type": "nGram",
        "minGram": 2,
        "maxGram": 5
      },
      "tokenFilters": [
        {
          "type": "shingle",
          "minSingleSize": 2,
          "maxShingleSize": 3
        }
      ]
    }
  ]
}

edgeGram

The edgeGram tokenizer has the following attributes:

Name Type Description Required? Default
type string The type of this tokenizer. Must be edgeGram. yes  
minGram integer Number of characters to include in the shortest token created. yes  
maxGram integer Number of characters to include in the longest token created. yes  

Example

The following example index definition uses a custom analyzer named edgegramShingler. It uses the edgeGram tokenizer to create tokens between 2 and 5 characters long starting from the first character of text input and the shingle token filter.

{
  "analyzer": "edegramShingler",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "edgegramShingler",
      "charFilters": [],
      "tokenizer": {
        "type": "edgeGram",
        "minGram": 2,
        "maxGram": 5
      },
      "tokenFilters": [
        {
          "type": "shingle",
          "minSingleSize": 2,
          "maxShingleSize": 3
        }
      ]
    }
  ]
}

regexCaptureGroup

The regexCaptureGroup tokenizer has the following attributes:

Name Type Description Required? Default
type string The type of this tokenizer. Must be regexCaptureGroup. yes  
pattern string A regular expression to match against. yes  
group integer Index of the character group within the matching expression to extract into tokens. Use 0 to extract all character groups. yes  

Example

The following example index definition uses a custom analyzer named phoneNumberExtractor. It uses the regexCaptureGroup tokenizer to creates a single token from the first US-formatted phone number present in the text input.

{
  "analyzer": "phoneNumberExtractor",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "phoneNumberExtractor",
      "charFilters": [],
      "tokenizer": {
        "type": "regexCaptureGroup",
        "pattern": "^\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b$",
        "group": 0
      },
      "tokenFilters": []
    }
  ]
}

regexSplit

The regexSplit tokenizer has the following attributes:

Name Type Description Required? Default
type string The type of this tokenizer. Must be regexSplit. yes  
pattern string A regular expression to match against. yes  

Example

The following example index definition uses a custom analyzer named dashSplitter. It uses the regexSplit tokenizer to create tokens from hyphen-delimited input text.

{
  "analyzer": "dashSplitter",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "dashSplitter",
      "charFilters": [],
      "tokenizer": {
        "type": "regexSplit",
        "pattern": "[-]+",
      },
      "tokenFilters": []
    }
  ]
}

Token Filters

Token Filters always require a type field, and some take additional options as well.

"tokenFilters": [
  {
    "type": "<token-filter-type>",
    "<additional-option>": <value>
  }
]

Atlas Search supports the following token filters:

Name Description
lowercase Normalizes token text to lowercase.
length Removes tokens that are too short or too long.
icuFolding Applies character folding from Unicode Technical Report #30.
icuNormalizer Normalizes tokens using a standard Unicode Normalization Mode.
nGram Tokenizes input into n-grams of configured sizes.
edgeGram Tokenizes input into edge n-grams of configured sizes.
shingle Constructs shingles (token n-grams) from a series of tokens.
regex Applies a regular expression to each token, replacing matches with a specified string.
snowballStemming Stems tokens using a Snowball-generated stemmer.

lowercase

The lowercase token filter has the following attribute:

Name Type Description Required? Default
type string The type of this token filter. Must be lowercase. yes  

Example

The following example index definition uses a custom analyzer named lowercaser. It uses the standard tokenizer with the lowercase token filter to lowercase all tokens.

{
  "analyzer": "lowercaser",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "lowercaser",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        }
      ]
    }
  ]
}

length

The length token filter has the following attributes:

Name Type Description Required? Default
type string The type of this token filter. Must be length. yes  
min integer The minimum length of a token. Must be less than or equal to max. no 0
max integer The maximum length of a token. Must be greater than or equal to min. no 255

Example

The following example index definition uses a custom analyzer named longOnly. It uses the length token filter to index only tokens that are at least 20 UTF-16 code units long after tokenizing with the standard tokenizer.

{
  "analyzer": "longOnly",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "longOnly",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "length",
          "min": 20
        }
      ]
    }
  ]
}

icuFolding

The icuFolding token filter has the following attribute:

Name Type Description Required? Default
type string The type of this token filter. Must be icuFolding. yes  

Example

The following example index definition uses a custom analyzer named diacriticFolder. It uses the keyword tokenizer with the icuFolding token filter to apply foldings from UTR#30 Character Foldings. Foldings include accent removal, case folding, canonical duplicates folding, and many others detailed in the report.

{
  "analyzer": "diacriticFolder",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "diacriticFolder",
      "charFilters": [],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "icuFolding",
        }
      ]
    }
  ]
}

icuNormalizer

The icuNormalizer token filter has the following attributes:

Name Type Description Required? Default
type string The type of this token filter. Must be icuNormalizer. yes  
normalizationForm string

Normalization form to apply. Accepted values are:

  • nfd (Canonical Decomposition)
  • nfc (Canonical Decomposition, followed by Canonical Composition)
  • nfkd (Compatibility Decomposition)
  • nfkc (Compatibility Decomposition, followed by Canonical Composition)

For more information about the supported normalization forms, see Section 1.2: Normalization Forms, UTR#15.

no nfc

Example

The following example index definition uses a custom analyzer named normalizer. It uses the whitespace tokenizer, then normalizes tokens by Canonical Decomposition, followed by Canonical Composition.

{
  "analyzer": "normalizer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "normalizer",
      "charFilters": [],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "type": "icuNormalizer",
          "normalizationForm": "nfc"
        }
      ]
    }
  ]
}

nGram

The nGram token filter has the following attributes:

Name Type Description Required? Default
type string The type of this token filter. Must be nGram. yes  
minGram integer The minimum length of generated n-grams. Must be less than or equal to maxGram. yes  
maxGram integer The maximum length of generated n-grams. Must be greater than or equal to minGram. yes  
termNotInBounds string

Accepted values are:

  • include
  • omit

If include is specified, tokens shorter than minGram or longer than maxGram are indexed as-is. If omit is specified, those tokens are not indexed.

no omit

Example

The following example index definition uses a custom analyzer named persianAutocomplete. It functions as an autocomplete analyzer for Persian and other languages that use the zero-width non-joiner character. It performs the following operations:

{
  "analyzer": "persianAutocomplete",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "persianAutocomplete",
      "charFilters": [
        {
          "type": "persian"
        }
      ],
      "tokenizer": {
        "type": "whitespace"
      },
      "tokenFilters": [
        {
          "type": "icuNormalizer",
          "normalizationForm": "nfc"
        },
        {
          "type": "shingle",
          "minShingleSize": 2,
          "maxShingleSize": 3
        },
        {
          "type": "nGram",
          "minGram": 1,
          "maxGram": 10
        }
      ]
    }
  ]
}

edgeGram

The edgeGram token filter has the following attributes:

Name Type Description Required? Default
type string The type of this token filter. Must be edgeGram. yes  
minGram integer The minimum length of generated n-grams. Must be less than or equal to maxGram. yes  
maxGram integer The maximum length of generated n-grams. Must be greater than or equal to minGram. yes  
termNotInBounds string

Accepted values are:

  • include
  • omit

If include is specified, tokens shorter than minGram or longer than maxGram are indexed as-is. If omit is specified, those tokens are not indexed.

no omit

Example

The following example index definition uses a custom analyzer named englishAutocomplete. It performs the following operations:

  • Tokenizes with the standard tokenizer.
  • Token filtering with the following filters:
    • icuFolding
    • shingle
    • edgeGram
{
  "analyzer": "englishAutocomplete",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "englishAutocomplete",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "icuFolding"
        },
        {
          "type": "shingle",
          "minShingleSize": 2,
          "maxShingleSize": 3
        },
        {
          type: "edgeGram",
          "minGram": 1,
          "maxGram": 10
        }
      ]
    }
  ]
}

shingle

The shingle token filter has the following attributes:

Name Type Description Required? Default
type string The type of this token filter. Must be shingle. yes  
minShingleSize integer Minimum number of tokens per shingle. Must be less than or equal to maxShingleSize. yes  
maxShingleSize integer Maximum number of tokens per shingle. Must be greater than or equal to minShingleSize. yes  

Example

The following example index definition uses a custom analyzer named shingler. It creates shingles of two and three token combinations after tokenizing with the standard tokenizer. minShingleSize is set to 2, so it does not index input when only one token is created from the tokenizer.

{
  "analyzer": "shingler",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "shingler",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "shingle",
          "minShingleSize": 2,
          "maxShingleSize": 3
        }
      ]
    }
  ]
}

regex

The regex token filter has the following attributes:

Name Type Description Required? Default
type string The type of this token filter. Must be regex. yes  
pattern string Regular expression pattern to apply to each token. yes  
replacement string Replacement string to substitute wherever a matching pattern occurs. yes  
matches string

Acceptable values are:

  • all
  • first

If matches is set to all, replace all matching patterns. Otherwise, replace only the first matching pattern.

yes  

Example

The following example index definition uses a custom analyzer named emailRedact. It uses the keyword tokenizer. It finds strings that look like email addresses and replaces them with the word redacted.

{
  "analyzer": "emailRedact",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "emailRedact",
      "charFilters": [],
      "tokenizer": {
        "type": "keyword"
      },
      "tokenFilters": [
        {
          "type": "regex",
          "pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$",
          "replacement": "redacted",
          "matches": "all"
        }
      ]
    }
  ]
}

snowballStemming

The snowballStemming token filter has the following attributes:

Name Type Description Required? Default
type string The type of this token filter. Must be snowballStemming. yes  
stemmerName string

The following values are valid:

  • arabic
  • armenian
  • basque
  • catalan
  • danish
  • dutch
  • english
  • finnish
  • french
  • german
  • german2 (Alternative German language stemmer. Handles the umlaut by expanding ü to ue in most contexts.)
  • hungarian
  • irish
  • italian
  • kp (Kraaij-Pohlmann stemmer, an alternative stemmer for Dutch.)
  • lithuanian
  • lovins (The first-ever published “Lovins JB” stemming algorithm.)
  • norwegian
  • porter (The original Porter English stemming algorithm.)
  • portuguese
  • romanian
  • russian
  • spanish
  • swedish
  • turkish
yes  

Example

The following example index definition uses a custom analyzer named frenchStemmer. It uses the lowercase token filter and the standard tokenizer, followed by the french variant of the snowballStemming token filter.

{
  "analyzer": "frenchStemmer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [
    {
      "name": "frenchStemmer",
      "charFilters": [],
      "tokenizer": {
        "type": "standard"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        },
        {
          "type": "snowballStemming",
          "stemmerName": "french"
        }
      ]
    }
  ]
}

Query Example

A collection named minutes contains the following documents:

{ "_id" : 1, "text" : "<head> This page deals with department meetings. </head>" }
{ "_id" : 2, "text" : "The head of the sales department spoke first." }
{ "_id" : 3, "text" : "<body>We'll head out to the conference room by noon.</body>" }

The index definition for the minutes collection uses a custom analyzer with the htmlStrip character filter to strip out HTML tags when searching for text specified in the query field of a search.

{
  "analyzer": "htmlStrippingAnalyzer",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [{
    "name": "htmlStrippingAnalyzer",
    "charFilters": [{
      "type": "htmlStrip",
      "ignoredTags": ["a"]
    }],
    "tokenizer": {
      "type": "standard"
    },
    "tokenFilters": []
  }]
}

The following search operation looks for occurrences of the string head in the text field of the minutes collection.

db.minutes.aggregate([
  {
    $search: {
       search: {
         query: "head",
         path: "text"
       }
    }
  }
])

The query returns the following results:

{ "_id" : 2, "text" : "The head of the sales department spoke first." }
{ "_id" : 3, "text" : "<body>We'll head out to the conference room by noon.</body>" }

The document with _id: 1 is not returned, because the string head is part of the HTML tag <head>. The document with _id: 3 contains HTML tags, but the string head is elsewhere so the document is a match.