Navigation

Path Syntax Examples

Beta

The Atlas Data Lake is available as a Beta feature. The product and the corresponding documentation may change at any time during the Beta stage. For support, see Atlas Support.

Overview

path supports parsing filenames into computed fields. Data Lake can add the computed fields to each document generated from the parsed file. Data Lake can target queries on those computed field values to only those file(s) with a matching file name.

  • You can specify a single parsing function on the filename:

    /path/to/files/{<fieldA> <data-type>}
    
  • You can specify multiple parsing functions on the filename:

    /path/to/files/{<fieldA> <data-type>}-{<fieldB> <data-type>}
    
  • You can specify parsing functions alongside static strings in the filename:

    /path/to/files/prefix-{<fieldA> <data-type>}-suffix
    
  • You can specify dot (i.e. .) along the path to the filename:

    /path/to/files/{<fieldA>.<fieldB> <data-type>}
    
  • You can specify ObjectIds in the path to the files to create partitions:

    /path/to/files/{objid objectid}
    
  • You can specify a range of ObjectIds in the path to the files to create partitions:

    /path/to/files/{min(obj) objectid}-{max(obj) objectid}
    
  • You can specify parsing functions along the path to the filename:

    /path/{<fieldA> <data-type>}/{<fieldB> <data-type>}/{<fieldC> <data-type>}/*
    

The default data type for partition attributes in the path is string. If you omit the data type, defaults to string. For example, suppose a path similar to the following:

/employees/{startDate}

In the above example, startDate is interpreted as a string. For more information on all supported data types, see Supported Partition Attribute Types.

Examples

The following examples demonstrate how to parse filenames into computed fields:

Parse Single Field from Filename

Consider a data store accountingArchive containing files where the filename describes an invoice date. For example, the filename /invoices/1564671291998.json contains the invoices for the UNIX timestamp 1564671291998.

The following databases object generates a field invoiceDate by parsing the filename as a UNIX timestamp:

"databases" : [
  {
    "name" : "accounting",
    "collections" : [
      {
        "name" : "invoices",
        "dataSources" : [
          {
            "storeName" : "accountingArchive",
            "path" : "/invoices/{invoiceDate date}"
          }
        ]
      }
    ]
  }
]

Data Lake adds the computed field and value to each document generated from the filename. Documents generated from the example filename includes a field invoiceDate: ISODate("2019-08-01T14:54:51Z"). Queries on the invoiceDate field can be targeted to only those files that match the specified value.

Parse Multiple Fields from Filename

Consider a data store accountingArchive containing files where the filename describes an invoice number and invoice date. For example, the filename /invoices/MONGO12345-1564671291998.json contains the invoice MONGODB12345 for the UNIX timestamp 1564671291998.

The following databases object generates:

  • A field invoiceNumber by parsing the first segment of the filename as a string.
  • A field invoiceDate by parsing the second segment of the filename as a UNIX timestamp.
"databases" : [
  {
    "name": "accounting",
    "collections" : [
      {
        "name" : "invoices",
        "dataSources" : [
          {
            "storeName" : "accountingArchive",
            "path" : "/invoices/{invoiceNumber string}-{invoiceDate date}"
          }
        ]
      }
    ]
  }
]

Data Lake adds the computed fields and values to each document generated from the filename. Documents generated from the example filename include the following fields:

  • invoiceNumber : "MONGODB12345"
  • invoiceDate : ISODate("2019-08-01T14:54:51Z")

Queries that include both the invoiceNumber and invoiceDate fields can be targeted to only those files that match the specified values.

Use Regular Expression to Parse Fields from Filename

Consider a data store accountingArchive containing files where the filename describes an invoice number and invoice date. For example, the filename /invoices/MONGODB12345-20190102.json contains the invoice MONGODB12345 for the date 20190102.

The following databases object generates:

  • A field invoiceNumber by parsing the first segment of the filename as a string
  • A field year by using a regular expression to parse only the first 4 digits of the second segment of the filename as an int.
  • A field month by using a regular expression to parse only the next 2 digits of the second segment of the filename as an int.
  • A field day by using a regular expression to parse only the next 2 digits of the second segment of the filename as an int.
"databases" : [
  {
    "name" : "accounting",
    "collections" : [
      {
        "name" : "invoices",
        "dataSources" : [
          {
            "storeName" : "accountingArchive",
            "path" : "/invoices/{invoiceNumber string}-{year int:\\d{4}}{month int:\\d{2}}{day int:\\d{2}}"
          }
        ]
      }
    ]
  }
}

Data Lake adds the computed fields and values to each document generated from the filename. Documents generated from the example filename include the following fields:

  • invoiceNumber : "MONGODB12345"
  • year : 2019
  • month: 01
  • day: 02

Important

You must escape the regex string specified in the path. For example, if the regex string includes double quotes, you must escape those values.

Queries that include all generated fields can be targeted to only those files that match the specified values.

Identify Ranges of Queryable Data from Filename

Consider a data store accountingArchive containing files where the filename describes the range of data contained in the file. For example, the filename /invoices/1546367712000-1549046112000.json contains invoices for the time period between 2019-01-01 and 2019-01-02 using UNIX timestamp representation for those dates.

The following databases object identifies the minimum time range as the first segment of the filename and the maximum time range as the second segment of the filename:

"databases" : [
  {
    "name: "accounting",
    "collections" : [
      {
        "name: "invoices",
        "dataSources" : [
          {
            "storeName" : "accountingArchive",
            "path" : "/invoices/{min(invoiceDate) date}-{max(invoiceDate) date}"
          }
        ]
      }
    ]
  }
]

When Data Lake receives a query on the "invoiceDate" field, it uses the specified path to identify which files contain the data that matches the query.

Queries on the invoiceDate field can be targeted to only those files whose range captures the specified value, including the min and max date.

Important

The field specified for the min and max ranges must exist in every document contained in the file to avoid unexpected or undesired behavior. Data Lake does not perform any validation that the underlying data conforms to this constraint.

Identify Nested Fields from Filename

Data Lake supports querying nested data when the nested data value is also the filename. You can use the dot operator (i.e. .) in your path to map the partitions attributes in your storage configuration to nested fields in your documents.

Consider a data store accountingArchive. The data store contains files with names that match values of nested fields in the documents. For example:

accountingArchive
|--invoices
   |--January.json
   |--February.json
   ...

Suppose the January.json file contains a document with the following fields:

{
  "invoice": {
     "invoiceNumber" : "MONGODB12345",
     "year" : 2019,
     "month": "January", //value matches filename
     "date": 02
  },
  "vendor": "MONGODB",
  ...
}

The following databases object identifies month as a nested field inside a document. The databases object also identifies the value of month as the name of the file that contains the document.

"databases" : [
  {
    "name" : "accounting",
    "collections" : [
      {
        "name" : "invoices",
        "dataSources" : [
          {
            "storeName" : "accountingArchive",
            "path" : "/invoices/{invoice.month string}"
          }
        ]
      }
    ]
  }
]

When Data Lake receives a query on a specific month such as January, it uses the specified path to identify which file contains the data that matches the query.

Create Partitions from ObjectIds

You can specify ObjectIds in the path to the files. For files that contain ObjectId in the filename, Data Lake creates partitions for each ObjectId.

Consider the following data store, accountingArchive. This data store contains files that include ObjectId in the filename:

accountingArchive
|--invoices
   |--507f1f77bcf86cd799439011.json
   |--507f1f77bcf86cd799439012.json
   |--507f1f77bcf86cd799439013.json
   |--507f1f77bcf86cd799439014.json
   |--507f1f77bcf86cd799439015.json

The following databases object creates partitions for the ObjectIds.

"databases" : [
  {
    "name" : "accounting",
    "collections" : [
      {
        "name" : "invoices",
        "dataSources" : [
          {
            "storeName" : "accountingArchive",
            "path" : "/invoices/{objid objectid}"
          }
        ]
      }
    ]
  }
]

Or, suppose the data store accountingArchive contains files that include a range of ObjectIds in the filename. For example:

accountingArchive
|--invoices
   |--507f1f77bcf86cd799439011-507f1f77bcf86cd799439020.json
   |--507f1f77bcf86cd799439021-507f1f77bcf86cd799439030.json
   |--507f1f77bcf86cd799439031-507f1f77bcf86cd799439040.json

The following databases object creates partitions for the given range of ObjectIds.

"databases" : [
  {
    "name" : "accounting",
    "collections" : [
      {
        "name" : "invoices",
        "dataSources" : [
          {
            "storeName" : "accountingArchive",
            "path" : "/invoices/{min(obj) objectid}-{max(obj) objectid}"
          }
        ]
      }
    ]
  }
]

When Data Lake receives a query on the ObjectId, it uses the specified path to identify which file contains the data that matches the query.

Create Partitions from File Path

You can specify parsing functions on any path leading up to the filename. Each computed field is based on a parsing function along the path. When querying the data, Data Lake converts each computed field to a partition. Partitions, which are synonymous with subdirectories, are then used to filter files with increasing precision.

Consider a data store accountingArchive with the following directory structure:

invoices
|--MONGO12345
   |--2019
      |--01
         |--02

The following databases object creates the invoiceNumber, year, month, and day partitions with a small set of filtered files:

"databases" : [
  {
    "name" : "accounting",
    "collections" : [
      {
        "name" : "invoices",
        "dataSources" : [
          {
            "storeName" : "accountingArchive",
            "path" : "/invoices/{invoiceNumber string}/{year int}/{month int}/{day int}/*"
          }
        ]
      }
    ]
  }
}

Generate Dynamic Collection Names from File Path

Consider a data store accountingArchive with the following directory structure:

invoices
|--SuperSoftware
|--UltraSoftware
|--MegaSoftware

The following databases object generates a dynamic collection name from the file path:

"databases" : [
  {
    "name" : "invoices",
    "collections" : [
      {
        "name" : "*",
        "dataSources" : [
          {
            "storeName" : "accountingArchive",
            "path" : "/invoices/{collectionName()}/"
          }
        ]
      }
    ]
  }
]

When applied to the example directory structure, the path results in the following collections:

  • SuperSoftware
  • UltraSoftware
  • MegaSoftware

Note

When you dynamically generate collections from filenames, the number of collections is not accurately reported in the Data Lake view.