Navigation

Data Lake Configuration

Beta

The Atlas Data Lake is available as a Beta feature. The product and the corresponding documentation may change at any time during the Beta stage.

The following page describes the configuration options available for Atlas Data Lake. Each Data Lake configuration file defines mappings between your data stores and Data Lake.

Data Lake configuration files use the JSON format. You must connect to the Data Lake using the mongo shell to retrieve or update the current configuration. For example, consider an S3 bucket datacenter-alpha containing data collected from a datacenter:

|--metrics
   |--hardware

The /metrics/hardware path stores JSON files with metrics derived from the datacenter hardware, where each filename is the UNIX timestamp of the 24 hour period covered by that file:

/hardware/1551398400.json

The following configuration file does the following:

  • Defines a data store on the datacenter-alpha S3 bucket in the us-east-1 AWS region. The data store is specifically restricted to only datafiles in the metrics folder path.
  • Maps files from the hardware folder to a MongoDB database datacenter-alpha-metrics and collection hardware. The configuration mapping includes parsing logic for capturing the timestamp implied in the filename.
{
  "stores" : [
    {
      "s3" : {
        "name" : "datacenter-alpha",
        "region" : "us-east-1",
        "bucket" : "datacenter-alpha",
        "prefix" : "/metrics",
        "delimiter" : "/"
      }
    }
  ],
  "databases" : {
    "datacenter-alpha-metrics" : {
      "hardware" : [
        {
          "store" : "datacenter-alpha",
          "definition" : "/hardware/{date date}"
        }
      ]
    }
  }
}

Atlas Data Lake parses the S3 bucket datacenter-alpha and scans all files under /metrics/hardware/. The collection definition uses definition parsing syntax to add a field date to each document generated from the filename, where the field value equals the filename as parsed and converted to a ISO-8601 date.

Users connected to the Data Lake can use the MongoDB Query Language and supported aggregations to analyze data in the S3 bucket through the datacenter-metrics.hardware collection.

Retrieve or Update Data Lake Configuration

You can retrieve or update the Data Lake configuration by connecting a mongo shell to the data lake:

  1. From the Atlas UI, select Data Lake from the left-hand navigation.
  2. Click Connect for the Data Lake to which you want to connect.
  3. Click Connect with the Mongo Shell
  4. Follow the instructions in the Connect modal. If you already have the mongo shell installed, ensure you are running at least the latest stable release of the 3.6 shell.

Once connected to the Data Lake, you can use the following database commands to retrieve or update the Data Lake configuration:

Retrieve Configuration
use admin
db.runCommand( { "storageGetConfig" : 1 } )

The command returns the current Data Lake configuration. For complete documentation on the configuration document format, see Configuration Document Format.

Set or Update Configuration
use admin
db.runCommand( { "storageSetConfig" : <config> } )

Replace <config> with the Data Lake configuration document. For complete documentation on the configuration document, see Configuration Document Format.

Configuration Document Format

The Data Lake configuration document has the following format:

{
  "stores" : [
    {
      "s3" : {
        "name" : "<string>",
        "region" : "<string>",
        "bucket" : "<string>",
        "prefix" : "<string>",
        "delimiter" : "<string>"
      }
    }
  ],
  "databases" : {
    "<database>" : {
      "<collection>" : [
        {
          "store" : "<string>",
          "defaultFormat" : "<string>",
          "definition" : "<string>"
        }
      ]
    }
  }
}
stores
The stores object defines each data store associated to the Data Lake. Data Lake can only access data stores defined in the stores object.
databases
The databases object defines the mapping between each data store defined in stores and a MongoDB database and collection.

stores

"stores" : [
  "s3" : {
    "name" : "<string>",
    "region" : "<string>",
    "bucket" : "<string>",
    "prefix" : "<string>",
    "delimiter" : "<string>"
  }
]
stores

The stores object defines an array of data stores associated with a Data Lake. Currently, a data store represents files in an S3 bucket. A Data Lake can only access data stores defined in the stores object.

stores.[n].s3

Defines an AWS S3 bucket as a data store.

stores.[n].s3.name

Name of the data store. The databases.database.collection.store field references this value as part of mapping configuration.

stores.[n].s3.region

Name of the AWS region in which the S3 bucket is hosted. For a list of valid region names, see Amazon Web Services (AWS).

stores.[n].s3.bucket

Name of the AWS S3 bucket. Must exactly match the name of an S3 bucket which Data Lake can access given the configured AWS IAM credentials.

See <placeholder-ref> for more information on configuring Data Lake access to AWS.

stores.[n].s3.prefix

Optional Data Lake applies this prefix when searching for files in the S3 bucket.

For example, consider an S3 bucket metrics with the following structure:

metrics
|--hardware
|--software
   |--computed

data store prepends the value of prefix to the databases.<database>.<collection>.[n].definition to create the full path for files to ingest. Setting the prefix to /software restricts any databases objects using the data store to only subpaths /software.

If omitted, Data Lake searches all files from the root of the S3 bucket.

stores.[n].s3.delimiter

Optional The delimiter that defines seperations of elements in the prefix

If omitted, defaults to "/"

databases

"databases" : {
  "<database>" : {
    "<collection>" : [
      {
        "store" : "<string>",
        "defaultFormat" : "<string>",
        "definition" : "<string>"
      }
    ]
  }
}
databases

Each nested object represents a database and its collections, each of which maps to a stores data store.

databases.<database>

The name of the database to which Data Lake maps the data contained in the data store. Each <database> can have multiple nested <collection> objects.

databases.<database>.<collection>

The name of the collection to which Data Lake maps the data contained each databases.<database>.<collection>.[n].store. Each object in the array represents the mapping between the collection and an object in the stores array.

You can generate collection names dynamically from file paths by specifying * for the collection name and the collectionName() function in the definition field. See Generate Dynamic Collection Names from File Path for examples.

databases.<database>.<collection>.[n].store

Name of a data store to map to the <collection>. Must match the name of at an object in the stores array.

databases.<database>.<collection>.[n].defaultFormat

Optional Specifies the default format Data Lake assumes if it encounters a file without an extension while searching the store.

If omitted, Data Lake attempts to detect the file type by scanning a few bytes of the file.

databases.<database>.<collection>.[n].definition

Controls how Atlas Data Lake searches for and parses files in the store before mapping them to the <collection>. Data Lake prepends the stores.[n].s3.prefix to the definition to build the full path to search within. Specify / to capture all files and folders from the prefix path.

For example, consider an S3 bucket metrics with the following structure:

metrics
|--hardware
|--software
   |--computed

A definition of / directs Data Lake to search all files and folders in the metrics bucket.

A definition of /hardware directs Data Lake to search only that path for files to ingest.

If the prefix is software, Data Lake searches for files only in the path /software/computed.

Appending the * wildcard character to the definition directs Data Lake to include all files and folders from that point in the path. For example, /software/computed* would match files like /software/computed-detailed, /software/computedArchive, and /software/computed/errors

definition supports additional syntax for parsing filenames, including:

  • Generating document fields from filenames.
  • Using regular expressions to control field generation.
  • Setting boundaries for bucketing filenames by timestamp.

See Definition Syntax for more information.

Definition Syntax

definition supports parsing filenames into computed fields. Data Lake can add the computed fields to each document generated from the parsed file. Data Lake can target queries on those computed field values to only those file(s) with a matching file name.

  • You can specify a single parsing function on the filename:

    /path/to/files/{<fieldA> <data-type>}
    
  • You can specify multiple parsing functions on the filename:

    /path/to/files/{<fieldA> <data-type>}-{<fieldB> <data-type>}
    
  • You can specify a parsing functions alongside static strings in the filename:

    /path/to/files/prefix-{<fieldA> <data-type>}-suffix
    

For more information on supported data types, see Supported Data Types.

Parse Single Field from Filename

Consider a data store accountingArchive containing files where the filename describes an invoice date. For example, the filename /invoices/1551398400.json contains the invoices for the UNIX timestamp 1551398400.

The following databases object generates:

"databases" : {
  "accounting" : {
    "invoices" : [
      "store" : "accountingArchive",
      "definition" : "/invoices/{invoiceDate date}"
    ]
  }
}

Data Lake adds the computed field and value to each document generated from the filename. Documents generated from the example filename includes a field invoiceDate: ISODate("2019-03-01T00:00:00Z"). Queries on the invoiceDate field can be targeted to only those files that match the specified value.

Parse Multiple Fields from Filename

Consider a data store accountingArchive containing files where the filename describes an invoice number and invoice date. For example, the filename /invoices/MONGO12345-1551398400.json contains the invoice MONGO12345 for the UNIX timestamp 1551398400.

The following databases object generates:

  • A field invoiceNumber by parsing the first segment of the filename as a string.
  • A field invoiceDate by parsing the second segment of the filename as a UNIX timestamp.
"databases" : {
  "accounting" : {
    "invoices" : [
      "store" : "accountingArchive",
      "definition" : "/invoices/{invoiceNumber string}-{invoiceDate timestamp}"
    ]
  }
}

Data Lake adds the computed fields and values to each document generated from the filename. Documents generated from the example filename include the following fields:

  • invoiceNumber : "MONGO12345"
  • invoiceDate : ISODate("2019-03-01T00:00:00Z")

Queries that include both the invoiceNumber and invoiceDate fields can be targeted to only those files that match the specified values.

Use Regular Expression to Parse Fields from Filename

Consider a data store accountingArchive containing files where the filename describes an invoice number and invoice date. For example, the filename /invoices/MONGO12345-20190102.json contains the invoice MONGO12345 for the date 20190102.

The following databases object generates:

  • A field invoiceNumber by parsing the first segment of the filename as a string
  • A field year by using a regular expression to parse only the first 4 digits of the second segment of the filename as an int.
  • A field month by using a regular expression to parse only the next 2 digits of the second segment of the filename as an int.
  • A field day by using a regular expression to parse only the next 2 digits of the second segment of the filename as an int.
"databases" : {
  "accounting" : {
    "invoices" : [
      "store" : "accountingArchive",
      "definition" : "/invoices/{invoiceNumber string}-{year int:\d{4}}{month int:\d{2}}{day int:\d{2}}"
    ]
  }
}

Data Lake adds the computed fields and values to each document generated from the filename. Documents generated from the example filename include the following fields:

  • invoiceNumber : "MONGO12345"
  • year : 2019
  • month: 01
  • day: 02

Important

You must escape the regex string specified to the definition. For example, if the regex string includes double quotes, you must escape those values.

Queries that include all generated fields can be targeted to only those files that match the specified values.

Identify Ranges of Queryable Data from Filename

Consider a data store accountingArchive containing files where the filename describes the range of data contained in the file. For example, the filename /invoices/1546300800-1546387200.json contains invoices for the time period between 2019-01-01 and 2019-01-02 using UNIX timestamp representation for those dates.

The following databases object identifies the minimum time range as the first segment of the filename and the maximum time range as the second segment of the filename:

"databases" : {
  "accounting" : {
    "invoices" : [
      "store" : "accountingArchive",
      "definition" : "/invoices/{min(invoiceDate) timestamp}-{max(invoiceDate) timestamp}"
    ]
  }
}

When Data Lake receives a query on the "invoiceDate" field, the specified definition allows it to identify which files contain the data that matches the query.

Queries on the invoiceDate field can be targeted to only those files whose range captures the specified value.

Important

The field specified for the min and max ranges must exist in every document contained in the file to avoid unexpected or undesired behavior. Data Lake does not perform any validation that the underlying data conforms to this constraint.

Generate Dynamic Collection Names from File Path

Consider a data store accountingArchive with the following directory structure:

invoices
|--SuperSoftware
|--UltraSoftware
|--MegaSoftware

The following databases object generates a dynamic collection name from the file path:

"databases" : {
  "invoices" : {
    "*" : {
      "store" : "accountingArchive",
      "definition" : "/invoices/{collectionName()}/"
    }
  }
}

When applied to the example directory structure, the definition results in the following collections:

  • SuperSoftware
  • UltraSoftware
  • MegaSoftware

Note

When you dynamically generate collections from filenames, the number of collections is not accurately reported in the Data Lake view.

Supported Data Types

The following table lists the supported keys, their associated data types, and an example of usage:

Key Data Type Example
int Parses the filename as an integer.

filename: /zipcodes/90210.json

definition: /zipcodes/{zipcode int}

Inserts the "zipcode": 90210 field into the document produced by zipcodes/90210.json.

string Parses the filename as a string.

filename: /employees/949-555-0195.json

definition: /employees/{phone string }

Inserts the "phone": "949-555-0195" field into the document produced by /employees/949-555-0195.json.

date Parses the filename as a Unix timestamp.

filename: /metrics/1551398400.json

definition: /metrics/{startTimestamp date }

Inserts the "startTimestamp": ISODate("2019-03-01T00:00:00Z") field into the document produced by /metrics/1551398400.json.

isodate Parses the filename as an ISO 8601 format date

filename: /metrics/2019-01-03T00:00:00Z.json

definition: /metrics/{startTimestamp isodate }

Inserts the "startTimestamp": ISODate("2019-01-03T00:00:00Z") field into the document produced by /metrics/2019-01-03T00:00:00Z.json.