Navigation

create

Beta

The Atlas Data Lake is available as a Beta feature. The product and the corresponding documentation may change at any time during the Beta stage. For support, see Atlas Support.

The create command creates a collection for existing stores in the Atlas Data Lake storage configuration.

The wildcard "*" can be used with the create command in two ways:

  • As the name of the collection to dynamically create collections that maps to files and folders in the specified file path on the stores data store.
  • In the path parameter to create a collection that maps to multiple files and folders in the specified file path on the stores data store.

Syntax

db.runCommand({ "create" : "<collection-name>|*", "dataSources" : [{ "storeName" : "<store-name>", "path" : "<path-to-files-or-folders>", "defaultFormat" :  "<file-extension>" }]})

Parameters

Parameter Type Description Required?
<collection-name>|* string The name of the collection to which Data Lake maps the data contained in the data store or the wildcard "*" to dynamically create collections using the wildcard collection function (collectionName()). yes
dataSources object Array of objects where each object represents a data store in the stores array to map with the collection. yes
dataSources.storeName string The name of a data store to map to the <collection>. Value must match the name in the stores array. yes
dataSources.path string The path to the files and folders. Specify / to capture all files and folders from the prefix path. See Path Syntax Examples for more information. yes
dataSources.defaultFormat string

The format that Data Lake defaults to if it encounters a file without an extension while querying the data store. If omitted, Data Lake attempts to detect the file type by scanning a few bytes of the file. The following values are valid:

.json, .json.gz, .bson, .bson.gz, .avro, .avro.gz, .tsv, .tsv.gz, .csv, .csv.gz, .parquet

no

Output

The command returns the following output if it succeeds. You can verify the results by running the commands in Verify Collection. If it fails, see Troubleshoot Errors below for recommended solutions.

{ ok: 1 }

Examples

The following examples use the sample dataset, airbnb and weather, on AWS S3 stores with the following settings:

Store Name egS3Store sampleS3Store
Region us-east-2 us-east-1
Bucket sbx-data-lake sbx-yadl-store
Prefix json egData
Delimiter / /
Sample Dataset airbnb weather

The Getting Started with Atlas Data Lake tutorial contains instructions for preparing your S3 bucket and uploading the sample dataset.

Basic Example

The following command creates a collection named airbnb in the sampleDB database in the storage configuration. The airbnb collection maps to the airbnb sample dataset in the json folder in the S3 store named egS3Store.

Example

use sampleDB
db.runCommand({ "create" : "airbnb", "dataSources" : [{ "storeName" : "egS3Store", "path" : "/json/airbnb", "defaultFormat" : ".json" }]})

The previous command returns the following output:

Example

{ "ok" : 1 }

The following commands show that the collection was successfully created:

> show collections
airbnb
> db.runCommand({"storageGetConfig" : 1 })
{
        "ok" : 1,
        "storage" : {
                "stores" : [{
                              "name" : "egS3Store",
                              "provider" : "s3",
                              "region" : "us-east-2",
                              "bucket" : "sbx-data-lake",
                              "delimiter" : "/",
                              "prefix" : ""
                      }],
                "databases" : [{
                        "name" : "sample",
                              "collections" : [{
                                "name" : "airbnb",
                                      "dataSources" : [{
                                        "storeName" : "egS3Store",
                                        "path" : "/json/airbnb",
                                        "defaultFormat" : ".json"
                                      }]
                        }]
                }]
        }
}

Multiple Data Sources Example

The following command creates a collection named egCollection in the sampleDB database in the storage configuration. The egCollection collection maps to the following sample datasets:

  • airbnb dataset in the json folder in the S3 store named egS3Store
  • weather dataset in the egData folder in the S3 store named sampleS3Store
use sampleDB
db.runCommand({ "create" : "egCollection", "dataSources" : [{ "storeName" : "egS3Store", "path" : "/json/airbnb" }],[{ "storeName" : "sampleS3Store", "path" : "/egData/weather" }]}})

The previous command returns the following output:

{ "ok" : 1 }

The following commands show that the collection was successfully created:

> show collections
egCollection
> db.runCommand({"storageGetConfig":1})
{
        "ok" : 1,
        "storage" : {
                "stores" : [{
                              "name" : "egS3Store",
                              "provider" : "s3",
                              "region" : "us-east-2",
                              "bucket" : "sbx-data-lake",
                              "delimiter" : "/",
                              "prefix" : ""
                      },
                      {
                              "name" : "sampleS3Store",
                              "provider" : "s3",
                              "region" : "us-east-1",
                              "bucket" : "sbx-yadl-store",
                              "delimiter" : "/",
                              "prefix" : ""
                      }],
                "databases" : [{
                        "name" : "egS3Store",
                              "collections" : [{
                                "name" : "egCollection",
                                "dataSources" : [
                                        {
                                          "storeName" : "sampleS3Store",
                                          "path" : "egData/weather/*"
                                        },
                                        {
                                          "storeName" : "egS3Store",
                                          "path" : "json/airbnb/*"
                                        }
                                ]
                        }]
                }]
        }
}

Wildcard Usage Examples

This example shows the two ways in which the wildcard "*" can be specified with the create command. The following commands create a collection that maps to a Data Lake store named egS3Store in the storage configuration. The egS3Store contains the sample dataset, airbnb, in a folder named json.

The following example uses the create command to dynamically create collections for the files in the path /json/ in the egS3Store data store. It uses the collectionName() function to name the collections after the filenames in the specified path.

use sampleDB
db.runCommand({ "create" : "*", "dataSources" : [{ "storeName" : "egS3Store", "path": "/json/{collectionName()}"}]})

The previous command returns the following output:

{ "ok" : 1 }

The following commands show that the collection was successfully created:

> show collections
airbnb
> db.runCommand({"storageGetConfig" : 1 })
{
  "ok" : 1,
  "storage" : {
    "stores" : [{
      "name" : "egS3Store",
      "provider" : "s3",
      "region" : "us-east-2",
      "bucket" : "sbx-data-lake",
      "delimiter" : "/",
      "prefix" : ""
    }],
    "databases" : [{
      "name" : "sample",
      "collections" : [{
        "name" : "*",
        "dataSources" : [{
          "storeName" : "egS3Store",
          "path" : "/json/{collectionName()}"
        }]
      }]
    }]
  }
}

The following example uses the create command to create a collection named egCollection that maps to a Data Lake store named egS3Store. The egS3Store contains the sample dataset, airbnb, in a folder named json.

use sampleDB
db.runCommand({ "create" : "egCollection", "dataSources" : [{ "storeName" : "egS3Store", "path": "/json/*"}]}})

The previous command returns the following output:

{ "ok" : 1 }

The following commands show that the collection was successfully created:

> show collections
egCollection
> db.runCommand({"storageGetConfig" : 1 })
{
        "ok" : 1,
        "storage" : {
                "stores" : [{
                              "name" : "egS3Store",
                              "provider" : "s3",
                              "region" : "us-east-2",
                              "bucket" : "sbx-data-lake",
                              "delimiter" : "/",
                              "prefix" : ""
                      }],
                "databases" : [{
                        "name" : "sample",
                              "collections" : [{
                                "name" : "egCollection",
                                "dataSources" : [{
                                        "storeName" : "egS3Store",
                                        "path" : "/json/*"
                                }]
                        }]
                }]
        }
}

Verify Collection

You can verify that the command successfully created the collection by running any of the following commands:

show collections
db.runCommand({ "storageGetConfig" : 1 })

Troubleshoot Errors

If the command fails, it returns one of the following errors:

{
        "ok" : 0,
        "errmsg" : "store name does not exist",
        "code" : 9,
        "codeName" : "FailedToParse"
}

Solution: Ensure that the specified storeName matches the name of a store in the stores array. You can run the listStores command to retrieve the list of stores in your Data Lake storage configuration.

{
        "ok" : 0,
        "errmsg" : "collection name already exists in the database",
        "code" : 9,
        "codeName" : "FailedToParse"
}

Solution: Ensure that the collection name is unique. You can run the show collections command to retrieve the list of existing collections.