Navigation

Sample Training Dataset

The sample_training database contains a set of realistic data used in MongoDB Private Training Offerings. This dataset is based on public available data sources such as:

These realistic datasets are used by our students to explore MongoDB’s functionality across our private training labs and exercises.

To learn how to load the sample data provided by Atlas into your cluster, see Load Sample Data into Your Cluster.

Collections

The sample_training database contains the following collections:

Collection Name Description
companies Contains a list of Crunchbase Data company information.
grades Contains student grade information on a given class, including scores on different assessments.
inspections Contains a list of New York City business inspections, including whether the business failed or passed the inspection.
posts Contains randomized US Senate speeches organized as blog posts with randomly generated comments.
routes Contains information of airline routes, with source and destination airports, the service airline and the type of airplane. This collection is used in labs that explore the $graphLookup aggregation stage.
stories Contains Digg stories, a website for sharing and commenting on online content.
trips Contains New York City Citibike Data trips data. This data is useful to explore the $graphLookup aggregation stage and showcase Geospatial Queries .
tweets Contains tweet data from Twitter Decahose stream service.
zips Contains United States general cities postal/zip code data.

sample_training.companies

This collection contains information on companies listed on Crunchbase. It has a variety of information such as the company website and/or blog websites about the company, funding rounds, and known individuals associated with the company.

Indexes

This collection contains the following indexes:

Name Index Description
_id_ { "_id": 1 } Primary key index on the _id field.

Sample Document

{
  "_id": {
      "$oid": "52cdef7c4bab8bd675298291"
  },
  "acquisition": null,
  "acquisitions": [],
  "alias_list": null,
  "blog_feed_url": "http://mobiance.wordpress.com/feed/",
  "blog_url": "http://mobiance.wordpress.com/",
  "category_code": "web",
  "competitions": [],
  "created_at": "Tue Feb 12 17:31:58 UTC 2008",
  "crunchbase_url": "http://www.crunchbase.com/company/mobiance",
  "deadpooled_day": null,
  "deadpooled_month": null,
  "deadpooled_url": null,
  "deadpooled_year": null,
  "description": null,
  "email_address": "info@mobiance.com",
  "external_links": [],
  "founded_day": {
      "$numberInt": "1"
  },
  "founded_month": {
      "$numberInt": "10"
  },
  "founded_year": {
      "$numberInt": "2004"
  },
  "funding_rounds": [],
  "homepage_url": "http://www.mobiance.com",
  "image": {
      "attribution": null,
      "available_sizes": [
          [
              [
                  {
                      "$numberInt": "150"
                  },
                  {
                      "$numberInt": "43"
                  }
              ],
              "assets/images/resized/0001/1859/11859v1-max-150x150.png"
          ],
          [
              [
                  {
                      "$numberInt": "208"
                  },
                  {
                      "$numberInt": "60"
                  }
              ],
              "assets/images/resized/0001/1859/11859v1-max-250x250.png"
          ],
          [
              [
                  {
                      "$numberInt": "208"
                  },
                  {
                      "$numberInt": "60"
                  }
              ],
              "assets/images/resized/0001/1859/11859v1-max-450x450.png"
          ]
      ]
  },
  "investments": [],
  "ipo": null,
  "milestones": [],
  "name": "Mobiance",
  "number_of_employees": {
      "$numberInt": "5"
  },
  "offices": [
      {
          "address1": "BC-3, Atrium Business Center,",
          "address2": "Coles Road, Frazer Town,",
          "city": "Bangalore",
          "country_code": "IND",
          "description": null,
          "latitude": null,
          "longitude": null,
          "state_code": null,
          "zip_code": "560005"
      }
  ],
  "overview": "<p>Mobiance provides the technology to track cell phones ...",
  "partners": [],
  "permalink": "mobiance",
  "phone_number": "+91-80- 41264756",
  "products": [],
  "providerships": [],
  "relationships": [
      {
          "is_past": true,
          "person": {
              "first_name": "Ritesh",
              "last_name": "Ambastha",
              "permalink": "ritesh-ambastha"
          },
          "title": "Product Manager"
      }
  ],
  "screenshots": [],
  "tag_list": null,
  "total_money_raised": "$0",
  "twitter_username": null,
  "updated_at": "Thu Dec 01 07:37:10 UTC 2011",
  "video_embeds": []
}

sample_training.grades

This collection has randomly generated student grades. Each document contains a class_id that identifies the class and a student_id that identifies the student. All student class exam scores are stored in the scores array, which contains subdocuments with two fields representing the type of assessment and the student score for that assessment.

Indexes

This collection contains the following indexes:

Name Index Description
_id_ { "_id": 1 } Primary key index on the _id field.

Sample Document

{
    "_id": {
        "$oid": "56d5f7eb604eb380b0d8d8fa"
    },
    "class_id": {
        "$numberDouble": "173"
    },
    "scores": [
        {
            "score": {
                "$numberDouble": "19.81430597438296"
            },
            "type": "exam"
        },
        {
            "score": {
                "$numberDouble": "16.851404299968642"
            },
            "type": "quiz"
        },
        {
            "score": {
                "$numberDouble": "60.108751761488186"
            },
            "type": "homework"
        },
        {
            "score": {
                "$numberDouble": "22.886167083915776"
            },
            "type": "homework"
        }
    ],
    "student_id": {
        "$numberDouble": "4"
    }
}

sample_training.inspections

The inspections collection was taken from the NYC OpenData dataset. Each inspections document contains information about:

  • The inspected business name, sector and address,
  • Inspection id, result, date and certificate number.

Indexes

This collection contains the following indexes:

Name Index Description
_id_ { "_id": 1 } Primary key index on the _id field.

Sample Document

{
   "_id": {
     "$oid": "56d61033a378eccde8a8357e"
   },
   "address": {
       "city": "LAWRENCE",
       "number": 1,
       "street": "BAY BLVD",
       "zip": 11559
   },
   "business_name": "SPRAGUE OPERATING RESOURCES LLC.",
   "certificate_number": 3019422,
   "date": "Mar  3 2015",
   "id": "11247-2015-ENFO",
   "result": "Fail",
   "sector": "Fuel Oil Dealer - 814"
}

sample_training.posts

The posts collection is a set of randomly generated blog posts created using US Senate speeches as the seed for the document body field. On each document you will find:

  • Information on the blog posts like body text, author, permalink, date and title,
  • Randomly generated list of tags,
  • Randomly generated list of comment subdocuments.

Indexes

This collection contains the following indexes:

Name Index Description
_id_ { "_id": 1 } Primary key index on the _id field.

Sample Document

{
    "_id": {
      "$oid": "50ab0f8bbcf1bfe2536dc3f9"
    },
    "author": "machine",
    "body": "Amendment I\n<p>Congress shall make no law respecting ...  ",
    "comments": [
        {
            "author": "Santiago Dollins",
            "body": "Lorem ipsum dolor sit amet, consectetur adipisicing...",
            "email": "HvizfYVx@pKvLaagH.com"
        },
        {
            "author": "Jaclyn Morado",
            "body": "Lorem ipsum dolor sit amet, consectetur adipisicing...",
            "email": "WpOUCpdD@hccdxJvT.com"
        }
        ...
    ],
    "date": {
      "$date": {
        "$numberLong": "1332804016000"
      }
    },
    "permalink": "aRjNnLZkJkTyspAIoRGe",
    "tags": [
        "watchmaker",
        "santa",
        "xylophone",
        "math",
        "handsaw",
        "dream",
        "undershirt",
        "dolphin",
        "tanker",
        "action"
    ],
    "title": "Bill of Rights"
}

sample_training.routes

The routes collection data was sourced from the Open Flights data. The documents of this collection have information on airline routes between airports.

Each document contains information about:

  • Airline data in subdocument containing the name, alias, unique identifier and the IATA airline code,
  • The source and destination airports, identified their IATA airport code,
  • Route codeshare and the number of stops.

Indexes

This collection contains the following indexes:

Name Index Description
_id_ { "_id": 1 } Primary key index on the _id field.

Sample Document

 {
     "_id": {
       "$oid": "56e9b39b732b6122f877fa5c"
     },
     "airline": {
         "alias": "2G",
         "iata": "CRG",
         "id": 1654,
         "name": "Cargoitalia"
     },
     "airplane": "A81",
     "codeshare": "",
     "dst_airport": "OVB",
     "src_airport": "BTK",
     "stops": 0
 }

sample_training.stories

The stories collection data is sourced from Digg, a web post and story sharing service. Each story document contains contains information about the posting such as:

  • Hyperlink of the story post, number of comments,
  • User that posted the story,
  • Number of diggs (likes) in the story.

The documents in this collection showcase different sets of rich subdocuments. These are useful to teach how to represent multi-dimensional data and dot notation queries.

Indexes

This collection contains the following indexes:

Name Index Description
_id_ { "_id": 1 } Primary key index on the _id field.

Sample Document

{
    "_id": {
      "$oid": "4ba2681f238d3ba3ca000065"
    },
    "comments": 79,
    "container": {
        "name": "Science",
        "short_name": "science"
    },
    "description": "Today's youth are generally not the self-centered,...",
    "diggs": 493,
    "href": "http://digg.com/general_sciences/Study_Today_s_youth_a...",
    "id": "19955283",
    "link": "http://www.brainmysteries.com/research/Study_Todays_yo...",
    "media": "news",
    "promote_date": 1268866803,
    "shorturl": [
        {
            "short_url": "http://digg.com/d31LjHP",
            "view_count": 2109
        }
    ],
    "status": "popular",
    "submit_date": 1268725903,
    "thumbnail": {
        "contentType": "image/jpeg",
        "height": 80,
        "originalheight": 232,
        "originalwidth": 350,
        "src": "http://digg.com/general_sciences/Study_Today_s_youth_...",
        "width": 80
    },
    "title": "Study: Today's youth aren't ego-driven slackers after all ",
    "topic": {
        "name": "General Sciences",
        "short_name": "general_sciences"
    },
    "user": {
        "icon": "http://digg.com/users/lekahe/l.png",
        "name": "lekahe",
        "profileviews": 63011,
        "registered": 1187263066
    }
}

sample_training.trips

The trips collection contains bike trips data from the New York City Citibike service. The documents are composed of:

  • Bicycle unique identifier,
  • Trip start and stop time and date,
  • Trip start and end stations names and geospatial location,
  • User information such as gender, year of birth and service type (Customer or Subscriber).

Indexes

This collection contains the following indexes:

Name Index Description
_id_ { "_id": 1 } Primary key index on the _id field.

Sample Document

{
    "_id": {
      "$oid": "572bb8222b288919b68abf82"
    },
    "bikeid": 14785,
    "birth year": 1977,
    "end station id": 433,
    "end station location": {
        "coordinates": [
            -73.98057249,
            40.72955361
        ],
        "type": "Point"
    },
    "end station name": "E 13 St & Avenue A",
    "gender": 1,
    "start station id": 518,
    "start station location": {
        "coordinates": [
            -73.9734419,
            40.74780373
        ],
        "type": "Point"
    },
    "start station name": "E 39 St & 2 Ave",
    "start time": {
      "$date": {
        "$numberLong": "1332804016000"
      }
    },
    "stop time": {
      "$date": {
        "$numberLong": "1352114016000"
      }
    },
    "tripduration": 812,
    "usertype": "Subscriber"
}

sample_training.tweets

The tweets collection is composed of real tweets posted on Twitter. This collection documents exemplify a set of very rich document structures such as subdocuments, embedded subdocument arrays and arrays of subdocuments. These documents provide a variety of different data types on different document fields.

For information on querying embedded documents, see Query on Embedded/Nested Documents in the MongoDB manual.

Indexes

This collection contains the following indexes:

Name Index Description
_id_ { "_id": 1 } Primary key index on the _id field.

Sample Document

{
  "_id": {
    "$oid": "5c8eccb0caa187d17ca62433"
  },
  "contributors": null,
  "coordinates": null,
  "created_at": "Thu Sep 02 18:11:30 +0000 2010",
  "entities": {
      "hashtags": [
          {
              "indices": [
                  0,
                  16
              ],
              "text": "IEGreekStepShow"
          }
      ],
      "urls": [],
      "user_mentions": []
  },
  "favorited": false,
  "geo": null,
  "id": "22819404700",
  "in_reply_to_screen_name": null,
  "in_reply_to_status_id": null,
  "in_reply_to_user_id": null,
  "place": null,
  "retweet_count": null,
  "retweeted": false,
  "source": "<a href=\"http://blackberry.com/twitter\" ...",
  "text": "#IEGreekStepShow I'm there.. Are you?",
  "truncated": false,
  "user": {
      "contributors_enabled": false,
      "created_at": "Mon Apr 13 23:24:53 +0000 2009",
      "description": "We are a team of people who specialize in: ...",
      "favourites_count": 11,
      "follow_request_sent": null,
      "followers_count": 988,
      "following": null,
      "friends_count": 1371,
      "geo_enabled": false,
      "id": 30990697,
      "lang": "en",
      "listed_count": 29,
      "location": "ANYWHERE We are NEEDED!!!",
      "name": "Juan Young II",
      "notifications": null,
      "profile_background_color": "000000",
      "profile_background_image_url": "http://a1.twimg.com/profile_backg...",
      "profile_background_tile": true,
      "profile_image_url": "http://a2.twimg.com/profile_images/110364584...",
      "profile_link_color": "CC3300",
      "profile_sidebar_border_color": "FFFFFF",
      "profile_sidebar_fill_color": "F7DA93",
      "profile_text_color": "000000",
      "profile_use_background_image": true,
      "protected": false,
      "screen_name": "juanyoungonline",
      "show_all_inline_media": false,
      "statuses_count": 14804,
      "time_zone": "Pacific Time (US & Canada)",
      "url": "http://www.hard2please-ent.com",
      "utc_offset": -28800,
      "verified": false
  }
}

sample_training.zips

The zips collection contains information of US cities and their area postal/zip code. Documents contain information on the city name, area zip code, city center geo coordinates (latitude and longitude), state and population.

This dataset is used to explore 2d Index creation and queries.

Indexes

This collection contains the following indexes:

Name Index Description
_id_ { "_id": 1 } Primary key index on the _id field.

Sample Document

{
  "_id": {
    "$oid": "5c8eccc1caa187d17ca6ed29"
  },
  "city": "CLEVELAND",
  "loc": {
      "x": 86.559355,
      "y": 33.992106
  },
  "pop": 2369,
  "state": "AL",
  "zip": "35049"
}