Navigation

Optimize Data Lake Query Performance

Beta

The Atlas Data Lake is available as a Beta feature. The product and the corresponding documentation may change at any time during the Beta stage. For support, see Atlas Support.

The performance of your Atlas Data Lake is affected by the following factors:

  • The structure of your data in S3 and how you represent it in your Atlas Data Lake configuration file.
  • The size of your data files.
  • The format and structure of your data files.

Data Structure in S3

For easier management, make sure that your data is logically grouped into partitions. You can leverage partitions to improve Data Lake performance by mapping them to partition attributes in your configuration file.

You can improve your Data Lake’s performance by ensuring that your partition structure maps to your query patterns and that it is defined in your configuration file. By mapping your partition attributes (the parts of your S3 prefix that looks like a folder) to a query attribute, Data Lake can selectively open the files that contain data related to your query. This both reduces the amount of time a query takes and decreases cost, since Data Lake reads and downloads less files from AWS.

Example

Consider an S3 bucket metrics with the following structure:

metrics
|--hardware
|--software
   |--computer
   |--phone

You can set a partition attribute for “metric type” by defining /metrics/{metric_type string}/* in your configuration file. If you issue a query that contains {metric_type: software}, Data Lake only scans the files with the prefix /software and ignores files with the prefix /hardware.

You can then set a partition attribute for “software type” by defining /metrics/{metric_type string}/{software_type string} in your configuration file. If you issue a query that contains {metric_type: software, software_type: computer}, Data Lake ignores files with the prefix /phone.

For more information on mapping partition attributes to a collection path, see Path Syntax Examples.

Data File Size

Each file that Data Lake handles requires a certain amount of compute resources. If your data store contains many small data files, the resources required compound and can reduce performance. Alternatively, many large data files are problematic as Data Lake then downloads and scans unnecessary data.

For most use cases, a performant file size is 100 to 200 MB.

Data File Format

Atlas Data Lake supports several data file formats. You can improve performance by compressing certain file formats or by optimizing file contents for your queries.

Compression

When you compress data files, they take less time to download. Reduced download time has a greater performance benefit than parsing uncompressed data.

You can compress the following file formats using gzip:

File Structure

Parquet, Avro, and ORC files contain metadata about the file itself so that an application can traverse the file contents in different ways. If you structure your data file to align with the queries you want to run, Atlas Data Lake can leverage this metadata to quickly jump to the right data.

Of these formats, Parquet files provide the best performance and space efficiency for Atlas Data Lake, as it is optimized to parse row and column groups for Parquet.