Files and Directories

Working with data in your filesystems

If your Dremio cluster is connected to Amazon S3, HDFS, or your NAS, you can query directories and files stored in these data sources.

Dremio supports many file formats, including Parquet, JSON, delimited files, Excel, and others. If your files are compressed, Dremio can query them directly.

For Dremio to query a file or directory, it must first be configured as a dataset.

Files

Individual files can be configured as datasets by clicking on the dataset configuration button. Hover over the file you want to configure and you will see the configuration button on the right:

Click the button on the right that shows a directory pointing to a directory with a table icon. You will now see a dialog that allows you to configure the dataset. Depending on the format of the file, you will see different options in this dialog. For this TXT file, for example, you would configure the delimiters and other options.

After you click save, you can now see the dataset you created.

And if you navigate back to the directory where the file is stored, it is listed as a physical dataset.

Directories

Groups of files with the same structure in a common directory can be queried together like they are a single table. To configure a directory as a dataset, navigate into the filesystem data source you have set up in Dremio, such as HDFS. You will see a list of directories like the following example:

If you click on this directory, you can see that there are many files.

As described above, you can configure each of these files to make them a dataset that Dremio can query. Alternately, If all the files share a common structure, you can configure the directory as a dataset, and all the files will be queried together as if they are a single table. To configure the directory, first hover over the directory to view the configuration button.

Click the button on the right that shows a directory pointing to a directory with a table icon. Next you will see the dialog for configuring the data in the directory, similar to the dialog for configuring a single file.

Dremio will sample several files in the directory to guide you through the setup. The options presented here will depend on the format of the files in the directory.

After you click Save, you should see the contents of the directory as a single dataset.

And now if you return to this data source, the directory is listed as a physical dataset instead of a directory.

Partitioned Datasets

When working with partitioned dataset, Dremio automatically discovers partition directory structures and makes partition values available as additional fields for that dataset.

Dremio will include 3 additional fields (named dirN) that represent the values for the partitions. For the scenario above, the top directory (year) would be called dir0, the second dir1 (month) and the third dir2 (day).

dir0

dir1

dir2

2018

February

15

2018

February

14

2018

February

13

When querying these datasets, having filters on the partition columns will make Dremio only access and scan relevant partitions, greatly enhancing query performance.

Additionally, when running queries with filters on Parquet based datasets, if there are files that only include a single value for a field included in the filter condition, Dremio will access and scan only relevant files -- even if there isn't any explicit directory structure for partitioning. This is achieved by inspecting Parquet file footers and using this information for partition pruning at query time.