2.0 Release Notes

Enhancements

Acceleration

Starflake reflectionsWhen creating reflections on datasets with joins, Dremio now keeps statistics and detects relationships for each join (e.g. 1-1, many-1). If the joins are non-expanding, Dremio can leverage this property to accelerate a larger set of queries. For example, if a user creates a reflection on a dataset that joins a fact table with three dimension tables, given that this reflection meets the above criteria, Dremio can accelerate queries that include any subset of these joins (e.g. fact table joined with just one of the dimension tables), without having to define multiple reflections.

External reflectionsDremio now supports external reflections, the ability to leverage summary tables or other digests built in external systems within Dremio's reflection framework. Datasets in any of the data sources that Dremio supports may be leveraged to accelerate queries by Dremio's cost-based optimizer once defined using SQL commands.

Better handling of missing dependencies: Dremio is now resilient to scenarios where a reflection may fail to refresh because the data source is down, or if the reflection is on a empty table.

Better handling of cyclical dependencies: If the user creates two reflections that can substitute for one another, Dremio ensures that only one should substitute for the other.

New reflection system tablesReflection information can now be programmatically accessed using the following tables: sys.reflections, sys.materializations, sys.refreshes.

Reflections now pickup schema updates to underlying datasets in many casesFor example, Dremio will automatically update reflections definitions when changing column types or droppping columns not referenced in a reflection.

Reflection REST APIUsers can now create and manage reflections using the new Reflection REST API.

Improved reflection statusesIndividual reflections now offer more detailed status information. This includes whether a reflection is ready to be used for acceleration, status of refreshes associated with that reflection, information about refresh failures and whether there were schema changes to the underlying dataset after a reflection was created.

Support for enabling/disabling individual reflectionsReflections can now be enabled/disabled individually. The reflection administration page has also been overhauled and now lists reflections individually, grouped by dataset.

Manual reflection refreshUsers can request an immediate refresh of all reflections that depend on a given dataset from the UI and REST API.

Propagation of reflection refresh interval changesDatasets in a data source will now inherit the refresh interval settings from its parent source whenever the settings for that source change. This behavior is disabled if users make changes on the dataset directly.

Ability to skip reflection recommendationsReflection recommendation generation can now be skipped.

Mixed type fields can now be used as part of reflectionsDremio now allows users to use mixed data type field as part of reflection definitions.

Improved reflection suggestions based data profileReflection suggestion logic has been updated to provide better recommendations for a variety of data profiles.

Reflections no longer rely on moves to support atomic updatesReflection creation logic has been updated to not rely on moves to support atomic updates. This also affects CTAS queries.

Web Application and APIs

SQL REST APIUsers can now execute queries in Dremio using the new SQL REST API. Once submitted, queries can be polled for progress and once completed their results can be retrieved.

Catalog REST APIUsers can now browse the Dremio catalog as well as create and modify sources, spaces, folders and datasets using the new Catalog API.

Job REST APIUsers can now get status and results for a specific job using the new Job API.

Dataset Votes (EE only)Admins now have the ability to see all Datasets that have votes to understand acceleration demand and dataset popularity.

Jobs Page now lists enqueued statusJobs page now includes enqueued, planning and running statuses, instead of listing them all as running.

Coordination and Metadata

Improved ODBC/JDBC metadata performanceMetadata calls from ODBC and JDBC clients should now perform better due to caching and retrieval optimizations.

Improved INFORMATION_SCHEMA retrieval performanceINFORMATION_SCHEMA queries should now perform better due to caching and retrieval optimizations.

Improved handling of system, Java and environment variable for YARN deploymentsUsers can now separately specify system, Java and environment variables as a part of deploying executors via YARN. Previously, environment variables could not be passed in YARN deployments.

Ability to start Dremio in the foregroundDremio’s startup script has been updated to support running in the foreground. This can be accessed using dremio start-fg command.

Imroved systemd supportDremio now packages tmpfiles.d file in its RPM installation, which ensures systemd to re-create /var/run/dremio after system restart.

Source Adapters

Improved source configuration change impact warningsDremio now warns users when they make configuration changes to source that will cause existing reflection, format and sharing settings to be cleared. Dremio also avoids doing additional operations if the changes are non-metadata impacting, leading to faster update times.

Add support for ALTER SOURCE REFRESH STATUSUsers can now trigger the refresh of a source's status through a SQL command.

Automated source status monitoring and recoveryDremio now monitors status of data sources and reports on problematic states as well as frequently attempting to re-connect to the source.

Option to ignore Elasticsearch scroll result count mismatchesOccasionally, Elasticsearch would return incorrect number of reported hits. By default, Dremio fails such queries to protect against incorrect results. There is now an option to disable behavior for Elasticsearch sources.

Automatic query retry when Elasticsearch alias definition changesDremio now monitors validity of Elasticsearch aliases and automatically updates its metadata at query time if a change is found. Dremio than re-executes the query with the new metadata.

Automatic query retry when RDBMS schema changes are detectedDremio now monitors changes to RDBMS schemas and automatically updates its metadata at query time if a change is found. Dremio than re-executes the query with the new metadata.

Execution

Optimized IN clause handlingIN clause performance has been greatly increased through both Dremio execution level optimizations as well as better query planning for queries that include IN clauses.

Executor internal CPU containerezationNumber of CPU cores that an executor can access can now be limited by Dremio. This is available for all deployment models (YARN, bare metal, etc.).

Upgraded to latest version of Arrow (0.9)Upgraded to latest version of Arrow (0.9). This also moves our decimal memory format to little endian. We've also added backwards compatibility support for clients that don't have support for the latest version of Arrow.

Dictionary encoding is disabled by default for reflections and $scratch tablesTo ensure optimal heap usages, dictionary encoding for reflections and $scratch tables has been turned off by default. This option can be controlled using store.parquet.enable_dictionary_encoding support key in Admin > Advanced Settings.

Query profiles now include additional planning informationDefault query profiles have been updated to include additional planning information.

Improved diagnostic reporting when queries are canceled due to memory limitsWe’ve improved the way Dremio accounts for memory and now record a wider set of telemetry such as including node memory usage details in addition to the existing query memory usage.

Improved early query termination supportTo optimize and minimize resource usage, Dremio may terminate queries early in some cases. For example, if one side of a join is evaluated to be empty, Dremio will not continue to process the other side.

Bug Fixes

Acceleration

Restart of master node causes reflection refresh to be startedIn some cases, restarting the master/coordinator node used to cause reflections that were not due for refresh to be refreshed pre-maturely.

Deleting datasets may leave orphan reflectionsDeleting datasets with reflections, would not remove reflections associated with that dataset in certain cases.

Reflections are sometimes matched but not used when working with joinsIn some cases, where portions of a query can be accelerated, Dremio would match reflections but would not use them. Planning logic has been updated to fix this.

Occasional sub-optimal query plans when aggregation reflection are available for a datasetWhen working against datasets with aggregation reflections, query performance would sometimes degrade due to bad plan choices. This issue is now fixed.

Reflection suggestion analysis would sometimes failReflection suggestion analysis jobs would sometimes fail when encountering non-UTF8 characters. This is now fixed.

Creating a reflection sometimes fires more than one job for the same reflectionDremio now guarantees that no more than a single refresh job will be running at a time for a particular reflection.

Coordination and Metadata

INFORMATION_SCHEMA and JDBC/ODBC metadata user-level filteringMetadata included in both INFORMATION_SCHEMA table and JDBC/ODBC calls are now filtered to only include items that the user has access to view.

Improved isolation of bad data sourcesProblematic data sources are now identified faster and do not impact metadata retrieval for other data sources in the system.

Incorrect username case handling when using LDAPIf a user tried to login for the first time using their username in the wrong case, that user’s home space would fail to initialize. This is now correctly handled.

Excessive planning time for queries with many joins when relevant reflections are availableQueries that included many joins, on datasets with reflections defined, could have excessively long planning times. This issue is now fixed.

Distributed storage paths are now only controlled by the master node configurationPreviously, distributed storage paths (i.e.paths.dist) were determined by the last node launched in the cluster.

Metadata issues when accessing datasets from Microsoft Power BISometimes, if a dataset hadn’t been queried from Dremio before, trying to access it from Microsoft Power BI caused an error. This behavior is now fixed.

Multiple window functions on virtual datasets would cause failuresQueries on virtual datasets that included multiple window functions would cause “400 - Bad request” from Dremio’s server. This issue is now addressed.

Provisioning screen would sometimes throw Version of submitted Cluster does not match stored exceptionProvisioning screen would sometimes throw Version of submitted Cluster does not match stored exception. This is now fixed.

JDBC clients would get blocked indefinitely when receiving invalid messageIn rare occurrences when JDBC clients received invalid messages, they were blocked indefinitely. This behavior has been fixed.

dremio.conf references were layered incorrectlyIn cases where an option in dremio.conf is referenced by another option, the default value for the referenced option would be used instead of the user-defined version. This issue has been fixed.

Using NOT (a IN …) syntax would cause query failuresQueries including NOT (a IN …) now run without issues.

Compatibility issue between JRuby and JDBC driverWhen trying to use Dremio’s JDBC driver from JRuby, users would get NoClassDefFoundError. This issues has been fixed.

Dremio startup command might pick the wrong Java binaryIn some cases, even if JAVA_HOME variable was set, Dremio startup command would pick the wrong Java binary. This logic has been updated to always first check if $JAVA_HOME/bin/java/ available before searching for alternatives.

Dataset metadata is marked as expired after coordinator node restartThis behavior caused performance degradation for the initial set of queries after a restart due to cost for in-line metadata fetch. This issue has been addressed.

Restart of master node may cause failed jobs to be marked as “In Progress”These type of jobs are now correctly marked as “Failed”.

`$_dremio$update$` field shows up incorrectly in INFORMATIONSCHEMA and ODBC/JDBC metadata callsFor datasets based on file-system sources, $_dremio_$_update_$ field may show up incorrectly as a part of datasets metadata INFORMATION_SCHEMA and ODBC/JDBC metadata calls. This would cause query failures.

Source Adapters

Exception when using OVER clauses in RDBMS sourcesUsing OVER clauses and Window functions with RDBMS sources would previously sometimes fail with an errorCannot convert RexNode to equivalent Dremio expression.These types of queries now succeed.

Issue working conflicting types for the same field when using an Elasticsearch aliasWorking with nested fields with the same name of different types in an Elasticsearch alias used to cause queries to fail. Dremio now correctly ignores such nested fields.

Incorrect handling of NULLs when pushing down not-equals expression to ElasticsearchNot equal expressions would previously incorrectly handle NULLs when pushing down to Elasticsearch sources. Pushdown logic has been updated to address this.

Performance issues with Elasticsearch queries including a LIMIT clausePreviously, Dremio would keep fetching until Dremio’s internal batch size was reached. This logic has been updated to terminate once user-requested limit has been reached.

Elasticsearch queries are sometimes under-parallelizedPlanning logic for Elasticsearch queries has been updated for a variety of use-cases to ensure optimal parallelization.

Execution

Avoid cancelling queries due to out of memory in some large sort queriesWhen spilling data in external sort, there were cases where we were allocating memory more than we had reserved. We tuned our memory allocation algorithm in sort code to be more adaptive by stepping down the requirement.

Failures related to running FLATTEN with reflections enabledWhen using FLATTEN with variable width data, there were cases where we were using incorrect length of variable width data. This resulted in internal failures related to over allocation of memory. The problem is now fixed.

Handle reference count failures in reflection materializationSome reflection materialization jobs were failing due to improper handling of references to internal buffers. The problem is now fixed.

Query failure when joining maps of mapsWhen working with nested map type data, the memory for inner data inside the map was being allocated twice for nested maps where we had a map inside a map. The problem is now fixed.

Limit result records for queries "Run" in the UIWhen querying large data sets through the UI, Dremio now limits the number of records being returned. The job details would indicate if the limit has been reached.

Better handle deeply nested lists in JSONWhen handling lists in Json data, our heuristics to allocate memory for nested lists were not optimal and we ended up making up memory allocation requests more than allowed by OS/JVM. We made a few of changes in our logic to handle this in a better fashion.

Using KVGEN function causes ClassCastExceptionKVGEN function may fail with a ClassCastException. This issue is now fixed.

If all executors associated with an active job are disconnected, Dremio accidentally marks the job as completedIf all executors associated with an active job are disconnected, Dremio accidentally marks the job as completed. This behavior has been improved to better monitor executor statuses and correctly mark the job as “Failed”.

Incorrect handling of negative decimal values from Parquet filesWhen working with negative decimal values from Parquet files, Dremio would fail to interpret correctly. This issue is now fixed.

Sort operations does not release partially allocated resources when a failure happensSort operation logic has been improved to better handle failures and release memory as needed.

NULL or empty string values are not handled in Reflection and $scratch table partitioning Partitioning logic has been updated to substitute DREMIO_DEFAULT_NULL_PARTITION__ for NULL partition values and DREMIO_DEFAULT_EMPTY_VALUE_PARTITION__ for empty partition strings.

Extended blocking operations cause unbalanced thread schedulingWhen tasks become unblocked, Dremio will now better evaluate alternative threads that might be able to complete the work more quickly.

Web Application

New lines are not rendered correctly in query previews and runsNew lines are now correctly handled and displayed.

Dremio UI incorrectly caches folder contents after name changesIf a folder, source or space is deleted and re-created with the same name under the same path, Dremio would show previous listings instead of current listings. This behavior is fixed.

Admin > Administrators page sometimes does not loadWhen using LDAP, in some cases, Admin > Administrators page sometimes would not load. Error handling logic has been updated to minimize impact of an individual problematic record.

Cannot remove given permissionsIn certain cases, administrators would not be able to remove permissions they’ve given to users or groups. This is now addressed.

Issue running queries without FROM clause in the UI Dremio UI now correctly handles running queries without FROM clause. For example, a SELECT 1 query.