Subscribe to this blog

Follow by Email

Why SQL Part 3 - Simple optimization makes life simpler

In the second part of this series of posts on (Why SQL Part 2 - It has a powerful framework) of this series of blog posts I explained the underlying framework that SQL is based on. That framework is referred to as “relational algebra” and it was first described in 1970 by E.F. Codd while at IBM . It provides well-founded semantics for both modelling and querying a set of data. Codd's framework established a series of basic principles that govern the construction and execution of a query and these can be summarized as follows:

Projection

Filter

Join

Aggregate

Now that we have examined the basics of the relational model let’s now move on to how the language actually determines the best way to answer a question based on its use of simplified and transparent optimizations. To understand why simplified and transparent optimization is so important it is useful to consider two of the most common programming paradigms and these are: procedural and declarative. Let’s explore these two approaches:

The Procedural approach to programming

This approach aims to break down a specific task into a collection of variables, procedures, functions, corresponding data structures and conditional operators. It uses a list of instructions, wrapped up in procedures and subroutines, to tell a computer what to do step-by-step. A program (or query) written in a procedural language will contain one or more procedures and these procedures can be invoked anywhere within the program hierarchy as well as by other procedures. Examples of a procedural programming language include C/C++, Java, ColdFusion and PASCAL. The example below shows how, using a procedural language such as Java, to return a sorted list of names from a simple table (this example refers to a data set based on a list of employees):

The code snippet above tells the computer precisely the order of processing in terms of how to sort the data. Other procedures load the data into an array for processing, and output the sorted list of names. The full code listing is available on the Google Hadoop & Hive site: https://sites.google.com/site/hadoopandhive/home/dsd

The Declarative approach to programming

The declarative approach allows the developer to simply build the structure and elements of the required program in a way that expresses just the logic of the computation without having to describe the actual control flow. In simple terms it describes “what” the program, or query, should accomplish rather than describing “how” to actually accomplish the desired outcome. The “how” part of the processing is managed transparently by the underlying platform. The most widely used declarative programming language is SQL. Referring back to our problem of returning a sorted list of employees, the SQL, declarative-based, code is much simpler. Using our simple example of returning a sorted list of employees, highlights the key differences between SQL’s declarative-based code and the more workflow-orientated procedural languages such as Java:

As you can see SQL allows the developer to simply build the structure and elements of the required program in a way that expresses just the logic of the computation without having to describe the actual control flow. In simple terms it describes “what” the program, or query, should accomplish rather than describing “how” to actually accomplish the desired outcome. The “how” part of the processing is managed transparently by the underlying platform. This contrasts with the procedural approach to programming, which aims to break down a specific task into a collection of variables, procedures, functions, corresponding data structures and conditional operators. It uses a list of instructions, wrapped up in procedures and subroutines, to tell a computer what to do, step-by-step. Many of today’s big data languages adopt this procedural approach.

Summary

With declarative languages, such as SQL, the developer only has to understand the relationships between entities, the final output and transformations needed to support the output. From these two code samples it is obvious that the declarative approach is more compact, significantly less complex and much easier to maintain and enhance. This explains another of the key reasons why SQL is becoming the go-to language for data analysis.

Key characteristics

Procedural Coding

Declarative Coding

Developer needs to understand…

how to organize and perform tasks, must track changes in state

relationships between entities, the final output and transformations needed to support the output

Flow control is implemented using…

looping, conditional branching and developer-defined function calls

Function calls, high-level recursion

Management of application and system state changes...

has to be controlled and managed by the developer

N/A

Order of execution...

very important

N/A

Simplified Optimizations

The optimization techniques for queries tend to focus on minimizing the level of network traffic, reducing I/O operations, minimizing looping operations and/or eliminating redundant calculations and operations. Each application language has it is own unique set of tuning issues so using lots of different languages to support specific sets of business requirements creates a huge overhead for project teams and reduces the overall agility of the project. Trying to implement and then optimize change requests from the business can become a time-consuming task.

Many big data languages are procedural in their approach where the developer is responsible for implementing optimization techniques to support their workload. These languages provide a series of structures to manage the resources consumed by a job. For example each Java based MapReduce job requires a job configuration file that contains a number of parameters: maximum virtual memory, the cumulative size of the serialization and accounting buffers, the number of segments on disk to be merged at the same time and the percentage of memory relative to the maximum heap-size in which map outputs may be retained during the reduce. Each developer optimizes his/her own specific code/process but does so in complete isolation from the overall workload.

This granular level of control provides tremendous flexibility but can create conflicts when there are multiple applications executing similar queries.The correct level of parallelism for big data maps, within a MapReduce job, seems to be around 10-100 maps per-node, although higher numbers for CPU-light map tasks is possible. However, there is nothing to stop the developer from taking even more resources from the pool to run their code. For more information on the various fine-grained parameters for implementing, configuring and tuning MapReduce jobs refer to the MapReduce tutorial on the Apache Hadoop website.

The fact that SQL is a declarative language means the developer is shielded from the complexities of the underlying query optimization techniques. Data discovery and other analytical requirements that are implemented using the Oracle Database as the processing platform have a rich set of optimization techniques covering parallelization, query transformations, indexing and join algorithms.

Building an Optimal Query Plan

The Oracle query optimiser is at the core of Oracle’s database software and it determines the most efficient method for a SQL statement to access requested data. Due to the many internal statistics and tools within the database, the Oracle Optimiser is usually in a better position than the developer to determine the best method of statement execution. During the process of determining the best plan, the optimizer examines multiple access methods, such as full table scan or index scans, and different join methods such as nested loops and hash joins before arriving at the “best execution plan”. For this reason, all SQL statements use the Oracle Optimiser.

It attempts to generate the best execution plan for a given SQL statement. An “execution plan” describes a recommended method of execution for a SQL statement. Each plan details the combination of steps that the Oracle Database must use to execute a SQL statement. The definition of the “best execution plan” is the plan with the lowest cost among all considered candidate plans. The calculation of the overall cost takes into account key factors of query execution such as I/O, CPU, and communication. With declarative languages all these factors have to be coded, configured and managed by the developer.

Consider a user who queries records for employees who are managers. If the database statistics indicate that 80% of employees are managers, then the optimizer may decide that a full table scan is most efficient. However, if statistics indicate that very few employees are managers, then reading an index followed by a table access by rowid may be more efficient than a full table scan. This type of optimization is impractical to implement in MapReduce, since the Java developer would not only have to now write the query code, but also write an additional set of code to gather and maintain statistics about the underlying data.

Parallel Execution

The most complicated area of optimization is typically in the area of parallel execution. The objective of parallel execution is to reduce the total execution time of an operation by using multiple resources concurrently. Resource availability is the most important prerequisite for scalable parallel execution. For procedural languages developers have to understand a complex set of rules and very carefully consider the best way to parallelize the operations within their code. As mentioned before, this process typically takes place in isolation so that the developer is unaware of other jobs that might be running at the same time and the level of resources that might be available at any given point in time.

The Oracle Database provides a powerful SQL parallel execution engine. It can decide autonomously if and how parallel execution should be enabled for a query. By default the Oracle Database is enabled for parallel execution and when Automatic Degree of Parallelism (Auto DOP) is active, the database decides if a statement should execute in parallel or not and what degree of parallelism (DOP) should be used. The decision when to use parallel execution and the DOP chosen are both based on the resource requirements of a statement and the resources available at the time the query is submitted.

After the optimizer determines the fastest execution plan for a statement then the parallel execution coordinator determines the parallel execution method for each operation within the plan. The parallel execution coordinator has to decide whether an operation can be performed in parallel and if it can it then has to determine how many parallel execution servers it should use. At this point the database has parsed the query, calculated the overall cost and then determined the optimal DoP. Of course it is entirely possible for the cheapest plan to be a serial execution process, therefore, this is also an option. Finally, the Oracle Database will determine if the statement can be executed immediately or if it needs to be queued until more system resources become available. The image below “Optimizer Calculation: Serial or Parallel?” illustrates this decision making process:

In general as more and more statements attempt to run in parallel it quickly becomes very important to manage the utilisation of the parallel processes available. This means some degree of intelligence is required about when to run a statement in parallel and verify whether the requested numbers of parallel processes are available. This is extremely difficult for a developer to do simply because they are working in isolation from the overall workload. The only way to implement this type of intelligent processing is to allow the database to control the whole process.

With SQL all this happens transparently as each SQL statement that is submitted. The developer does need to get involved in determining the degree of parallelization and this simplifies the overall application logic. It also means that as new optimization techniques become available the application source code can transparently reap the benefits of the new features.

Summary

The initial focus on developer centric optimization techniques within big data projects is now giving way to a more platform centric approach as the complexity of analysis demanded by business users and data scientists continues to increase. By letting the database perform the optimisations at source, where the data is located, it is possible to share tuning improvements across a wide spectrum of front-end systems (BI reports, dashboards, applications etc). The ability of the database to “own” the optimization and processing characteristics of how a SQL statement is executed is an important element in the success of SQL as a data analysis language.

By deferring the majority of operations to the database developers can simplify their application code and transparently inherit new features as they are released.

Comments

Post a Comment

Popular posts from this blog

This post covers one of the new SQL performance enhancements that we incorporated into Database 12c Release 2. All of these enhancements are completely automatic, i.e. transparent to the calling app/developer code/query. These features are enabled by default because who doesn’t want their queries running faster with zero code changes?

So in this post I am going to focus on the new In-Memory “cursor duration” temporary table feature. Let’s start by looking at cursor duration temp tables…Above image courtesy of wikimedia.org
What is a cursor duration temp table?
This is a feature that has been around for quite a long time. Cursor duration temporary tables (CDTs) are used to materialize intermediate results of a query to improve performance or to support complex multi step query execution. The following types of queries commonly use cursor duration temp tables:WITH Clause and parallel recursive WITHGrouping SetsStar TransformationFrequent Item Set CountingXLATE
What happens during the …

….and now it’s here in PDF format as well!The free big data warehousing Must-See guide for OpenWorld 2017 is now available for download in PDF format - click here, and yes it’s completely free. This comprehensive guide covers everything you need to know about this year’s Oracle OpenWorld conference so that when you arrive at Moscone Conference Center you are ready to get the most out of this amazing conference. The guide contains the following information:Page 8 - On-Demand VideosPage 17 - JustifyYour tripPage 19 - KeyPresentersPage 41 - Must See SessionsPage 90 - Useful MapsChapter 1 - Introduction to the must-see guide.Chapter 2 - A guide to the key the highlights from last year’s conference so you can relive the experience or see what you missed. Catch the most important highlights from last year's OpenWorld conference with our on demand video service which covers all the major keynote sessions. Sit back and enjoy the highlights. The second section explains why you need to atte…

MATCH_RECOGNIZE and predicates
At a recent user conference I had a question about when and how predicates are applied when using MATCH_RECOGNIZE so that’s the purpose of this blog post. Will this post cover everything you will ever need to know for this topic? Probably!
Where to start….the first thing to remember is that the table listed in the FROM clause of your SELECT statement acts as the input into the MATCH_RECOGNIZE pattern matching process and this raises the question about how and where are predicates actually applied. I briefly touched on this topic in part 1 of my deep dive series on MATCH_RECOGNIZE: SQL Pattern Matching Deep Dive - Part 1.
In that first post I looked at the position of predicates within the explain plan and their impact on sorting. In this post I am going to use the built in measures (MATCH_NUMBER and CLASSIFIER) to show the impact of applying predicates to the results that are returned.
First, if you need a quick refresher course in how to use the MATCH…

Keith Laker

Keith Laker

Disclaimer

Opinions expressed are entirely my own and do not reflect the position of Oracle or any other corporation. Do NOT take anything written here, unless explicitly mentioned otherwise, to be Oracle policy or reflecting Oracle's support policy.

About Me

I have been working with Oracle data warehouse technology for over 20 years working on a wide variety of data warehouse projects both as a consultant and an onsite support engineer. I am now part of the Data Warehouse Product Management Team where I am responsible for analytical SQL. I am based in the UK at our Manchester office.

A key part of my role is to work with our sales teams to brief our customers on data warehousing and analytical SQL: explaining the wide variety of new and exciting opportunities that our DW and analytical solutions can support.

I regularly deliver sales training for data warehousing and analytical SQL across all our sales regions and provide competitive intelligence support across all the major data warehouse vendors.