Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Making a Small Job Out of Big Data With In-Database Connections - Inspire Europe 2017

The volume of data is growing exponentially and it's now the norm to store every bit of it. With the innovative in-database tools, Alteryx puts Big Data in the hands of all analysts. Learn how to maximize your existing Big Data infrastructure, when to move data out for processing, and when to use in-database blending.

To watch a recording of this session from Inspire 2017, visit https://www.alteryx.com/inspire-europe-2017-tracks

First, we’ll do an overview of what in-db actually is and why it’s something we’re focusing on Second, I’ll go into the benefits of using In-DB After that, we’ll look at some of the tips, tricks, and best practices around using In-DB Then we’ll talk about the roadmap and vision for In-DB Finally, we’ll open up Designer and do a quick demo and overview of where to find the in-db tools and how to get started using them

First, I’ll give a brief overview of our In-DB functionality.

With our In-DB tools, we’re actually creating a sql query in ALteryx when In-DB tools are added to a workflow

Instead of processing this query in Alteryx, we push it down to the underlying database, where all of the processing is done

After all the processing is done, only the results get sent back to Alteryx, if that’s part of the workflow.

In this way (and most importantly), we are preventing any movement of data between Alteryx and the underlying workflow.

So, why are we focusing so much on In-DB? Here are a few reasons: first of all, Data is changing dramatically. And, I have several facts to back these statements up over the next few slides and they all come from a great Forbes article about Big Data cited down here.

- Facts from Forbes (http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/#35f9a3786c1d) in September 2015

Scenario: This is a lot of data to move if you need to clean, blend, prep, or analyze it. Which, we know we’ll have to bc raw data is messy and even more so as it becomes more and more unstructured.

By pushing the processing to the underlying database, we limit the movement of data and get huge performance improvements, especially as the data gets bigger and bigger.

- Facts from Forbes (http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/#35f9a3786c1d) in September 2015

Hadoop: open source software for distributed computing

Distributed computing (performing computing tasks using a network of computers in the cloud) is used by many companies today.

The scenario here is that companies are investing lots of time and money in infrastructure to handle all of this data efficiently as well as designing these frameworks to handle all of their data in a way that works best for them, so there’s a lot of customization in these big data environments.

Instead of trying to pull all of this data into Alteryx and build our own functionality to handle all of this data efficiently or even create a one size fits all solution for every company, we’re using In-DB tools to leverage a company’s existing infrastructure. Alteryx doesn’t have to reinvent the wheel and companies get to take full advantage of these environments that they’ve already put into place.

- Facts from Forbes (http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/#35f9a3786c1d) in September 2015

Data is incredibly valuable and gives companies insights that result in increases of millions of dollars to the bottom line

In-DB tools uses the same intuitive, no coding required, drag and drop interface found in our standard tools, making this data accessible to anyone in the company, not just the DBAs, developers, and data scientists

The biggest benefit of using the in-DB functionality is speed. And this is because we eliminate the transfer of data to and from Alteryx. Because, while the Alteryx engine is very efficient at processing workflows in-mem, we’ll never be able to control or improve how quickly data is transferred over a network.

So, to illustrate this, we have this visual that shows how long it takes to upload and download a 5 gig dataset using standard upload and download speeds. We’re looking at 2-4 minutes to move data from the data source into Alteryx and 3-5 minutes to move data from Alteryx back into the data source. And these times can vary greatly depending on connection speed and where your data lives- whether it’s in the cloud or where the server is physically located.

So, if we average this out, we’re looking at a total runtime of about 8 min using standard tools, with the vast majority of that time being the actual transfer of the data.

So, remove the need to transfer that data and you’re looking at about 30 seconds because, as mentioned, the performance hits come from the transfer of the data, not the processing.

Using standard I/O at average download of ~50 mbps = ~ 2 min to download a 5 gigabyte dataset (wired) Using standard I/O at average download of ~22 mbps = ~4 min to download a 5 gigabyte dataset

To further prove this point, I built a very simple workflow to show the performance gains of In-DB vs standard tools.

We’re simply connecting to a table in a database (SQL DW in this case) that contains on-time data for flights. Then, we’re filtering out the flights that were not delayed and writing that table back to the database.

The table we’re reading from contains 485k records and the filtered table that we’re writing to contains 167k records.

Using standard tools, this workflow takes 7 min, 49 sec to run. Using In-DB tools, this exact same workflow takes 22.8 seconds to run. So, significant performance improvements. And, keep in mind, that the gains will be different depending on what data source you use and what your workflow is actually doing.

Speed is the benefit of using In-DB. To leverage this speed, we also offer Flexibility and hybrid workflows

Flexibility refers to being able to use data in your in-DB workflow that doesn’t live in your underlying data source. So, I have an excel or csv file that I want to use as part of my data blending, but it lives on my local machne and I want to use it with tables that live in the In-DB source I’m connecting to. With the data stream in tool, we allow that flexibility.

Hybrid workflows is similar to flexibility except that, here, we’re specifically referring to streaming data out of an In-DB workflow to subsequently do even more analysis on it. For example, we don’t offer In-DB spatial tools or In-DB reporting tools, but by allowing hybrid workflows you can still take advantage of the In-DB performance benefits by limiting most of the data to the In-DB part of the workflow and only streaming out the relevant data to further analyze via spatial tools or reporting tools.

We have been asked several times if we have benchmarking tests for In-Db vs standard tools. I decided to just build a very simple workflow to show the performance gains of In-DB vs standard tools.

We’re simply connecting to a table in a database (Redshift in this case) that contains on-time data for flights. Then, we’re filtering out the flights that were not delayed and writing that table back to the database.

The table we’re reading from contains 485k records and the filtered table that we’re writing to the database contains 167k records.

Each database has different gains and, as you can see, using the exact same workflow we used in Redshift, SQL DW has even more significant gains In-DB vs standard tools. In this case, using in-DB is 7:26 faster.

I’ve built a workflow that exemplifies the benefits I’ve just mentioned- Speed, Flexibility, and hybrid workflows.

Let’s break these down a bit further.

The first part of this workflow shows how to most effectively leverage the performance gains offered by In-DB.

That’s because you’ll notice all of these tools are In-DB tools, so no transfer of data is ever occurring.

So, if you’re able to do what you need to do using our suite of In-DB tools, you should do so.

In this particular example, we’re using the In-DB filter and formula tools to remove outliers and noisy data. And then we’re using an In-DB join tool to join tables within the database. Then another In-DB filter tool. And, finally, a write data in-db tool to create a new table in the underlying data source.

Again, This type of workflow is where you’re going to see the most efficiencies compared to using standard tools because your entire workflow is In-DB and, therefore, no data is being transferred into or out of Alteryx.

“Because we know all of your data isn't in once place, this capability leverages the ability to bring in data from other sources and blend it with your in-db source”

Here, we’re streaming in a local file and joining it with a table in the underlying database. This dataset is outside of the underlying database, but we are still able to offer flexibility in where data is coming from by allowing users to bring this data in via the Data Stream In tool.

The Data stream in tool is being used to join this flat file with a table in Redshift

This part of the workflow showcases the flexibility offered by the In-database feature of allowing outside data to be brought into an in-DB workflow for full analysis.

Similarly, the third part of the workflow focuses on the power of hybrid workflows.

we don’t have In-DB reporting tools, but we can easily create a hybrid workflow that lets us stream the data out of the In-DB workflow to take full advantage of the reporting capabilities already contained within Alteryx.

In this particular case, we’re using a stream out tool to move the relevant data outside of Redshift. Next, we’d use some of the reporting tools to create a report and the email tool to email charts with the relevant data to all employees.

As you can see, in this case, the user is able to limit data movement where needed and use both In-db and non in-database tools to maximum efficiency. One key to using this feature efficiently is to only stream out the relevant data that you’ll need for further analysis, so as you can see, I’ve used this at the very end of my workflow, after I’ve summarized my data.

Since last year we’ve added quite a few more data sources. These include Microsoft APS, Microsoft Azure SQL DB, IBM Netezza, SAP Hana

We have also added In-DB predictive tools for SQL and Teradata

Full list of what we support as of V11.5

We also offer our most popular predictive models for several of our data sources. And this chart shows the different combinations we offer, this can also be found in our help documentation for future reference.

Users often ask when they should use In-DB. And the answer is, it depends.

But, these are some general guidelines to start with and factors to think about.

You’re working with large datasets. I know the next question is- what’s a large dataset? This is entirely dependent on several factors including the environment in which the data is being stored, how fast your network is, etc.

The SQL query is triggered at runtime for each Browse In-DB tool, Data Stream Out tool, and Write Data In-DB tool. What that means is that use of these tools is going to impact performance of your workflow. And, in particular, the browse tool and the data stream out tool because

In addition to triggering the SQL query, the Browse In-DB tool and the Data Stream Out tool actually move data into Alteryx. So, let’s first talk about the browse tool and how best to use it.

First of all, limit the use of the Browse tool in your In-DB workflows. The browse tool, by default, moves the first 100 rows of data from the underlying database and displays them in Alteryx. Since one of the underlying principles of In-DB workflows is to limit data movement, it is best to avoid using Browse In-DB tools to realize maximum efficiency. Browse tools can be very useful while building an In-DB workflow to enable transparency into what’s actually going on with your data, but it’s always best to remove them once you’re satisfied with your workflow and certainly before you publish or schedule your workflow.

Secondly, when using the Browse In-DB tool, don’t change the defaults if you don’t have to. The Browse in-DB tool has several options to limit data movement. The first of these is the number of rows to view. The default is 100. we recommend you use this or a smaller number to limit how many rows to bring back. Again, it’s a great tool to use while building a workflow, but it’s usually unnecessary to bring back all rows into Alteryx

The second option is the Enable Caching option. As with any workflow built in Alteryx, it is recommended that you cache your results the first time the workflow is run to allow for faster performance on subsequent runs.

The data stream out tool is a great and necessary tool when building hybrid workflows within Alteryx, but, like the browse tool, it is important to keep in mind that this tool is also moving data from the underlying database into Alteryx.

To optimize use of the data stream out tool, it is best to use this tool as far downstream as possible.

For example, in my workflow, it is one of the last tools I’m using- after the selects, after the summarizes, after all my data is cleaned up and I’m only streaming out the relevant data that I want to further analyze.

Another best pracyive of using In-DB tools effienctly is to reduce your dataset as quickly as possible. The most effective way to do this is to utilize the Connect in-DB tool and the filter tool.

When configuring a Connect In-DB tool, avoid selecting all of the columns. From the get-go, it’s best to choose only the relevant columns you want to work with.

Use a filter tool as early as possible in your workflow to filter out all irrelevant data.

By reducing your dataset, the results will be reduced and the subsequent nested queries in your workflow will run faster.

I did promise some tricks along with my tips, so here you go:

As In-DB workflows get larger and larger and more tools are added, the SQL query being composed by Alteryx gets more complex- essentially, each time a tool is added, a nested query is inserted in the original select statement.

In general, the underlying data source should be able to optimize this query, but sometimes it will get so complex that the database will become overwhelmed. One way to see the query and potentially modify it yourself is to use the Dynamic Output In-DB tool.

This is obviously a more advanced option as you will need to understand SQL code for this to be of any use.

It’s also a great illustration of how Alteryx is making this data more accessible to more users by creating the necessary sql query for the user while they just drop tools into the workflow- no coding required!

*Because Alteryx creates generic queries…

For my second trick, I want to show how to use the In-DB predictive tools

It’s very simple, but does require a bit of trickery.

If you have an In-DB tool already in your workflow and you drag a predictive tool onto your canvas and connect it to that in-DB tool, some magic will happen.

Your standard predictive tool (linear regression in this case) that you just dragged from your Predictive category will morph into an In-DB version of the Linear Regression tool!

*change back, right-click and choose version

This is more of a best practice than a fun trick, but it’s very useful.

If you’re writing to Hive or Impala (Question audience- is anyone using In-DB tools to write to Hive or Impala?), configure the write tab so that it’s writing directly to HDFS and NOT writing via the ODBC driver.

A lot of people shy away from this option because they think that it means we’ll only write a file to HDFS and not register it as a table in Hive like we do via the ODBC driver. false. These options will result in the exact same outcome as using the ODBC driver- we will write a file to HDFS and register the table to your Hive instance just as we do with the ODBC driver. And writing directly to HDFS is much faster than going through the ODBC driver.

Alright, y’all. Who thinks we don’t have sort functionality in-DB?

While it’s true we don’t have an explicit sort tool, we do offer users the ability to sort their data.

Let’s first talk about the concept of sorting. Sorting is purely a runtime, UI-based function. It is solely used to make results easier to view for the user. After the results leave the GUI, we’re not maintaining sorts any longer. Tables will never be written back to a database in sorted order- that would be highly inefficient- especially for massive datasets stored in some of our big data sources.

So, keeping that in mind, there’s no real need to have a separate sort tool for In-DB bc sorting data only makes sense when you actually transfer data from the underlying data source to actually view that data in the GUI. And, remember, as soon as we’re transferring data, we’re no longer in-DB anymore- we’re in-mem.

Therefore, you can find the sort functionality in a couple of places for in-DB: one place is in the sample tool. You can check this option and then choose which field to sort by and in what order. Keep in mind, unless you use a write in-DB tool and create a new table in-DB this sample of sorted data will not persist after your workflow stops running. So, you can either write a table to capture this. OR, you can drop a Browse in-DB tool after the sample to see the sorted results. Though, again, this sample of your dataset won’t persist unless you actually write it to a new table.

Secondly, you can use the data stream out tool to sort your results. Almost the exact same thing is happening here- check this option and then choose which fiel to sort by and in what order. Agin, this tool alone only manioulates the data at runtime and how you actually view the data. So, you need to drop a browse tool to actually view the sorted results.

Make In-DB More Accessible – we’re working towards making the in-db functionality as accessbile to all users as our standard drag nad drop functionality. At this point, users still have to know a bit of SQL to take full advantage of the functionality offered in-db and we want to remove that dependency. Make In-DB experience feel the same as the standard experience: this includes everything from the configuration of a connection to the saving and management of aliases. Continue to build out support for the most-requested data sources as well as adding more tools and functionality to our in-db suite. Make in-db “smarter” refers to eventually building out a framework and architecture that removes any decision by the end user of where the data should be processed. User builds out a workflow, Alteryx decides whetehr it makes sense to process in-db or in-mem depending on what is happening in the user’s workflow.

Already have transpose in-db in the lab, though only works currently with SQL and Oracle

Meh. Might drop this slide

First and foremost, using In-DB is fast. And that’s bc no data is being transferred. If you remember nothing else, just remember that.

If you want to remember a couple more things, these are good ones: we’ll continue to build out the in-db functionality and expand the capabilities of in-db. A lot of what we decide to do will depend on feedback from users, so please continue to let us know what’s missing from in-db.

And finally, there are lots of great tricks when using in-db, but one of the least known tricks is that we do offer some predictive model support in-db. So, remember that when you’re trying to build some of the most popular models against the most popular data sources.

Do you have more ideas for In-DB? Data sources you’d like to see, tools you’d like included, other features? Be sure to visit our Community and post your ideas in the Ideas section. We use these to gauge demand and help shape our future roadmap.

The Community is also a great resource for Help and other questions you might have

Making a Small Job Out of Big Data With In-Database Connections - Inspire Europe 2017

1.
MAKING A SMALL JOB
OUT OF BIG DATAWITH IN-
DATABASE CONNECTIONS
Presented by Alex Patten, Product Manager- Data Platforms

2.
FORWARD-LOOKING STATEMENTS
This presentation includes “forward-looking statements” within the meaning of the Private Securities Litigation Reform Act of 1995.
These forward-looking statements may be identified by the use of terminology such as “believe,” “may,” “will,” “intend,” “expect,”
“plan,” “anticipate,” “estimate,” “potential,” or “continue,” or other comparable terminology. All statements other than statements of
historical fact could be deemed forward-looking, including any projections of product availability, growth and financial metrics and any
statements regarding product roadmaps, strategies, plans or use cases. Although Alteryx believes that the expectations reflected in
any of these forward-looking statements are reasonable, these expectations or any of the forward-looking statements could prove to
be incorrect, and actual results or outcomes could differ materially from those projected or assumed in the forward-looking
statements. Alteryx’s future financial condition and results of operations, as well as any forward-looking statements, are subject to
risks and uncertainties, including but not limited to the factors set forth in Alteryx’s press releases, public statements and/or filings with
the Securities and Exchange Commission, especially the “Risk Factors” sections of Alteryx’s Quarterly Report on Form 10-Q.These
documents and others containing important disclosures are available at www.sec.gov or in the “Investors” section of Alteryx’s website
at www.alteryx.com. All forward-looking statements are made as of the date of this presentation and Alteryx assumes no obligation to
update any such forward-looking statements.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are only intended to
outline Alteryx’s general product direction. They are intended for information purposes only, and may not be incorporated into any
contract. This is not a commitment to deliver any material, code, or functionality (which may not be released on time or at all) and
customers should not rely upon this presentation or any such statements to make purchasing decisions. The development, release,
and timing of any features or functionality described for Alteryx’s products remains at the sole discretion of Alteryx.

7.
WHY IN-DATABASE?
• Data is Growing
• More data has been created in the past two years than in the entire previous history of
the human race
• By 2020, 1.7 megabytes of new information will be created every second for every
human being on the planet
• By 2020, digital universe of data will contain 44 zettabytes, up from 4.4 zettabytes
today
Scenario: This is a lot of data to move if you need to clean, blend, prep, or analyze it
Solution: With In-DB, you don’t have to!
Source:
http://www.forbes.com/sites/bernardmarr/2015/
09/30/big-data-20-mind-boggling-facts-
everyone-must-read/#35f9a3786c1d

8.
WHY IN-DATABASE?
• Data Storage is Changing
• By 2020, at least a third of all data will pass through the cloud
• The Hadoop market is forecast to grow at a compound annual growth of 58%,
surpassing $1 billion by 2020
• Every day, Google uses distributed computing to involve about 1,000 computers in
answering a single search query.This takes no more than .2 seconds to compute.
Source:
http://www.forbes.com/sites/bernardmarr/2015/
09/30/big-data-20-mind-boggling-facts-
everyone-must-read/#35f9a3786c1d
Scenario: Infrastructure is costly and customized
Solution: In-DB leverages this infrastructure

9.
WHY IN-DATABASE?
• Access to Data is Key toValuable Insights
• Currently, less than 0.5% of all data is ever analyzed and used
• For a typical Fortune 1000 company, a 10% increase in data accessibility will increase net
income by $65 million
• Retailers who leverage the full power of big data could increase their operating margins
by 60%
Source:
http://www.forbes.com/sites/bernardmarr/2015/
09/30/big-data-20-mind-boggling-facts-
everyone-must-read/#35f9a3786c1d
Scenario: Data is valuable, but hard to access
Solution: In-DB tools make this data accessible to anyone, no coding required!

23.
*THINK ABOUT USING IN-DB IF…
• You’re working with large datasets
• Data source is sitting on top of HDFS (Hive, Impala, Spark)
• Data source is in the cloud (Azure, Redshift, Snowflake)
• Data source is not on a local server
• Data source is supported in Alteryx
• Any part of your workflow is working with a reduced dataset

36.
FORWARD-LOOKING STATEMENTS
This presentation includes “forward-looking statements” within the meaning of the Private Securities Litigation Reform Act of 1995.
These forward-looking statements may be identified by the use of terminology such as “believe,” “may,” “will,” “intend,” “expect,”
“plan,” “anticipate,” “estimate,” “potential,” or “continue,” or other comparable terminology. All statements other than statements of
historical fact could be deemed forward-looking, including any projections of product availability, growth and financial metrics and any
statements regarding product roadmaps, strategies, plans or use cases. Although Alteryx believes that the expectations reflected in
any of these forward-looking statements are reasonable, these expectations or any of the forward-looking statements could prove to
be incorrect, and actual results or outcomes could differ materially from those projected or assumed in the forward-looking
statements. Alteryx’s future financial condition and results of operations, as well as any forward-looking statements, are subject to
risks and uncertainties, including but not limited to the factors set forth in Alteryx’s press releases, public statements and/or filings with
the Securities and Exchange Commission, especially the “Risk Factors” sections of Alteryx’s Quarterly Report on Form 10-Q.These
documents and others containing important disclosures are available at www.sec.gov or in the “Investors” section of Alteryx’s website
at www.alteryx.com. All forward-looking statements are made as of the date of this presentation and Alteryx assumes no obligation to
update any such forward-looking statements.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are only intended to
outline Alteryx’s general product direction. They are intended for information purposes only, and may not be incorporated into any
contract. This is not a commitment to deliver any material, code, or functionality (which may not be released on time or at all) and
customers should not rely upon this presentation or any such statements to make purchasing decisions. The development, release,
and timing of any features or functionality described for Alteryx’s products remains at the sole discretion of Alteryx.

38.
#inspire16#
alteryx.com/trial
Ready to bring these incredible
and tangible benefits to your
organization?
Download a FREETrial of Alteryx
and start making your data work
for you, instead of you working for
your data