Archive for June, 2014

In June 2014, the HP Vertica summer interns headed to the East End House in East Cambridge, MA to work with students through a community service project. Sarah Perkins, a business planner on the Project Management team, organized the project. Since 1875, the East End House has offered innovative programs to the community and continues to strive for excellence. Their programs help support families and individuals through curricula that enhance education standards. Programming supports the whole family with comprehensive services like a Food Pantry, Community Workshops, Parent Education and Senior Programming.

The interns, accompanied by mentors Sarah Lemaire and Jaimin Dave, helped students build bridges with very limited materials: fifty gumdrops and one hundred toothpicks! The goal was to build a bridge that spanned a six-inch gap and would hold at least 300 grams, or 120 pennies.

Teams of interns paired up to work with four students, ranging from third to eighth grade. They watched and assisted as the students discussed strategies, drew prototypes and started to build and re-build their structures. As the students worked on their bridges, they got to know more about the Vertica interns, including their majors, hometowns, and the projects they are working on for the summer. Throughout the course of the day, if students could correctly answer questions about their interns, they would win HP Vertica swag, including toy bulldozers, stress balls, flashlights, and more.

Once the bridges were built, the interns and students tested them across a six-inch gap. Students placed pennies on a paper plate on top of the bridge, one-by-one, until the bridge collapsed under the weight. The winning team’s bridge, led by interns Swikriti Jain and Jun Yin, held 255 pennies, which weigh more than 1 1/3 pounds! The top two teams won a bundle of HP Vertica swag, including t-shirts, water bottles, and baseballs.

The HP Vertica interns had a great time learning about students at the East End House, and helping them build successful bridges. It was a unique opportunity to interact with students of many ages, while also encouraging them to remain active in school and participate in extracurricular activities.

In this part of the de-mythification series, I’ll address another common misconception in the big data marketplace: that there exists a single piece of technology that will solve all big data problems. Whereas the first two entries in this series focused on market needs, this will focus more on the vendor side of things in terms of how big data has driven technology development, and give some practical guidance on how an organization can better align their needs with their technology purchases.

Big Data is the Tail Wagging the Vendor

Big data is in the process of flipping certain technology markets upside-down. Ten or so years ago, vendors of databases, ETL, data analysis, etc. all could focus on building tools and technologies for discrete needs, with an evolutionary eye – focused on incremental advance and improvement. That’s all changed very quickly as the world has become much more instrumented. Smartphones are a great example. Pre-smartphone, the data stream from an individual throughout the day might consist of a handful of call-detail records and a few phone status records. Maybe a few kilobytes of data at most. The smartphone changed that. Today a smartphone user may generate megabytes, or even gigabytes of data in a single day from the phone, the broadband, the OS, email, applications, etc. Multiply that across a variety of devices, instruments, applications and systems, and the result is a slice of what we commonly refer to as “Big Data”.

Most of the commentary on big data has focused on the impact to organizations. But vendors have been, in many cases, blindsided. With technology designed for orders of magnitude less data, sales teams accustomed to competing against a short list of well-established competitors, marketing messages focused on clearly identified use cases, and product pricing and packaging oriented towards a mature, slow-growth market, many have struggled to adapt and keep up.

Vendors have responded with updated product taglines (and product packaging) which often read like this:

“End-to-end package for big data storage, acquisition and analysis”

“A single platform for all your big data needs”

“Store and analyze everything”

Don’t these sound great?

But simple messages like these mask the reality that there are distinct activities that which comprise big data analytics, and that these activities come with different technology requirements, and much of today’s technology was born in a very different time – so the likelihood of there being a single tool that does everything well is quite low. Let’s start with the analytic lifecycle, depicted in the figure below, and discuss the ways this has driven the state of the technology.

This depicts the various phases of an analytic lifecycle from the creation and acquisition of data through the exploration and structuring to analysis and modeling, to putting the information to work. These phases often require very different things from technology. Let’s take the example of acquiring and storing of large volumes of data with varying structure. Batch performance is often important here, as is cost to scale. Somewhat less important is ease of use – load jobs tend to change at a lower rate than user queries, especially when the data in a document-like format (e.g. JSON). By contrast, the development of a predictive model requires a highly interactive technology which combines high performance with a rich analytic toolkit. So batch use will be minimal, while ease of use is key.

Historically, many of the technologies required for big data analytics were built as stand-alone technologies: a database, a data mining tool, an ETL tool, etc. Because of this lineage, the time and effort required to re-engineer these tools to work effectively together as a single technology, with orders of magnitude more data, can be significant.

Despite how a vendor packages technology, organizations must ask themselves this question: what do you really need to solve the business problems? When it comes time to start identifying a technology portfolio to address big data challenges, I always recommend that customers start by putting things in terms of what they really need. This is surprisingly uncommon, because many organizations have grown accustomed to vendor messaging which is focused on what the vendor wants to sell as opposed as to what the customer needs to buy. It may seem like a subtle distinction, but it can make all the difference between a successful project and a very expensive set of technology sitting on the shelf unused.

I recommend engaging in a thoughtful dialog with vendors to assess not only what you need today, but to explore things you might find helpful which you haven’t thought of yet. A good vendor will help you in this process. As part of this exercise, it’s important to avoid getting hung up on the notion that there’s one single piece of technology that will solve all your problems: the single solution elf.

Once my colleagues and I dispel the single solution myth, we can then have a meaningful dialog with an organization and focus on the real goal: finding the best way to solve their problems with a technology portfolio which is sustainable and agile.

I’ve been asked, more than once “Why can’t there be a single solution? Things would be so much easier that way.” That’s a great question, which I’ll address in my next blog post as I discuss some common sense perspectives on what technology should – and shouldn’t – do for you.

Obtaining and installing your HP Vertica license may seem like tricky business. Especially if you have more than one. But the process need not be complicated or frustrating. For a Community Edition license, you don’t even need to go through any additional steps after installing Vertica. For Enterprise Edition or Flex Zone licenses, you’ll go through a step-by-step process in HP’s licensing portal called Poetic and then provide Vertica with the path to the license file you download. That’s it! You can even apply your license through the Vertica Management Console. To see the process in action, watch this video about obtaining and installing the different HP Vertica licenses.

If you ask Conservation International this question, they may just say yes. After all, Conservation International has teamed up with HP Earth Insights to provide organizations around the world — from environmentalists to policy makers – with a real-time look at what is happening within our planets most valuable natural resource: the rain forest.

But how does their work relate to you as a start-up organization or a Fortune 500 company?
First, they have surprisingly similar analytical needs to many other start-ups and corporations, collecting data regularly from 16 sites around the globe, performing more than 4 million climate measurements as of this February, and managing more than 3 TB of biodiversity information. As the name implies, this information is incredibly, well… diverse, including everything from photos to hand-recorded measurements to weather station and camera trap imagery. While your company may not be recording/analyzing the metadata of candid photos of elephants and/or chimpanzees, chances are, many of you out there are working with at least more than one type of data.

Collecting and Analyzing Multiple Data Types
All of these different data types have to be funneled into a database, analyzed, and then acted on. Running queries based on millions of climate readings begins to look a lot like doing the same on a diverse customer base like many other companies deal with every day. Many agricultural companies collect sensor data from across their farm lands to get a forecast of how the climate has affected their crops for the upcoming year. These days, utilities companies are launching Advanced Metering Infrastructures (AMI) to deal with the staggering amounts of sensor data collected from the energy usage of millions of homes. HP Vertica coincidentally works as an effective Meter Data Management (MDM) system (read more here).

Visualizing the Data and Reaching More People
Working with HP, Conservation International has built from the ground up their own analytics system and dashboard for visualizing their data from all 16 rainforests around the globe. CI DBA’s discover trends based on over 140 million simulations, and analyze the metadata from over 1.7 million photos. Not only is their custom interface intuitive, it also enables them to generate PDFs instantly and share to social media directly from the dashboard. For CI, this means more people now see more of their impact in more places to proactively address environment threats. For you, it might mean anything from less time spent prepping your data to present to management, or just simply fewer emails to send.

The Power of Prediction for the Greater Good
Like many companies, CI uses standard methodology in processing their data, and uses R for their analysis, as is very common in scientific studies. Using R, CI can proactively assess where the future trouble spots will be, and what parts of their monitored ecosystems are most threatened. Many other HP Vertica customers use R in surprisingly similar ways, such as seeing what neighborhoods a future power outage might affect most, or how serious the next year’s dry season will be to a farmer’s crops

See Conservation International at the HP Vertica Big Data Conference
These are just a few examples of how an incredibly unique organization uses HP Vertica to analyze unique data, yet does it in ways that many other groups might find surprisingly familiar. Sometimes after a closer look, we can see that many organizations have a lot more in common with their data needs than they may think, and HP Vertica is the right tool for the job.

Be sure to attend out upcoming Big Data Conference in Boston MA, where Conservation International is leading the hackathon!

Modern analytic databases such as HP Vertica often need to process a myriad of workloads ranging from the simplest primary-key lookup to complex analytical queries that include dozens of large tables and joins between them. Different types of load jobs (such as batch type ETL jobs and near real-time trickle loads) keep the data up-to-date in an enterprise data warehouse (EDW). Therefore, an enterprise class database like HP Vertica must have a robust yet easy-to-use mixed-workload management capability.

The Resource Manager

HP Vertica manages complex workloads using the Resource Manager. With the Resource Manager, you manage resource pools, which are pre-defined subsets of system resources with an associated queue. HP Vertica is preconfigured with a set of built-in resource pools that allocate resources to different request types. The General pool allows for a certain concurrency level based on the RAM and cores in the machines.

HP Vertica provides a sophisticated resource management scheme that allows diverse, concurrent workloads to run efficiently on the database. For basic operations, the built-in general pool is usually sufficient. However, you can customize this pool to handle specific workload requirements.

In more complex situations, you can also define new resource pools that can be configured to limit memory usage, concurrency, and query priority. You can even optionally restrict each database user to use a specific resource pool to control memory consumption of their requests.

Understanding and Classifying Workloads

Before you start thinking about resource pools and workload optimization in HP Vertica, you must first develop a solid understanding of the customer’s workloads and know how to properly classify them.

What should you use for classification criterion? You could pick apart a complex query, studying its structure in detail, counting and adding up the number of tables, joins and aggregate functions, number and types of derived tables and analytical functions to come up with some weighted score. However this kind of approach is extremely tedious and subjective and, as a result is not a practical option.

What if we use the standalone runtime of a query as the criterion? This method is also problematic because a query that runs in one minute while using up 80% of a system’s resources should obviously not be in the same category as another query that runs in the same amount of time (one minute) but uses < 0.1% of the resources.
In HP Vertica, the best proxy for query complexity is its memory usage. Being a modern MPP columnar database, HP Vertica is rarely, if ever, I/O bound. HP Vertica is also less likely to hit the CPU bottleneck because of the tremendous power and speed of modern multi-core CPUs. Therefore, the most common resource bottleneck in a production HP Vertica cluster running a complex mixed-workload is often memory. Because of this, the HP Vertica Resource Manager focuses on establishing equitable memory allocation among different workloads or pools. This ensures that no resource pool is starved out of memory in the worst-case scenario -- under full system load.

Determining Memory Requirements for a Query

If you can somehow determine quickly how much memory a query requires per node, then you can use that value to classify an HP Vertica query (or any other job). Based on extensive real-world performance tuning experience gained from working with some of HP Vertica’s biggest and most demanding customers, I have found the following classification rules to be very easy to use and effective:

Small: <200 MB
Medium: 200-800 MB
Large: >800MB

How can you quickly determine a query’s memory requirement? It turns out that HP Vertica has a convenient profiling option (similar to EXPLAIN.) You can use the PROFILE statement to get the total memory required for the query (among other things). As a best practice, you should set up a small and dedicated profiling pool for this purpose, as shown in the following example:

Creating a specific profiling pool forces your query to borrow from the default general pool for any extra memory that it needs to execute. If you use the general pool (a common mistake), depending on the detailed pool settings, the reserved memory may be more than a query actually needs. HP Vertica could be “fooled” to report on reserved memory as opposed to the actual allocated/used memory under certain circumstances and this would skew your result.

For more information on the Resource Manager, see Managing Worloads in the HP Vertica Documentation set.

Po Hong is a senior pre-sales engineer in HP Vertica’s Corporate Systems Engineering (CSE) group with a broad range of experience in various relational databases such as Vertica, Neoview, Teradata and Oracle.

Yesterday, myself and a few other fellow members of the HP Vertica team attended Boston TechJam 2014 at the city hall plaza in Boston. Featuring a digital art display by local artist Cindy Bishop entitled “The Way You Move”, our booth was thronged with people wanting to know more about what we do as the leading big data analytics platform. Myself and the rest of my team wanted to send out a huge thank you to everyone who stopped by our booth to talk with us. I personally had an amazing time interacting the rest of the tech community here in beautiful Boston, getting a chance to talk to everyone from up and coming innovators to grizzled tech veterans, (some of whom may be joining our ranks in the future!)

Below are some pictures I snapped of the festivities (when there was a rare break in between people coming up to the booth). I’m already looking forward to next year!

This blog is just the first in a series that addresses frequently asked tech support questions. For now, we’ll talk about optimizing your database for deletion.

You may find that from time to time your recovery and query execution is slow due to high volumes of delete vectors. Occasionally, performing a high number of deletes or updates can negatively affect query performance and recovery due to delete replay.

Delete replay occurs when ROS containers are merged together. The data marked for deletion in each of the ROS containers needs to be remarked once the containers are merged. This process can hold up your ETL processes because the Tuple Mover lock (T lock) stays on until the replay deletes finish.

Luckily, optimizing your database for deletes can help speed up your processes. If you expect to perform a high number of deletes, first consider the reason for deletion. The following is a list of common reasons for high delete usage:

You regularly delete historical data and upload new data at specific intervals

You constantly update data or you want to delete data that was loaded my mistake

You often delete staging tables

To optimize your database for deletion, follow the suggestions that correspond to your reason for deletion.

If you regularly delete historical data to make room for newer data, use partitioning to chunk data into groups that will be deleted together. For example, if you regularly delete the previous month’s data, partition data by month. When you use partitioning, you can use the DROP_PARTITION function to discard all ROS containers that contain data for the partition. This operation removes historical data fast because no purging or replay deletes are involved.

You may also want to delete a high volume of data because it was loaded by mistake or because you frequently update data (which involves frequently deleting data). In these cases, you may see a high volume of delete vectors. There are three good ways to prevent this:

Create delete-optimized projections by using a high cardinality column at the end of the sort order. This helps the replay delete process quickly identify rows to be marked for deletion.

Make sure your Ancient History Mark (AHM) is advancing and close to the Last Good Epoch (LGE) or Current Epoch. You may also want to periodically use the MAKE_AHM_NOW function to advance the ancient history mark to the greatest allowable value. When a mergeout occurs, all data that is marked for deletion before the AHM will be purged, minimizing the amount of replay deletes.

Periodically check the number of delete vectors in your tables using the DELETE_VECTORS system table. The automatic Tuple Mover will eventually purge deleted data but if you find your tables have a large number of delete vectors, you can manually purge records using the PURGE_TABLE function.

You may find that you frequently delete staging tables. To streamline this process, you can truncate the staging table instead of deleting it using the TRUNCATE TABLE function. Truncating a table will discard the ROS containers that contain the data instead of creating delete vectors, and thus is more efficient than table deletion.

Frequently deleting data is often a cause of slow query performance. Fortunately, you can optimize your database for deletions with these tips and avoid the headache.