The Guinness World Record for the Largest Data Warehouse: A Q&A with Tom Traubitz of SAP

This article features Ron Powell’s interview with Tom Traubitz, senior director of product strategy at SAP. Tom and Ron talk about the advancements in technology and SAP’s Guinness Record for the world’s largest data warehouse.

Data warehousing continues to evolve with all of the latest advancements in technology. Can you tell us how SAP has evolved its technology and now is making it possible to architect and manage large data warehouses that access a variety of data sources and integrate many different data types?

Tom Traubitz: There are a couple of technologies that we’ve put a lot of work into. We use in-memory core processing. But in order to get big, we needed to make sure that in-memory core could reach out to lots of data sources and data types. We used a technology that Sybase, which is now part of SAP, pioneered several years ago called Smart Data Access. It is real-time technology that allows the warehouse to interrogate other databases in their native language, not just sending very simple primitives to the remote system. It can ask very complicated questions of the remote system in its own language, and that remote system can share some of the processing load. We then combined that with powerful columnar technologies in SAP IQ, which allows us to get a lot of storage under management in a very large columnar sense.

What does this ability to have large data warehouses mean to the standard enterprise? Do most companies even deal with that amount of data?

Tom Traubitz: Most companies tend to deal with two to 10 terabytes of data -- the medium and large size companies. They’re growing quickly. We wanted to make sure that we had the headroom. When we started doing these extremely large systems like the Guinness record system, we’re trying to make sure that we’re providing all of that technology that is ready for them so that the growth is ready for them when they get to these larger sizes. Even though they only have a few terabytes now, we’re prepared to offer them systems as large as they will ever need.

Today’s successful enterprises are the ones that are able to use data most effectively for all types of analytics. Can you talk about how some SAP customers are using advanced analytics?

Tom Traubitz: One well-known company is eBay. They’re using a lot of advanced analytics in real time to make sure that their auctions go off flawlessly. They have tens of thousands of auctions going on continuously, and as the time ticks down, it is very important that the auction finishes critically. They have a lot of infrastructure that has to be monitored in the background in order to make sure that the auction does not get interrupted in the last few minutes because if it does and the auction fails, they have to refund the person who originally placed the auction with them. It becomes a reputation matter. So being able to flawlessly execute auctions means being able to look at all of their systems, thousands of computers, and make sure that all those signals from those computers are saying that everything is okay. So they do a lot of that in real time with our technology. Another company, MKI in Japan, is doing real-time DNA results. Traditionally their DNA tests would take two days to process. They now can process those in about 20 minutes. So the patient gets the results right there while they’re at the clinic. Other companies such as John Deere, Bayer MaterialScience, EB Braun, Mercedes AMG and others are doing a lot of things such as being able to process their core transaction applications -- their business applications -- and their analytics on the same platform: SAP HANA. This allows them to continuously evaluate their business data in an analytic setting without intermediary ETLs or other ways of moving data around.

Being able to process the data where it lives is really important. What about the cloud. What kind of cloud adoption are you seeing from your customers?

Tom Traubitz: We’re seeing very strong adoption of the cloud, particularly our own SAP HANA cloud platform. We have a number of offerings in that area from an enterprise cloud, which is a fully managed service for our customers, to SAP HANA One, which is a good starting point for startups and developers that can be run on the Amazon WorkSpaces system. We tested in the cloud to make sure that we could do things large. So we actually did a 100-terabyte prototype in the Amazon cloud using 111 instances of HANA spanning 1,776 cores of compute power, and we federated it together so we could do 60 billion rows of information being loaded at 8 million rows per second. With all those speeds and feeds, what we found was that with all these nodes to deal with that 100 terabytes of data, we could still do traditional queries at around 330 milliseconds, which means that you’re still getting sub second response time to complicated queries of all data. The cloud is there. It is ready to go. We have many enterprise-level offerings for it, and we’ve made sure that you can go big in the cloud. We are ready for all of our customers to add their technology to the cloud.

Those queries can be ad hoc, right? You don’t have to know what they are ahead of time.

Tom Traubitz: Absolutely. That’s one of the great things about HANA. All the data is indexed all the time, even when you use HANA in a federated circumstance. You can simply ask that question without a developer or an analyst having pre-aggregated data or pre-organized the data warehouse specifically to respond to that question.

I remember in the early days of data warehousing and my days at DM Review when we talked about a large data warehouse being in the 50-60 gigabyte range. Eventually, we moved into the terabyte range, but now SAP has set a new Guinness World Record for the largest data warehouse at an amazing 12.1 petabytes. Can you tell us about that data warehouse?

Tom Traubitz: Sure. It was a combination of social media, text, business data and other sources of information to simulate a really large business. So essentially, we took existing business data, and we duplicated it up to demonstrate in proportion a really large business. We built it on HP hardware with NetApp disks, Red Hat Linux, our SAP HANA product, our SAP IQ, and BMMsoft technology for federation software. What we wanted to do was create a genuine business model of a large data warehouse as per the Guinness rules. We brought together about 22 HP Proliant systems and about 20 NetApp storage arrays to get that physical presence. Then we used about 20 nodes of SAP IQ 16 and about 5 nodes of SAP HANA, and we put that all together to get that 12.1 petabytes. To put that in perspective, two petabytes is about all the data in the U.S. academic research libraries. It’s still not the biggest in the world. I mean all the academics have estimated that the written works of all mankind in all languages since the dawn of recorded history is about 50 petabytes. However, I think that you’ll be seeing our next warehouse will be trying to encapsulate all of that.

Wow -- I look at 12.1 petabytes, and if 50 petabytes is all of the information, you literally have created a warehouse that contains a quarter of all the information out there. That is really amazing.

Tom Traubitz: An amazing amount of data. We’re not too surprised that some of our biggest enterprises are consuming a lot of data because social media has been driving new information that I think allows people to get closer to their customers and better understand what their customers want. However, in order to really understand social media, you have to be able to gather it in a large context and treat it statistically, which means that when you start treating things statistically, you want a lot of observation points in order to make good judgments about that data.

I know the world of analytics is only going to get bigger, and I thank you for sharing SAP’s Guinness World Record for the world’s largest data warehouse.

Ron is an independent analyst, consultant and editorial expert with extensive knowledge and experience in business intelligence, big data, analytics and data warehousing. Currently president of Powell Interactive Media, which specializes in consulting and podcast services, he is also Executive Producer of The World Transformed Fast Forward series. In 2004, Ron founded the BeyeNETWORK, which was acquired by Tech Target in 2010. Prior to the founding of the BeyeNETWORK, Ron was cofounder, publisher and editorial director of DM Review (now Information Management). He maintains an expert channel and blog on the BeyeNETWORK and may be contacted by email at rpowell@powellinteractivemedia.com.