A New World of Data

A New World of Data

Finally on a quiet 17 hour flight over the Pacific Ocean to Australia with no Gogo internet access, I completely disconnected from the all the constant digital buzz and rejuvenated my true introverted spirit. No email, forums, Twitter, LinkedIn, Facebook, Instagram, conference calls, texts or instant messages for Q&A tech support. In this dreamy long stretch of treasured total silence, I contemplated my future and seized the moment to review hybrid solution architecture designs that I have been collecting to launch the Exploring Oceans of Data series.

It is no secret that I love to keep a pulse on the market. In this series, I will first share a few significant industry shifts that are slowly but surely changing data platform architectures. Then I will introduce an example of a modernized hybrid data platform logical technical architecture. In future articles, I will dive deeper into specific solution areas discussing technology capabilities, target use cases, lessons learned and how to get started exploring your own data with these innovations.

How Hadoop Changed the Landscape

Over the past four years since Dr. David DeWitt gave a Hadoop keynote at PASS Summit, I have been watching big data architectural design patterns evolve. Back then, there were rumors of NoSQL killing the enterprise relational data warehouse world. It didn’t take long for those bleeding-edge, early NoSQL adopters to back-peddle on those claims. Facebook publicly discussed adding relational back into their analytics architecture in 2013. Grossly oversimplifying what they learned, relational databases excel in areas where NoSQL struggles. It was a classic case of use the right tool for the job. NoSQL and SQL are complementary to one another. That was indeed the same message Dr. David DeWitt provided with his bridge diagram depicting the first SQL Server Polybase vision of unifying NoSQL and SQL worlds. It is exciting that this capability will be available in SQL Server 2016 along with stretch databases to the cloud, embedded R and so much more.

The distributed, rapid storage of just about anything unstructured or structured into Hadoop Distributed File Structure (HDFS) combined with the capability of contemporary analytics tools to massively parallel query that data in a timely manner for analysis, is appealing, powerful and driving the data platform modernization movement. Despite all the loud hype, big data and Internet of Things (IoT) projects are still in early adoption. Relational databases and Excel remain the primary data sources in most organizations right now. However that should not deter organizations from preparing for predictable change. The impact of big data will indeed be BIG!

Another market force that enabled Hadoop growth is Open Source momentum. Can you believe Microsoft loves Linux, develops for iOS first and no longer ignores Android? Power BI shifted from a proprietary visualization approach to a HTML5, D3.js visualization framework shared publicly on GitHub. These are just a few examples of changing times where Microsoft is not standing alone. Other vendors across the industry have made similar moves. After Hadoop was embraced by major vendors in our industry that saw opportunities to monetize it by providing better tools, security, maintenance and support, the adoption risks were reduced. While the Hadoop ecosystem continues to rapidly improve, exponential volumes of data from digital devices are just beginning to overwhelm traditional data architectures.

Cloud Data Gravity

Another undeniable force that I am seeing in our industry mid-life crisis was also discussed at the 2015 Gartner BI Summit. The center of data gravity is moving with more apps being delivered via cloud Software as a Service (SaaS). In the past, I might only have to extract Salesforce data with other on-premises app data into a client’s on-premises data warehouse. Today there is a constantly growing list of popular cloud app data sources that analytics pros need to include in decision-making processes. If you neglect the ocean of cloud and IoT data sources that your opponents do include in their analytics, you will lose your competitive edge and may miss a key window of opportunity in the hyper-competitive global economy. Don’t believe me? Here is a competitive reality check. In 2014, 89% of 1955 Fortune 500 firms vanished. Steven Denning pointed out in Forbes that “fifty years ago, the life expectancy of a firm in the Fortune 500 was around 75 years. Today, it’s less than 15 years and declining all the time.” Analytics is ultimately about competitive advantage and survival.

There is also attractive cloud streaming analytics, cloud data warehouse and cloud data lake technology that is fairly simple, fast and cost-effective to spin up, scale up or scale down versus making a multi-million dollar hardware purchase that is almost immediately outdated and then struggling to stitch open source technologies together in-house. In Microsoft’s cloud called Azure, I can simply “plug-and-play” into a variety of cloud data sources to develop high scale, advanced analytics solutions with Revolution R, Python, Azure ML and Power BI. Data virtualization is making a rebound enabling folks to analyze big data without moving it. Another cool analytics capability that I have recently reviewed is a semantic layer for Hadoop from AtScale that felt like it was a vNext Analysis Services for big data. There are many more cloud data sources and analytics advancements rolling out every single day. These are the types of topics that I plan to cover in my Exploring Oceans of Data series.

Still Afraid of the Cloud?

If you are still afraid of the cloud, I highly recommend reading Steven Sinofsky’s blog, Why Sony’s Breach Matters. It is one of the best wake-up call articles that I have read from an extremely knowledgeable source. I am also seeing CIA and other sensitive data organizations adopt cloud. Since I support early adopters and laggards around the world, I will also share that my laggards by far have had the most issues. I suspect top technical talent moves to companies that embrace the latest and greatest our industry has to offer. Surprisingly there are a lot of organizations running wild with unsupported or unpatched apps, databases and operating systems. Those groups are at a much higher risk of being the next hacker headline news story. I’d personally trust my own data in any of the top three cloud provider environments than in any on-premises laggard environment…and I am a former cloud cynic.

You may not know it from my blog but I was NOT a cloud fan initially. I was skeptical and totally annoyed by cloud, cloud, cloud. I loathe recurring fees for anything and tend to be a technology control freak. After experiencing the difference in my own app development this past year, I am genuinely warming up to cloud. I don’t miss server installation, configuration or administration one bit. I never liked those tasks nearly as much as developing an app, playing with data or finding interesting insights! I super love the frequency of enhancements and pushing an easy button to get scalable, secure, backed up data platform components. Furthermore, there is a priceless peace of mind that comes with knowing someone else is on the hook along with me to make sure everything works. I guess you could say that I have finally seen the cloud light. Now it is time for me to jump on the hybrid train combining on-premises with cloud – in a new world of data.

Traditional versus Modern Approaches

Where to start? How about clarifying that big data analytics is not traditional business intelligence with more data. In a hybrid big data world, we will end up redefining information life-cycle processes. We don’t need to replace our awesome enterprise data warehouse or OLAP cubes despite in-memory advances. In a big data world, ROLAP semantic models still play a key role. Querying data “accurately over time” design patterns have not changed. Check out Kimball’s Cloudera session. If you just use a bunch of “views” or “lenses”, you end up with a big data mess. Thank goodness a few of my old-timer skills are still relevant in a big data world!

Source: Microsoft

Hybrid Big Data Architecture

We do end up changing patterns of how the data gets loaded and updated. Instead of gathering detailed requirements at a table and field level, then extracting, transforming and loading that data into a data warehouse (“schema on write”), we now are ingesting and loading a variety of data into a data lake for customers to explore (“schema on read”). Data lakes are net new additions for most modern data architecture designs.

Source: Microsoft

The hybrid big data world involves an updated data ecosystem along with related analytics and productivity tools. Here is a higher level overview of the components that I am commonly seeing in modernized technical architecture plans.

Source: Microsoft

In the next Exploring Oceans of Data series article, I will dive into data ingestion. It is an area that I need to improve upon. I typically skip it to indulge in the data analysis fun! Streaming analytics, event hubs, Azure Data Factory and even how to use SQL Server Integration Services (SSIS) with these hybrid big data technologies are fabulous skills to master. I just saw a salary survey mentioning cloud architects rake in over $160,000 USD from The Verge. As mentioned in my last article, it also looks like these jobs won’t be completely automated in the future versus the dreary forecast for statisticians, programmers and many other professions. We are beginning to see a future world of man versus machine where data visualization and basic data analysis has become a commodity, cognitive intelligence has matured and smart robots surround us. Get ready for an exciting adventure.

Tags

Jen Underwood is a Senior Director at DataRobot and founder of Impact Analytix, LLC. She has a unique blend of product management and “hands-on” experience in data warehousing, reporting, visualization, and advanced analytics. In addition to keeping a constant pulse on industry trends, she enjoys digging into oceans of data to solve complex problems with machine learning.
Over the past 20 years, Jen has held worldwide product management roles at Microsoft and served as a technical lead for system implementation firms. She has experience launching new products and turning around failed projects. Most recently she provided advisory, strategy, educational content development, and marketing services to 100+ technology vendors through her own firm. She has been mentioned by KD Nuggets, Information Management and Forbes for her work. She also has written for InformationWeek, O’Reilly Media, and numerous other tech industry publications.
Jen has a Bachelor of Business Administration – Marketing, Cum Laude from the University of Wisconsin, Milwaukee and a post-graduate certificate in Computer Science – Data Mining from the University of California, San Diego. She was also honored to be a former IBM Analytics Insider, Tableau Zen Master, and Top 10 Women Influencer.