Digital Data Science and the Analytics Warehouse

Here at EY, we spent a good chunk of time last year helping client’s build out digital capabilities in the analytics warehouse. In some cases, that meant building traditional data stores and traditional data models in Oracle and SQL-Server. But for the most part, it meant building analytics capabilities on top of Hadoop systems; that’s been bracing, difficult, sometimes frustrating, and always interesting. I remain convinced that the future of analytics lies in working at a very detailed level of the data (though I think there’s much to be debated about exactly which level of detail is right). I also remain convinced that massively parallel architectures and in-memory databases will be critical to that future and that these new technologies bring important capabilities for our type of analytics to the table. I’ve also seen first-hand that these new technologies are often immature in ways that we simply aren’t used to in our core technology stacks. We’ve experienced problems in GUIs, in memory management, in system administration, and in connectivity that have delayed and frustrated projects and that are reminiscent (to me at least) of my days programming personal computers back in the 80’s and early 90’s. It can be a mess.

Over the course of 2015, I hope to tackle some of the key issues in pursuing this type of new technology analytics warehouse. There’s probably no practical limit to the list of potential issues, but at the very least I hope to consider the following issues in considerable depth.

Modeling Digital Data: What does it mean to create a data model in the big data world? Isn’t the whole point of these technologies that we don’t need data models? Well, no. I don’t think that is the point. It’s true that the type of data models we need are fundamentally different than either the normalized relational models or the aggregated cubes we’ve traditionally built. But they still exist. At a theoretical level, I’d argue that they necessarily exist even if they are purely temporary and are never instantiated in a physical data layout but only in the temporary ordering of data as code executes. On a practical level, I thing big data often requires significant modeling and re-interpretation of the data to maximize its effective use. I hope to explore this in some detail.

Perhaps this sounds too theoretical. I also hope to take up some practical problems in digital data modeling that include the specific levels of digital detail data that are potentially interesting, the problem of tying digital data to traditional customer data, and some techniques for structuring data that are useful for funnel analysis, for pathing, for merchandising, and for customer journey.

Statistical ETL: We often describe a basic digital data model as having five levels (ranging from hit level up to visitor level). It’s a fairly trivial exercise to build this type of data model and it’s important to realize that while some of this model is similar to what we would have constructed in a traditional relational database it is also quite useful in a big data, Hadoop world. However, as I’ve argued for some time, one of the great challenges to digital data is that it doesn’t aggregate well. When you build visit or visitor level detail files, you are necessarily creating some kind of aggregation. In traditional relational models, we mostly relied on basic counting and summing to do the job of aggregation. This was never ideal. There’s a powerful role here for the use of statistical and analytic methods like clustering and self-organizing-maps to build more interesting aggregations as PART of the ETL that helps instantiate the data model. It isn’t that this type of statistical analysis to create new data fields is that extraordinary, but its role in big data systems has been poorly understood.

Technology Choices: What’s the role of Hadoop in the analytics ecosystem? What about in-memory systems? Is there still a role for traditional database systems? How much do the answers to these questions depend on the type of analytics and reporting you intend to focus on? And what, if anything, does the growing role of powerful detail-level analytic technologies imply for the future of digital analytics software? These are all challenging and difficult questions that drive important real-world decisions with huge implications for budget, staffing, and organization. I’m not going to pretend there’s anything like one right answer to ANY of these questions, much less some perfect idealized technology stack. But I do hope to make clear some of the fundamental connections between the business problems you hope to solve and the technologies you intend to deploy. Along the way, I’m going to write about the new ways we’ve been describing the digital analytics technology stack in terms of a diverging series of maturity curves.

Big Data Analytics Methods: One of my few stock presentations is a lightly technical description of what big data is. I’ve written on this in the past, but the presentation encapsulates some very nice visualizations that I think drive home the point that big data is important not because of any traditional IT concerns (embodied in the “Four V’s”) but because it attempts to tackle a set of analysis problems that involve order, time, and pattern of events. I continue to believe that this is a compelling explanation for what big data is, why it’s different, and how it impacts our traditional notions around technology stack, data integration and statistical analysis. However, this leaves open the question, what is the right set of analytics techniques for analyzing digital problems? My answers are no more than exploratory, but at least I hope to provide a few useful techniques as well as explain why some of the directions that are being pursued in terms of technology stack within Hadoop are not fruitful.

Organization and Process: We recently complete a really enjoyable project for a client conducting a scan of the enterprise marketplace around analytics organization. For the project, we got to interview a bunch of folks about their enterprise’s approach to analytics and some of the successes and struggles that have followed from that approach. It was fascinating stuff. Along the way, some really clear cautionary tales emerged about how to create value out of analytics and some of the biggest gotchas in organizing an analytics center of excellence – particularly one focused on advanced data science. Like technology stack, it’s absurd to expect there to be a single right answer to the organization of analytics in the enterprise. The structure and the culture of the individual organization have too much to do with what is possible and right to make that a reasonable expectation. One particularly cogent interviewee described to me how the inevitable cycles of centralization and decentralization of analytics in the organization were somewhat Hegelian with thesis followed by antithesis followed by synthesis and then a (never ending and, in this case, non-convergent) repetition of the cycle. But that shouldn’t engender complete cynicism. There are individual tactics and processes that seem to promote success regardless of where you are in that never-ending cycle and, depending on the type of business problems that are most pressing, there are probably good reasons why it makes sense to pivot your organization at certain times in certain directions.

While I do intend this to be a pretty extensive series and I expect to be writing on these topics for much of the year, I haven’t laid out a series roadmap at any level of detail and I’m going to freely intersperse other topics along the way. I’ve been working, for instance, on a discussion with fellow EY analyst Loren Hadley of the difficulties and challenges of doing analytics on ecommerce sites based in China and I hope to have that ready in the next week or two. Fascinating stuff, and a useful reminder that big data and data science are still just a piece of what digital analytics is all about.