Posts

Ever wonder how to migrate from a relational database management system (RDBMS) schema structure to a big data platform (NoSQL database implementation). This article (part 2 of an overall series; Big Data: part 1) offers some practical advice for such a migration.

In the following sections you’ll find information that will show you the ins and outs of migrating from a traditional RDBMS schema structure to a big data storage structure. While most of the examples are made to be fairly generic, the final structure will be created for use in the “Wide Column / Column Family Store” variety. There are many ways of getting to the end goal. The process I illustrate is one of many, but it does work well for what I’m trying to show.

Warning! Here’s where things will get a little deep, technically speaking. Not to worry though! Just like in American football when you know you’ll be breaking through the defenses line, buckle up your chin-strap and keep going. The reality is that even if this discussion causes you to twitch a bit (indicating that you may be on the upper end of the non-technical speaking population) you’ll probably come through it just fine.

Relational to Composite Structure

For the sake of narrowing the scope of this discussion, I’m going to limit the focus to the “De-normalize and duplicate” approach. We’ll be ignoring the “Balanced” approach because it takes too much time to explain correctly. We’ll also be ignoring the “Model it as-is” approach because that’s a poor a practice and doesn’t perform well.

With our discussion scope sufficiently narrowed, we’ll start by tackling a relatively simple relational structure. The very first thing we’ll need to do is to evaluate which entities can be de-normalized to become what I call super-classes. “Super-class” is not a standard big data term. It’s my term and I find it makes things easier for the initial discussion. I’ll explain why later. Each of these super-classes will be used to help define the new composite structure (an actual Big Data term). We’ll be using the following Entity Relationship Diagram (ERD) to lay out the steps needed to identify our super-classes.

In this example we have a relational structure that is used to manage information about Really Simple Syndication (RSS) feeds. Remember, there are no joins and no referential integrity in the No SQL database world. Thus, we’ll be adapting the relational structure represented in the ERD to one that matches more of a composite structure.

Finding and de-normalizing the entities in our diagram into our super-classes can be as easy as asking a few simple questions. These questions will help us identify the super-class candidates, and the de-normalization will help us create them correctly. Ready? Let’s go!

Does the entity contain foreign keys (references) to other entities?

If the answer to the foreign key question is no, we have to ask an additional clarifying question. Do other entities reference this entity? If yes, in most cases, we have a pure reference entity that will be absorbed into one or more super-classes. Reference entities become part of the super-classes that reference them (de-normalization).

If yes, this is an indicator that the entity could be either an intersection/junction or an actual super-class. To discover which, we’ll ask a clarifying question. Does the entity contain only foreign keys, foreign keys that all make up the primary key or foreign keys with an additional, separate primary key, dates or other superfluous information? A ‘no’ means we’ve identified a super-class. ‘Yes’ indicates that we have identified an intersection/junction entity. We’ll ignore any entities of this type (for now).

Let’s take a look at how this type of analysis works with our existing schema structure. Using the questions we just went through, it looks as though we have three reference entities: “categories”, “sources”, and “item_types.” We also have an intersection/junction entity: “feed_contents.” That leaves the two remaining entities: “feeds” and “contents” as our identified super-classes. That’s it, right? Not quite.

We haven’t touched on the ever-so-popular: (pause for effect…) common modeling jargon, lingo, you know – the terms you just have to know! Let’s get that taken care of now, shall we? The following table shows the common terms that we’ll need to understand to make that all important transition in our thinking when modeling from relational to non-relational.

See! That wasn’t so bad. The key here is that unlike in the relational world we have to being to think about structures in more finite terms. A relational column is broken down into a name-value pair. Primary keys become row keys. Tables are column families and databases are key spaces. Column Families can be related to one another but not in a traditional sense so the concepts of joins and referential integrity do not translate well at all.

Before we get into the resulting structure, we still have to address a few missing parts. Namely, we need to follow through with the de-normalization of the reference entities. Let’s start with the reference table “item_types.” Its information will need to be duplicated into both the “feeds” and the “contents” super-classes. That doesn’t mean simply adding the columns from one table to another. Instead, think of it more as nesting. I’ll use braces and brackets to illustrate table boundaries, nesting depth, and column ownership as follows:

You see, the “item_types” entity literally becomes part of the “feeds” entity. Using that same technique, let’s fast-forward and show how our “feeds” super-class looks after de-normalization.

Along with the modeling terms and the de-normalization technique we just learned about, we’ll also need to know about composite structures and how they can look in Wide Column Stores. Each has their uses, pros, and cons. While I’m going to introduce these structures and their uses, I will not get into all of the pros and cons as there are currently too many discussions for, and arguments against, one structure over another to list them accurately. That said, here we go.

The column family structure is what I would call a “class.” It’s a pretty generic structure and can be easily used to model simple entities.

The super column family structure is derived from what I’ve been calling a “super-class.” It’s the one structure that allows us to model our de-normalized entities and accommodates elements of the column family structure also.

Let’s look at an illustration of how our “feeds” super-class converts into the “feeds” super column family.

I know what you’re thinking: “what about that intersection/junction entity we’ve been ignoring?” You’re right. Now is definitely the right time to bring that up. How are we going to handle the many-to-many relationship? My preferred way is by modeling it explicitly.

While this may seem like a waste of space, this model allows us to efficiently query all the contents for a specific feed as well as all feeds that have a specific content without using the actual super column families making the whole lookup process more efficient. This form of modeling can also be called: Query Pattern Modeling.

How many lookup column families and which data columns you have in those families depends what you’re intending to use in your application data-wise and how you’d like to model your query patterns. Ultimately, modeling the data structure based on the type of data you’re after is the right way to go. I will note that some structures will require much less de-normalization than ours. In our example we opted for super column families to hold the bulk of our data and did our de-normalization for that up-front.

A note on performance: Modeling the intersection/junction explicitly caused us to create two lookup column families. It also caused some obvious, additional, de-normalization for both “feeds” and “contents.” Even with the extra duplication, this method is still very reasonable to populate lookup information for both feed and content detail pages. Both will perform well with this model. Both will cause two lookups: one to query feeds and one to query contents. This will remain constant regardless of the total number of feeds and the total number of contents (a good thing).

In this article, I’ve illustrated how to define a data model for a “Wide Column / Column Family Store” NoSQL database implementation. I’ve shown how to define the model and discussed an applicable modeling technique for our narrowly defined scope.

The information presented is relevant regardless of structure modeling technique you end up using for the “Wide Column / Column Family Store” NoSQL database implementation you choose.

So, you’ve succumbed to the buzz and now you’re looking around trying to make heads or tails of the mass amounts of information out there hyped up as “big data.” Or perhaps you’re even ready to start your own internal project to get your existing applications on the bandwagon. In either case, terrific! Your decision is a good one.

Unfortunately, now comes the flurry of potentially overwhelming questions:

Where do I start?

What are my expectations?

What does big data mean to my company?

What does big data mean in the context of our applications

How do I assess my application needs?

How do I know or determine if big data solutions will work for us?

After some online research, you’ll quickly find that most folks are merely picking a place at the edge of the pool, dipping their toes in here and there to test the water temperature.

The reality is that it is incredibly difficult to define the term big data. Its meaning includes so much more than just storing and using large data sets. When you hear people referring to big data, they’re actually referring to is the use of NoSQL database implementations to store and process large amounts of information.

Don’t be discouraged! In the following sections we’ll get into what those NoSQL databases are and how to identify which is best, if any, for your project[s]. That’s right; the goal here is to provide you with information that will allow you to draw the correct conclusions for your organization and your specific application[s] or project[s]. (In part two of this article, I explain how to migrate from relational databases to NoSQL.)

NoSQL databases aren’t really databases. In fact, they are nothing like a traditional relational database management system (RDBMS). Instead they are implementations of various data stores which do not have fixed schemas, referential integrity, defined joins, or a common storage model. Also, they typically do not adhere to ACID principles (atomicity, consistency, isolation, and durability) and have sometimes widely varied technologies behind them. The term NoSQL (or Not-only SQL) is intended to imply that many of these implementations also support SQL-like query capabilities.

In this big data market where the NoSQL database is king, there are more than 100 different offerings available in various licensed models. The fact that these non-databases vary is no accident. Each distinct implementation has different strengths, weaknesses, and generally accepted uses. However, the bulk of these break down into four major categories based on some common underlying characteristics — as shown in this chart:

Choosing the Right Path

A heavy emphasis should be placed on the definition of your requirements. What are those? Well, that’s a large discussion all by itself. However, I’ll try to quickly paraphrase for the purpose of furthering this topic of discussion: Data requirements are artifacts captured during the process of defining application behavior with respect to gathering, storing, retrieving, or displaying information (data).

For example, in your application are you processing stock quotes, working with CRM data, or processing social information? There are different needs for different application types and thus a varied number of NoSQL implementations, not all of which are designed to be applicable to your needs.

It may be that after careful evaluation, you determine that your current RDBMS approach is valid and appropriate for your application. That’s not a bad thing. Traditional RDBMS certainly has its place and will remain very relevant for business use well into the future.

You see, there’s loads of confusion about this big data thing because there is no One Path concept. It just doesn’t exist. What’s good for one business may not be good for another even though they are doing similar things. There are many factors that go into selecting the right implementation and, honestly, not everyone is careful or critical when evaluating their needs.

An Assessment

So you still think that you might be better off migrating away from RDBMS to a NoSQL implementation. This decision isn’t for the faint of heart. It requires real consideration and planning, which raises several additional questions:

Question 1: How do I know which implementation is correct for my needs?

Fortunately, there are certain high-level criteria that help us get the decision process started. One such is determined by answering this question: Is the application intensive with reads or writes (e.g. large numbers of transactions in an OLTP system or large numbers of reads in an OLAP system) today? If the answer is no, and we’re merely dealing with volumes of information, we can automatically exclude items that fall into the “Column Families/Wide Column Stores” category. There are always caveats but that’s a good, general rule of thumb. An initial litmus test if you will.

Let’s test this with an example: We’ve got a compiled application that processes transactions for a global book seller. This application sees no fewer than 70,000 transactions a minute, 24 hours per day, 7 days per week. Which category of NoSQL database implementations fits? It’s obvious, right? You bet! It’s the “Column Families/Wide Column Stores” category.

Let’s try another. In this example we’ve got a web-based application that enables title companies to enable secure signing of large numbers of documents. The transactional volume isn’t large nor are the numbers of reads. Which category fits now? Right! It’s the “Document Store” category. This is primarily because only a small amount of data is changing; the signatures and whole documents need to be stored.

I think you’ve probably got a handle on how to determine initial fits by category now. So long as we ask the right questions, with the right view of our data requirements, we should always be able to identify the right category to start trying to assess which implementations might meet our big data needs.

We’re done then, right? We can pick any random implementation from the category we’ve identified, get it installed, configured and deploy our application[s] so we can start touting our Big Data story! Hold on there, we’re not really done yet.

Question 2: What will we be using to communicate?

The language[s] your application[s] use to communicate is an important consideration when choosing the right NoSQL database implementation. In the earlier examples we were able to identify the right categories based on our knowledge of the requirements and the application usage. Now we need to narrow down the field of choices and zero in on what is likely to be right for our needs. Unfortunately we actually don’t have enough information to identify, even at a high level, which implementation matches our communication needs. Yet!

We’ll need help to answer this and other questions, so we’ll employ the use of another table. In this table we’ve laid out a few of the most popular NoSQL database implementations, their protocol[s], API[s], licenses and replication models:

Clear as mud? Right! Basically what this table shows is exactly what you’ll find on the web: Every NoSQL database implementation has its own way of getting data into and out of its store. Fortunately though, this is precisely what we need to match our requirements with implementations in our selected category because, in the end, our application[s] will need to know how to interact with it.

Shall we try another example?: This time, we’ll use our high-transaction application for our global book seller which we’ve already matched to NoSQL database implementations in the “Column Families / Wide Column Stores” category. Now we take a look at our application’s communication requirements. The application today communicates via an ODBC driver (a compiled binary). We could make some assumption that if we use one compiled driver, we can use any of them. That line of thinking is not un-heard of and would lead us to select Cassandra from the above tables because it resides in the right category and because of it also uses a compiled driver: its Thrift driver. A stretch? Perhaps a small one, but it’s really close enough for the purposes of our discussion.

Can it really be that easy? Well, yes and no. There are many other factors that we should consider when choosing a NoSQL database implementation. Some of these will leverage information from the table above. The additional criteria can be determined by answering the following questions:

Is this a commercial application? If so, is this for internal or external use? You see, there may be specific licensing restrictions that must be considered, especially in commercial, for-profit, applications.

Are there existing deployment restrictions such as supported operating systems or others like single vs. multi-server, replication model, or even needing to meet specific backup requirements for the company’s disaster recovery plan?

Not asking for or failing to give heed to these answers may cause your project to fail due to a poorly selected match.

I hope this has helped in some small way to eliminate some of the confusion around big data and what it might mean for your company.

There are far too many areas within web development in need of innovation today. However, as I see it the biggest areas for innovation in this space are:

Big Data – Everyone is talking about big data and how important and impactful it is on business today. However, I’m not reading much about it when it comes to the web. Let’s face it, businesses who are worth their salt have information and some form of branding and/or marketing on the web. That’s a pretty simplistic achievement by today’s standards. The next big leap for business is to move their practices, information access, and even their management to the web. That will mean the need for broader, device independent access, support for heterogeneous data sources both big and small, embracing the mobile user base regardless of deployment preference: on-premise or cloud, and a lot more use of machine learning and predictive analytics geared toward real-time decision making. All of this encompasses our use of what’s called Big Data today. However, the innovations surrounding the use of it today are small and will pale in comparison to what the future holds.

Web Intelligence – As our knowledge of design and user experience expands, we begin to expand our belief system; re-examining what is and isn’t possible in the world wide web. We already employ intelligent algorithms used for machine learning and predictive analytics.

However, I envision a world where those same algorithms, spread across thousands of source systems, will join into a global intelligence network providing behavior pattern recognition, experience insights, and even pattern-based categorization.

Yes, the web is about to become a whole lot more intelligent and Komaya will be right there helping to shape it, mold it, and to help our clients benefit greatly from it.

Performance – I believe that over the next 3-5 years we will see vast improvements over the current forms of image and video compression of today. Development tools will be smarter and automatically analyze images to find the sweet spot, such that the quality is high enough and the file size is as small as possible. More than that though, web servers of the future will work harder and smarter and will have built-in features that make experience and performance decisions on-the-fly. Those server-side algorithms will process each element of a page transmission and evaluate things like:

What is the client type?

What is the overall page size being requested?

Do images need existing transparency?

Do images need their animated elements?

Do images contain non-visible data?

and will be able to act accordingly to ensure the down-stream clients, regardless of type get the best possible speed and visual experience. Rest assured that every innovation we see between now and even 5 years from now will itself lead to even more opportunities. It’s an exciting time to be in the web space.

How will sites of the future effect revenue?

The short answer is they already do in some very big ways. As new web standards emerge, we see a continual need from our clients to include more and more visual elements. These changes are exciting and new and draw more and more traffic to their sites, which is good! More traffic can equate to more revenue. However, more imagery equates to bigger page sizes which can become big performance problems very quickly. Page performance effects revenue, plain and simple. Small sites, like AutoAnything, cut their load time in half, and saw revenue grow by 13 percent. Large sites, like Amazon have shown that for every 100 milliseconds of performance slowdown, they experience a 1 percent drop in revenue. This is especially true for mobile users. Large pages mean long load times and poor perceived performance. While that does mean big money for the mobile carriers who are all too eager to cash in on the data plans, it also means a lot of frustration for downstream mobile users which equates to less money for those sites as traffic is driven away by poor performance. Sites of the future will have all of the essential elements in-check, striking a good balance between stellar visual experience and high performance.

How will Komaya’s products and solutions be different five years from today?

Current trends are pointing toward a world in which the web plays larger, more relevant and even strategic roles in everyday business. It’s no secret that web implementations are as varied as the sands in the Sahara. Today’s businesses are already seeing the need to expand their web presence to include social, marketing, and other such information. It’s not a big stretch to say that super visual, interactive, and data oriented sites will emerge as the new normal in the next couple of years. Those exciting new sites will have elements that mix state-of-the-art visualization with data that spans a variety of source systems, and languages, each tailored to the user’s expectations and experience. Yes, it’s apparent that the need for smarter, real-time, on-the-fly operations, data presentations, and visualizations becomes the new norm, even for smaller businesses in which staying alive and continuing to be competitive becomes paramount.
Komaya is gearing up to meet these challenges head-on. In fact over the next three to five years we’ll have sites that employ a wide variety of visualization and data technologies with incredible performance! We’ve got algorithms that interpret or infer what a users’ needs are and make requests to cache necessary data preemptively. I think that’s enough of a peak behind the curtain for now though.

What is the most exciting part of your job?

There are many exciting elements in what I consider my job. Not the least of which are the many opportunities I have to interact with our prospects and clients. I really do enjoy helping them realize their dreams and see their successes! A close second is the opportunity to interact and rub shoulders with some of the smartest developers in the world. There are so many opportunities in front of us. From technologies to explore and employ, to stale areas of web design ripe for innovation, there are far too many variables for things not to be constantly exciting and new around here. Our very existence depends on our ability to make our clients’ visions a reality. That means that things that seem impossible today, are the very things we’re toying with. The goal for all of us at Komaya is to make today’s impossibilities tomorrow’s possibilities and even realities. I believe that we will be a key player in revolutionizing the way we think about web solutions, their varied implementations, and much more. You see, it doesn’t really matter what’s impossible today. To quote Lewis Carroll’s Through the Looking Glass: “Why, sometimes I’ve believed as many as six impossible things before breakfast.”