Posts Tagged ‘Big data’

NoSQL database supplier Couchbase says it is tweaking its key-value storage server to hook into Fusion-ios PCIe flash ioMemory products – caching the hottest data in RAM and storing lukewarm info in flash. Couchbase will use the ioMemory SDK to bypass the host operating systems IO subsystems and buffers to drill straight into the flash cache.

Can you hear it? It’s starting to happen. Can you feel it? The biggest single meme of the last 2 years Big Data/NoSQL is mashing up with PCIe SSDs and in memory databases. What does it mean? One can only guess but the performance gains to be had using a product like CouchBase to overcome the limits of a traditional tables/rows SQL database will be amplified when optimized and paired up with PCIe SSD data stores. I’m imagining something like a 10X boost in data reads/writes on the CouchBase back end. And something more like realtime performance from something that might have been treated previously like a Data Mart/Data warehouse. If the move to use the ioMemory SDK and directFS technology with CouchBase is successful you are going to see some interesting benchmarks and white papers about the performance gains.

What is Violin Memory Inc. doing in this market segment of tiered database caches? Violin is teaming with SAP to create a tiered cache for the HANA in memory databasefrom SAP. The SSD SAN array provided by Violin could be multi-tasked to do other duties (providing a cache to any machine on the SAN network). However, this product most likely would be a dedicated caching store to speed up all operations of a RAM based HANA installation, speeding up Online transaction processing and parallel queries on realtime data. No doubt SAP users could stand to gain a lot if they are already invested heavily into the SAP universe of products. But for the more enterprising, entrepreneurial types I think Fusio-io and Couchbase could help get a legacy free group of developers up and running with equal performance and scale. Which ever one you pick is likely to do the job once it’s been purchased, installed and is up and running in a QA environment.

In Part One we covered data, big data, databases, relational databases and other foundational issues. In Part Two we talked about data warehouses, ACID compliance, distributed databases and more. Now well cover non-relational databases, NoSQL and related concepts.

I really give a lot of credit to ReadWriteWeb for packaging up this 3 part series (started May 24th I think). This at least narrows down what is meant by all the fast and loose terms White Papers and Admen are throwing around to get people to consider their products in RFPs. Just know this though, in many cases to NoSQL databases that keep coming into the market tend to be one-off solutions created by big social networking companies who couldn’t get MySQL/Oracle/MSQL to scale in size/speed sufficiently during their early build-outs. Just think of Facebook hitting the 500million user mark and you will know that there’s got to be a better way than relational algebra and tables with columns and rows.

In part 3 we finally get to what we have all been waiting for, Non-relational Databases, so-called NoSQL. Google’s MapReduce technology is quickly shown as one of the most widely known examples of a NoSQL type distributed database that while not adhering to absolute or immediate consistency gets there with ‘eventual consistency (Consistency being the big C in the acronym ACID). The coolest thing about MapReduce is the similarity (at least in my mind) it bears to the Seti@Home Project where ‘work units’ were split out of large data tapes and distributed piecemeal over the Internet and analyzed on a person’s desktop computer. The complete units were then gathered up and brought together into a final result. This is similar to how Google does it’s big data analysis to get work done in its data centers. And it follows on in the opensource project Hadoop, an opensource version of MapReduce started by Yahoo and now part of the Apache organization.

Document databases are cool too, and very much like an Object-oriented Database where you have a core item with attributes appended. I think also of LDAP directories which also have similarities to Object -oriented databases. A person has a ‘Common Name’ or CN attribute. The CN is as close to a unique identifier as you can get, with all the attributes strung along, appended on the end as they need to be added, in no particular order. The ability to add attributes as needed is like ‘tagging’ in the way Social networking websites like Picture, Bookmark websites do it. You just add an arbitrary tag in order to help search engines index the site and help relevant web searches find your content.

The relationship between Graph Databases and Mind-Mapping is also very interesting. There’s a good graphic illustrating a Graph database of blog content to show how relation lines are drawn and labeled. So now I have a much better understanding of Graph databases as I have used mind-mapping products before. Nice parallel there I think.

At the very end of hte article there’s mention of NewSQL of which Drizzle is an interesting offshoot. Looking up more about it, I found it interesting as a fork of the MySQL project. Specifically Drizzle factors out tons of functions some folks absolutely need but don’t always have (like say 32-bit legacy support). There’s a lot of attempts to get the code smaller so the overall lines of code went from over 1 million for MySQL to just under 300,000 for the Drizzle project. Speed and simplicity is the order of the day with Drizzle. Add missing functions by simply add the plug-in to the main app and you get back some of the MySQL features that might have been missing.

After linking to the Part 1 of this series of articles on ReadWriteWeb (all the way back in May), today there’s yet more terminology and info for the enterprising, goal-oriented technologists. Again, there’s some good info and a diagram to explain some of the concepts, and what makes these things different from what we are already using today. I particularly like finding out about performance benefits of these different architectures versus tables, columns and rows of traditional associative algebra driven SQL databases.

Where I work we have lots of historic data kept on file in a Data Warehouse. This typically gets used to generate reports to show compliance, meet regulations and continue to receive government grants. For the more enterprising Information Analyst it also provides a source of historic data for creating forecasts modeled on past activity. For the Data Scientist ir provides an opportunity to discover things people didn’t know existed within the data (Data Mining). But now that things are becoming more ‘realtime’ there’s a call for analyzing data streams as they occur instead of after the fact (Data Warehouses and Data Mining).

This is the shortest presentation I’ve seen and most pragmatic about what SSDs can do for you. He recommends buying Intel 320s and getting your feet wet by moving from a bicycle to a Ferrari. Later on if you need to go with a PCIe SSD do it, but it’s like the difference between a Formula 1 race car and a Ferrari. Personally in spite of the lack of major difference Artur is trying to illustrate I still like the idea of buying once and getting more than you need. And if this doesn’t start you down the road of seriously buying SSDs of some sort check out this interview with Violin Memory CEO, Don Bazile:

Basile said: “Larry is telling people to use flash … That’s the fundamental shift in the industry. … Customers know their competitors will adopt the technology. Will they be first, second or last in their industry to do so? … It will happen and happen relatively quickly. It’s not just speed; its the lowest cost of data base transaction in history. [Flash] is faster and cheaper on the exact same software. It’s a no-brainer.”

Violin Memory is the current market leader in data center SSD installations for transactional data or analytical processing. The boost folks are getting from putting the databases on Violin Memory boxes is automatic, requires very little tuning and the results are just flat out astounding. The ‘Larry’ quoted above is the Larry Ellison of Oracle, the giant database maker. So with that kind of praise I’m going to say the tipping point is near, but please read the article. Chris Mellor lays out a pretty detailed future of evolution in SSD sales and new product development. 3-bit Multi-Level memory cells in NAND flash is what Mellor thinks will be the tipping point as price is still the biggest sticking point for anyone responsible for bidding on new storage system installs. However while that price sticking point is a bigger issue for batch oriented off-line data warehouse analysis, for online streaming analysis SSD is cheaper per byte per second throughput. So depending on the typical style of database work you do or performance you need SSD is putting the big iron spinning hard disk vendors to shame. The inertia of these big capital outlays and cozy relationships with these vendors will make some shops harder to adopt the new technology (But IBM is giving us such a big discount!…WE are an EMC shop,etc.). However the competitors of the folks owning those datacenters will soon eat all that low hanging fruit a simple cutover to SSDs will afford and the competitive advantage will swing to the early adopters.

*Late Note: Chris Mellor just followed up Monday night (June 27th) with an editorial further laying out the challenge to disk storage presented by the data center Flash Array vendors. Check it out:

What should the disk drive array vendors do, if this scenario plays out?They should buy in or develop their own all-flash array technology. Having a tier of SSD storage in a disk drive array is a good start but customers will want the simpler choice of an all-flash array and, anyway, they are here now. Guys like Violin and Whiptail and TMS are knocking on the storage array vendors customer doors right now.

In short, big data simply means data sets that are large enough to be difficult to work with. Exactly how big is big is a matter of debate. Data sets that are multiple petabytes in size are generally considered big data (a petabye is 1,024 terabytes). But the debate over the term doesn’t stop there.

There’s big doin’s inside and outside the data center theses days. You cannot spend a day without a cool new article about some new project that’s just been open sourced from one of the departments inside the social networking giants. Hadoop being the biggest example. What you ask is Hadoop? It is a project Yahoo started after Google started spilling the beans on it’s two huge technological leaps in massively parallel databases and processing real time data streams. The first one was called BigTable. It is a huge distributed database that could be brought up on an inordinately large number of commodity servers and then ingest all the indexing data sent by Google’s web bots as they found new websites. That’s the database and ingestion point. The second point is the way in which the rankings and ‘pertinence’ of the indexed websites would be calculated through PageRank. The invention for the realtime processing of this data being collected is called MapReduce. It was a way of pulling in, processing and quickly sorting out the important highly ranked websites. Yahoo read the white papers put out by Google and subsequently created a version of those technologies which today power the Yahoo! search engine. Having put this into production and realizing the benefits of it, Yahoo turned it into an open source project to lower the threshold of people wanting to get into the Big Data industry. Similarly, they wanted to get many eyes of programmers looking at the source code and adding features, packaging it, and all importantly debugging what was already there. Hadoop was the name given to the Yahoo bag of software and this is what a lot of people initially adopt if they are trying to do large scale collection and real-time analysis of Big Data.

Another discovery along the way towards the Big Data movement was a parallel attempt to overcome the limitations of extending the schema of a typical database holding all the incoming indexed websites. Tables and Rows and Structured Query Language (SQL) have ruled the day since about 1977 or so, and for many kinds of tabbed data there is no substitute. However, the kinds of data being stored now fall into the big amorphous mass of binary large objects (BLOBs) that can slow down a traditional database. So a non-SQL approach was adopted and there are parts of the BigTable database and Hadoop that dump the unique key values and relational tables of SQL to just get the data in and characterize it as quickly as possible, or better yet to re-characterize it by adding elements to the schema after the fact. Whatever you are doing, what you collect might not be structured or easily structured so you’re going to need to play fast and loose with it and you need a database of some sort equal to that task. Enter the NoSQL movement to collect and analyze Big Data in its least structured form. So my recommendation to anyone trying to get the square peg of Relational Databases to fit the round hole of their unstructured data is to give up. Go NoSQL and get to work.

This first article from Read Write Web is good in that it lays the foundation for what a relational database universe looks like and how you can manipulate it. Having established what IS, future articles will be looking at what quick, dirty workarounds and one off projects people have come up with to fit their needs. And subsequently which ‘Works for Me’ type solutions have been turned into bigger open source projects that will ‘Work for Others’, as that is where each of these technologies will really differentiate themselves. Ease of use and lowering the threshold will be deciding factors for many people’s adoption of a NoSQL database I’m sure.