Monthly Archives: October 2014

(Excerpt from original post on the Taneja Group News Blog)

I had a blast last week at Strata/Hadoop World NY 2014. I got a real sense that the mass of big data sponsors/vendors are finally focusing on what it takes to get big data solutions into production operations. In fact, in one of the early keynotes it was noted that the majority of the attendees were implementing software engineers and not necessarily analytical data scientists. Certainly there was no shortage of high profile use cases bandied about and impressive sessions on advanced data science, but on the show floor much of the talk was about making big data work in real world data centers.

I’ll certainly be diving into many of these topics more deeply, but here is a not-so-brief roundup of major themes culled from the 20+ sponsors I met with at the show:

(Excerpt from original post on the Taneja Group News Blog)

Ok, who here has used their free Dropbox (or similar) account not just at home but at work? Turns out that most of us do that can get away with it. Enterprises who care about data security are all scrambling to provide those kinds of data services to users, but hoping to do it far more securely.

(Excerpt from original post on the Taneja Group News Blog)

(Quoting myself from a BlueData press release today -) “Taneja predicted that 2014 would be the year of virtualized Big Data computing and BlueData is proving that out,” said Mike Matchett, senior analyst and consultant at Taneja Group. “BlueData essentially virtualizes scale-out computing, turning a physical cluster into a Big Data cloud platform with elastic provisioning and policy-driven management. Best of all, BlueData helps companies leverage their Big Data wherever it currently exists, wherever it is, and streams it in with performance boosting technologies to the self-provisioning Hadoop/NoSQL cloud. With this leap, companies of all sizes can now readily make progress on broader, more aggressive Big Data visions.”

I’ve written before about the opportunities and benefits that virtualizing big data clusters can provide— especially in quick spin up/down use cases, migrations, and test/dev — and also about the various storage options for Hadoop (see our Taneja Group BrightTalk channel for some past presentations). Existing Hadoop virtual hosting solutions like VMware BDE and OpenStack Sahara Projects (fka Project Savanna) have proven out use cases, but fundamentally there is still the problem with how to best handle corporate data. If we virtualize HDFS nodes we aren’t going to tackle PB-scale data sets. If we go with native HDFS on commodity servers we’ll miss critical enterprise features. And if we try to use enterprise SAN’s, we suffer performance penalties, not to mention possibly dedicating expensive storage only to the cluster. (and copying big data sets to AWS ECS? yikes!)

We really only want one master copy of our data if we can help it, but it also must be secure, protected, shared in a workflow manner (file and transactional access through othter protocols), performant, and highly available. MapR might get us all that for physical Big Data clusters, but we need it for virtual compute clusters too. BlueData bridges this gap by providing for virtualized hosting of the compute side of the Hadoop (and other big data scale-out compute solutions) ecosystem, while baking in underneath an optimizing IO “service” that channels in existing enterprise storage, fronting it as HDFS to the virtually hosted Hadoop nodes.

You could call this HDFS virtualization, but it’s not the virtual hosting of HDFS nodes as in BDE or Project Serengeti, or the complete remote indirection of HDFS like EMC Isilon offers. Rather its more like abstraction, like what IBM SVC does for regular storage. EMC’s ViPR HDFS used with VMware BDE might in some ways be seen as functionally comparable, but ViPR requires some modification to the Hadoop environment to work and isn’t integrated with BDE to provide any IO performance optimizations.

What are these performance optimizations? One, BlueData a native caching solution underneath the virtual compute clusters called IOBoost, and a related SAN/NAS attachment facility called DataTap. Together these can be used to pull from and stream any existing data from where it sits into the virtualized clusters for analysis, without dedicating, duplicating or moving data unnecessarily. What I really like is that all of an organizations existing data processing can simply “share” their data from it’s existing storage with analytics running in the virtualized big data clusters. Internally, IT can now offer an elastic big data “cloud” on corporate data sets without having to stage, build, or maintain any new storage solutions.

Today’s news from BlueData is that they are offering a free (in perpetuity) 5-node license of their full enterprise EPIC platform, not just the free one-node community edition already available. With no restrictions on cores or storage and with the full cloud-like multi-tenancy provisioning and the ability to analyze existing data where it currently sits, it seems downright hard to not grab this free license and standup an internal big data cloud of your own.

(Excerpt from original post on the Taneja Group News Blog)

One thing is certain in technology – the wheel keeps turning from differentiating advantage to fungible commodity, and then eventually back again. Now we think the time has come for data center connectivity to arise once more and become a competitive asset again. Yep, I’m talking about cables, switches, and the actually physical connections that tie all our precious IT infrastructure together. We think Fiber Mountain is about to change the game here, and provide some real disruption to the way data centers are wired.

First, consider the changes in data center traffic, from North-South to more East-West from increased virtualization and Hyperconvergence densities, scale out clusters for SANs and Big Data and private clouds, and the increased dynamics (and possible fluid motion) of increasingly “software-defined” resources. According to Fiber Mountain, 70% of traffic is actually within a rack these days. Maintaining a proper cabling and switch infrastructure requires a lot of time and attention to detail and adherence to “best practices”, not to mention endless spreadsheets and arcane cabling schemes. It’s getting harder to maintain, not to mention the increasing cost for huge core switches in order to keep up that precious hub design in which most packets must still go to core and come back down.

Given this reality, Fiber Mountain has just announced a new paradigm. First, they let you pull ultra-dense, intelligent fiber cables everywhere (think 24+ fiber “ribbon” cables with MPO connectors) across the row and datacenter. Everything is on fiber, which provides for any speed of any protocol current or future. Second, they provide matching intelligent top of rack and end of row, and core switches which are based on full optical cross-connections. Light speed ahead. While YMMV, think about sub 5ns latency anywhere to anywhere in the datacenter (and maybe beyond?), with a much smaller (and cheaper) core switch requirement.

The first mind-bender here is that every cable termination knows and can report on what it is connected to physically (MAC address). That’s right, the fiber (and/or copper if you insist) itself knows what it is connected to. You no longer track which cable is connected to what port where on what. The system does it for you. Just rack whatever you have and plug it into whatever fiber happens to be dangling nearby (no, we don’t really recommend leaving fiber just dangling, but you get the point). No more time or effort spent labeling, tracking, inventorying, troubleshooting what cable goes to which port on what device.

The rack and row switches then can smartly switch any traffic directly from point A to point B without going through a formerly slow and bottlenecked core. And if traffic does need to flow through core now, it’s all glass. And if required, fiber capacity can be dedicated for things like “remote” DASD or even physical level multi-tenancy separation assurance. It’s all managed by the Fiber mountain “meta” orchestration system which knows about topology, integrates alarms, and controls connections (and has a REST API for you hackers out there). This is SDN in spades…

There is a great roadmap for trying this out, starting with one row and migrating row by row in parallel connectivity as desired. Fiber Mountain estimates the total cost of this scheme is 1/3 Capex of what folks pay today for the big core hub centered design, with at least two times the capacity and clear performance advantages. And the Opex is expected to be much lower – less space, power, cooling, MAC effort, connectivity mistakes/outages, etc.

I’ve been predicting the eventual convergence of DCIM, IPM, and APM. I had thought the key overlap was in tying power consumption to performance to application (showback/chargeback). But now I see that it’s going to be much more than that, with physical connectivity as “software-defined” as servers, storage, and logical-layer network functions.

What can you do with 2/3+ of your data center connectivity budget back in your pocket? Let us know if you’ve had a chance to look at Fiber Mountain, and what you think about this new future data center design.

An IT industry analyst article published by SearchDataCenter.

Just because you can add a cache doesn’t mean you should. It is possible to have the wrong kind, so weigh your options before implementing memory-based cache for a storage boost.

Can you ever have too much cache?

[Cache is the new black…] As a performance optimizer, cache has never gone out of style, but today’s affordable flash and cheap memory are worn by every data center device.

Fundamentally, a classic read cache helps avoid long repetitive trips through a tough algorithm or down a relatively long input/output (I/O) channel. If a system does something tedious once, it temporarily stores the result in a read cache in case it is requested again.

Duplicate requests don’t need to come from the same client. For example, in a large virtual desktop infrastructure (VDI) scenario, hundreds of virtual desktops might want to boot from the same master image of an operating system. In a cache, every user gets a performance boost and saves the downstream system from a lot of duplicate I/O work.

The problem with using old-school, memory-based cache for writes is if you lose power, you lose the cache. Thus, [unless with battery backup] it is used only for read cache. Writes are set up to “write through” — new data must persist somewhere safe on the back end before the application continues.

Flash is nonvolatile random access memory (NVRAM) and is used as cache or as a tier of storage directly…

RT @TruthinIT: There's no cost of goods like a traditional NAS device where I've got disks I've got to pay for. And if I'm not using the data on those disks, I still got to pay for those disks. bit.ly/2BBX073@Nasuni@smworldbigdata

In 30 min I'm interviewing @Cohesity (and customer) on @TruthinIT about Mass Data Fragmentation. It's about having too many copies in about four or five different "dimensions", including cloud! Join us webcast (12.11.18) @ 1pmET (and there will be prizes) bit.ly/2PdqrQn