ReadWrite - big datahttp://readwrite.com/tag/big-data
enCopyright 2015 Wearable World Inc.http://blogs.law.harvard.edu/tech/rssTue, 31 Mar 2015 11:04:11 -0700Why Apple Had To Take The NoSQL Plunge<!-- tml-version="2" --><p>Apple has quietly—<a href="http://readwrite.com/2015/03/24/apple-foundationdb-database-talent-icloud-itunes">and not so quietly</a>—been buying up Big Data companies over the past few years, most recently acquiring FoundationDB but in 2013 <a href="http://appleinsider.com/articles/15/03/25/apple-acquires-big-data-analytics-firm-acunu">also purchasing Acunu</a>, maker of a real-time analytics platform. The intent seems to be to purchase data infrastructure talent—and very particular talent at that.</p><p>Basically, Apple needed to get into NoSQL database technology in a bad way. These alternatives to traditional relational databases (long known as SQL systems) offer speed and flexibility that older-style databases can only dream of.</p><blockquote><p><strong>See also: <a href="http://readwrite.com/2013/03/25/when-nosql-databases-are-good-for-you">When NoSQL Databases Are—Yes—Good For You</a></strong></p></blockquote><p>As former Wall Street analyst and NoSQL (MongoDB and now Aerospike) executive Peter Goldmacher declares, Apple's interest in NoSQL translates into a need to handle "massive workloads in a cost-effective way."</p><div tml-image="ci01ca7ff940012a83" tml-image-caption="Peter Goldmacher" tml-render-layout="right"><figure><img src="http://a3.files.readwrite.com/image/upload/c_fill,cs_srgb,dpr_1.0,q_80,w_620/MTI5MDU2MjQ3MjkzOTc1MDA2.jpg" /><figcaption>Peter Goldmacher</figcaption></figure></div><p>In a far-ranging interview, Goldmacher points to the need to rethink enterprise data and calls out Hadoop and NoSQL technologies as the foundational bedrock of any Big Data strategy.</p><p><strong>ReadWrite</strong>:&nbsp;<em>Apple bought FoundationDB, but uses quite a bit of Cassandra, MongoDB, Hbase, and Couchbase already. At least as measured by job postings, it's <a href="https://jobs.apple.com/us/search#&amp;ss=FoundationDB&amp;t=0&amp;so=&amp;lo=0*USA&amp;pN=0">not using FoundationDB</a> (the product). Why do you think they opted to purchase FoundationDB, the company?</em></p><p><strong>Goldmacher</strong>:&nbsp;Apple is first and foremost an extremely innovative company in everything they do. They have created both transitional (iPod) and transformational (iPad) technologies and this desire to always innovate permeates the fabric of its corporate culture.&nbsp;</p><p>If you look at the software products the company provides, like iTunes, iMessage, iAd, etc., all of these products operate at massive scale. If they were written on traditional relational database technologies, it’s not clear if&nbsp;a) they would work or b) they wouldn’t bankrupt the company given the scale at which these products operate and the cost of a traditional RDBMS license.&nbsp;</p><p>So Apple innovated and was a very early adopter of NoSQL.&nbsp;It is reasonable to wonder if Apple’s software products would have even been possible without NoSQL technologies.&nbsp;</p><p>And here we are almost a decade after these products were launched, and Apple is yet again taking advantage of new technology. While the existing NoSQL technology was up to the task, it was expensive because of the massive server farms required to support the scale and the people required to support the massive server farms.&nbsp;</p><p>FoundationDB offers a key value store database akin to what Apple was using with Cassandra, but it runs in memory, which means you can reduce your hardware by about a factor of 8-10x.&nbsp;Said another way, if the company was using 75,000 servers to support the workload as I’ve seen speculated in the press [and <a href="http://cassandra.apache.org/">on the Cassandra project page</a>], FoundationDB would enable them to get that down to 7,500 servers.&nbsp;</p><p>To your question why purchase FoundationDB, I think they loved the technology and figured that if they just bought the company, they’d have the talent in house to continue to innovate and enhance the product and thus their ability to continue to innovate on the product front.</p><p>[<strong>Asay note</strong>: It's worth pointing out that not everyone agrees on the value of FoundationDB's actual product today. As MongoDB executive Kelly Stirman highlights:</p><p>But we'll let Goldmacher and Stirman duke this one out in another post.]</p><p><strong>RW</strong>:&nbsp;<em>You say that the initial wave of NoSQL players can't handle "massive workloads in a cost-effective way." What is it about multi-model databases like Aerospike and FoundationDB that gives them this ability?</em></p><p><strong>PG</strong>:&nbsp;Foundation and Aerospike are Key Value store databases akin to Cassandra, but the secret sauce is that the data resides in flash and not on spinning disk. This creates significant performance advantages with the knock on effect of needing less hardware.</p><p><strong>RW</strong>:&nbsp;<em>You do realize, of course, that DataStax, MongoDB, and others have customers running at "massive scale," right? DataStax has Netflix and other marquee customers <a href="http://www.datastax.com/1-million-writes">at significant scale</a>, as <a href="http://www.mongodb.com/mongodb-scale">does MongoDB</a>....</em></p><p><strong>PG</strong>:&nbsp;Absolutely, but there’s massive scale and then there’s the cost of massive scale. If I can get similar performance at 1/10th of the cost and massive scale means I am spending $50M, why wouldn’t I take that cost down to $5M?</p><p><strong>RW</strong>:&nbsp;<em>Do you think Apple's acquisition is a sign of things to come for NoSQL, generally? Are we about to enter a consolidation phase?</em>&nbsp;</p><p><strong>PG</strong>:&nbsp;I think Apple is one of a special class of companies like Google, LinkedIn and Facebook that are so cutting edge and so heavily reliant on data as an asset, they absolutely must own and innovate on the technology that supports the business.&nbsp; </p><p>So we may or may not be entering a phase of consolidation in the NoSQL world, but the buying rationale won’t be anything like Apple’s rationale for buying FoundationDB.&nbsp;</p><p>I can clearly see a world where traditional enterprise IT companies that don’t have a dog in the database fight buy NoSQL vendors to go after Oracle. In fact, EMC is already pretty far down this path.&nbsp;</p><p>At some point the Ciscos and Dells of the world have to step up and become players in the database space because we are seeing the database players getting into the hardware space. The stage was set a long time ago for consolidation and I believe this trend will continue. &nbsp;</p><p><strong>RW</strong>: <em>Let's pick winners. If an enterprise were forced to use only two Big Data technologies, what should they be and why?</em></p><p><strong>PG</strong>:&nbsp;Well, it feels like everything is Big Data technology these days.... Still, if I were running IT at a large company, I would be investing in Hadoop and NoSQL.&nbsp;</p><p>With Hadoop, you have the ability to dramatically and cost effectively expand the contents and thus value of your data warehouse which is extremely important. The more you can measure, the more you can improve.&nbsp;</p><p>And in the NoSQL world, you have two opportunities.</p><p>First, use MongoDB/DataStax/CouchDB to replace workloads that have historically run in Oracle even though they weren’t a great fit either because of cost or functionality limitations. For example, MongoDB enjoys a number of consistent use cases like content management systems, web catalogs and web sites. Oracle is overkill for that.&nbsp;</p><p>So those NoSQL players help you do old things better.&nbsp;</p><p>But if you want to do new and truly innovative things, you need enormous speed and scalability. This is the second opportunity.</p><p>One of the most common use cases for Aerospike is in the AdTech world. The AdTech players load an Aerospike database every morning with relatively static data created in Hadoop. This data is essentially a person’s profile based on their cookies as they click around the internet every day.&nbsp; </p><p>In a gross oversimplification, Peter is a 45-year old male that lives in the Bay Area and shops on all the bargain web sites. This data gets loaded into Aerospike and then Aerospike collects data all day about what Peter is clicking on that day.&nbsp;</p><p>Well, if Peter is clicking a bunch of web sites looking for a watch, Timex or the local watch store would bid aggressively for the opportunity to put an advertisement in front of Peter because he is exhibiting characteristics of a likely buyer. That is a great example of deriving tremendous value from your data warehouse by making the data actionable when it matters. &nbsp;</p><p><em>Photo by <a href="https://www.flickr.com/photos/mac_ivan/7679988780/">Ivan Bandura</a></em></p>It needs serious database talent—but not just any talent.http://readwrite.com/2015/03/31/apple-foundationdb-nosql-peter-goldmacher
http://readwrite.com/2015/03/31/apple-foundationdb-nosql-peter-goldmacherWorkTue, 31 Mar 2015 06:00:00 -0700Matt AsayBatch Your Big Data Jobs—Or Stream Them?<!-- tml-version="2" --><p>Even among the über-sexy big data elite, Apache Spark is smoking. Promising dramatically better performance on in-memory (100x faster than Hadoop's MapReduce!) and on-disk (10x faster!) storage, Spark seems to be leading the charge into a beautifully fast Big Data future.</p><p>According to some, Hadoop's batch-oriented days—that is, where you have to pile all your data together, process it through Hadoop and then interpret the output—may be numbered. But while alternatives to batch processing certainly look promising, rumors of Hadoop's death may be a wee bit exaggerated. &nbsp;</p><h2>Why Batch When You Can Stream?</h2><p>Just as Hadoop started to hit mainstream consciousness, some people started touting The Next Big Thing. As <a href="http://databricks.com/">Databricks</a> engineer Patrick Wendell told me in an interview, we are at the "beginning of what will likely be a major expansion of streaming workloads over the next few years." Such workloads would start yielding results while the analysis was still underway, rather than forcing you to wait for the entire job to finish.</p><p>Of course, saying "streaming analytics" is a lot easier to say than actually implementing it, according to Wendell:&nbsp;</p><blockquote tml-render-layout="inline"><p>The big technical challenges with streaming are around operational complexity. Streaming programs are inherently more complex to maintain then offline batch processing engines, you have to be "always on," have quick response time, and deal with bursty incoming data. Furthermore, it can be expensive from an engineering perspective to maintain two different stacks: one for batch processing and the other for streaming.</p></blockquote><p>The answer, according to Wendell and another streaming analytics pioneer, <a href="http://www.zoomdata.com/">Zoomdata</a>, is to consolidate big data technologies around streaming analytics. But the two companies approach the problem differently.</p><h2>Streamlining Big Data</h2><p>For <a href="http://databricks.com/">Databricks</a>, the company behind Apache Spark, the best approach is to "unify&nbsp;the streaming programming model with batch," as Wendell explains. Doing so—as Databricks accomplished with <a href="https://spark.apache.org/streaming/">Spark Streaming</a>—"lets users take existing business logic and apply it in real time." This means that "All of the effort they put into writing code to define metrics, do anomaly detection, etc., they can do it directly on their streaming data."&nbsp;</p><p>The big payoff? "They only have to maintain one software stack."</p><p>But just as importantly, as Cloudera's <a href="http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/">Ted Malaska highlights</a>, is that Spark Streaming allows you to "create data pipelines that process streamed data using the same API that you use for processing batch-loaded data."&nbsp;</p><p>Not everyone agrees.</p><h2>"Unnecessary Tradeoffs"</h2><p>According to <a href="http://www.zoomdata.com/">Zoomdata</a> CEO and co-founder Justin Langseth (with whom <a href="http://readwrite.com/2014/11/11/business-intelligence-big-data-zoomdata-justin-langseth">I spoke recently about business intelligence and Big Data</a>), batch-oriented systems like Hadoop are unnecessary in an increasingly real-time world:</p><blockquote tml-render-layout="inline"><p>There is no real need to batch up data given today’s modern architectures such as Kafka and Kinesis. Modern data stores such as MongoDB, Cassandra, Hbase, and DynamoDB can accept and store data as a stream, and modern BI tools like the ones we make at Zoomdata are able to process and visualize these streams as well as historical data, in a very seamless way. Just like your home DVR can play live TV, rewind a few minutes or hours, or play movies from last century, the same is possible with data analysis tools like Zoomdata that treat time as a fluid.</p></blockquote><p>As Langseth told me in our interview, proposed new architectures that incorporate the best of batch and real-time are a step backward:</p><blockquote tml-render-layout="inline"><p>Those who have proposed a “Lambda Architecture,” which effectively separates paths for real-time and batched data, are espousing an unnecessary trade-off, one that is optimized for legacy tooling that simply wasn’t engineered to handle streams of data be they historical or real-time. At Zoomdata we believe that it is not necessary to separate-track real-time and historical data, as there is now end-to-end tooling that can handle both form sourcing, to transport, to storage, to analysis and visualization.</p></blockquote><p>The key, as Langseth continues, is not to get mired in batch-oriented systems at all, <em>even if you don't currently care about real-time analysis of your data</em>. Sticking with streaming data from the start "massively simplifies big data architectures [as] you don’t need to worry about batch windows, recovering from batch process failures, and so on," he says.</p><p>In short, "even if you don’t need to analyze data from five seconds or even five minutes ago to make business decisions, it still may be simplest and easiest to handle the data as a stream nevertheless."</p><h2>The Future Takes A Long Time</h2><p>Even if Langseth is correct, and developers are better off dumping batch for stream-based systems, it's going to take a long time to get there. As Datastax senior community manager Scott Hirleman told me, "Truly forward thinking companies are just starting to experiment now [with streaming analytics] so that says that even to reach Hadoop's level of awareness will be a few years or more."</p><p>Real-time analytics may be a thing, in other words, but it's a thing that will take a long time to really hit.</p><p>And when it does, as Hadoop creator Doug Cutting stressed in an interview with me, "streaming [will simply] join[] the suite of processing options that folks have at their disposal." Streaming <em>and</em>&nbsp;Hadoop, in other words, not or.</p><p>This may not be the beatific future Langseth envisions, but it may be the best we get.</p>Real-time alternatives to Hadoop may give you a choice.http://readwrite.com/2015/03/23/hadoop-big-data-batch-streaming
http://readwrite.com/2015/03/23/hadoop-big-data-batch-streamingWorkMon, 23 Mar 2015 12:23:35 -0700Matt Asay"Reactive" Systems: Easy To Program, Now Easier To Manage<!-- tml-version="2" --><p>Enterprise deployment of "reactive" systems is <a href="http://readwrite.com/2014/09/19/reactive-programming-jonas-boner-typesafe">no longer a fringe concept</a> enjoyed by a handful of early adopters. According to the <a href="http://www.google.com/url?q=http%3A%2F%2Fwww.reactivemanifesto.org%2F&amp;sa=D&amp;sntz=1&amp;usg=AFQjCNEtgmJQcpWdII54Sqa65RoPO6K5uQ">Reactive Manifesto</a>, developing these message-driven, elastic, resilient, consistent, and highly responsive applications is what’s fueling the new wave of systems that are deployed on everything from mobile devices to cloud-based clusters running thousands of multi-core processors.&nbsp;</p><p>Users expect millisecond response times and constant uptime.</p><blockquote tml-render-layout="inline"><p><strong>See also: <a href="http://readwrite.com/2014/09/19/reactive-programming-jonas-boner-typesafe">As Systems Get More Complex, Programming Is Getting "Reactive"</a></strong></p></blockquote><p>Web-giant enterprises like Twitter, Facebook, Google and Netflix measure their data in petabytes, but even smaller companies face enormous pressure to keep pace with the market. Expectations have changed—rather than batch processing, both users and lines of business expect data to be processed in real-time for both user experience and a competitive advantage in the market.</p><p>This can strain legacy architectures, even those dealing with “medium data” measured in gigabytes. On top of user expectations being significantly higher today than a decade ago, the pressures facing enterprise operations teams to manage resilient, responsive systems is brutal; most existing technologies available today are not designed to deploy and manage reactive systems running on clusters. </p><h2>Chain Reaction</h2><div tml-image="ci01c9460c100199de" tml-image-caption="Kevin Webber" tml-render-layout="right"><figure><img src="http://a3.files.readwrite.com/image/upload/c_fill,cs_srgb,dpr_1.0,q_80,w_620/MTI4NzExMDgxNzEwMzYxMDU0.jpg" /><figcaption>Kevin Webber</figcaption></figure></div><p>Typesafe, developers of Akka, Scala, and Play Framework, and a company I've covered several times because of the cool open source things they help to build, have launched a new tool to manage reactive applications called ConductR.&nbsp;</p><p>I spoke recently to Kevin Webber, who joined Typesafe last year as a developer advocate. Most recently he led the team that redesigned Walmart Canada’s new e-commerce platform. Walmart’s platform was built on Akka, Play and Scala.</p><p><em><strong>ReadWrite</strong>:&nbsp;What’s driving this new scale challenge where monolithic applications are morphing into microservices?</em></p><p><strong>Kevin Webber</strong>:&nbsp;The short answer is that enterprises need reactive systems to respond to growth in mobile computing and the Internet of Things. The rapid cycle of technological innovation is troubling for enterprises.&nbsp;</p><p>If we look back 10 years to 2005, you’ll notice that we didn’t have the iPhone, Facebook, Twitter, Netflix streaming, Kindle, App Store, Spotify or Big Data tools. In 2005, the average household of consumers would have one or two computers—a desktop and laptop, let’s say—that would connect to applications running on one or two redundant servers. Internet connectivity wasn’t especially fast, and applications weren’t especially fast.&nbsp;</p><p>Let’s face it, by today’s standards nothing was really that fast. And most of us were fine with it—back then.&nbsp;</p><p>Jumping forward to 2015, we see the need for a major change in the network topology of enterprises in order to meet the demands being placed on back-end infrastructure. What has happened is that the very same household now has multiple devices per person—all of which may use different protocols and platforms. Now we have a multitude of devices connecting to a multitude of application server instances.&nbsp;</p><p>A single application server simply can’t handle the strain, no matter how much vertical scalability is attempted. There’s only one option—to scale out.</p><h2>Managing Reaction</h2><p><em><strong>RW</strong>:&nbsp;So Typesafe believes that while the development side of the organization is embracing the principles of reactive programming, the operations side needs new tools for managing these new applications?</em></p><p><strong>KW</strong>: Yes, enterprises have begun embracing the concept of architecting reactive systems for many reasons: they are more flexible, loosely coupled, and scalable. This makes them easier to develop and amenable to change, plus significantly more tolerant of failure and when failure does occur they meet it with elegance rather than disaster.&nbsp;</p><p>Because enterprise systems are being built as smaller individual applications rather than the giant monoliths of the past, the next logical step is to enable operations to intelligently deploy and manage their production systems using the same reactive principles.&nbsp;</p><p>Whereas 10 years ago a system may have been comprised of a handful of core applications, now a system may be comprised of tens or hundreds of smaller, lightweight applications. Although distributed, more resilient against failure and capable of handling much higher loads, reactive systems are actually safer, more predictable and easier to manage and deploy than traditional systems; however, they differ from traditional architectures and this unfamiliarity must be managed using new approaches and tools.&nbsp;</p><p>Armed with existing solutions, is your ops team confident that they have the tools needed to handle the demands placed on enterprise infrastructure in 2020? Software architectures are changing; the real question is if your operations team is ready to handle these changes.</p><p>We will be the first to say that the challenges of deploying and managing reactive systems are not obvious to the casual observer. Without technologies designed specifically for distributed architectures, there are several reasons why we believe enterprises are struggling to deploy and manage reactive systems:</p><ol><li>Existing operations processes and methodologies are inadequate for reactive systems.</li><li>Existing operations tools and technologies are inadequate for reactive systems</li><li>The costs and risks of downtime using inadequate solutions are higher than ever.</li></ol><h2>How To React To Reactive</h2><p><em><strong>RW</strong>: So how should reactive-inclined enterprises manage this new complexity?</em></p><p><strong>KW</strong>: That's what ConductR is for.&nbsp;The idea behind ConductR is to let operations teams experience the same convenience that development teams get using Play Framework to build their reactive applications.&nbsp;</p><p>Indeed, your system needs to be designed in a reactive way—minimal shared state, for example. Here, resilience is key. You should be prepared for your database to disappear at any point in time.&nbsp;</p><p>Once you’re that resilient, then you can move around a cluster very easily, shifting load over to active areas and reducing latency to a minimum. Once your apps are tolerant enough to lose everything, shut down and immediately restart somewhere else, then you’ve got yourself a reactive system.</p><p><em><strong>RW</strong>:&nbsp;How does ConductR fit into an enterprise's new architecture?</em></p><p><strong>KW</strong>: Enterprises&nbsp;have other tools designed to handle provisioning, virtualization and other infrastructure needs for your machines and cluster, so ConductR was designed to work well with existing technology and promote the four tenets of reactive systems—responsive, resilient, elastic and message-driven.</p><p><em>Photo courtesy of FitStar</em></p>The new tool ConductR makes it easier to manage high-scalability applications.http://readwrite.com/2015/03/12/reactive-programming-conductr-kevin-webber
http://readwrite.com/2015/03/12/reactive-programming-conductr-kevin-webberHackThu, 12 Mar 2015 11:54:16 -0700Matt AsayHadoop Creator: If You Want To Succeed With Big Data, Start Small<!-- tml-version="2" --><p>"What is Hadoop?" is a simple question with a not-so-simple answer. Using it successfully is even more complex.</p><p>Even its creator Doug Cutting offered an accurate-but-unsatisfying "it depends" response when I asked him last week at the Strata+Hadoop World conference&nbsp;to define Hadoop.&nbsp;He wasn't being coy. Despite serving as the poster child for Big Data, Hadoop has grown into a complicated ecosystem of complementary and sometimes competitive projects.</p><div tml-image="ci01c7a45ef001c80a" tml-image-caption="Doug Cutting" tml-render-layout="right"><figure><img src="http://a3.files.readwrite.com/image/upload/c_fill,cs_srgb,w_620/MTI4MjUxODQxMjU4NTA2MjUw.png" /><figcaption>Doug Cutting</figcaption></figure></div><p>Which is precisely what makes it so interesting and powerful.</p><p>As Cutting went on to tell me, Hadoop can fill a myriad of different roles within an enterprise. The trick to getting real value from it, however, is to start with just one.&nbsp;</p><h2>The New Linux</h2><p>Hadoop, avers Cutting, is much like Linux. "Linux, properly speaking, is the kernel and nothing more," he notes, <a href="http://en.wikipedia.org/wiki/GNU/Linux_naming_controversy">channeling his inner Richard Stallman</a>.&nbsp;"But more generally, it’s an ecosystem of projects. Hadoop is like that."</p><p>But this wasn't always the case.&nbsp;</p><p>Hadoop started as a new way to process (MapReduce) and store (Hadoop Distributed File System, or HDFS) data. Ten years later, Hadoop has become motley assembly of oddly-named projects, including Pig, Hive, YARN, Hbase, and more.&nbsp;</p><p>Already a far-ranging ecosystem, Hadoop is the largest galaxy in an ever-growing universe of Big Data (though people often use Hadoop to mean Big Data). Ultimately, says Cutting, Hadoop expresses a certain "style" of thinking about data, one that centers on scalability (commodity hardware, open source, distributed reliability) and agility (no need to transform data to a common schema on load but rather load it and then improvise on schema as you go along).</p><p>All of which, he says, means one simple thing: "More power to more people more easily with more data."</p><h2>Calling Your Hadoop Baby Ugly</h2><p>"Easily," however, is in the eye of the data science beholder.&nbsp;</p><p>Hadoop, despite its power, has yet to see mainstream adoption (it still accounts for just 3% of the enterprise storage footprint, as <a href="https://451research.com/report-short?entityId=79405">451 Research finds</a>), largely <a href="http://readwrite.com/2013/12/06/hadoop-enterprise-big-data">because of how complicated it is</a>. And that's&nbsp;not helped by how fast the Hadoop ecosystem continues to grow.</p><p>This, in turn, may be one reason that Hadoop deployments haven't grown as fast as they otherwise might, as Gartner analyst <a href="http://blogs.gartner.com/merv-adrian/2015/02/18/hadoop-adoption-moving-but-not-necessarily-forward/">Merv Adrian highlights</a>:</p><div tml-image="ci01c7a4dad00199de" tml-image-caption="Source: Gartner" tml-render-layout="inline"><figure><img src="http://a5.files.readwrite.com/image/upload/c_fill,cs_srgb,w_620/MTI4MjUyMzczMjk3NTgwMDQy.png" /><figcaption>Source: Gartner</figcaption></figure></div><p>Cutting recognizes Hadoop's warts. While it would be too much to say he celebrates them, it is true that he's not embarrassed by them. At all. As he puts it, "One of the things I liked when I got started in open source is that I didn’t have to apologize. Nor did I have to deceive and say a project could do things well if it actually couldn't."&nbsp;</p><p>The code, after all, tells its own truth. &nbsp;</p><p>Hence, he's comfortable saying things like:</p><blockquote><p>Hadoop is what it is. It used to be a lot worse and used a lot less. It’s gotten a lot better and is now used a lot more.</p></blockquote><p>How much more? Despite the Gartner analysis above, Cutting told me that Cloudera sees the Hadoop world doubling each year: "doubling of number of customers, company revenue, customer cluster sizes, and even Strata has roughly doubled each year."&nbsp;</p><p>That's good, but what would make it move even faster? Otherwise stated, what does he see as the biggest barriers to Hadoop adoption?</p><h2>Barriers To Hadoop World Domination</h2><p>There are a few things that block Hadoop's progress, Cutting tells me. First, there are features that enterprises need that Hadoop and its ecosystem still lack.</p><p>More than this, he suggests, there are missing integrations between the different stars in the Big Data/Hadoop galaxy. In other words, it's still not easy or seamless to move data between different tools, like Hive to Solr or MongoDB to Impala. Enterprises want to use different tools to deal with the same data set.</p><p>We're also still waiting on applications to remove complexity and streamline Hadoop. As he suggests, Cloudera sees common use cases—like risk management for financial services companies—that require the same tools to solve the problem. At some point these "recipes" for success (use tools X and Y in this or that way) need to become productized applications, rather than each enterprise assembling the recipe on their own.</p><p>Finally, Hadoop needs more people trained to understand it. If Strata roughly doubles in size each year, that implies that half are newbies. Getting those newbies up-to-speed is paramount to helping their enterprises embrace Hadoop.&nbsp;</p><h2>Start Small To Go Big</h2><p>There are, of course, a swelling number of online and classroom-led courses to teach people Hadoop, which is one way to become proficient.</p><p>But Cutting thinks there's another way to drive fast, effective learning. As he puts it, "What works best and leads to the least disappointment is to look at your business and find the low-hanging fruit, a discrete project that could save or make the company a lot of money."&nbsp;</p><p>Though Cloudera sells a vision of enterprise data hubs, he thinks that's more of an end goal, not the first step. "Don’t try to jump to moving your company to an enterprise data hub," he declares. "Not at first. Start with a point solution with relatively low risk." Then grow the solution (and the team's understanding) from there.</p><p>"If you’re doing it right," he continues, "others will find out what you’re doing and they'll ask to add extra data to your Hadoop setup. Maybe it doesn’t solve your immediate business problem, but it allows Hadoop experience to grow organically within an organization."</p><p>Which seems like the exact right way to go big with Big Data: by starting small.&nbsp;</p><p>The reality, as he notes, is that enterprises don't turn over their technology investments very fast. As such, Hadoop primarily gets used for new applications, with the majority of enterprises still running themselves on old technology. It will take years for Hadoop to take its rightful place in the enterprise, but by starting small, Hadoop-savvy employees can position their companies to profit all along that growth trajectory.</p><p><em>Image courtesy of <a href="http://www.shutterstock.com">Shutterstock</a></em></p>Doug Cutting says walk before you run.http://readwrite.com/2015/02/25/hadoop-big-data-start-small-doug-cutting
http://readwrite.com/2015/02/25/hadoop-big-data-start-small-doug-cuttingHackWed, 25 Feb 2015 06:47:59 -0800Matt AsayFour Ways Data Visualization Makes Big Data Easier<!-- tml-version="2" --><p>Strata+Hadoop World, Big Data's big conference last week, was filled with sessions dedicated to the gospel of bigness: More data equals more good. From data lakes to enterprise data hubs, the industry has made a fetish of gathering ever more data.&nbsp;</p><p>Because, you know, insights are bound to occur. In a twist on open source's "given enough eyeballs, all bugs are shallow," Big Data proclaims, "Given enough data, all data will sprout correlations and consequent insights."</p><p>Except, of course, that it doesn't.&nbsp;</p><p>As much as we want to fetishize data volumes, the reality is that data is only as useful as the people interpreting it. Yes, machines can programmatically act on correlations they "see" in large data sets, but truly revolutionary change may start with Big Data but ends with Big Insights from real people.</p><h2>Signal, Meet Noise</h2><p>Even T.S.&nbsp;Eliot, one of the great poets of the twentieth century, knew this. Writing in 1935, Eliot bemoans the insight we've lost in spite of a wealth of data:</p><blockquote tml-render-layout="inline"><p>Where is the wisdom we have lost in knowledge?&nbsp;Where is the knowledge we have lost in information?</p></blockquote><p>At least some of the struggles we have with Big Data arise from not knowing what to do with all the data we now accumulate. This shows through in <a href="http://readwrite.com/2013/09/18/gartner-on-big-data-everyones-doing-it-no-one-knows-why">a Gartner survey</a>:</p><div tml-image="ci01a8bfd7bf14860b" tml-image-caption="" tml-render-layout="inline"><figure><img src="http://a5.files.readwrite.com/image/upload/c_fill,cs_srgb,w_620/MTIxNDI3Mjk0OTQ0OTIxMTAx.png" /><figcaption></figcaption></figure></div><p>More data, it turns out, doesn't automagically turn into more insight, as noted statistician <a href="http://readwrite.com/2013/03/29/nate-silver-gets-real-about-big-data">Nate Silver declares</a>:</p><blockquote tml-render-layout="inline"><p>If the quantity of information is increasing by 2.5 quintillion bytes per day, the amount of <em>useful</em> information almost certainly isn't. Most of it is just noise, and the noise is increasing faster than the signal. There are so many hypotheses to test, so many data sets to mine--but a relatively constant amount of objective truth.</p></blockquote><p>Real insight begins when people apply domain expertise to a body of data to intelligently query that data. As Silver continues, "The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning." Hence, while we can introduce biases into the data, we also can attain perspicacity.&nbsp;</p><h2>Visualizing Data</h2><p>To enable individuals to make sense of the ever-increasing mountains of corporate data, companies like Tableau, Roambi, Zoomdata, and other next-generation business intelligence vendors have arisen. These companies make it easier for the rank-and-file within an enterprise to understand data.</p><p>As Zoomdata's <a href="http://readwrite.com/2014/11/11/business-intelligence-big-data-zoomdata-justin-langseth">Justin Langseth told ReadWrite</a>, the point is not to deliver more data to "high priest" data scientists, but rather</p><blockquote><p>to provide a beautiful, simple, yet powerful interface and underlying tech stack to allow regular business people to access, visualize, and collaborate around data that is residing and streaming into a variety of big data backends, and do that efficiently at large data and user scale.</p></blockquote><p>Or as Roambi recently <a href="http://blog.roambi.com/why-we-still-can%E2%80%99t-fully-grasp-the-mobile-analytics-we-have">noted</a> in a blog post, "As you invest in big data and analytics solutions, make sure you invest just as much into the people who will use them."</p><p>As the company explains, "It’s up to the business to invest in training end-users how to think about and use data and analytics as much as they invest in the actual infrastructure and product." In other words, downloading Hadoop isn't the answer. Not the final answer, anyway.&nbsp;</p><p>Which may be another way of saying that companies need to prioritize their people, not their data. As Roambi coaches, data-before-people is increasingly the norm, and it causes several, related problems:</p><ol><li><strong>Analysts aren’t sure which metrics to provide:</strong>&nbsp;They may know how to pick apart data to discover insights, but don't know how to communicate these through dashboards that tell a story to a particular job function</li><li><strong>Metrics aren’t being segregated based off job roles:&nbsp;</strong>Different roles require different data</li><li><strong>End-users can’t transform information into knowledge:</strong>&nbsp;People need training to learn how to think about data effectively</li><li><strong>Businesses are collecting data without changing behaviors:</strong>&nbsp;Organizations should change in response to the data</li></ol><p>The foundation for resolving these issues is to better visualize data for mere mortals. Small wonder, then, that Tableau, the market leader, has seen its <a href="https://www.google.com/finance?q=NYSE%3ADATA&amp;ei=F-zoVPH1EoSTqQGgpoCABA">stock hit all-time highs</a> recently.&nbsp;</p><p>By all means, keep investing in Hadoop, NoSQL databases, and other Big Data infrastructure. Just don't forget to also invest in the data visualization software that will help to make it meaningful for your employees who will ultimately be the ones to make sense of your data.</p><p><em>Lead photo by <a href="https://www.flickr.com/photos/golbenge/5400710724">Seongbin Im</a></em></p>You need better tools to make data meaningful.http://readwrite.com/2015/02/23/big-data-visualization
http://readwrite.com/2015/02/23/big-data-visualizationWorkMon, 23 Feb 2015 06:20:58 -0800Matt AsayProject Myriad Aims To Bring Order To The Big Data Universe<!-- tml-version="2" --><p>Big Data no longer looks like a Hadoop monopoly, but it's not yet clear exactly what its future will be.</p><p>For years, the open-source, data storage-and-management framework Hadoop—and its associated data-processing tool MapReduce—were virtually synonymous with Big Data. Now, though, the would-be Big Data Scientist has a much larger array of software tools from which to choose, one of the most promising being Spark, which <a href="http://readwrite.com/2015/01/27/spark-scala-hadoop-typesafe-dean-wampler">I covered recently</a>.&nbsp;</p><p>Spark and other tools herald the arrival of an emerging trend towards “<a href="http://www.zdnet.com/article/fast-data-hits-the-big-data-fast-lane/">fast data</a>,<a href="http://www.infoworld.com/article/2608040/big-data/fast-data--the-next-step-after-big-data.html"></a>” which poses a lot of questions for how Big Data jobs get done. Historically, the approach has been to run large MapReduce jobs in batches on dedicated clusters with Apache YARN as the default cluster manager.&nbsp;</p><blockquote tml-render-layout="inline"><p><strong>See also: <a href="http://readwrite.com/2013/05/24/hadoop-20-yarn-bid-data-mapreduce">Hadoop 2.0 And YARN—This Summer's Big Data Breakthrough</a></strong></p></blockquote><p>Maybe there's a better way.</p><p>Developers from Ebay, MapR and Mesosphere have collaborated&nbsp;to release <a href="https://github.com/mesos/myriad">Project Myriad</a>, a framework that integrates YARN with <a href="http://opensource.com/business/14/9/open-source-datacenter-computing-apache-mesos">Apache Mesos</a>—another open-source cluster manager—to run Big&nbsp;Data workloads on the same clusters as other applications in the datacenter and&nbsp;the cloud. Today its developers submitted Myriad to the Apache Incubator, affirming their commitment to&nbsp;open-source collaboration.&nbsp;</p><p>I spoke with Adam Bordelon, distributed systems architect at Mesosphere, Apache Mesos&nbsp;Committer, and a key committer on Project Myriad, to learn more about the benefits of moving&nbsp;Big Data workloads out of standalone, dedicated YARN clusters and into a single shared pool of&nbsp;resources where YARN workloads run alongside the rest of your datacenter applications.</p><h2>Minding Your Knitting</h2><p><strong>ReadWrite</strong>: <em>Tell us a little bit about the origins of the project and why its committers saw a need&nbsp;to extend the capabilities of YARN.</em></p><p><strong>Adam Bordelon</strong>: Apache Hadoop is the de facto standard for running big data workloads today, but the&nbsp;original MapReduce JobTracker could only scale to a few thousand nodes. To scale further,&nbsp;YARN took the resource-management component out of the JobTracker and moved it into its&nbsp;own separate process.&nbsp;</p><p>As Hadoop gains traction and becomes the home for the data lake, there&nbsp;is an increasing need to integrate Hadoop with other datacenter services, ideally co-locating the&nbsp;data in HDFS/HBase with the non-Hadoop services that need it.&nbsp;</p><p>But the typical Hadoop&nbsp;deployment model favors static partitioning of datacenter resources into separate Hadoop&nbsp;clusters, database clusters, web server clusters, etc. This practices under-utilizes the overall&nbsp;datacenter resources, and exhibits poor data sharing between Hadoop clusters and other&nbsp;applications in the same datacenter or cloud.</p><p>Last year Mohit Soni—an engineer at eBay—had the idea of using a Mesos cluster to elastically&nbsp;run YARN alongside other workloads. He was specifically interested in offloading traffic during&nbsp;peak hours, as well as solving for data replication challenges across different data silos, where&nbsp;different data sets were marooned on separate YARN clusters.&nbsp;</p><p>But he also had a broader vision&nbsp;of a comprehensive framework that combined YARN with Apache Mesos to finally (and cleanly)&nbsp;break Big Data workloads out of dedicated static clusters and allow YARN to coexist with non-Hadoop applications including long-running web services, streaming applications (e.g. Storm),&nbsp;continuous integration tools (e.g. Jenkins), HPC jobs (e.g. MPI), Docker containers, as well as&nbsp;custom scripts and applications.</p><h2>Virtualizing Big Data Workloads</h2><p><strong>RW</strong>: <em>Who is going to be most excited about the general availability of Myriad?</em></p><p><strong>AB</strong>: It’s really the operations teams who will be most excited about Myriad (analytics teams&nbsp;typically are not as concerned with how to share their resources with other clusters). But the&nbsp;poor ops teams have been wondering why all these Hadoop data scientists get their own&nbsp;resources—because it adds a huge amount of complexity to manage multiple clusters within the&nbsp;datacenter, and the aggregate utilization rates are very poor when you have dedicated Hadoop&nbsp;clusters isolated from other workloads.&nbsp;</p><p>Myriad addresses two important goals for ops. One is&nbsp;improving cluster utilization—rather than the Hadoop cluster crunching numbers overnight and&nbsp;sitting relatively idle during peak web traffic hours, Myriad enables Mesos to dynamically share&nbsp;resources between the Hadoop cluster and Web servers and other applications on demand,&nbsp;even simultaneously co-locating Hadoop jobs on the same machines as other tasks, an&nbsp;approach that can easily double or triple utilization.&nbsp;</p><p>The other goal is easier administration. With&nbsp;statically partitioned clusters, if you wanted to add a new node to your Hadoop cluster, you’d&nbsp;have to execute a lot of manual procedures, decommissioning an underutilized server, then&nbsp;configuring it to become a Hadoop node. With Myriad, workloads can just expand into unused&nbsp;capacity when those resources are needed.</p><p><strong>RW</strong>: <em>As Big Data analytics become more real-time, what does the complexity look like on&nbsp;the back end?</em></p><p><strong>AB</strong>:&nbsp;When the data became bigger than the compute, we started moving the compute to the data,&nbsp;rather than the other way around. This is the principle that MapReduce is based on.&nbsp;</p><p>But as the&nbsp;compute itself becomes faster, the demands of real-time or interactive analytics pushes us to&nbsp;reduce the overhead of scheduling and launching short-lived tasks. Mesos’ two-level scheduler&nbsp;model enables Mesos itself to be thin and fast in its scheduling decisions, while individual&nbsp;frameworks like Marathon or Spark (originally developed as an example Mesos framework) can&nbsp;choose their own scheduling policies, either spending a long time deciding the best place for a&nbsp;long-running service, or quickly placing a real-time task on the first available resources.&nbsp;</p><p>This&nbsp;approach is preferable to a monolithic scheduler that treats long-running jobs and interactive&nbsp;queries equally, forcing the same scheduling overhead on all workload classes.</p><h2>YARN vs. Mesos</h2><p><strong>RW</strong>: <em>There is a pretty spirited <a href="http://www.quora.com/How-does-YARN-compare-to-Mesos">Quora thread</a> where YARN and Mesos compare and&nbsp;contrasts are debated pretty heavily. What does it mean that Big Data adopters no longer&nbsp;have to choose?</em></p><p><strong>AB</strong>: Yeah, as Jay Kreps says in that thread, YARN and Mesos have the same goal—to share a&nbsp;large cluster of machines between different frameworks.&nbsp;</p><p>The biggest difference is that YARN&nbsp;was designed to be Hadoop-specific and Mesos was designed to handle an infinite range of&nbsp;workload classes with custom per-framework schedulers.&nbsp;</p><p>What’s really exciting here is that&nbsp;organizations that are very committed to Hadoop and MapReduce now have a way to elastically&nbsp;expand their YARN cluster while at the same time taking advantage of Mesos’ ability to run any&nbsp;other kind of workload, including non-Hadoop applications like web servers, mobile backends,&nbsp;distributed databases and other types of common services.&nbsp;</p><p>And at the same time, the&nbsp;community that’s already running Mesos can now tap into the power of YARN on their unified&nbsp;Mesos cluster. Running Myriad requires no changes to YARN or Mesos source code, so it can&nbsp;be easily integrated into existing Mesos or Hadoop clusters.</p><p><strong>RW</strong>: <em>How does Myriad fit into some of the larger trends affecting the datacenter?</em></p><p><strong>AB</strong>: Myriad is one of the clearest examples of how companies are starting to treat the&nbsp;datacenter as if it were just one big computer, where you can install new “killer apps” like YARN,&nbsp;Spark, Kafka, Cassandra and HDFS with a single command and run all of these services&nbsp;multitenant on the same cluster, while isolating them from each others’ resources using Linux&nbsp;containers.&nbsp;</p><p>Much of the work we’re doing at Mesosphere, building an operating system for the&nbsp;datacenter, the Mesosphere DCOS, is based on the belief in this trend.&nbsp;</p><p>Myriad will bring YARN&nbsp;to the same level as all of these easy-to-install services—where YARN is just another&nbsp;framework that runs reliably and efficiently alongside other common services on a datacenter-scale distributed operating system,</p><p><em>Image by <a href="https://www.flickr.com/photos/quinndombrowski/7619249908">Quinn Dombrowski</a></em></p>It makes YARN play nicely with Mesos in Hadoop.http://readwrite.com/2015/02/11/project-myriad-big-data-hadoop-yarn-mesos
http://readwrite.com/2015/02/11/project-myriad-big-data-hadoop-yarn-mesosWorkWed, 11 Feb 2015 12:58:15 -0800Matt AsayBig Data Failures Owe More To Business Culture Than Technology<!-- tml-version="2" --><p>Hope springs eternal, especially in Big Data. Despite widespread failure to achieve much of anything with Big Data projects, gargantuan piles of cash keep flowing into such projects, hitting $31 billion in 2013 and expected to top $114 billion by 2018.&nbsp;</p><p>Yet while 60% of executives believe Big Data will upend their industries within three years—according to a <a href="http://www.capgemini-consulting.com/resource-file-access/resource/pdf/big_data_pov_03-02-15.pdf">recent Capgemini report</a>—a mere 8% describe their own projects as "very successful," while another 27% call their efforts "successful."</p><p>Given how much companies are spending, one would hope for better returns. Real success, however, derives from a cultural affinity for data, and not simply a technology purchase.</p><h2>Failure All The Way Down</h2><p>No one seems to dispute the inherent value of data, and the more of it the better. In Capgemini's survey, fully 60% of respondents believe Big Data is going to change the world, starting with their industries.&nbsp;</p><p>Yet when asked about the status of their Big Data initiatives, it's clear that reality bites:</p><div tml-image="ci01c6b8da4001efe2" tml-image-caption="" tml-render-layout="inline"><figure><img src="http://a1.files.readwrite.com/image/upload/c_fill,cs_srgb,w_620/MTI3OTkyODg2NDA1OTAwOTMx.png" /><figcaption></figcaption></figure></div><p>For those that have been paying attention, this isn't really news. After all, <a href="http://readwrite.com/2013/09/18/gartner-on-big-data-everyones-doing-it-no-one-knows-why">Gartner found a few years ago</a> that while everyone was jumping into Big Data, few knew how to make it work.&nbsp;</p><p>As to why Big Data projects fail, the answer is "it depends." Some of the reasons are cultural ("ineffective coordination of teams across the organization"), while others are more easily fixed ("dependency on legacy systems"):</p><div tml-image="ci01c6b8e870019512" tml-image-caption="" tml-render-layout="inline"><figure><img src="http://a2.files.readwrite.com/image/upload/c_fill,cs_srgb,w_620/MTI3OTkyOTQ3MzQwNzQ5NDQz.png" /><figcaption></figcaption></figure></div><p>Looking at this list, it's hard to see how things improve in the short term.&nbsp;</p><h2>Teaching Old Data Dogs New Data Tricks</h2><p>Some vendors pitch a "data lake" as the solution to the first problem. Capgemini's survey finds that 79% of enterprises haven't completely integrated data sources from across the organization. To make this simpler, the data lake advocates insist that enterprises don't need to standardize data as it enters the organization; instead they can keep it in its original format and just store it in one big repository.</p><blockquote><p><strong>See also: <a href="http://readwrite.com/2014/08/13/data-lake-hype-gartner-attack">Oh, Go Jump In A Data Lake</a></strong></p></blockquote><p>While this sounds simple, it's not clear that it's actually useful.</p><p>I've <a href="https://twitter.com/mjasay/status/553616333179351041">heard</a> some call such a data lake a "Hadump," an unflattering play on the <a href="http://readwrite.com/tag/hadoop">decentralized storage framework Hadoop</a> that suggests just because all the data resides in one place doesn't make it useful. As Gartner analyst <a href="http://readwrite.com/2014/08/13/data-lake-hype-gartner-attack">Nick Huedecker has pointed out</a>:</p><blockquote tml-render-layout="inline"><p>The fundamental issue with the data lake is that it makes certain assumptions about the users of information. It assumes that users recognize or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources without 'a priori knowledge' and that they understand the incomplete nature of datasets, regardless of structure.</p></blockquote><p>So some fail with newfangled, Hadoop-inspired data lakes, while many more fail by trying to get antiquated data infrastructure (e.g., relational databases) to fit modern data (messy, disparate, and lots of it).&nbsp;</p><h2>The Cultural Problem</h2><p>But the biggest cause of failure, even if not acknowledged as such, is that most enterprises simply don't have a culture of data-centricity. At best, they treat "Big Data" as a discrete project with a definitive completion date.</p><p>As such, they're not set up to succeed, as the Capgemini report finds:</p><blockquote tml-render-layout="inline"><p>There are many factors that go into the making of a successful Big Data implementation. However, the single biggest factor that we observed was that organizations that have a strong operating model stood apart. This operating model has multiple distinct elements, which include, among others, a well-defined organizational structure, systematic implementation plan, and strong leadership support.</p></blockquote><p>Each of these three things ties into a corporate culture that appreciates and is built around data.&nbsp;</p><p>I would also add, following something that Zoomdata's Justin Langseth recently said to me, that design is an essential element of any successful Big Data project. The best Big Data projects will bring data to life for the rank-and-file within an enterprise, not merely the high priests and priestesses of data science.&nbsp;</p><p>In sum, Big Data success flows from a cultural affinity for data, which can be sparked by a strong leader within an organization, but ultimately must become how an entire company thinks about its business.</p><p><em>Image courtesy of <a href="http://www.shutterstock.com">Shutterstock</a></em></p>Most corporate cultures still aren't built around data.http://readwrite.com/2015/02/09/big-data-failure-blame-corporate-culture
http://readwrite.com/2015/02/09/big-data-failure-blame-corporate-cultureWorkMon, 09 Feb 2015 09:13:09 -0800Matt AsayHow Open Source Succeeds In The Cloud—It Trades Freedom For Simplicity<!-- tml-version="2" --><p>Those new to open source won’t remember just how much of the early code amounted to little more than crappy-but-free clones of popular proprietary products. Boy, how times have changed.</p><p>Open source, once a clumsy (but free!) imitator of proprietary innovation is now doing taking the lead on industry innovation, with Big Data being the most obvious example. While this is a hugely positive industry shift, it also introduces complexities. Namely, with so much exceptional open source software contending to power your next Big Data project, how do you choose which to use?</p><h2>Opening Up Innovation</h2><p>Black Duck Software recently named its annual “<a href="https://www.blackducksoftware.com/open-source-rookies">Open Source Rookies of the Year</a><a href="https://www.blackducksoftware.com/news/releases/black-duck-announces-open-source-rookies-year-winners"></a>,” pulling data from thousands of projects relative to project activity, commits pace, project team attributes, and other factors. Spanning cloud and virtualization, mobile, social media and more, they reflect the ever-increasing scope of code that is successfully developed in the open, rather than behind closed doors.</p><blockquote tml-render-layout="inline"><p><strong>See also: <a href="http://readwrite.com/2014/08/15/open-source-software-business-zulily-erp-wall-street-journal">Why Your Company Needs To Write More Open-Source Software</a></strong></p></blockquote><p>Nowhere is this trend more evident than in Big Data.</p><p>As Cloudera co-founder <a href="https://www.linkedin.com/pulse/20131003190011-29380071-the-cloudera-model">Mike Olson declares</a>, “No dominant platform-level software infrastructure has emerged in the last ten years in closed-source, proprietary form.” That’s a stunning assessment, but it’s absolutely true. Open source may have come to life as an imitator, but it’s innovating at a frenetic pace in Big Data land.</p><p>Which may be a problem.</p><h2>Spoiled By Open Source Riches</h2><p>Big Data projects are now being released at such a frenetic pace that developers struggle to keep up. In case you’re just getting your feet wet with Hadoop, for example, you now need to consider Spark, <a href="http://samza.apache.org/">Samza</a> or a variety of other oddly-named but increasingly important Big Data tools.</p><blockquote tml-render-layout="inline"><p><strong>See also: <a href="http://readwrite.com/2014/12/31/big-data-companies--applications-money-startups">Applications Drive The Biggest Money In Big Data</a></strong></p></blockquote><p>Importantly, these tools are largely being born within enterprises like LinkedIn that have serious Big Data needs that no commercial software can solve. Even the National Weather Service has jumped in, <a href="http://www.emc.ncep.noaa.gov/GFS/code.php">open sourcing the code</a> that powers its global forecast system.</p><p>While most companies won’t need such niche code, they may want the sorts of things released by the big Web companies. Take for instance, <a href="http://siliconangle.com/blog/2015/02/02/what-you-missed-in-big-data-hadoop-no-longer-the-only-game-in-town/">LinkedIn’s release of Apache Samza</a>:</p><blockquote tml-render-layout="inline"><p>The LinkedIn-developed framework is designed to process complex real-time workloads that require special handling after ingestion. It embeds a local key-value store in every stream that makes it possible to store the kind of contextual information needed to carry out advanced operations such as merging datasets locally instead of having to query a remote system every time they’re needed.</p></blockquote><p>This leads to fantastic performance. It also leads to the question: what should a developer use to tackle her organization’s data load? </p><p>On the database side, there are hundreds of options, ranging from NoSQL databases like MongoDB and Cassandra to relational mainstays like Oracle and MySQL. Should a developer choose the most popular database, picking from a list like DB-Engines’ ranking? That’s one approach, but you could easilyend up with a big mismatch between the workload and the tool managing it. </p><p>If this seems like a trivial problem, it’s not. At all. I spent years working for Big Data infrastructure providers, and now work for a company trying to make sense of the deluge of open source Big Data tools. It’s hard to keep up, and very difficult to know which to use. </p><h2>Closing Off Choices</h2><p>One reason that Amazon Web Services (AWS) has become the go-to public cloud is that the company has managed to simultaneously offer a broad array of open source solutions to run (supported and unsupported) on its cloud, and a suite of proprietary services for everything from email to data warehousing. </p><p>Developers, anxious to “get stuff done,” can turn to AWS and know that they’ll have both a variety of options and the safety of a paved path. </p><p>Microsoft Azure has followed suit. Not content to roll out a Hadoop-based analytics service, for example, Microsoft is now close to releasing&nbsp;<a href="http://www.zdnet.com/article/microsoft-to-offer-a-paid-version-of-its-internal-cosmos-big-data-service/">Cosmos</a>, its parallel processing and storage service. Or take the company’s support for MongoDB, an open source document database, to appeal to those that want the popular NoSQL database. At the same time, Microsoft has rolled out its own document database as a service, for those that want a document database but may prefer Microsoft’s packaging of it.</p><p>Microsoft, in short, wants to provide choice to its customers, but curated and nicely packaged. </p><p>This looks like the future of open source infrastructure: free to download, but perhaps more useful rolled into a cloud service that removes complexity (and choice). It may not be what the open source crowd would prefer, but it may end up being the ideal way to turn open source Big Data innovation into solutions mainstream enterprises can actually use.</p><p><em>Photo by <a href="https://www.flickr.com/photos/incredibleguy/6937250243">George Thomas</a></em></p>As demonstrated by Amazon and Microsoft.http://readwrite.com/2015/02/03/open-source-big-data-simplicity-not-freedom
http://readwrite.com/2015/02/03/open-source-big-data-simplicity-not-freedomCloudTue, 03 Feb 2015 07:00:00 -0800Matt AsayHow Big Data Could Limit Super Bowl Sticker Shock<!-- tml-version="2" --><p><em>Guest author Alex Salkever is the&nbsp;head of product marketing and business development at Silk.co.</em></p><p>Andrew Kitchell is from Seattle and is the co-founder of <a href="http://www.pricemethod.com">PriceMethod</a>, a startup that helps AirBnB and HomeAway hosts price their properties. His co-founder Joe Fraiman is from Boston. They both follow football and pondered going to the Super Bowl, but were floored by the high prices for accommodations—even though their business is all about supply and demand, which gives them a certain insight into the impact of 100,000 people abruptly descending on a city in search of an affordable place to stay.</p><div tml-image="ci01c5ad08c001c80a" tml-image-caption="Credit: PriceMethod" tml-render-layout="inline"><figure><img src="http://a1.files.readwrite.com/image/upload/c_fill,cs_srgb,w_620/MTI3Njk4NDE2NzM3MjI5Mjc4.png" /><figcaption>Credit: PriceMethod</figcaption></figure></div><p>So Kitchell and Fraiman flipped their methodology around and built a simple tool to help Super Bowl attendees find cheaper <a href="http://www.pricemethod.com/superbowl/guest">last minute lodging</a>. They took the same Big Data harvesting and categorization infrastructure they had built and, on a dime, put a new UI on the results to make it easier for the public to search for cheap accommodations—the exact opposite of their normal business helping peer-to-peer property owners charge what the market will bear.</p><p>I caught up with Kitchell to talk with him about their Super Bowl findings and how PriceMethod crawls data and builds data models that can give property owners the same pricing tools as big hotel chains. Here's a lightly edited version of our conversation.</p><h2>Leveling The Playing Field</h2><p><em><strong>ReadWrite:</strong> So where did the idea come from? </em></p><p><strong>Andrew Kitchell:</strong> We are a data science-focused team of Y Combinator alums, and <em>usually</em> we help Airbnb and HomeAway listings with data-driven pricing. However, my co-founder is from Boston, and I'm from Seattle, so we thought this would be a fun time to use our data to help our fellow football fans.<br tml-linebreak="true" /> </p><p><em><strong>RW</strong><strong>:</strong> Tell us a little bit about how PriceMethod works.</em></p><p><strong>AK: </strong>We’re trying to level the playing field for P2P (peer-to-peer) accommodations versus traditional big hotels. To do that, we need to have a good picture of the entire market including hotels and other accommodation sources.</p><p>As a base we collect data from Airbnb and HomeAway, the two biggest P2P accommodation networks. We do that several times per day. Additionally, we collect hotel price and occupancy data from multiple sources across the Internet. Primarily, we use hotel data to build a predictive pricing model for local demand. We assume that hotels, because they have very strong predictive pricing tools, are already baking in good assumptions for local demand based on their own algorithms and historical data.</p><p>We also use vacation rental and P2P property data to build a reactive pricing model. This adjusts prices based on how local demand translates into actual bookings within a neighborhood, inventory type. You need that in the P2P market because it is still somewhat unpredictable. </p><p><em><strong>RW:</strong> How do you account for things like the price of inventory taken off the market?</em></p><p><strong>AK:</strong> For scraped hotel and vacation rental or P2P listings, we infer the "booked price" for any day from the last observed price. We collect data from channels throughout the day, so we will observe and record any booking within, at most, 24 hours. With a linked account, we can get perfect access to booking data. However, as a first step, we can use the last observed price to inform a robust model.</p><h2>How To Build A Pricing Model</h2><p><em><strong>RW:</strong> Your team has some deep experience in building pricing models for big financial firms in commodities and other trading markets. How do you build your pricing models for the P2P accommodations markets? </em></p><p><strong>AK: </strong>Our current pricing model consists of four components. First, we base price recommendations on the average market value of similar listings. Then we make a local adjustment due to the popularity of any given neighborhood. This adjusts and improves our base pricing model.</p><p>We then apply a time-sensitive model model informed by the booking curve of the local market, taking into account time periods expected for local bookings. Lastly, we look at demand driven changes depending on the local availability of vacation rentals and hotels. </p><p><em><strong>Q: </strong>So how is the Super Bowl different in terms of pricing?</em></p><p><strong>A:</strong> By our calculation, at least 75% of the P2P and vacation rental market is underpriced for the Super Bowl. We're seeing some amazing price increases for informed owners, and our favorite example of how the rest of the accommodations market is moving is captured by the fact that someone is selling a basic room for 20x their normal rate.</p><p>For the Super Bowl, we wanted to determine how hosts could price their home during a period of <em>exceptional</em> demand. So we actually skewed our model to analyze how much experienced P2P hosts—those with more reviews and more future bookings—were increasing prices, and how booked out these listings were at their raised prices. In some cases, owners are increasing their prices up to 15 times their normal rates, so we were able to observe bookings at this homes to discern the efficacy of these increases. </p><p>For hosts during the Super Bowl, we used this analysis to recommend a reasonable range of price increases for other homes<em>.</em> For travelers attending the Super Bowl, we used this same process to determine which homes were priced best in comparison to their potential value. </p><h2>Let's Talk Nerdy</h2><p><em><strong>RW: </strong>What does your tech stack look like?&nbsp;</em></p><p><strong>AK:</strong> It’s a Rails stack with a PostGres database and Reddis for caching. The whole thing is sitting on top of Amazon Web Services so we can spin up as many nodes as we need to do our crawls. We use Mechanize for a lot of our crawling and are using a combination of APIs, mobile APIs and standard Web data to fuel our system. AWS makes it very easy to get up and running. It’s almost a no brainer. It has so many tools and for the cost and the power, it’s quite amazing.</p><p><em><strong>RW:</strong> For vacation rental owners that use you, how much more money can they expect to make? </em></p><p><strong>AK: </strong>Our initial numbers show we are increasing their revenue by 20% to 40%. Those numbers will get better as we have a large set of customers. We can’t disclose numbers right now but this is a huge, multi-billion dollar market that is poorly addressed right now. AirBnB is adding thousands of listings per day. We’re bootstrapping right now and are going to raise money in a few months. But we’re confident the market is there.&nbsp;</p><p><em>Lead graphic courtesy of <a href="http://www.shutterstock.com">Shutterstock</a>;&nbsp;</em></p>By helping visitors find low-cost lodging in Phoenix for the big game.http://readwrite.com/2015/01/29/super-bowl-cheap-rooms-pricemethod-big-data
http://readwrite.com/2015/01/29/super-bowl-cheap-rooms-pricemethod-big-dataCloudThu, 29 Jan 2015 06:00:00 -0800Alex SalkeverThe Big-Data Tool Spark May Be Hotter Than Hadoop, But It Still Has Issues<!-- tml-version="2" --><p>Hadoop is hot. But its kissing cousin Spark is even hotter.</p><p>Indeed, Spark is hot like Apache Hadoop was half a decade ago. Spawned at UC Berkeley’s AMPLab, Spark is a fast data processing engine that works in the Hadoop ecosystem, replacing MapReduce. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and iterative algorithms, like those commonly found in machine learning and graph processing.</p><p>San Francisco-based Typesafe, sponsors of a<a href="http://readwrite.com/2014/10/20/java-8-adoption-apache-spark-internet-of-things"> popular survey on Java developers I wrote about last year</a> and the commercial backers of Scala, Play Framework, and Akka, recently conducted a <a href="http://info.typesafe.com/COLL-20XX-Spark-Survey-Report_LP.html?lst=RW&amp;lsd=COLL-20XX-Spark-Survey-Trends-Adoption-Report">survey of developers about Spark</a>. More than 2,000 (2,136 to be exact) developers responded. Of the findings, three conclusions jump out:</p><ol><li><strong>Spark awareness and adoption are seeing hockey-stick-like growth.</strong> Google Trends <a href="http://www.google.com/trends/explore#q=apache%20spark&amp;cmpt=q&amp;tz=">confirms</a> this. The survey shows that 71% of respondents have at least evaluation or research experience with Spark, and 35% are now using it or plan to use it.</li><li><strong>Faster data processing and event streaming are the focus for enterprises.</strong> By far the most desirable features are Spark's vastly improved processing performance over MapReduce (over 78% mention this) and the ability to process event streams (over 66% mention this), which MapReduce cannot do.</li><li><strong>Perceived barriers to adoption are not major blockers.</strong> When asked what's holding them back from the Spark revolution, respondents mentioned their own lack of experience with Spark and the need for more detailed documentation, especially for more advanced application scenarios and performance tuning. They mentioned perceived immaturity, in general, and also integration with other middleware, like message queues and databases. Lack of commercial support, which is still spotty even by the Hadoop vendors, was also a concern. Finally, some respondents mentioned that their organizations aren't in need of big data solutions at this time.</li></ol><p>I spoke to Typesafe’s architect for Big Data Products and Services, Dean Wampler (<a href="https://twitter.com/deanwampler">@deanwampler</a>), on his thoughts about the rise of Spark. Wampler<a href="http:///h"> </a><a href="http://www.infoq.com/presentations/spark-scala-mapreduce-java">recently recorded a talk</a> on why he thinks Spark/Scala are rapidly replacing MapReduce/Java as the most popular Big Data compute engine in the enterprise.</p><h2>Striking The Spark</h2><div tml-image="ci01c56a771001efe2" tml-image-caption="Dean Wampler" tml-render-layout="right"><figure><img src="http://a2.files.readwrite.com/image/upload/c_fill,cs_srgb,dpr_1.0,q_80,w_620/MTI3NjI1MjE5NDg4NjYzNTYy.jpg" /><figcaption>Dean Wampler</figcaption></figure></div><p><strong>ReadWrite</strong>:&nbsp;<em>For those venturing into Spark, what are the most common hurdles?</em></p><p><strong>Wampler</strong>:&nbsp;It’s mostly around things like acquiring expertise, having good documentation with deep, non-trivial examples. Many people aren’t sure how to manage, monitor, and tune their jobs and clusters. Commercial support for Spark is still limited, especially for non-YARN deployments. However, even among the Hadoop vendors, support is still spotty.&nbsp;</p><p>Spark still needs to mature in many ways, especially the newer modules, such as Spark SQL and Spark Streaming. Older tools, like Hadoop and MapReduce, have had a longer runway and hence more time to be hardened and expertise to be documented. All these issues are being addressed and they should be resolved relatively soon.</p><p><strong>RW</strong>:&nbsp;<em>I hear people ask "where are you running Spark?" all the time, suggesting a pretty broad range of resource management strategies, e.g., standalone clusters, YARN, Mesos. Do you believe industry will tend to run Big Data clusters in isolation, or do you see the industry eventually moving to running Big Data clusters alongside other applications in production?&nbsp;</em></p><p><strong>DW</strong>: I think most organizations will still use fewer, larger clusters, just so their operations teams have fewer clusters to watch. Mesos and YARN really make this approach attractive. Conversely, Spark makes it easier to set up small, dedicated clusters for specific problems. Say you’re ingesting the Twitter firehose. You might want a dedicated cluster tuned optimally for that streaming challenge. Maybe it forwards “curated” data to another cluster, say a big one used for data warehousing.</p><h2>Keeping The Spark Alive</h2><p><strong>RW</strong>:&nbsp;<em>Is the operations side of Spark different than the operations side of MapReduce?</em></p><p><strong>DW</strong>:&nbsp;For batch jobs, it’s about the same. Streaming jobs, however, raise new challenges.&nbsp;</p><p>For a typical batch job, whether it’s written in Spark or MapReduce, you submit a job to run, it gets its resources from YARN or Mesos, and once it finishes, the resources are released. However, in Spark streaming, the jobs run continuously, so you might need more robust recovery if the job dies, so stream data isn’t lost.&nbsp;</p><p>Another problem is resource allocation. For a batch job, it’s probably okay to give it a set of resources and have those resources locked up for the job’s life cycle. (Note, however, some dynamic management is already done by YARN and Mesos.) Long-running jobs really need more dynamic resource management, so you don’t have idle resources during relatively quiescent periods, or overwhelmed resources during peak times.&nbsp;</p><p>Hence, you really want the ability grow and shrink resource allocations, where scaling up and down is automated. This is not a trivial problem to solve and you can’t rely on human intervention either.</p><p><strong>RW</strong>: <em>Let’s talk about the Scala / Spark connection. Does Spark require knowledge of Scala? Are most people using Spark also well versed in Scala? And is it more the case that Scala users are those who tend to favor Spark, or is Spark creating a “pull” effect into Scala?</em></p><p><strong>DW</strong>: Spark is written in Scala and it is pulling people towards Scala. Typically they’re coming from a Big Data ecosystem already, and they are used to working with Java, if they are developers, or languages like Python and R, if they are data scientists.&nbsp;</p><p>Fortunately for everyone, Spark supports several languages - Scala, Java, Python, and R is coming. So people don’t necessarily have to switch to Scala.&nbsp;</p><p>There has been a lag in the API coverage for the other languages, but the Spark team has almost closed the gap. The rule of thumb is that you’ll get the best runtime performance if you use Scala or Java, and you’ll get the most concise code if you use Scala or Python. So, Spark is actually drawing people to Scala, but it doesn’t require that you have to be a Scala expert.&nbsp;</p><p>I like the fact that Spark uses the more mainstream features of Scala. It doesn’t require mastery of more advanced constructs.</p><p><em>Photo courtesy of <a href="http://www.shutterstock.com">Shutterstock</a></em></p>It's the cool kid these days, but it's flunking some subjects.http://readwrite.com/2015/01/27/spark-scala-hadoop-typesafe-dean-wampler
http://readwrite.com/2015/01/27/spark-scala-hadoop-typesafe-dean-wamplerWorkTue, 27 Jan 2015 07:00:00 -0800Matt Asay