ReadWrite - mapreducehttp://readwrite.com/tag/mapreduce
enCopyright 2015 Wearable World Inc.http://blogs.law.harvard.edu/tech/rssTue, 31 Mar 2015 15:02:36 -0700The Big-Data Tool Spark May Be Hotter Than Hadoop, But It Still Has Issues<!-- tml-version="2" --><p>Hadoop is hot. But its kissing cousin Spark is even hotter.</p><p>Indeed, Spark is hot like Apache Hadoop was half a decade ago. Spawned at UC Berkeley’s AMPLab, Spark is a fast data processing engine that works in the Hadoop ecosystem, replacing MapReduce. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and iterative algorithms, like those commonly found in machine learning and graph processing.</p><p>San Francisco-based Typesafe, sponsors of a<a href="http://readwrite.com/2014/10/20/java-8-adoption-apache-spark-internet-of-things"> popular survey on Java developers I wrote about last year</a> and the commercial backers of Scala, Play Framework, and Akka, recently conducted a <a href="http://info.typesafe.com/COLL-20XX-Spark-Survey-Report_LP.html?lst=RW&amp;lsd=COLL-20XX-Spark-Survey-Trends-Adoption-Report">survey of developers about Spark</a>. More than 2,000 (2,136 to be exact) developers responded. Of the findings, three conclusions jump out:</p><ol><li><strong>Spark awareness and adoption are seeing hockey-stick-like growth.</strong> Google Trends <a href="http://www.google.com/trends/explore#q=apache%20spark&amp;cmpt=q&amp;tz=">confirms</a> this. The survey shows that 71% of respondents have at least evaluation or research experience with Spark, and 35% are now using it or plan to use it.</li><li><strong>Faster data processing and event streaming are the focus for enterprises.</strong> By far the most desirable features are Spark's vastly improved processing performance over MapReduce (over 78% mention this) and the ability to process event streams (over 66% mention this), which MapReduce cannot do.</li><li><strong>Perceived barriers to adoption are not major blockers.</strong> When asked what's holding them back from the Spark revolution, respondents mentioned their own lack of experience with Spark and the need for more detailed documentation, especially for more advanced application scenarios and performance tuning. They mentioned perceived immaturity, in general, and also integration with other middleware, like message queues and databases. Lack of commercial support, which is still spotty even by the Hadoop vendors, was also a concern. Finally, some respondents mentioned that their organizations aren't in need of big data solutions at this time.</li></ol><p>I spoke to Typesafe’s architect for Big Data Products and Services, Dean Wampler (<a href="https://twitter.com/deanwampler">@deanwampler</a>), on his thoughts about the rise of Spark. Wampler<a href="http:///h"> </a><a href="http://www.infoq.com/presentations/spark-scala-mapreduce-java">recently recorded a talk</a> on why he thinks Spark/Scala are rapidly replacing MapReduce/Java as the most popular Big Data compute engine in the enterprise.</p><h2>Striking The Spark</h2><div tml-image="ci01c56a771001efe2" tml-image-caption="Dean Wampler" tml-render-layout="right"><figure><img src="http://a2.files.readwrite.com/image/upload/c_fill,cs_srgb,dpr_1.0,q_80,w_620/MTI3NjI1MjE5NDg4NjYzNTYy.jpg" /><figcaption>Dean Wampler</figcaption></figure></div><p><strong>ReadWrite</strong>:&nbsp;<em>For those venturing into Spark, what are the most common hurdles?</em></p><p><strong>Wampler</strong>:&nbsp;It’s mostly around things like acquiring expertise, having good documentation with deep, non-trivial examples. Many people aren’t sure how to manage, monitor, and tune their jobs and clusters. Commercial support for Spark is still limited, especially for non-YARN deployments. However, even among the Hadoop vendors, support is still spotty.&nbsp;</p><p>Spark still needs to mature in many ways, especially the newer modules, such as Spark SQL and Spark Streaming. Older tools, like Hadoop and MapReduce, have had a longer runway and hence more time to be hardened and expertise to be documented. All these issues are being addressed and they should be resolved relatively soon.</p><p><strong>RW</strong>:&nbsp;<em>I hear people ask "where are you running Spark?" all the time, suggesting a pretty broad range of resource management strategies, e.g., standalone clusters, YARN, Mesos. Do you believe industry will tend to run Big Data clusters in isolation, or do you see the industry eventually moving to running Big Data clusters alongside other applications in production?&nbsp;</em></p><p><strong>DW</strong>: I think most organizations will still use fewer, larger clusters, just so their operations teams have fewer clusters to watch. Mesos and YARN really make this approach attractive. Conversely, Spark makes it easier to set up small, dedicated clusters for specific problems. Say you’re ingesting the Twitter firehose. You might want a dedicated cluster tuned optimally for that streaming challenge. Maybe it forwards “curated” data to another cluster, say a big one used for data warehousing.</p><h2>Keeping The Spark Alive</h2><p><strong>RW</strong>:&nbsp;<em>Is the operations side of Spark different than the operations side of MapReduce?</em></p><p><strong>DW</strong>:&nbsp;For batch jobs, it’s about the same. Streaming jobs, however, raise new challenges.&nbsp;</p><p>For a typical batch job, whether it’s written in Spark or MapReduce, you submit a job to run, it gets its resources from YARN or Mesos, and once it finishes, the resources are released. However, in Spark streaming, the jobs run continuously, so you might need more robust recovery if the job dies, so stream data isn’t lost.&nbsp;</p><p>Another problem is resource allocation. For a batch job, it’s probably okay to give it a set of resources and have those resources locked up for the job’s life cycle. (Note, however, some dynamic management is already done by YARN and Mesos.) Long-running jobs really need more dynamic resource management, so you don’t have idle resources during relatively quiescent periods, or overwhelmed resources during peak times.&nbsp;</p><p>Hence, you really want the ability grow and shrink resource allocations, where scaling up and down is automated. This is not a trivial problem to solve and you can’t rely on human intervention either.</p><p><strong>RW</strong>: <em>Let’s talk about the Scala / Spark connection. Does Spark require knowledge of Scala? Are most people using Spark also well versed in Scala? And is it more the case that Scala users are those who tend to favor Spark, or is Spark creating a “pull” effect into Scala?</em></p><p><strong>DW</strong>: Spark is written in Scala and it is pulling people towards Scala. Typically they’re coming from a Big Data ecosystem already, and they are used to working with Java, if they are developers, or languages like Python and R, if they are data scientists.&nbsp;</p><p>Fortunately for everyone, Spark supports several languages - Scala, Java, Python, and R is coming. So people don’t necessarily have to switch to Scala.&nbsp;</p><p>There has been a lag in the API coverage for the other languages, but the Spark team has almost closed the gap. The rule of thumb is that you’ll get the best runtime performance if you use Scala or Java, and you’ll get the most concise code if you use Scala or Python. So, Spark is actually drawing people to Scala, but it doesn’t require that you have to be a Scala expert.&nbsp;</p><p>I like the fact that Spark uses the more mainstream features of Scala. It doesn’t require mastery of more advanced constructs.</p><p><em>Photo courtesy of <a href="http://www.shutterstock.com">Shutterstock</a></em></p>It's the cool kid these days, but it's flunking some subjects.http://readwrite.com/2015/01/27/spark-scala-hadoop-typesafe-dean-wampler
http://readwrite.com/2015/01/27/spark-scala-hadoop-typesafe-dean-wamplerWorkTue, 27 Jan 2015 07:00:00 -0800Matt AsayTop Big Data Trends Of 2013<!-- tml-version="2" --><p></p><div tml-image="ci01b280a300018266" tml-render-position="right" tml-render-size="medium"><figure><img src="http://a4.files.readwrite.com/image/upload/c_fill,cs_srgb,dpr_1.0,q_80,w_620/MTIyMzAxNTc2Mzg2NzQwODM4.jpg" /></figure></div><p><em><a href="http://readwrite.com/series/reflect">ReadWriteReflect</a>&nbsp;offers a look back at major technology trends, products and companies of the past year.</em></p><p>Thank the U.S. National Security Agency for bringing Big Data onto the national stage. The agency’s questionable use of data mining and analysis in the name of national security raised Main Street’s awareness of Big Data like never before.</p><p>Making sense of a wellspring of unstructured and structured information: What do we like? What did we buy? What do we do? These questions have proven extremely valuable to companies large and small, as well as local and worldwide organizations.</p><blockquote tml-render-position="right" tml-render-size="medium"><p><strong>See also: <a href="http://readwrite.com/2013/12/26/big-data-myths-reality">Big Data Myths Give Way To Reality In 2014</a></strong></p></blockquote><p>In a year that saw everyone from retailers like Target to healthcare facilities in Texas embrace these questions, one could say Big Data really came into its own in 2013.</p><p>So let’s look back at some of the key trends that emerged in the Big Data ecosystem.</p><h2>Entering Mainstream Thinking</h2><p>When Jake Porway, a data scientist from the National Geographic Society, compared data to “a bucket of crude oil” last year, executives around the globe collectively nodded in agreement.</p><p>This year, the aforementioned NSA’s probing into phone records and customer databases at technology companies like Apple and Google caused enough of a stir that other tech companies began revisiting their own privacy procedures.</p><p>In most ways, however, Big Data is helping humanity. The World Health Organization’s Global Burden of Disease project searches vast data sets of 21 countries to keep track of diseases and avoid pandemics. Environmental organization Conservation International announced its plan this year to gather data from thousands of sensors and databases to help scientists take proactive steps against ecological threats.</p><p>For good or ill, the concept of Big Data is strong enough that more companies are preparing to spend time and money to use Big Data analysis tools. Some 64 percent of organizations surveyed by researchers at Gartner invested in big data technology this year. Thousands of executives at major organizations like Bank of America, HSBC, AIG and even the Federal Communications Commission identify themselves as Chief Data Officer. That’s a title you just didn’t see on the corporate ladder a few years ago.</p><h2>Hadoop, The Elephant In The Room</h2><p></p><div tml-image="ci01b2791260016d19" tml-render-position="right" tml-render-size="medium"><figure><img src="http://a4.files.readwrite.com/image/upload/c_fill,cs_srgb,dpr_1.0,q_80,w_620/MTIyMjkzMjU4NjQ1NzAxMjIy.jpg" /></figure></div><p>The beauty of Big Data is less about the data gathering and more about making sense of it all. This year, we saw improvements to the software systems that power big data projects and increases in training for people who use Big Data tools.</p><p>The Apache Software Foundation’s software project for processing Big Data collections, known as Hadoop, continued to capture the imaginations of data scientists everywhere. While there are many off-the-shelf software products that can collect and sort data, Hadoop, NoSQL databases and all other leading Big Data technologies are open source. Meaning they can be customized.</p><p>This year saw the release of Hadoop 2.2, which increased the number of search tools that can be used, the capability to build apps right inside of Hadoop framework instead of extracting them and found Hadoop now officially supported on Microsoft Windows.</p><p>The software community also seems to want to play well with others within the Hadoop framework. IBM, Cloudera, Netflix and LinkedIn are but a few companies that are contributing code to the greater good of cracking the Big Data nut.</p><p>Other tools to build Big Data software are getting the royal treatment as well. In addition to Hadoop, programmers are incorporating MapReduce, Beyond Java, Scala, and C/C++ into various platforms.</p><h2>Show Me The Money</h2><p>Some of the biggest news around Big Data was the amount of money thrown at companies who can analyze bits and pieces of information.</p><p>An estimated $3.6 billion was injected into startups focused on big data. Young companies like Palantir, MongoDB, Hortonworks, Cloudera, DataStax and Mu Sigma all saw their pocketbooks lined with venture capital funding.</p><p>Acquisitions of Big Data companies were also popular. Google, Apple, Walmart, Facebook and IBM all added companies that specialize in Big Data analysis.</p><p>Financial investors also took a liking to Big Data in 2013. Splunk, Tableau, and Rocket Fuel all transitioned from private to public companies this year. More companies in this category are expected to follow suit in 2014.</p><h2>So Much More To Learn</h2><p>The one trend in Big Data that pretty much all analysts agree on is that we don’t know where this is headed. Just understanding “what is Big Data” is a challenge for a significant number of people, according to researchers at Gartner.</p><p>While many companies are looking at extracting information from their own data warehouses, the vast majority of businesses working on Big Data Projects have yet to include external data sources to their analysis. For Big Data, 2013 was the year of experimentation and early deployment.</p><p>"Different industries have different priorities when it comes to big data,” Lisa Kart, research director at Gartner, said back in September. “Industries that are driving the customer experience priority are retail, insurance, media and communications, and banking, while process efficiency is a top priority for manufacturing, government, education, healthcare and transportation organizations."</p><p>As businesses and organizations now look forward to 2014, there is no denying that Big Data will have a major impact on their decisions, just as it had in 2013.</p><p><em>Lead image courtesy of Flickr user <a href="http://www.flickr.com/photos/t_buchtele/">t_buchtele</a>&nbsp;via CC</em></p>Big data was on the move in 2013 thanks to NSA snooping, improved tools and a need to connect the dots.http://readwrite.com/2013/12/26/top-big-data-trends-of-2013
http://readwrite.com/2013/12/26/top-big-data-trends-of-2013WebThu, 26 Dec 2013 09:02:00 -0800Michael SingerHadoop 2.0 Makes Big Data Even More Accessible<!-- tml-version="2" --><p>It took a little longer than expected, but the Apache Software Foundation has announced the general availability of Apache Hadoop 2.0 yesterday, which will ultimately be an elephant-sized step in how Hadoop is used for managing big data collections.</p><p>The biggest change to Apache Hadoop 2.2.0, the <a href="http://hadoop.apache.org/releases.html#15+October%2C+2013%3A+Release+2.2.0+available">first generally available version of the 2._x_ series</a>, is the update to the MapReduce framework to Apache YARN, also known as MapReduce 2.0. MapReduce is a big feature in Hadoop—the batch processor that lines up search jobs that go into the Hadoop distributed file system (HDFS) to pull out useful information. In the previous version of MapReduce, jobs could only be done one at a time, in batches, because that's how the Java-based MapReduce tool worked.</p><blockquote tml-render-position="right" tml-render-size="medium"><p><strong>See also: <a href="http://readwrite.com/2013/05/23/hadoop-what-it-is-and-how-it-works">Hadoop: What It Is And How It Works</a></strong></p></blockquote><p>With the available update, MapReduce 2.0 will enable multiple search tools to hit the data within the HDFS storage system at the same time.</p><p></p><div tml-image="ci01b2816380058266" tml-image-caption="The new YARN/MapReduce 2.0 architecture." tml-render-position="right" tml-render-size="medium"><figure><img src="http://a2.files.readwrite.com/image/upload/c_fill,cs_srgb,w_620/MTIyMzAyNDAzMTY3OTM5ODY1.png" /><figcaption>The new YARN/MapReduce 2.0 architecture.</figcaption></figure></div><p>What YARN does is divide the functionality of MapReduce even further, breaking the two major responsibilities of the MapReduce JobTracker component—resource management and job scheduling/monitoring—into separate applications: a global ResourceManager and per-application ApplicationMaster.</p><p>Splitting up these functions provides a more powerful way to manage a Hadoop cluster's resources than the current MapReduce systems can handle. It manages resources similar to the way an operating system handles jobs, which means no more one-at-a-time limitation.</p><p>With MapReduce 2.0, developers can now build apps directly within Hadoop, instead of bolting them on from the outside, as many third-party vendor tools have had to do in Hadoop 1.0. This essentially will establish Hadoop 2.0 as a platform into which developers can create applications that will search for an manipulate data far more efficiently.</p><p>While YARN is the biggest change in the new version of Hadoop, there are some nice changes in the HDFS side of the Hadoop, too, including high availability for HDFS, HDFS snapshots, and support for the NFSv3 filesystem to access data in HDFS, if need be.</p><p>Also, Hadoop 2.2 is now officially supported on Microsoft Windows, which will no doubt stir up interest from companies committed to Microsoft-only platforms.</p><p>There will no doubt be growing pains with Hadoop as companies migrate the new release, but the fundamental changes to the MapReduce framework will mean even more usefulness for Hadoop in big-data scenarios moving forward. Expect a lot of new tools that will capitalize on the new capabilities in YARN, and soon.</p><p></p><p><em>YARN image courtesy of Hortonworks.</em></p>The new major-version release of Apache Hadoop will enable apps to directly connect to stored data.http://readwrite.com/2013/10/16/hadoop-2-yarn-mapreduce-2-big-data-more-accessible
http://readwrite.com/2013/10/16/hadoop-2-yarn-mapreduce-2-big-data-more-accessibleWorkWed, 16 Oct 2013 07:34:30 -0700Brian ProffittMaking Do With Google's Leftovers<!-- tml-version="2" --><div tml-image="ci01b2823960016d19" tml-render-position="center" tml-render-size="large"><figure><img src="http://a4.files.readwrite.com/image/upload/c_fill,cs_srgb,dpr_1.0,q_80,w_620/MTIyMzAzMzIxMjE3MzMwNDU3.jpg" /></figure></div><p>It's perhaps one of the industry's great ironies that today's hottest enterprise technology is yesterday's leftovers at Google. Hadoop, an open-source implementation of Google's MapReduce technology, is all the rage in the enterprise as a primary tool for tackling Big Data, and probably will remain such for years to come.</p><p>But at Google, MapReduce may already be too slow and not nearly scalable enough.</p><p>This isn't news. Mike Miller, CEO of Cloudant, <a href="http://gigaom.com/2012/07/07/why-the-days-are-numbered-for-hadoop-as-we-know-it">made this point</a> in 2012, and Bill McColl, CEO of Cloudscale, <a href="http://www.nytimes.com/external/gigaom/2010/10/23/23gigaom-beyond-hadoop-next-generation-big-data-architectu-81730.html">made it</a> two years before that. As McColl argued in 2010, "the people who really do have cutting edge performance and scalability requirements today have already moved on from the Hadoop model."</p><p>Which is another way of saying Google lives in the future.</p><p>I've <a href="http://readwrite.com/2013/01/04/html5-not-linux-key-to-ubuntus-quixotic-mobile-war">told the story before</a> about a wealthy friend telling me his money lets him "see into the future a few years" by affording expensive things today that will be cheap for everyone in the future. In a similar fashion, Google, <a href="http://readwrite.com/2013/01/07/trickle-down-web-innovation-breathing-new-life-into-enterprise-it">not to mention other web giants like Facebook and Twitter</a>, is building things today, to solve problems of scale and data processing, that will likely be commonplace for mainstream enterprises tomorrow.&nbsp;</p><p>Today Google's data and scale problems are almost magical. Tomorrow they will likely be average.</p><p>Which may mean that peering into the future, whether you're an entrepreneur or a venture capitalist, may be as simple as watching Google. While <a href="https://developers.facebook.com/opensource/">Facebook releases</a> much of its code as open source, the place to gaze into Google's soul is its treasure trove of <a href="http://research.google.com/pubs/papers.html">published&nbsp;research</a>. There you'll find "Efficient spatial sampling of large geographical tables" and more information on "Spanner: Google's Globally-Distributed Database."</p><p>You will see, in other words, the future of enterprise computing, otherwise known as Google's leftovers.&nbsp;</p><p><em>Image courtesy of <a href="http://www.shutterstock.com/gallery-987p1.html?cr=00&amp;pl=edit-00">AHMAD FAIZAL YAHYA</a> / <a href="http://www.shutterstock.com/?cr=00&amp;pl=edit-00">Shutterstock</a>.</em></p>Some of the industry's hottest enterprise technology is old news at Google (and other web giants), and likely will remain such for years to come. We are living with yesterday's Google's leftovers.http://readwrite.com/2013/04/10/making-do-with-googles-leftovers
http://readwrite.com/2013/04/10/making-do-with-googles-leftoversWorkWed, 10 Apr 2013 07:07:44 -0700Matt Asay