Svedlund Nordström, Johan

Abstract [en]

The immense growth of the web has led to the age of Big Data. Companies like Google, Yahoo and Facebook generates massive amounts of data everyday. In order to gain value from this data, it needs to be effectively stored and processed. Hadoop, a Big Data framework, can store and process Big Data in a scalable and performant fashion. Both Yahoo and Facebook, two major IT companies, deploy Hadoop as their solution to the Big Data problem. Many application areas for Big Data would benefit from the ability to share datasets across cluster boundaries. However, Hadoop does not support searching for datasets either local to a single Hadoop cluster or across many Hadoop clusters. Similarly, there is only limited support for copying datasets between Hadoop clusters (using Distcp). This project presents a solution to this weakness using the Hadoop distribution, Hops, and its frontend Hopsworks. Clusters advertise their peer-to-peer and search endpoints to a central server called Hops-Site. The advertised endpoints builds a global hadoop ecosystem and gives clusters the ability to participate in publicsearch or peer-to-peer sharing of datasets. HopsWorks users are given a choice to write data into Kafka as it’s being downloaded. This opens up new possibilities for data scientists who can interactively analyse remote datasets without having to download everything in advance. By writing data into Kafka as its being downloaded, it can be consumed by entities like Spark-streaming or Flink.