Overview

Applications such as web search and social networking have been moving from centralized to decentralized cloud architectures to improve their scalability. MapReduce, a programming framework for processing large amounts of data using thousands of machines in a single cloud, also needs to be scaled out to multiple clouds to adapt to this evolution. The challenge of building a multi-cloud distributed architecture is substantial. Notwithstanding, the ability to deal with the new types of faults introduced by such setting, such as the outage of a whole datacenter or an arbitrary fault caused by a malicious cloud insider, increases the endeavor considerably.

Medusa is a platform that allows MapReduce computations to scale out to multiple clouds and tolerate several types of faults.

Our solution fulfills four objectives: First, it is transparent to the user, who writes her typical MapReduce application without modification. Second, it does not require any modification to the widely used Hadoop framework. Third, the proposed system goes well beyond the fault-tolerance offered by MapReduce to tolerate arbitrary faults, cloud outages, and even malicious faults caused by corrupt cloud insiders. Fourth, it achieves this increased level of fault tolerance at reasonable cost.

In the following sections I show my small notes. Some of them are hard to understand.

Installation of the Proxy

The emulab-install.sh script install all the packages in a clean state OS. These packages are Hadoop MR, python2.7 and java.

Creating python environment

This tool has the purpose to control the execution of the mapreduce job in several clusters.
We tested the proxy in python 2.7, and fabric 1.6.0.
It is preferable that you install using a isolated Python environment using virtualenv.

Here it is an example of creating an virtualenv environment.

virtualenv ENV (e.g.$ virtualenv python-2.7)
ENV/bin/activate

All the clusters must be connected to an Rabbit MQ server.
All the clusters must know each other and added to authorized keys to allow ssh passwordless connections (The clusters that I am using know already)

You can install the python plugins that are set in pip-plugins.sh

Hadoop MapReduce

To run Hadoop MR, you need it to configure it first.
If you need help to install MapReduce, you can this link 1.
I also have set the .bashrc with some environment variables.

3. How to configure rabbitMQ in Amazon EC2?

DNS translates names into IP addresses. 0.0.0.0 means "listen on all network interfaces", so whatever
IP address public DNS resolves to, is used. In fact, there's only one IP address per instance most of the time.

Due to the private and public DNS, the queues/channels assume the private DNS, despite the host is configured to access the public DNS. E.g.,

If you have this error, verify that you have this in core-site.xml 1 "ubuntu cannot impersionate ubuntu". You need to restart hadoop.

<property><name>hadoop.proxyuser.ubuntu.hosts</name><value>\*</value></property>
13/07/28 14:26:29 ERROR tools.DistCp: Exception encountered
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException): User: ubuntu is not allowed to impersonate ubuntu
at org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:169)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:283)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$500(WebHdfsFileSystem.java:91)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$Runner.getResponse(WebHdfsFileSystem.java:549)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$Runner.run(WebHdfsFileSystem.java:470)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.run(WebHdfsFileSystem.java:403)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:570)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:581)
at org.apache.hadoop.fs.FileSystem.isFile(FileSystem.java:1371)
at org.apache.hadoop.tools.SimpleCopyListing.validatePaths(SimpleCopyListing.java:67)
at org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:79)
at org.apache.hadoop.tools.GlobbedCopyListing.doBuildListing(GlobbedCopyListing.java:90)
at org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:80)
at org.apache.hadoop.tools.DistCp.createInputFileListing(DistCp.java:326)
at org.apache.hadoop.tools.DistCp.execute(DistCp.java:151)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

##### 5. How to set internal ip address in ExoGeni?

In ExoGeni, NEuca service sets the loopback address to the hostname. Thus, we cannot assign the internal ip to the hostname. See this link

by default, the NEuca writes a loopback address to the hostname[1]. For me this is a problem because, I would like to assign the ip address 172.16.100.1 to the hostname, but now I can't. Is there a workaroud to assign 172.16.100.1 ip to the hostname?
[1]
$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 ubuntu
### BEGIN NEuca loopback modifications - DO NOT EDIT BETWEEN THESE LINES. ###
127.255.255.1 NodeGroup0-0
### END NEuca loopback modifications - DO NOT EDIT BETWEEN THESE LINES. ###

Answer:

Take a look in the following configuration file /etc/neuca/config, and you will find a section that looks like this:

[runtime]## Set the node name in /etc/hosts to a value in the loopback space.## Value can be "true" or "false"#set-loopback-hostname = true### The address that should be added to /etc/hosts if "set-loopback-hostname" is "true"## This address *must* be in the 127.0.0.0/8 space; any other value will result in an error.#loopback-address = 127.255.255.1

You can disable setting the hostname to loopback, if you like, by uncommenting the appropriate configuration item, and altering it to taste.
After you have done so, restart the neuca daemon thus /etc/init.d/neuca restart
Afterward, you can edit the /etc/hosts file to your heart’s content.