Ofir's random thoughts of data technologies

Main menu

Post navigation

In this post I’ll explore how the HDFS default values really work, which I found to be quite surprising and non-intuitive, so there s a good lesson here.

In my case, I have my local virtual Hadoop cluster, and I have a different client outside the cluster. I’ll just put a file from the external client into the cluster in two configurations.

All the nodes of my Hadoop (1.2.1) cluster have the same hdfs-site.xml file with the same (non-default) value for dfs.block.size (renamed to dfs.blocksize in Hadoop 2.x) of 134217728, which is 128MB. In my external node, I also have the hadoop executables with a minimal hdfs-site.xml.

First, I have set dfs.block.size to 268435456 (256MB) in my client hdfs-site.xml and copied a 400MB file to HDFS:

Now, let’s remove the reference for dfs.block.size from my hdfs-site.xml file on the client. I expected that, since we don’t specify a value in our client, it should not ask for any specific block size, so the actual block size will be the value set in the NameNode and DataNodes – 128MB. However, this is what I got:

Well, the file definitely has a block size of 64MB, not 128MB! This is actually the default value for dfs.block.size in Hadoop 1.x.

It seems that the value of dfs.block.size is dictated directly by the client, regarding of the cluster setting. If a value is not specified, the client just picks the default value. This finding is not specific to this parameter – for example, the same thing happens with dfs.replication and others….

So, if you ever wondered why you must have full copy of the Hadoop config files around instead of just pointing to the NameNode and JobTracker, now you know…
Now double check your client setup!

Corrections and comments are welcomed!

P.S.Just as I was publishing it, I found a shorter way to check the block size of a file, that also works from any client:

When a file is written to HDFS, it is split up into big chucks called data blocks, whose size is controlled by the parameter dfs.block.size in the config file hdfs-site.xml (in my case – left as the default which is 64MB). Each block is stored on one or more nodes, controlled by the parameter dfs.replication in the same file (in most of this post – set to 3, which is the default). Each copy of a block is called a replica.

Regarding my setup – I have nine nodes spread over three racks. Each rack has one admin node and two worker nodes (running DataNode and TaskTracker), using Apache Hadoop 1.2.1. Please note that in my configuration, there is a simple mapping between host name, IP address and rack number -host name is hadoop[rack number][node number] and IP address is 10.0.1.1[rack number][node number].

Where should the NameNode place each replica? Here is the theory:

The default block placement policy is as follows:

Place the first replica somewhere – either a random node (if the HDFS client is outside the Hadoop/DataNode cluster) or on the local node (if the HDFS client is running on a node inside the cluster).

Place the second replica in a different rack.

Place the third replica in the same rack as the second replica

If there are more replicas – spread them across the rest of the racks.

That placement algorithm is a decent default compromise – it provides protection against a rack failure while somewhat minimizing inter-rack traffic.

So, let’s put some files on HDFS and see if it behaves like the theory in my setup – try it on yours to validate.

Putting a file into HDFS from outside the Hadoop cluster

As my block size is 64MB, I created a file (called /sw/big) on my laptop for testing. The file is about 125MB – so it consumes two HDFS blocks. It is not inside one of the data node containers but outside the Hadoop cluster. I then copied it to HDFS twice:

Conclusion – we can see that when the HDFS client is outside the Hadoop cluster, for each block, there is a single replica on a random rack and two other replicas on a different, random rack, as expected.

Putting a file into HDFS from within the Hadoop cluster

OK, let’s repeat the exercise, now copying the big file from a local path on one of the data nodes (I used hadoop22) to HDFS:

Conclusion – we can see that when the HDFS client is inside the Hadoop cluster for each block, there is a single replica on the local node (hadoop22 in this case) and two other replicas on a different,random rack, as expected.

Just for fun – create a file with more than three replicas

We can override Hadoop parameters when invoking the shell using -D option. Let’s use it to create a file with four replicas from node hadoop22:

In this post I’ll cover how to define and debug a simple Hadoop network topology.

Background

Hadoop is designed to run on large clusters of commodity servers – in many cases spanning many physical racks of servers. A physical rack is in many cases a single point of failure (for example, having typically a single switch for lower cost), so HDFS tries to place block replicas on more than one rack. Also, there is typically more bandwidth within a rack than between the racks, so the software on the cluser (HDFS and MapReduce / YARN) can take it into account. This leads to a question:

How does the NameNode know the network topology?

By default, the NameNode has no idea which node is in which rack. It therefore by default assumes that all nodes are in the same rack, which is likely true for small clusters. It calls this rack “/default-rack“.

So, we have to teach Hadoop our cluster network topology – the way the nodes are grouped into racks. Hadoop supports a pluggable rack topology implementation – controlled by the parameter topology.node.switch.mapping.impl in core-site.xml, which specifies a java class implementation. The default implementation is using a user-provided script, specified in topology.script.file.name in the same config file, a script that gets a list of IP addresses or host names and returns a list of rack names (in Hadoop2: net.topology.script.file.name). The script will get up to topology.script.number.args parameters per invocation, by default up to 100 requests per invocation (in Hadoop2: net.topology.script.number.args).

In my case, I set it to /usr/local/hadoop-1.2.1/conf/topology.script.sh which I copied from the Hadoop wiki. I just made a few changes – I changed the path to my conf directory in the second line, added some logging of the call in the third line and changed the default rack name near the end to /rack01:

A bit long, but pretty straight forward. Please note that in my configuration, there is a simple mapping between host name, IP address and rack number – the host name is hadoop[rack number][node number] and IP address is 10.0.1.1[rack number][node number].

You could of course write any logic into the script. For example, using my naming convention, I could have written a simple script that just takes the second-to-last character and translate it to rack number – that would work on both IP addresses and host names. As another example – I know some companies allocate IP addresses for their Hadoop cluster as x.y.[rack_number].[node_number] – so they can again just parse it directly.

Before we test the script, if you have read my previous post on LXC setup, please note that I made a minor change – I switched to a three-rack cluster, to make the block placement more interesting. So, my LXC nodes are:

Why did the NameNode send all the IP addresses of the DataNodes in a single call? In this case, I have pre-configured another HDFS parameter called dfs.hosts in hdfs-site.xml. This parameter points to a file with a list of all nodes that are allowed to run a data node. So, when the NameNode started, it just asked for the mapping of all known data node servers.

What happens if you don’t use dfs.hosts? To check, I removed this parameter from my hdfs-site.xml file, restarted the NameNode and started all the data nodes. In this case, the NameNode called the topology script once per DataNode (when they first reported their status to the NameNode):

I hope this post has enough data for you to hack and QA your own basic network topologies. In the next post I hope to investigate block placement – how to see where HDFS is actually putting each copy of each block under various conditions.

TL;DR Why and how I created a working 9-node Hadoop Cluster on my laptop

In this post I’ll cover why I wanted to have a decent multi-node Hadoop cluster on my laptop, why I chose not to use virtualbox/VMware player, what is LXC (Linux Containers) and how did I set it up. The last part is a bit specific to my desktop O/S (Ubuntu 13.10).

Why install a fully-distributed Hadoop cluster on my laptop?

Hadoop has a “laptop mode” called pseudo-distributed mode. In that mode, you run a single copy of each service (for example, a single HDFS namenode and a single HDFS datanode), all listening under localhost.

This feature is rather useful for basic development and testing – if all you want is write some code that uses the services and check if it runs correctly (or at all). However, if your focus is the platform itself – an admin perspective, than that mode is quite limited. Trying out many of Hadoop’s feature requires a real cluster – for example replication, failure testing, rack awareness, HA (of every service types) etc.

Why VirtualBox or VMware Player are not a good fit?

When you want to run several / many virtual machines, the overhead of these desktop virtualization sotware is big. Specifically, as they run a full-blown guest O/S inside them, they:

Consume plenty of RAM per guest – always a premium on laptops.

Consume plenty of disk space – which is limited, especially on SSD.

Take quite a while to start / stop

To run many guests on a laptop, a more lightweight infrastructure is needed.

What is LXC (Linux Containers)

LXC is a lightweight virtualization solution for Linux, sometimes called “chroot on steroids”. It leverages a set of Linux kernel features (control groups, kernel namespaces etc) to provide isolation between groups of processes running on the host. In other words, in most cases the containers (virtual machines) are using the host kernel and operating system, but are isolated as if they are running on their own environent (including their own IP addresses etc).

In practical terms, I found that the LXC containers start in a sub-second. They have no visible memory overhead – for example, if a container runs the namenode, only the namenode consumes RAM, based on its java settings. They also consume very little disk space, and their file system is just stored directly in the host file system (under /var/lib/lxc/container_name/rootfs), which is very useful. You don’t need to statically decide how much RAM or virtual cores to allocate each (though you could) – they consume just what they need. Actually, while the guests are isolated from each other, you can see all their processes together in the host using the ps. command as root..

Some other notes – LXC is already used by Apache Mesos and Dockers, amongst others. A “1.0” release is targeted next month (February). If you are using RHEL/CentOS, LXC is provided as a “technical preview” on RHEL 6 (likely best to use a current update).

If you want a detailed introduction, Stéphane Graber which is one of the LXC maintainer started an excellent ten-post series on LXC a couple of weeks ago.

So, how to set it up?

The following are some of my notes. I only tried it on my laptop, with Ubuntu 13.10 64-bit as a guest, so if you have a different O/S flavor, expect some changes.

All commands below are as root. start by installing LXC:

apt-get install lxc

The installation also creates a default virtual network bridge device for all containers (lxcbr0). Next, create your first Ubuntu container called test1 – it will take a few minutes as some O/S packages will be downloaded (but they will be cached, so the next containers will be created in a few seconds):

lxc-create -t ubuntu -n test1

Start the container with:

lxc-start -d -n test1

List all your containers (and their IP addreses) with:

lxc-ls --fancy

and open a shell on it with:

lxc-attach -n test1

stop the container running on your host:

lxc-stop -n test1

You can also clone a container (lxc-clone) or freeze/unfreeze a running one (lxc-freeze / lxc-unfreeze). LXC also has a web GUI with some nice visualizations (though some of it was broken for me), install it with:

Now, some specific details for an Hadoop container

1. Install a JVM inside the container. The default LXC Ubuntu container has a minimal set of packages . At first, I installed openjdk in it:

apt-get install openjdk-7-jdk

However, I decided to switched to Oracle JDK as that is what’s typically being used. That required some extra steps (the first step adds the “add-apt-repository” command, the others set a java7 repository to ease installation and updates):

By default, LXC uses DHCP to provide dynamic IP addresses to containers. As Hadoop is very sensitive to IP addresses (it occasionally uses them instead of host names), I decided that I want static IPs for my Hadoop containers.

In Ubuntu, the lxc bridge config file is /etc/default/lxc-net – I uncommented the following line:

That should cover more containers than I’ll ever need… My naming convention is that the host name is hadoop[rack number][node number] and the IP is 10.0.1.1[rack number][hostname]. For example, host hadoop13 is located at rack 1 node 3 with IP address of 10.0.113. This way it is easy to figure out what’s going on even if the logs only have IP.

I’ve also added all of the potential containers to my /etc/hosts and copied them to hadoop11, which I use as a template to clone the rest of the nodes:

I ran into one piece of ugliness regarding DHCP. I use hadoop11 container as a template, so I as I develop my environment I have destroyed the rest of the containers and cloned them again a few times. At that point, they got a new dynamic IP – it seems the DHCP server remembered that the static IP that I wanted was already allocated. Eventually, I added this to the end of my “destroy all nodes” script to solve that:

If you run into networking issues – I recommend checking the Ubuntu-specific docs, as lots of what you find on the internet is scary network hacks that are likely not needed in Ubuntu 13.10 (and didn’t work for me).

Another setup thing – if you run on BTRFS (a relatively new copy-on-write filesystem), the lxc-clone should detect it and leverage it. It doesn’t seem to work for me, but instead copying files directly into /var/lib/lxc/container_name/rootfs using cp –reflink works as expected (you can verify disk space after copying by waiting a few seconds and then issuing btrfs file df /var/lib/lxc/ )

3. Setting Hadoop

This is pretty standard… You should just set up ssh keys, maybe create some O/S users for Hadoop, download Hadoop bits from Apache or a vendor and get going… Personally, I used Apache Hadoop (1.2.1 and 2.2.0) – I have set things up on a single node and then clone it to create a cluster.

In addition, to ease my tests, I’ve been scripting the startup, shutdown and full reset of the environment based on a config file that I created. For example, here is my current config file for Apache Hadoop 1.2.1 (the scripting works nicely but is still evolving – ping me if you are curious):

Anyway, Rob Klopp wrote an interesting post titled “Specialized Databases vs. Swiss Army Knives“. In it, he argues with Stonebraker’s claim that the database market will split to three-to-six categories of databases, each with its own players. Rob counter claims by saying that data is typically used in several ways, so it is cumbersome to have several specialized databases instead of one decent one (like Hana of course…).

I have a somewhat different perspective. In the past, let’s say ten years ago, I was sure that Oracle database is the right thing to throw at any database challenge, and I think many in the industry shared that feeling (each with his/her favorite database, of course). There was a belief that a single database engine could be smart enough, flexible enough, powerful enough to handle almost everything.

That belief is now history. As I will show, it is now well understood and acknowledged by all the major vendors that a general-purpose database engine just can’t compete in all high-end niches. HOWEVER, the existing vendors are, as always, adapting. All of them are extending their databases to offer multiple database engines inside their product, each for a different use case.

The leader here seems to be Microsoft SQL Server. SQL Server 2014 (cuirrently at CTP2) comes with three separate database engines. In addition to the existing engine, they introduced Hekaton – an in-memory OLTP engine that looks very promising. They also delivered a brand new implementation of their columnar format – now called clustered columnstore index – which is now fully updatable and is actually not an index – it is a primary table storage format with all the usual plumbing (delta trees with a tuple mover process when enough rows have accumulated).

Oracle has been promoting two engines for a long while – TimesTen for low-latency, intensive OLTP (in-memory of course) ,and the Oracle database. Within the Oracle database, they are planning to introduce a second engine for data warehousing in the next release (with an in-memory columnar format) – which I discussed in the past. And IBM DB2 have already introduced an in-memory columnar engine called BLU Acceleration six months ago.

Now, why do I call them engines? It seems to me all these are major, intrusive features, with whole new on-disk formats, memory representations, processing logic and code paths, well beyond a new index type or similar change. In addition, some come with different or no locking/latching and other dramatic internal changes.

So, to wrap it up, while I believe that specialized engines is a new must, the existing players are working to implement multiple engines within their products to keep them relevant. As for having a single decent database for everything – if it does support the required high performance and scalability, with low cost and rich enough functionality (HA, security, resource management, great optimizer etc)… There is always place for a single database for different workloads, if and when it delivers.

Earlier today a witty comment I made on twitter led to a long and heated discussion about vendor exaggerations regarding SQL-on-Hadoop relative performance. It started as a link to this post by Hyeong-jun Kim, CTO and Chief Architect at Gruter. That post discusses some ways that vendor exaggerate and suggests to verify these claims with your own data and queries.

Anyway, Milind Bhandarkar from Pivotal suggested that joining an industry standard benchmarking effort might be the right think. I disagree, wanted to elaborate on that:

Industry Standard SQL-on-Hadoop benchmark won’t improve a thing

These benchmarks (at least in the SQL space) don’t help users to pick a technology. I’ve never heard of a customer who picked some solution because it lead the performance or price/performance list of TPC-C or TPC-H.
Customers will always benchmark on their data and their workload…

…And they do so because in this space, many of the small variations in data (ex: data distribution within a column) or workload (SQL features, concurrency, transaction type mix) will have dramatic impact on the results.

So, the vendors who write the benchmark will fight to death to have the benchmark highlight their (exisitng) strengths. If the draft benchmark will show them at the bottom, they’ll retire and bad mouth the benchmark for not incorporating their suggestions.

Of course, the close-source players still won’t allow publishing benchmark results – the “DeWitt Clause” (right Milind? What about HAWK?)

And even with a standard, all vendors will still use micro-benchmarks to highlight the full extent of new features and optimizations, so the rebuttals and flame wars will not magically end.

What I think is the right thing for users

Since vendors (and users) will not agree on a single benchmark, the next best thing is that each player will develop their own benchmark(s), but will make it easily reproducible.

For example, share the dataset, the SQLs, the specific pre-processing if any, the specific non-default config options if any, the exact SW versions involved, the exact HW config and hopefully – provide the script that runs the whole benchmark.

This would allow the community and ecosystem – users, prospects and vendors – to:

Quickly reproduce the results (maybe on a somewhat different config).

Play with different variations to learn how stable are the conclusions (different file formats, data set size, SQL and parameter variations any many more).

Share their result in the open, using the same disclosure, for everyone to learn from it or respond to it.

Short version – community over committee.

Why that also won’t happen

Sharing a detailed, easily reproducible report is great for smart users who want to educate themselves and choose the right product.

However, there is nearly zero incentive for vendors and projects to share it, especially for the leaders. Why? Because they are terrified that a competitor will use it to show that he is way faster… That could be the ultimate marketing fail (a small niche player afford it, since they have little to lose).

Some other reasonable excuses – there could be a dependency on internal testing frameworks, scripts or non-public datasets, not enough resources to clean up and document each micro-benchmark or to follow up with everyone etc.

Also, such disclosure may prevent marketing / management from highlighting some results out of context (or add wishful thinking)… Not sure many are willing to report to their board – “sales are down since we told everyone that our product is not yet good enough”.
For example – I haven’t yet seen a player claiming “Great Performance! We are now 0.7x faster than the current performance leader!”. Somehow you’ll need to claim that you are great, even if for some a specific scenario under specific (undisclosed?) constraints.

Summary

You can’t really optimize picking the right tool by building a universal performance benchmark (and picking is not only about performance).

Human interests (commercial and others) will generate friction as all compete for some mindshare. Social pressure might sometimes help to make competitors support civilized technical discussion, but only rarely.

when sharing performance numbers, please be nice – share what you can and don’t mislead.

As a user, try to learn and verify before making big commitments, and be prepare to screw up occasionally…

Last month I attended AWS Summit in Tel-Aviv. It was a great event, one of the largest local tech events I’ve seen. Anyway, a session on Redshift got me thinking.

When Amazon Redshift was announced (Nov 2012), I was working at Greenplum. I remember that while Redshift pricing was impressive, I mostly dismissed it at that time due to some perceived gaps in functionality and being in a “limited beta” status. Back to last month’s session, I came to see what’s new and was taken by surprise by the tone of the presentation.

As a database expert, I thought I would hear about the implementation details – the sharding mechanism, the join strategies, the various types of columnar encoding and compression, the pipelined data flow, the failover design etc… While a few of these were briefly mentioned, they were definitely not the focus. Instead, the main focus was the value of getting your data warehouse in a PaaS platform.

So what is the value? It is a combination of several things. Lower cost is of course a big one – it allows smaller organization to have a decent data warehouse. But it is much more than cost – being a PaaS means you can immediately start working without worrying about everything from a complex, time-consuming POC to all the infrastructure work for successful production-grade deployment. Just provision a production cluster for a day or two, try it out on your data and queries, and if you like the performance (and cost) – simply continue to use it. It allows a very safe play – or as they put it “low cost of failure“. In addition, a PaaS environment can handle all the ugly infrastructure tasks around MPP databases or appliances that typically block adoption – backups, DR, connectivity etc.
Let me elaborate on these points:

I guess anyone who did a DW POC can relate to the following painful process. You have to book the POC many weeks in advance, pick from several uneasy options (remote POC, fly to vendor central site or coordinate a delivery to your data center), pick in advance a small subset of data and queries to test, don’t use your native BI/ETL tools, rely on the vendor experts to run and tune the POC, finish with a lot of open questions as you ran out of time, pick something, pay a lot of money, wait a couple of months for delivery, start migration / testing / re-tuning (or even DW design and implementation) and several months later you will likely realize how good (or bad) your choice really was… This is always a very high risk maneuver, and the ability to provision a production environment, try it out and if you like it, just keep it without a huge upfront commitment is a very refreshing concept.

Same goes for many “annoying” infrastructure bits. For example – backup, recovery and DR. Typically these will not be tested in a real-world configuration as part of a POC due to various constraints, and the real tradeoff between feasible options in cost, RPO and RTO will not seriously evaluated or in many cases not even considered. Having a working backup included with Redshift – meaning, the backup functionality with underlying infrastructure and storage – is again huge. Another similar one is patching – it is nice that it’s Amazon’s problem.

Last of these is scaling. With an appliance, scaling out is painful. You order an (expensive) upgrade, wait a couple of month for it to be install, then try to roll it out (which likely involves many hours of storage re-balancing), then pray it works. Obviously you can’t scale down to reduce costs. With Redshift, they provision a second cluster (while your production is switched to read-only mode), and switch you over to the new one “likely within an hour”. Of course, with Redshift you can scale up or down on demand, or even turn off the cluster when not needed (but there are some pricing gotchas as there is a strong pricing bias to a 3-year reserved instances).

If you’re interested, you can find Guy Ernest’s presentation here .During his presentation, most slides were hidden as he and Intel had only 45 minutes – I see now that the full slide deck actually does have some slides full of implementation details 🙂

BTW – another thing that I was curious about was – can the Amazon team significantly evolve the Redshift code base given its external origin(ParAccel)? I assumed that it is not a trivial task. Well, I just read last week that AWS are releasing a big functionality upgrade or Redshift (plus some more bits), so I think that one is also off the table.

It would be interesting to see if Redshift would now gain more traction, especially as it got more integrated into the AWS offering and workflow.

Categories

EMC Employee Disclaimer

The opinions and interests expressed on EMC employee blogs are the employees’ own and do not necessarily represent EMC’s positions, strategies or views. EMC makes no representation or warranties about employee blogs or the accuracy or reliability of such blogs. When you access employee blogs, even though they may contain the EMC logo and content regarding EMC products and services, employee blogs are independent of EMC and EMC does not control their content or operation. In addition, a link to a blog does not mean that EMC endorses that blog or has responsibility for its content or use.