Pythian - Data Experts Blog » Gwen Shapirahttp://www.pythian.com/blog
Official Pythian Blog - Love Your DataFri, 27 Feb 2015 16:58:06 +0000en-UShourly1Thoughts on Intel’s Hadoop Distributionhttp://www.pythian.com/blog/intel-hadoop-distribution/
http://www.pythian.com/blog/intel-hadoop-distribution/#commentsTue, 04 Jun 2013 17:52:11 +0000http://www.pythian.com/blog/?p=55435When I heard that Intel announced their own Hadoop distribution, my first thought was “Why would they do that?”. This blog post is an attempt to explore why anyone would need their own Hadoop distribution, what Intel can gain by having their own and who is likely to adopt Intel’s distribution.

Why does anyone need an Hadoop distribution? Hadoop is open source, and it would make sense that RedHat and Canonical would package Hadoop and add it to their own distribution – just like they do to MySQL and other open source applications. Instead, we have Cloudera, Hortonworks, MapR, EMC, Intel and probably many more, each with their own Hadoop distribution.

When you try to pick a Hadoop distribution, the first thing you’ll notice is that each one has a slightly different set of components. Cloudera includes Flume and Scoop, which HortonWorks doesn’t. HortonWorks includes Ambari and a platform by Talend. Having a distribution gives companies a chance to define Hadoop. This matters a lot to new adopters and especially to larger companies – we look at the distribution as an indication of which components are safe to use, and are reluctant to add components outside their distribution. As an example, Oozie and Azkaban are similar tools performing the similar task of managing jobs in Hadoop. In my experience, Oozie is far more popular, not because its a superior tool, but because it is part of the popular Cloudera distribution.

There’s a reason Hadoop users prefer to use a distribution as a whole rather than mix and match toolchains: Considering the many components in a Hadoop production system, matching the versions to make sure all the tools are working well together is a challenging task. Companies that release their own distribution pick the correct versions, test a lot, and furiously patch to make sure all the components will work as a whole. This is somewhat similar to the way Oracle will announce that 11g is supported on RHEL5 but not RHEL6, except much more so. Of course, Redhat could do the same, as they do to all software in their Linux distribution, but as you can see, they don’t.

When users choose a well known distribution they don’t just get a well chosen and tested mix of components. They also get the option of purchasing support for this distribution. That’s the main benefit for companies selling their own Hadoop distribution: You go through all the trouble of picking components and testing them, so that you are well positioned to provide support for them. Other companies can of course sell support for the same distribution – Pythian will happily support any Hadoop distribution you choose. But the owner of the distribution has some advantage since it is much more difficult for 3rd party supporters to offer bug fixes in Hadoop code.

Of course, all this doesn’t apply to Intel, who shows no intention of selling support.

So why would Intel need their own distribution?

Let’s start from basics: Intel sells CPUs. That’s their main line of business, but they also write software. For example, Intel’s C compiler is first rate. I used to love working with it. Intel wrote their own compiler so executables generated with it will always use the best Intel features. This means that popular software would run faster on Intels, because their performance features will be used even when developers don’t know about them (Oracle Optimizer attempts to do the same, but with less success).

How does it apply to Hadoop? Clearly Intel noticed that Hadoop clusters tend to have lots of CPUs, and they are interested in making sure that these CPUs are always Intel, possibly by making sure that Hadoops run faster on Intel CPUs.

“The Intel Distribution for Apache Hadoop software is a 100% open source software product that delivers Hardware enhanced performance and security (via features like Intel® AES-NI™ and SSE to accelerate encryption, decryption, and compression operation by up to 14 times).”

“With this distribution Intel is contributing to a number of open source projects relevant to big data such as enabling Hadoop and HDFS to fully utilize the advanced features of the Xeon™ processor, Intel SSD, and Intel 10GbE networking.”

“Intel is contributing enhancements to enable granular access control and demand driven replication in Apache HBase to enhance security and scalability, optimizations to Apache Hive to enable federated queries and reduce latency. ”

Intel is doing for Hadoop the same thing it did for C compilers – make sure they use the best hardware enhancements available in the CPUs and other hardware components available from Intel. The nice thing is that the enhancements are available as open source – Intel doesn’t care that the software is free, since they are selling the hardware!

Improved Hadoop security is on the top of the list of things the enterprise needs from Hadoop (http://tdwi.org/Blogs/Philip-Russom/2013/04/Hadoop-Functionality-that-Needs-Improvement.aspx). Mixing Intel’s well known encryption support on the CPU with the enterprise requirement for improved security is a very smart move in my book. I know that security is much more than just fast encryption, but if Intel can leverage their security brand to create a strong security model for Hadoop, it’s a welcomed effort. The security offerings are promising indeed – key management, unified and integrated access management, and possibly even replacing Kerberos with something better integrated? Sign me in, and from what I heard – my customers are ready to sign too.

None of those were officially released yet, and I didn’t try to compile the code and run, so I can’t say much about what is actually delivered. Perhaps someone did and can comment. But I did notice another interesting detail. The Project Rhino README lists all the Hadoop components that Intel intends to include in its unified and integrated security model:

Core: A set of shared libraries

HDFS: The Hadoop filesystem

MapReduce: Parallel computation framework

ZooKeeper: Configuration management and coordination

HBase: Column-oriented database on HDFS

Hive: Data warehouse on HDFS with SQL-like access

Pig: Higher-level programming language for Hadoop computations

Oozie: Orchestration and workflow management

Mahout: A library of machine learning and data mining algorithms

Flume: Collection and import of log and event data

Sqoop: Imports data from relational databases

Looks familiar to anyone? That’s because it’s more or less identical to Cloudera’s Hadoop distribution. Why did Intel choose to use CDH? Possibly because of its focus on the enterprise toolchain – those are the tools you’ll need to build an ETL pipeline and a data-science practice on Hadoop. If Intel’s unified solution won’t include these tools, getting the enterprise adoption they are looking for will be a much bigger challenge. However, it does open new questions: Will Intel offer support for the distribution, or will they leave it to Cloudera, who already supports all the components? And can you have a “unified security solution” that leaves HortonWorks and MapR completely out of the plan?

It’s far too early to tell where this will all go, but so far Intel has made interesting decisions that make me look forward to the day when they have more to download than just a PDF. If you have any thoughts on where this is all going, I’d love to read your comments too.

]]>http://www.pythian.com/blog/intel-hadoop-distribution/feed/4Love Your MongoDBhttp://www.pythian.com/blog/love-your-mongodb/
http://www.pythian.com/blog/love-your-mongodb/#commentsMon, 04 Feb 2013 20:06:51 +0000http://www.pythian.com/blog/?p=41397Pythian now officially supports MongoDB both as On-Demand and Managed Services offerings.
We’ve been dipping our toes into the MongoDB pool for some time now, as more and more of our customers adopt MongoDB into their data infrastructure, but now it’s official.

Part of the decision to officially support MongoDB was driven by our desire to offer our customers full managed services for their entire data stack, which includes MongoDB now. But I’d like to believe that my own advocacy to expand our services was part of the decision, and I kept advocating this because MongoDB is the perfect database to have with managed services.

MongoDB is perfect for managed services because there is no other database that is so much fun to use as a developer and so challenging to support as an administrator. Even as an experienced database administrator and a newbie developer, I much prefer to program on MongoDB than to manage it. This is not the case with Oracle or MySQL, where tuning and tinkering are actually more fun than writing SQL.

Why is MongoDB such amazing fun for developers? MongoDB is a JSON document store. You can store any JSON doc there through a very easy to use API. Tons of Restful web APIs are sending back JSON docs, and being able to easily dump them into a database makes MongoDB just perfect. I’m working on a small program to provide some social network analysis from Twitter, and because Twitter throttles some of the APIs rather aggressively, I need to slowly grab the data I need over many hours. Using MongoDB to store all that data is a no-brainer. It’s as easy as writing into text files.
Of course, unlike text files, MongoDB lets you build indexes and run queries.

This brings me to the most important reason MongoDB is awesome. MongoDB supports the most important paradigm of the post-relational-database era: Grab data first, structure it later. As much as DBAs don’t want to hear about it, the requirement to define a data model and schema before you start collecting data creates serious friction on the starting phases of development. It’s not a big deal for a multi-month large-scale project, but can be a problem for the small hacks I’m typically involved in. In a recent project, we need to store events from a queue to a database. All events have originator, timestamp, and priority. But they also have a bunch of “other data”. In a pure relational schema, we can only keep the data we know we need and know how to structure, and maybe store the “other data” as a CLOB and be totally unable to use it later. With MongoDB, I can store everything since the events themselves are JSON. And I will still be able to query the data , even though I don’t know in advance what is the data and what I’d like to get in the queries.

MongoDB is not the only “store now, structure later” datastore; Hadoop allows you to do the same and so do many other data stores. But if you are now a relational-only shop, you can be sure that your developers are looking at one of those solutions with keen interest, and you’ll be wise to do the same.

Why is MongoDB a pain to manage? The web is full of horror stories from companies regretting the moment they adopted MongoDB. It is interesting to see how those MongoDB war stories typically involve downtimes far longer than anything you’d see in other data stores. It is very rare to see downtime on Oracle or MySQL lasting longer than 4 hours. With MongoDB, stories often include over 12h of downtime. This is the main reason I was nagging my managers to provide MongoDB support. I care about our customers and don’t want to see them experience 12h downtime. As an IT professional, I find the idea unacceptable.

A lot of those downtimes look completely preventable or at least solvable in shorter amounts of time. The main causes can be reduced to:

1. A company adopted MongoDB without fully understanding the model, its benefits, and its limitations. Especially its limitations. Any database has to be well understood before it goes into production, and MongoDB is no different. Assuming that you can just install a database, throw data in, and assume it will work is lunacy. This kind of lunacy used to be too common with MongoDB adopters. I wrote two blog posts (No silver bullet and Difficulty of migrations) just ranting about this kind of optimism in the MongoDB community.

2. MongoDB was managed by a small team of developers. Developers are naturally not as good as experienced administrators in detecting problems early and responding accordingly and can often be out of their depth when things really go wrong. DBAs and sysadmins read mailing lists with people describing how their database crashed and discuss possible solutions. Developers are more likely to read mailing lists with programming problems and solutions. So guess who is more likely to recognize a production problem and know the solution?

Why are those problems worse in MongoDB than in other databases? Part of it is maturity – MongoDB is not as mature as Oracle or MySQL. The documentation (official and in blogs and forums) is not as complete, the error messages are not as clear, and instrumentation is still lacking. Finding root cause of issues still take longer. But the main problem is that MongoDB is getting adopted into production by teams of developers with little to none operational commitment. Even if the operations team knows MongoDB is running somewhere (and it’s not always the case!), they probably don’t go to MongoDB training or even read a book. They have enough other work to do and hope the developers who pushed it into production know what they are doing. I’ve seen this happen with an early adoption of MySQL, and it’s a deja vu all over again. This is why our most experienced MySQL admins have been busy learning MongoDB – they’ve been there before too.

I hope you see now why I’m so excited about managed services for MongoDB. Your developers can have fun, and leave the pain to us.

To answer the most frequently asked question: No, we are not competing with 10Gen. First, because we like 10Gen and hope to work with them a lot. Second, because we can’t hope to compete with 10Gen – they employ many MongoDB developers, they know the code inside out, and they can fix bugs for you when you need it. How can we compete with that? Just like our Oracle customers still use Oracle support, we encourage our MongoDB customers to also have 10Gen support.
The third reason is that 10Gen can’t compete with us either. We offer full managed services – we will configure full monitoring of MongoDB, based on our experience and best practices, and the alerts will go to our pagers, providing 24/7 support. We aim to fix problems before you even notice they’re there. We are very proactive about your high availability, recoverability, and performance. We have tons of experience making systems run so smooth that you’ll forget they are there, until you customers call to ask “Why is everything so much faster now?” (True story!)

Speaking of monitoring, our team of experts is building MongoDB monitoring system as I’m writing this blog post. I think that this is the secret sauce for MongoDB success. Monitoring and capacity planning go hand in hand and are the keys to keep systems up and performing well, so we put a lot of focus of getting the basics right. Of course, if you already have your own monitoring solution (or use 10Gen’s), we’ll integrate our pagers and capacity planning systems with whatever monitoring you use.

Are you running MongoDB in production? How’s the experience so far?

]]>http://www.pythian.com/blog/love-your-mongodb/feed/10Hadoop FAQ – But What About the DBAs?http://www.pythian.com/blog/hadoop-faq-but-what-about-the-dbas/
http://www.pythian.com/blog/hadoop-faq-but-what-about-the-dbas/#commentsFri, 25 Jan 2013 02:12:10 +0000http://www.pythian.com/blog/?p=39933There is one question I hear every time I make a presentation about Hadoop to an audience of DBAs. This question was also recently asked in LinkedIn’s DBA Manager forum, so I finally decided to answer it in writing, once and for all.

“As we all see there are lot of things happening on Big Data using Hadoop etc….
Can you let me know where do normal DBAs like fit in this :
DBAs supporting normal OLTP databases using Oracle, SQL Server databases
DBAs who support day to day issues in Datawarehouse environments .

Do DBAs need to learn Java (or) Storage Admin ( like SAN technology ) to get into Big Data ? ”

I hear a few questions here:

Do DBAs have a place at all in Big Data and Hadoop world? If so, what is that place?

Do they need new skills? Which ones?

Let me start by introducing everyone to a new role that now exists in many organizations: Hadoop Cluster Administrator.

Organizations that did not yet adopt Hadoop sometimes imagine Hadoop as a developer-only system. I think this is the reason why I get so many questions about whether or not we need to learn Java every time I mention Hadoop. Even within Pythian, when I first introduced the idea of Hadoop services, my managers asked whether we will need to learn Java or hire developers.

Organizations that did adopt Hadoop found out that any production cluster larger than 20-30 nodes requires a full time admin. This admin’s job is surprising similar to a DBA’s job – he is responsible for the performance and availability of the cluster, the data it contains, and the jobs that run there. The list of tasks is almost endless and also strangely familiar – deployment, upgrades, troubleshooting, configuration, tuning, job management, installing tools, architecting processes, monitoring, backups, recovery, etc.

I did not see a single organization with production Hadoop cluster that didn’t have a full-time admin, but if you don’t believe me – note that Cloudera is offering Hadoop Administrator Certification and that O’Reilly is selling a book called “Hadoop Operations”.

So you are going to need a Hadoop admin.

Who are the candidates for the position? The best option is to hire an experienced Hadoop admin. In 2-3 years, no one will even consider doing anything else. But right now there is an extreme shortage of Hadoop admins, so we need to consider less perfect candidates. The usual suspects tend to be: Junior java developers, sysadmins, storage admins, and DBAs.

Junior java developers tend not to do well in cluster admin role, just like PL/SQL developers rarely make good DBAs. Operations and dev are two different career paths, that tend to attract different types of personalities.

When we get to the operations personnel, storage admins are usually out of consideration because their skillset is too unique and valuable to other parts of the organization. I’ve never seen a storage admin who became a Hadoop admin, or any place where it was even seriously considered.

I’ve seen both DBAs and sysadmins becoming excellent Hadoop admins. In my highly biased opinions, DBAs have some advantages:

Everyone knows DBA stands for “Default Blame Acceptor”. Since the database is always blamed, DBAs typically have great troubleshooting skills, processes, and instincts. All of these are critical for good cluster admins.

DBAs are used to manage systems with millions of knobs to turn, all of which have a critical impact on the performance and availability of the system. Hadoop is similar to databases in this sense – tons of configurations to fine-tune.

DBAs, much more than sysadmins, are highly skilled in keeping developers in check and making sure no one accidentally causes critical performance issues on an entire system. This skill is critical when managing Hadoop clusters.

DBA experience with DWH (especially Exadata) is very valuable. There are many similarities between DWH workloads and Hadoop workloads, and similar principles guide the management of the system.

DBAs tend to be really good at writing their own monitoring jobs when needed. Every production database system I’ve seen has crontab file full of customized monitors and maintenance jobs. This skill continues to be critical for Hadoop system.

To be fair, sysadmins also have important advantages:

They typically have more experience managing huge number of machines (much more so than DBAs).

They have experience working with configuration management and deployment tools (puppet, chef), which is absolutely critical when managing large clusters.

They can feel more comfortable digging in the OS and network when configuring and troubleshooting systems, which is an important part of Hadoop administration.

Note that in both cases I’m talking about good, experienced admins – not those that can just click their way through the UI. Those who really understand their systems and much of what is going on outside the specific system they are responsible for. You need DBAs who care about the OS, who understand how hardware choices impact performance, and who understand workload characteristics and how to tune for them.

There is another important role for DBAs in the Hadoop world: Hadoop jobs often get data from databases or output data to databases. Good DBAs are very useful in making sure this doesn’t cause issues. (Even small Hadoop clusters can easily bring down an Oracle database by starting too many full-table scans at once.) In this role, the DBA doesn’t need to be part of the Hadoop team as long as there is good communication between the DBA and Hadoop developers and admins.

What about Java?
Hadoop is written in Java, and a fairly large amount of Hadoop jobs will be written in Java too.
Hadoop admins will need to be able to read Java error messages (because this is typically what you get from Hadoop), understand concepts of Java virtual machines and a bit about tuning them, and write small Java programs that can help in troubleshooting. On the other hand, most admins don’t need to write huge amounts of Hadoop code (you have developers for that), and for what they do write, non-Java solutions such as Streaming, Hive, and Pig (and Impala!) can be enough. My experience taught me that good admins learn enough Java to work on Hadoop cluster within a few days. There’s really not that much to know.

What about SAN technology?
Hadoop storage system is very different from SAN and generally uses local disks (JBOD), not storage arrays and not even RAID. Hadoop admins will need to learn about HDFS, Hadoop’s file system, but not about traditional SAN systems. However, if they are DBAs or sysadmins, I suspect they already know far too much about SAN storage.

So what skills do Hadoop Administrators need?

First and foremost, Hadoop admins need general operational expertise such as good troubleshooting skills, understanding of system’s capacity, bottlenecks, basics of memory, CPU, OS, storage, and networks. I will assume that any good DBA has these covered.

Second, good knowledge of Linux is required, especially for DBAs who spent their life working with Solaris, AIX, and HPUX. Hadoop runs on Linux. They need to learn Linux security, configuration, tuning, troubleshooting, and monitoring. Familiarity with open source configuration management and deployment tools such as Puppet or Chef can help. Linux scripting (perl / bash) is also important – they will need to build a lot of their own tools here.

Third, they need Hadoop skills. There’s no way to avoid this :) They need to be able to deploy Hadoop cluster, add and remove nodes, figure out why a job is stuck or failing, configure and tune the cluster, find the bottlenecks, monitor critical parts of the cluster, configure name-node high availability, pick a scheduler and configure it to meet SLAs, and sometimes even take backups.

So yes, there’s a lot to learn. But very little of it is Java, and there is no reason DBAs can’t do it. However, with Hadoop Administrator being one of the hottest jobs in the market (judging by my LinkedIn inbox), they may not stay DBAs for long after they become Hadoop Admins…

Any DBAs out there training to become Hadoop admins? Agree that Java isn’t that important? Let me know in the comments.

]]>http://www.pythian.com/blog/hadoop-faq-but-what-about-the-dbas/feed/42Hadoop FAQ – Getting Startedhttp://www.pythian.com/blog/hadoop-faq-getting-started-version/
http://www.pythian.com/blog/hadoop-faq-getting-started-version/#commentsFri, 11 Jan 2013 20:24:00 +0000http://www.pythian.com/news/?p=38951After my “Building Integrated DWH with Oracle and Hadoop” webinar for IOUG Big Data SIG on Tuesday, I got a bunch of excellent follow-up questions. The most frequently asked questions were: “What is the minimum I need to do to get started with Hadoop?” and “How do I load data into Hadoop?”

Since so many people are interested in the same questions, it makes more sense to put my answers on a blog than to copy and paste them to everyone personally. Also, in the grand open source tradition, there are many ways to get started with Hadoop and load data into Hadoop.

Let’s go over a few options for getting started with Hadoop:
Hadoop can run in few different modes.

Local mode – Here, you can run your map-reduce code over files in your local filesystem. In this mode there is no cluster or HDFS. It’s mostly used to test if your fancy map-reduce jar files will run at all.

Pseudo-distributed mode - In this mode, you are running all Hadoop processes as you would in a real cluster, but they are all running from a single server. My test “cluster” is a pseudo-distributed Hadoop running in a VM, and for many tests this is enough. For example, I ran all the Hadoop-Oracle integration scenarios in this setup.

Fully-distributed mode – Here the sky is the limit. All production clusters run in this mode, and they can be huge and complex. But for starters, we are looking at just two or three machines, either VMs or in the cloud and running just one of each of the basic processes. For some tests, you need a fully distributed cluster – for example there is a blog post coming up about configuring HA HDFS, and I couldn’t do this with just a single node.

Getting started in Local Mode is probably easiest. You download a release of Apache Hadoop and install it. Then configure /etc/hadoop/conf/hadoop-env.sh with your JAVA_HOME (You need Java 1.6 to run Hadoop). At this point you can run /bin/hadoop jar <your job> and presto! You are running Hadoop!

This may be good enough for developers, but DBAs usually don’t feel they really experimented with a new data store if they don’t have some place to store the data. Local file system isn’t what we are looking for. So we usually get started with the Pseudo-distributed Mode.

In Pseudo-distributed mode, you will be running all basic Hadoop processes:

HDFS name node – master process for HDFS. There is always just one of those, and it is responsible for keeping track of the filesystem meta data, such as file names and directories.

HDFS data node – slave process for HDFS. In a real cluster there are many of those. They are responsible for communicating with the client, storing the data, and replicating it.

I used the first method when I wanted to run Hadoop on a system that already had Oracle installed. It’s easier to install Hadoop on Oracle server than vice versa. The rest of the time, I used the VM for my tests.

Full cluster Mode is required when you run a more serious POC or want to test some of the HA features.

There are no easy ways to get started with a full cluster. Both require a deeper understanding of Hadoop, but they are not very difficult either:

Take the VM from the Pseudo-distributed step, and run two of it. Make sure they can communicate with each other. Stop all services Hadoop on both VMs. Configure /etc/hadoop/conf/core-site.conf , /etc/hadoop/conf/hdfs-site.conf, and /etc/hadoop/conf/mapred-site.conf with the appropriate IP addresses so the services will be able to find each other. Start name-node and job-tracker on one node and data-node and task-tracker on the other.

O.K. In one way or another, you now have Hadoop. Now lets see how we get data on HDFS. I’m going to assume a Pseudo-distributed cluster was used here, since this is what I mostly use.

There are literally endless ways to get data on Hadoop, so let’s review some of my favorites. Note that none of these methods require any special Java or even programming knowledge. They can be used by any simple DBA:

Copy a file: hadoop fs -put localfile /user/hadoop/hadoopfile

If you have large number of files (very common in my experience), a shell script that will run multiple “put” commands in parallel will greatly speed up the process. File copying is easy to put in parallel without any need to write fancy MR code.

You can also have a cron job scanning a directory for new files and “put” them in HDFS as they show up.

Mount HDFS as a file system and simply copy files or write files there. The instructions for mounting are in my presentation.

Use Sqoop to get data from a database to Hadoop. This was also covered in my presentation.

Use Flume to continuously load data from logs into Hadoop. There are some catches there – you want to make sure you get relatively fresh data since you don’t want Flume to do too much buffering, but you also don’t want many small files because that’s bad for HDFS. I’ll probably blog about this in the future.

In my presentation, I mentioned Perl as a method of ETLing data from MySQL to Hadoop. It was mentioned because I use Perl whenever possible, not necessarily because it was recommended. Most of the time, using Sqoop will be a better idea. Pre-processing the data in the DB can also be a good idea, and if you need custom code (like I did) and can program in Java (like I hate doing), you should probably write proper MR code.

In any case, I used an interface called Streaming which allows running any shell command that reads and writes key-value pairs as mappers and reducers. I wrote my own perl scripts to get data from the DB and did some pre-processing before dumping it into Hadoop.

If you have other tips on getting started with Hadoop or for loading data, the comments are all yours! :)

]]>http://www.pythian.com/blog/hadoop-faq-getting-started-version/feed/1New Year, New Big Data Appliancehttp://www.pythian.com/blog/new-year-new-big-data-appliance/
http://www.pythian.com/blog/new-year-new-big-data-appliance/#commentsTue, 25 Dec 2012 23:14:04 +0000http://www.pythian.com/news/?p=38781Shortly before we all went on break for the holiday, Oracle announced the new BDA X3-2. Now, I have time to properly sit down with a glass of fine scotch and dig into the details of the release. Turns out that there are quite a few changes packed in. We are getting new hardware, new Hadoop, new Connectors, and new NoSQL. Tons of awesome features are included.

For those in a hurry, these are the best new features in the release in my opinion:

64G RAM per node, expandable up to 512G.
Disregard what I said in the past about this being too much memory for workloads that tend to be IO-bound. There is no such thing as too much memory, especially not for map-reduce jobs, which tend to be written in Java – not a language known for its efficient use of memory. You will want at least 1GB RAM per job, with 2G being more reasonable. 64G RAM will allow you to do around 30 jobs per machine, which is rather balanced for 16 cores. More memory will allow you to configure larger IO and network buffers, reduce-side joins using more memory as “scratch” space, and, of course, Impala.

Hadoop is upgraded from CDH3 to CDH4.1 with the following new features:

High-availability name node. This means that the name node is no longer a single point of failure and removes the biggest issue with deploying Hadoop as a production enterprise cluster.

Federated name nodes. The amount of data stored in the cluster is limited by the amount of memory in the name node. Federated name nodes allows working around this barrier by splitting the filesystem between multiple nodes for higher total memory limit. I doubt anyone with BDA will require this feature.

YARN – the new job framework with the new resource manager.

Impala! It actually doesn’t arrive on the BDA, and I’m not even sure if it’s supported. Impala is pretty beta anyway. But it’s also pretty awesome. Queries that take 10 seconds to run on Hive take milliseconds in Impala. It does everything in-memory, which is the best excuse for upgrading to the 512G RAM version of BDA. Remember that it is not part of the BDA, but if you decide to install it, it should run fine on CDH4.1. If you try to install it and break something, don’t tell support “Gwen said it should run fine”.

New Connectors:
BDA arrives with connectors that are compatible with CDH4. The release notes were a bit useless as they listed all connector capabilities without breaking them down into old and new features. I had to dig into the docs to try to find what is actually new. Here’s what I figured out. Feel free to correct me if I got it wrong:

Oracle SQL Connector for HDFS (OCSH) is the new name for Oracle Direct Connector for HDFS. The direct connector worked as a pre-processor for external tables, allowing us to reference a file on HDFS. This was pretty cool. The new connector runs as a MR process and creates the external table for you. If the data you want is in Hive, it will read the Hive metadata store to create the external table definition for you. Normal files get external tables in which all columns are varchar2. The deeper integration with Map Reduce seems to allow better support for parallel queries. It also looks like Avro file format and encryption codecs are now supported – which is awesome considering how much trouble the lack of support caused me in the past.

Oracle Loader for Hadoop seems to support loading data from NoSQL 2.0 in addition to Hadoop. Support for Avro was added here as well.

There is also new Connector for R, but I didn’t dig into the features there yet.

New management tools:
I haven’t looked into any features there either, but the new BDA includes Cloudera Manager and Big Data plugin for OEM

Oracle NoSQL 2.0:
The new Oracle NoSQL release is the first release that has features specifically for the Enterprise Edition.

Hadoop integration - New classes allow Hadoop MR jobs to read data stored in Oracle NoSQL. I’ve seen a lot of requests for this feature, but I never really understood it. I’m curious to see how this will be used. If anyone uses this feature, I also want to hear why they don’t use HBase. However, MongoDB and Cassandra already have support for MR jobs, so it’s nice to see Oracle NoSQL closing this gap.

Access from Oracle RDBMS through External Tables (Enterprise Edition only) – I think it’s implemented through the new support for MR jobs and the new Hadoop connectors.

Avro support - It defines schema for the data contained in the record value. Schemas are defined with JSON, and there is some support for schema evolution.

Support for different numbers of replication nodes per physical storage node – This allows heterogeneous hardware in the NoSQL cluster. Not that exciting for BDA owners.

Elastic sharding – This is a feature that was sorely missing from the previous release. You can now add replication nodes, mode them around, rebalance load between nodes, etc. A number of partitions are still static, so you still want to configure this right from the installation.

Stream based API for storing very large values without materializing them in memory in full.

Clearly an awesome release, packed not just with new features, but with features that the customers actually need. I can see existing BDA owners looking to upgrade, not for the hardware boost, but for critical features like HA namenode and elastic NoSQL.

This brings me to a painful point: Nothing in the release notes or white papers even mention the possibility of an upgrade. Sure, owners of BDA can just start installing new Cloudera Manager, upgrade to CDH4, install new Oracle NoSQL, etc. However, if this is indeed enterprise software, the word “patch” should have been mentioned somewhere, in my opinion. To further complicate things, while there are very clear instructions on upgrading a cluster from CDH3 to CDH4 intact, upgrading Oracle NoSQL to release 2.0 without the equivalent of exporting the data out and re-importing it into a new cluster remains unclear. What’s even less clear is whether Oracle will even support CDH4.1 and Oracle NoSQL 2.0 on the old appliance. If they do, it’s not really an appliance, and if they don’t, it makes the whole BDA proposition far less attractive.

The only other complaint I have about BDA X3-2 is the operating system. OEL 5.8. Meh. I guess Oracle decided that upgrading every single component in this release is too much and left the OS alone?

If you are considering buying BDA or just wondering why Oracle geeks like me dig this Hadoop stuff, I’m giving a webinar for IOUG Big Data SIG. I will explain why Hadoop is just what your enterprise data warehouse needs and what are the best ways to integrate Hadoop and Oracle. The webinar is on Jan 8, 11am PST, and I’m racing against the clock to update the presentation with all the cool BDA x3-2 features.

]]>http://www.pythian.com/blog/new-year-new-big-data-appliance/feed/0No Time Like the Presenthttp://www.pythian.com/blog/no-time-like-the-present/
http://www.pythian.com/blog/no-time-like-the-present/#commentsThu, 20 Dec 2012 20:04:58 +0000http://www.pythian.com/news/?p=38619I’m not one to panic over non-events. I wouldn’t survive very long in my career if I were. Besides, I always have plenty of real emergencies to worry about. But when even the New York Times tells us to prepare for the end of the world, which may or may not arrive in a few days, I start to worry. I even found a nice website to tell me if the apocalypse happened yet.
Like any self-respecting DBA, I can’t prepare for anything without concrete requirements. “Apocalypse” and “End of the World” are rather vague descriptions and don’t give me a good idea of what I’m preparing for. I asked some people what they expect the end of the world to be like and got the following list: earthquakes, war, shortage of natual resources, chaos, stock market crash, epidemic, and satanic hosts of demons.

But fact remains that most Americans, and DBAs among them, are woefully unprepared for an apocalypse. Here at Pythian, we want to help out customers and friends to prepare for the apocalypse in the hope that we still have customers and friends next week.

Prepare your database for the apocalypse with these 10 easy tips:

Best advice for the new era is from our Oracle ACE, OCM, and RMAN expert, Yuri Velikanov, who said: “RMAN DROP DATABASE should make all the preparations much trouble-less”. Indeed – No database, no problems!

If you are somewhat reluctant to actually delete your database, you can try a slightly less drastic approach. Take backups a few hours before the end of the world, and then shut down. In other words, treat the end of the world like you would any other risky maintenance, such as applying a patchset. I’m sure you’ll find that there are many similarities between the events.
Of course, the interesting question is where to store the backup. If the world is about to end, the only safe place for your backup is in orbit. Launch your backup satellite before it’s too late.

Over years of working in IT, we noticed a certain interesting correlation: Database problems typically happen after DBAs do something to the database. Deployments, upgrades, space allocations, all sorts of routine maintenance actions have some probability of ending in tears. If no one touches the database, it will typically keep on happily functioning for years.
To survive the end of the world without a hitch, you should prevent your DBA team from touching your database. We are not suggesting going as far as firing them (although we will be happy to hire them at Pythian if you would), but consider sending them on early new-year vacation somewhere remote and possibly safe from trouble. Hawaii should be wonderful this at time of year, but I’m not sure about volcanic activity.

On the topic of DBAs, you should definitely prepare for the event that some of your DBAs will not survive the apocalypse. Place the job ads today, or better yet, sign a flexible contract with your friendly remote-DBA provider.

Why stop at just signing a contract with a friendly remote-DBA provider? It’s time to spend your budget like there’s no tomorrow. Personal Exadata X3-8 for the entire DBA staff perhaps?

Of course, the ultimate plan against sudden lack of DBAs is automation. Make sure all your routine tasks are scripted and automated, and your database may survive the apocalypse even if none of your DBAs do. Some may even argue that automating all routine tasks is a good idea even if the world does not come to an end.

Now is also an excellent time to review and revise your disaster recovery plans (DRP). Years ago, our department underwent an audit that required us to have a DR plan. At the time, we did not have multiple datacenters or even off-site backups, so we doubted our ability to pass the audit, not to mention surviving an apocalypse. Luckily, our manager was very detail-oriented and noticed that passing the audit just required us to have a DR plan. It did not specify what the plan should actually contain. We drafted a simple DR plan:When in trouble or in doubt
Run in circles, scream and shout.We passed the audit.

I noticed that there is a large number of Friday night parties being promoted as “End of the World Party” or “Party like There’s No Tomorrow”. This is my personal plan for the end of the world – if the music is loud enough and I’m enjoying the dancing, I probably won’t even notice the earthquakes and demons. Luckily for all beer lovers, the US government tested beer safety following a nuclear apocalypse and declared commercially packaged beer safe so we can keep on partying as the world ends.

Some of us have the tendency to defer to tomorrow anything that does not absolutely have to be done today. Knowing that the world may end on Friday is the best excuse to procrastinate. If we put our unpleasant tasks off long enough, we may not need to do them at all. Ever.
For example, the last tip in this list will be published on Monday.

Disclaimer:
The above tips are a joke, yes? We did not validate any of them and performed no dry-runs of the end of the world. Use common sense and keep your databases safe. If you feel a strong need to prepare for an apocalypse, it’s never to early to start worrying about unix-time rollover on January 2038.

]]>http://www.pythian.com/blog/no-time-like-the-present/feed/2Changing SID on a RAC Environmenthttp://www.pythian.com/blog/changing-sid-on-rac-environment/
http://www.pythian.com/blog/changing-sid-on-rac-environment/#commentsSat, 08 Dec 2012 02:03:38 +0000http://www.pythian.com/news/?p=38241This post is just a short note documenting a procedure that isn’t done frequently or described anywhere else (to my knowledge).

Sometime last month, a customer asked for my help to change the SIDs on one of his RAC databases to match the new corporate standard. The database name matched the standard so we could leave that alone, but the SID needed to be changed.

Here’s what we did to change it (on 11gR2, but it should work on older versions too):

And of course, comment below if you think I forgot a step or know about a better way to do it.

]]>http://www.pythian.com/blog/changing-sid-on-rac-environment/feed/4Concrete Advice for Abstract Writershttp://www.pythian.com/blog/concrete-advice-for-abstract-writers/
http://www.pythian.com/blog/concrete-advice-for-abstract-writers/#commentsThu, 22 Nov 2012 00:17:23 +0000http://www.pythian.com/news/?p=37663October is abstract-writing season. Many database conferences have their call-for-papers deadline in the last half of October, and I spend significant portions of my time considering the areas I’m interested in these days and whether they will make good presentations. Having an interesting topic to present is only half the battle; I must convince the conference organizers that my presentation will be educational and entertaining. Most conferences base their decision on a title and abstract submitted by the hopeful speakers, so much depends on being able to compose a short description of your presentation that will catch their eye and compel them to invite you to the conference.

November is abstract-reading season. I volunteer with multiple user groups, and as part of my activities, I review abstracts and recommend the ones I think should be included in the conference program. I estimate I’ve read close to a hundred abstracts in the past few weeks. When reviewing abstracts, I see huge variance - the best abstracts are clear and make the topic sound exciting. Bad abstracts are either boring or give you no meaningful idea of what the presentation will include. The worse are just a random collection of buzzwords. I always assumed that presenters who write bad abstracts either don’t know what they are going to talk about or don’t care enough about the topic to make the abstracts better.

At least this is what I used to believe, until my own husband asked me to take a look at an abstract he was about to send off. It was terrible. I know that my husband is smart and that he cared about the topic. (How can anyone not care about methods of cheating in visual cryptography?). As I sat with him at the kitchen table, gently trying to explain the issues in the abstract and how to correct them, it occurred to me that abstract writing does not come naturally to everyone.

The main difficulty in writing good abstracts is that you are trying to accomplish multiple goals in the same short text: You are trying to convince the abstract reviewer that the topic is interesting, that you are an expert on the subject matter, and that you will be educational and entertaining. At the same time, you need to describe the general content of the presentation to your potential audience so that they’re able to make an informed decision on whether to attend the presentation.

I recommend splitting the abstract into two parts and dedicating a paragraph, or at least a sentence, to each.

The first part is the purpose. The goal here is to explain why the topic is relevant and interesting. This section should sound like a newspaper headline since they share a similar goal: to catch the eye and create interest.

For example: “Getting training data for a recommender system is easy: if users clicked it, it’s a positive – if they didn’t, it’s a negative. … Or is it?”

This part shouldn’t be overly verbose. One or two sentences are typically enough to introduce the topic, demonstrate its importance, and create interest in the presentation.

“The MapReduce programming model lets developers without experience with parallel and distributed systems utilize the resources of a large, multi-CPU system. Hadoop clusters can be used to implement this model, but Oracle Database also provides mechanisms to support the same model – and with less programming. “

The second part should provide more details about your specific presentation. It is a prose-form of your presentation top level outline. Which topics will you focus on? What will the audience learn? Will there be a demo or examples of use-cases?

For example: “In this talk, we use examples from production recommender systems to bring training data to the forefront: from overcoming presentation bias to the art of crowdsourcing subjective judgments to creative data exhaust exploitation and feature creation.”

or

“Bryn and Maria address ways to implement an application to solve a specific problem where, because of the typically huge volumes of data that must be distilled down to provide interesting information, performance matters. A person whose job is to work with an installed system won’t be able to use what he explains in his talk to tune its performance.”

Here’s a complete example for an excellent abstract that follows this format:

“Craigslist uses a variety of data storage systems in its backend systems: in-memory, SQL, and NoSQL. This talk is an overview of how craigslist works with a focus on the data storage and management choices that were made in each of its major subsystems. These include MySQL, memcached, Redis, MongoDB, Sphinx, and the filesystem. Special attention will be paid to the benefits and tradeoffs associated with choosing from the various popular data storage systems, including long-term viability, support, and ease of integration.”

You read this type of abstract and you immediately know what the presentation is about (variety of data storage systems), why they are interesting (craigslist uses them), and what the main topics of the presentation are (tradeoffs, benefits, support, integration, and a long list of data storage systems).

The most common reason reviewers reject abstracts is that we can’t figure out the content of the presentation, or the abstract does not convince us that the presentation has anything new or interesting to say about the subject.

Two common mistakes to avoid:

1. Abstract is not an intro - Too often I see abstracts that are all about explaining why the topic they will talk about is interesting and important. It is important to remember that an abstract should summarize the entire presentation, it should not contain just the content of the first 3 introduction slides. We and your audience need to understand what you will talk about, not just gain a general idea of the subject matter.

2. Abstract is not a random collection of buzzwords – Too often I see an abstract that is simply a collection of randomly chosen buzzwords thrown together. There is nothing that will get your abstract faster onto my “reject” pile. If I can’t get a clear idea of the main three points of your presentation by reading your abstract, I will strongly suspect that you have absolutely no idea what you want to talk about. If you don’t care enough about the topic to figure out what you want to say, I don’t care to accept it.

Just taking this small two-part construction step for your abstract will put you way ahead of 90% of the abstracts I see submitted to conferences. Your abstract will be interesting, not too long, and will include specific details on your topic. If you picked the right topic, this is enough to get the abstract accepted to most conferences.

Good luck and have an excellent conference season!

]]>http://www.pythian.com/blog/concrete-advice-for-abstract-writers/feed/3Select Statement Generating Redo and Other Mysteries of Exadatahttp://www.pythian.com/blog/select-statement-generating-redo-and-other-mysteries-of-exadata/
http://www.pythian.com/blog/select-statement-generating-redo-and-other-mysteries-of-exadata/#commentsWed, 31 Oct 2012 04:54:47 +0000http://www.pythian.com/news/?p=37343Like many good stories, this one also started with an innocent question from a customer:

“I often check “SELECT COUNT(*) FROM TABLE;” on our old 10.2.0.4 cluster and the new Exadata. The Exadata is slower than our old cluster by few minutes.
Exadata is not good at count(*) ? or other problem?”

To which I replied:
“Exadata should be AMAZING at count(*)…
so, we definitely need to check why it took so long…
which query?”

Exadata should do count(*) either from memory or by using smart scans, and since my friend is counting fairly large tables that won’t all fit into memory, Exadata should be significantly faster than their old system.

So, I started by trying to reproduce the issue. The customer sent me his slow query, and I ran it. The same query that took him 3 minutes to run, took me 2 seconds. Clearly we were not doing the same thing, and disk access was then our main suspect.

To see what his executions indicate, I asked the customer to run the same script adding “SET AUTOTRACE ON” at the top and send me the script and the results.
“SET AUTOTRACE ON” is far from perfect, but an easy way to get an overview of the issue without annoying the customer into non-compliance.

There was a lot of disk access, but I was distracted by the huge redo size. Why was a select statement generating so much redo?

My best guess was delayed block cleanout . A quick twitter survey added another vote for cleanouts and a suggestion to check for audits. Checking for audits was easy, we had no audit on selects. But how do I confirm whether or not I’m doing block cleanouts?

The tweeps suggested: “First trace it. Then you have concrete evidence.” Trace is often the performance tuning tool of choice, since it shows very detailed information about the queries that run and their wait events. But trace is not the perfect tool for everything - “delayed block cleanout” is neither a wait event nor a query, so this information will not show up in a trace file.
Oracle keeps track of cleanouts using “delayed block cleanout” statistics. V$SESSTAT has this information, but it is cumulative. I will have to take a snapshot of V$SESSTAT before and after each query. If only there was a tool that would make it easier…

I ran Snapper for snapshots every 5 seconds during the time the scripts ran. Why not use Snapper’s multi-snapshot ability? Because it only displays the information after the run, and my ADHD mind wanted constant feedback like I got from top and vmstat.

Leaving out a lot of boring stuff, we can immediately see that we do not have any delayed block cleanout. Instead we have redo…for lost write detection.
What is this lost write detection? A quick Google query led me to a parameter that was added in version 10g: DB_LOST_WRITE_PROTECT . The documentation clearly said:
“When the parameter is set to TYPICAL on the primary database, the instance logs buffer cache reads for read-write tablespaces in the redo log, which is necessary for detection of lost writes.”

In the old cluster, the parameter was unset and defaulted to NONE. In Exadata X2-2, the default is TYPICAL. Maybe Oracle knows something that we don’t about write safety on Exadata? Regardless, when we set the parameter to NONE, the results were very different:

Understanding and getting rid of the mysterious REDO is great. But in follow up tests, we saw that every time the query had to touch disk, Exadata was still significantly slower than the cranky old SAN we used in the old cluster.

Let’s go back to Snapper.

Snapper output showed that 84.4% of the time was spent on “cell single block physical read”, and the statistics show that it is all “optimized” – meaning read from Exadata’s flash cache. It’s nice that the reads are fast, but why do we use “cell single block physical read”?

I turned to the execution plans, incidentally the one piece of information Snapper needed for troubleshooting that Snapper doesn’t show you.

Do you see the difference? Are you surprised to learn that Exadata will apply smart scans on INDEX FAST FULL SCAN, but not on INDEX FULL SCAN?

Why did Oracle choose INDEX FULL SCAN instead of FAST FULL SCAN?

There could be many reasons: Maybe the index is not unique or allows nulls. Or maybe our optimizer parameters make single block access too attractive. Or maybe we have bad statistics.

Since it’s a short list, I eliminated the options one by one. Primary key exists, and optimizer parameters were all default. Table statistics was fine, so I was almost fooled. But index statistics indicated zero rows!

Over the years, the paper became a classic in our field. It is widely referenced by security professionals and performance-monitoring experts, both of whom need to perform detailed analysis of the data Oracle communicates over the network.

The original paper, however, became nearly impossible to find. It seemed to have only been published on the ukcert website, and after it was removed from their servers, the only place to find it was web.archive.org. Web Archive is wonderful, but it is a very unreliable way to preserve one of the most important papers published in our field.

Fortunately, Ian Redfern released his paper into the public domain. I can now reproduce it here in full to prevent it from disappearing forever:

Oracle Protocol

This document is an attempt to document the network protocol used by Oracle
database clients to communicate with Oracle database servers in order to allow
developers to decode this traffic and construct new, interoperable client and
server software.

The network protocol is known variously as SQL*Net, Net8, TNS and TTC7 – I
shall refer to it as Net8. It can be run over a number of transports, but I
shall only discuss the TCP/IP variant. I believe the details are valid for all
Oracle versions since Oracle 7.2

Basics

All Net8 traffic goes over an ordinary TCP connection to port 1521 on the
server, although this can be overridden. After logging in, multiple transactions
are carried over the connection until it is closed after logout.

Every packet begins with a length, a checksum, a type and a flags byte. Like
all Net8 integers, these are Big-Endian. The maximum length of a packet is the
SDU (Session Data Unit), which is at most 4086 bytes. By default the SDU is 4086
and the TDU (Transport Data Unit) is 32767 (also its maximum) – the TDU is never
smaller than the SDU.

The checksum is either the ones complement of the sum of the packet header or
whole packet (like an IP checksum) or – in reality – zero.

Connect

A Connect packet is of type 1. Its length is 34 unless there is connection
data. Connection data is a string of the form
(SOURCE_ROUTE=yes)(HOP_COUNT=0)(CONNECT_DATA=((SID=)CID=(PROGRAM=)(HOST=)(USER=)))
or similar.

If the connection data is longer than 221 bytes, it is carried immediately
after the CONNECT packet and the CONNECT packet length is 34 bytes, as if there
were no connection data.

It should be acceptable to use these canned packets for negotiations – they
simply disable all ANO facilities.

Types and marshalling

This is not a true self-descriptive mechanism like ASN.1 or XML, but it does
deal with variable-length binary data, and so it has a marshalling mechanism for
doing so.

There are four native types: B1, B2, B4 and PTR. Each one can be shipped as
native, universal, LSB or (universal and LSB). Native values are big-endian,
universal ones are length-byte-preceeded and LSB ones are little-endian.

By default, B1 types (signed and unsigned bytes) are native, B2, B4 and PTR
are universal. Universal types are a length followed by the non-zero bytes of
data, so 0 is represented as just as zero byte. Negative values are indicated by
setting the high bit of the length.

The following types fit into this scheme:

UB1, unsigned byte length 1 (B1)

SB1, signed byte length 1, never negative, B1

UB2, unsigned byte length 2 (B2)

SB2, signed byte length 2 (B2)

UB4, unsigned byte length 4 (B4)

SB4, signed byte length 4 (B4)

UWORD, unsigned word length 4 (B4)

SWORD, signed word length 4 (B4)

RefCusror, signed word length 4 (B4)

B1Array, array of B1, written as native

UB4Array, array of UB4, written as multiple UB4s

Ptr, pointer, byte 0 if null, otherwise byte 1

O2U, boolean, byte 0 if false, byte 1 if true

NULLPTR, byte 0

PTR, byte 1

CHR, character array, written as native or CLR if conversion

CLR, byte array

DALC, byte array, either 0 (if null/empty) or SB4 length followed by CLR

UCS2, single unicode character

TEXT, 0-terminated array of B1

A CLR is a byte array in 64-byte blocks. If its length <=64, it is just
length-byte-preceeded and written as native. Null arrays can be written as the
single bytes 0x0 or 0xff. If length >64, first a LNG byte (0xfe) is written,
then the array is written in length-byte-preceeded chunks of 64 bytes (although
the final chunk can be shorter), followed by a 0 byte. A chunk preceeded by a
length of 0xfe is ignored.

A UCS2 character is (if B2 is universal, as is usual) prefixed by a byte of 1
or 2. The character then follows in one or two bytes, reversed if B2 is LSB
(which it usually isn’t).

In this document I will not mark B1 types as they are always raw bytes.

Logon

First we get the v8 TTI protocol negotiation. The client passes in its client
type and a list of versions – presumably those it is compatible with. The TTI7
client handles up to version 4, sqlplus up to 5 and the JDBC client up to 6.

I shall document the latest protcol, version 6, as used by the JDBC client,
as it is the current version.

Password algorithm

The Oracle password encryption mechanism is based on DES, and uses a random
challenge from the server which the client must encrypt. The algorithm is quite
complex, and is most easily described in the attached Perl source
– you will need Crypt::DES and Crypt::CBC to use it.
There is now also a C version, orapasswd.c
by Xue Yong Zhi, which requires OpenSSL.

This document and its accompanying source code samples are in
the public domain, and you may do anything with them that you
wish. The author takes no responsibility for the accuracy of their
contents. Some of the terms in this document are trademarks of Oracle
and other companies. No trade secrets or other privileged information
has been used in its compilation, and the author has no relationship
with Oracle.