Monday, 22 April 2013

MySQL Applier For Hadoop: Real time data export from MySQL to HDFS

MySQL replication enables data to be replicated from one MySQL database
server (the master) to one or more MySQL database servers (the
slaves). However, imagine the number of use cases being served if
the slave (to which data is replicated) isn't restricted to be a
MySQL server; but it can be any other database server or platform
with replication events applied in real-time!

This is what the new
Hadoop Applier empowers you to do.

An example of such a slave could be a
data warehouse system such as Apache Hive, which uses HDFS as a data
store. If you have a Hive metastore associated with HDFS(Hadoop
Distributed File System), the Hadoop
Applier can populate Hive tables in real time. Data is
exported from MySQL to text files in HDFS, and therefore, into Hive
tables. It is as simple as running a 'CREATE TABLE' HiveQL on Hive,
to define the table structure similar to that on MySQL (and yes, you
can use any row and column delimiters you want); and then run Hadoop
Applier to start real time data replication.

The motivation to develop the Hadoop
Applier is that currently, there is no tool available to perform this
real time transfer. Existing solutions to import data into HDFS
include Apache Sqoop which is well proven and enables batch transfers
, but as a result requires re-import from time to time, to keep the
data updated. It reads the source MySQL database via a JDBC connector
or a fastpath connector, and performs a bulk data transfer, which can
create an overhead on your operational systems, making other queries
slow. Consider a case where there are only a few changes of the
database compared to the size of the data, Sqoop might take too long
to load the data.

On the other hand,
Hadoop Applier reads from a binary log and inserts data in real
time, applying the events as they
happen on the MySQL server; therefore other queries can continue to
execute without effect on their speed.No bulk transfers required! Hadoop Applier takes only the changes and insert them, which
is a lot faster.

Hadoop Applier can thus be a solution when you need
to rapidly acquire new data from MySQL for real-time processing
within Hadoop.

Introducing The Applier:

It is a method
which replicates events from the MySQL binary
log to provide real time integration of MySQL with Hadoop and
related frameworks which work on top of HDFS. There are many use
cases for the integration of unstructured data stored in Apache
Hadoop and structured data from relational databases such as MySQL.

Hadoop Applier
provides real time connectivity between MySQL and Hadoop/HDFS(Hadoop
Distributed File System); which can be used for big data analytics:
for purposes like sentiment analysis, marketing campaign analysis,
customer churn modeling, fraud detection, risk modelling and many
more. You can read more about the role of Hadoop Applier in Big data
in the blogby Mat Keep. Many widely used
systems, such as Apache Hive,
use HDFS as a data store.

The diagram below represents the
integration:

Replication via Hadoop Applier happens
by reading binary
log events , and writing them into a file in HDFS(Hadoop
Distributed File System) as soon as they happen on MySQL master.
“Events” describe database changes such as table creation
operations or changes to table data.

As soon as an Insert query is fired on
MySQL master, it is passed to the Hadoop Applier. This data is then
written into a text file in HDFS. Once data is in HDFS files;other Hadoop ecosystem platforms and
databases canconsume this data for their own application.

In general, there is 'customers' table and our users can register on website (new records appear in table) and they also can change their statuses, like confirm their accounts or deactivate them (existent records are updated).And we need to run some reports, which gather analytics about users with different statuses.That's why it's important for us to be able to have latest updates of users' statuses in Hadoop.

Currently we are trying to use Sqoop tool (http://sqoop.apache.org/) in order to setup incremental import workflow based at lastmodified_date field and then merge latest updates with initial data in Hadoop.But as for Sqoop 1.4.2 there is a lot of bugs, which currently do not allow us to build such workflow in an automated way (without human interference).

So currently we are very interested in 'updates' feature of Hadoop Applier.

It also will be great to have opportunity to migrate inital mysql schema into Hive table (I mean, without creating Hive tables by hand in advance).Because it's a bit frustrating to recreate (mirror) the whole database schema in Hive by hand in case if we are going to migrate couple of databases with hundreds of tables.For instance, Sqoop has such ability to fetch metada from db and map column types to appropriate Hive types and generate Hive tables by itself.

Please make sure that the MySQL server code is built. MySQL Server generates some header files (which are required by the Hadoop Applier) during compilation of the server code, hash.h being one of them. Hence the error.

Hey there,I have the same problem, can what do you mean by "make sure that the MySQL server code is built". I downloaded the binary version of mysql server but couldn't find hash.h in the include dir. Can you assist me with this please.

Unfortunately, downloading the binary version of MySQL server code will not help. You need any one of the following: - MySQL Server source code (http://dev.mysql.com/downloads/mysql/#downloads , slect 'source code' from the drop down menu)

Now, what I meant by "make sure that the MySQL server code is built" is: If you are using the MySQL server code for the headers and libraries, run the following commands on the source code you downloaded: - cmake . - make This will make sure that all the required header files and the library is in place when you include/link to the Hadoop applier.

Hope this helps.Please reply on the thread in case you still face issues.

my.cnf file:## The MySQL database server configuration file.## You can copy this to one of:# - "/etc/mysql/my.cnf" to set global options,# - "~/.my.cnf" to set user-specific options.# # One can use all long options that the program supports.# Run program with --help to get a list of available options and with# --print-defaults to see which it would actually understand and use.## For explanations see# http://dev.mysql.com/doc/mysql/en/server-system-variables.html

# This will be passed to all mysql clients# It has been reported that passwords should be enclosed with ticks/quotes# escpecially if they contain "#" chars...# Remember to edit /etc/mysql/debian.cnf when changing the socket location.[client]port = 3306

# Here is entries for some specific programs# The following values assume you have at least 32M ram

# This was formally known as [safe_mysqld]. Both versions are currently parsed.[mysqld_safe]nice = 0

## Instead of skip-networking the default is now to listen only on# localhost which is more compatible and is not less secure.bind-address = 0.0.0.0## * Fine Tuning## * IMPORTANT: Additional settings that can override those from this file!# The files must end with '.cnf', otherwise they'll be ignored.#federated

my hadoop config:

fs.default.name hdfs://amr-Lenovo-G580:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.

hadoop.tmp.dir /home/hduser/tmp

dfs.replication 2 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

dfs.support.broken.append true

mapred.job.tracker amr-Lenovo-G580:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.

Great that the compilation works fine now! Thanks for trying this out.

For this issue, I request you to check the following:

1. I understand that you cannot see data into Hive, but is the data replicated into HDFS? i.e. check the HDFS file system. (you can do it from the command line or the Hadoop web-gui).

If yes, then check the base dir into which the db structure is created. This should match the data warehouse directory in Hive.

By default the base dir into which Hadoop Applier writes, is set as /user/hive/warehouse.

The hive configuration file (hive-default.xml.template) should have the property \hive.metastore.warehouse.dir <\property> \/user/hive/warehouse\<\value> \location of default database for the warehouse\<\description>

The value here, "/user/hive/warehouse" should match the value you set while running the happlier executable as the base dir.

2.Please note that replication would start from the first insert you make into the MySQL table. Note, the executable 'happlier' should be running when the insert is made into the table on MySQL server.

So when you execute ./happlier mysql://root@localhost:13000 hdfs://amr-Lenovo-G580:54310The output should be this:

"The default data warehouse directory in HDFS will be set to /user/hive/warehouseChange the default warehouse directory (Y or N)?"

Please give a Y or N option, Suppose you give N, output will be"Connected to HDFS File SystemThe data warehouse directory is set as /user/hive/warehouse."

Now,whenever an insert is done, the success message from happlier will be:"Written N bytes to datafile in the following directory: 13000 hdfs://amr-Lenovo-G580:54310/user/hive/warehouse/mysql_replicate_test.db/rep_test."

Please have a look at the demo here, it might be of some help:http://www.youtube.com/watch?v=mZRAtCu3M1g

In case this does not help, I request you to please paste the output of happlier when a row insert is done on MySQL.

The database i created is there and all but the connector isn't working

Here is the output of the executable:

The default data warehouse directory in HDFS will be set to /usr/hive/warehouseChange the default data warehouse directory? (Y or N) YEnter the absolute path to the data warehouse directory :/user/hive/warehouseConnected to HDFS file systemThe data warehouse directory is set as /user/hive/warehouse...

Nothing gets printed after this, even when i insert to the tables.

I added some debugging code to the connector, basically i print statements after each step of the connection is done, here is what i got:

Some comments:Please note that the server you start using mtr does not use the conf settings you might have specified in INSTALLDIR/my.cnf; and hence your binlog format settings there are not used here.

You may please use the command Neha specified, or else give it as an option during the time server starts:./mtr --start --suite=rpl --mysqld=--binlog_format=ROW --mysqld=--binlog_checksum=NONE

Also, the debugging code gives the correct output, i.e.

user: root, password: , host: localhost, binlog: , port: 13000

The binlog name is initially empty, and that is expected. This is because Hadoop Applier does not get the name of the binlog file when it connects to the server and registers as a slave. It is only sent when it is registered as a slave and it requests for the binlog dump; i.e. requests a binlog stream from the server (via the COM_BINLOG_DUMP command).

The problem was that when i was trying to insert to mysql, i was connecting to mysql on port 3306 rather than port 13000 which caused the bin log event not to be triggered.

That said, i couldn't compile the binlog-browser. Don't really need it now but here is the error i got:

make binlog-browserg++ binlog-browser.cpp -o binlog-browserbinlog-browser.cpp: In function ‘bool check_event_db(mysql::Binary_log_event**)’:binlog-browser.cpp:267:8: error: ‘WRITE_ROWS_EVENT_V1’ was not declared in this scopebinlog-browser.cpp:269:8: error: ‘UPDATE_ROWS_EVENT_V1’ was not declared in this scopebinlog-browser.cpp:271:8: error: ‘DELETE_ROWS_EVENT_V1’ was not declared in this scopebinlog-browser.cpp:280:33: error: ‘STMT_END_F’ is not a member of ‘mysql::Row_event’binlog-browser.cpp: In function ‘int main(int, char**)’:binlog-browser.cpp:636:48: error: ‘str_error’ was not declared in this scopebinlog-browser.cpp:688:51: error: ‘str_error’ was not declared in this scopebinlog-browser.cpp:728:38: error: ‘WRITE_ROWS_EVENT_V1’ is not a member of ‘mysql’binlog-browser.cpp:730:38: error: ‘UPDATE_ROWS_EVENT_V1’ is not a member of ‘mysql’binlog-browser.cpp:732:38: error: ‘DELETE_ROWS_EVENT_V1’ is not a member of ‘mysql’binlog-browser.cpp:752:37: error: ‘STMT_END_F’ is not a member of ‘mysql::Row_event’binlog-browser.cpp:796:21: error: ‘class mysql::Binary_log_event’ has no member named ‘print_long_info’binlog-browser.cpp:796:52: error: ‘class mysql::Binary_log_event’ has no member named ‘print_event_info’binlog-browser.cpp:814:18: error: ‘class mysql::Binary_log_event’ has no member named ‘print_long_info’binlog-browser.cpp:814:49: error: ‘class mysql::Binary_log_event’ has no member named ‘print_event_info’make: *** [binlog-browser] Error 1

I'm guessing it's not seeing the mysql include files. Will check it once i got some time.

W.r.t the binlog-browser, I think the problem isn't because of not being able to find the header files, but because of using the outdated header files (from the previous release) of the applier, namely 'binlog_event.h'

Please note, Hadoop Applier is the second release for mysql-replication-listener, (you can have a quick look on launchpad here).

I notice that the error "no member found" is for the additions done in the current release.

But, in that scenario, I wonder why is 'make happlier' doesn't report the same errors. Did you do something different to compile it?

I'm running into more interesting problem now. When i start the happlier along with a script that will parse a large XML file and insert it to mysql, the applier will exist randomly.Also when i check the exist status "echo $?" it prints 0 which means that it was a normal exist.

Is there any case where the happlier will exist with no external interaction. Or exist at all for that matter?

Sorry for the late reply.No, the happlier is a continuous process, and should not ideally exit on starting another process which would insert data into MySQL. If it does, there would be an error code or message.

Thank you for the interest in the product.Yes, there are workarounds possible.The only change from MySQL 5.5 affecting the applier is the new field types supporting fractional timestamps, added in version 5.6.

1. A quick and short way, if you require compatibility with MySQL 5.5 only (this would work for 5.6 too, but these fields will not be supported), is to apply this patch:

2.A better way, however, is to have support for both, where we require detecting the library version (libmysqlclient) during build time, and use it as flags to support conditional compilation for the above field types.

For Binary_log_event objects, is the object being pointed to by m_header leaked?An object is allocated as "m_waiting_event" in Binlog_tcp_driver::wait_for_next_vent, and is then passed to the event, but I don't see where it might be deleted.

Thank you for pointing out the issue, and thank you for trying out the applier!

In case of m_header, it is an object in Binary_log_event, so the memory is freed when the destructor is called. The caller has the responsibility to free memory used to set m_header. In the current code, the tcp_driver sets it using m_waiting_event (memory assigned from heap), and file_driver sets it using m_event_log_header (memory assigned in stack).

There is a memory leak when the tcp_driver is used to connect to the MySQL server ( which is m_waiting_event). It should be assigned memory in the constructor(s) for Binlog_tcp_driver instead of wait_for_next_event, and freed in the disconnect method.

A patch to address this has been committed, and will be published in the next release.

Thank you for trying out the Applier. - MySQL 5.6.2: Great! Make sure when you run this along with the Applier, set binlog_checksum=NONE before starting the server.

- Hadoop Version: You can use any. I have tested it with 1.0.4, but you can try with the latest stable versions. in the link you provided, you can download any one, which suits your platform. It might be that you need to edit the file 'FindHDFS.cmake', if necessary, to have HDFS_LIB_PATHS set as a path to libhdfs.so, and HDFS_INCLUDE_DIRS have the path pointing to the location of hdfs.h.

For 1.x versions, library path is $ENV{HADOOP_HOME}/c++/Linux-i386-32/lib , and header files are contained in $ENV{HADOOP_HOME}/src/c++/libhdfs. For 2.x releases, header files and libraries can be found in $ENV{HADOOP_HOME}/lib/native, and $ENV{HADOOP_HOME}/include respectively. (Details in part 2 of this blog http://innovating-technology.blogspot.com/2013/04/mysql-hadoop-applier-part-2.html )

We run a large production environment with OLTP in MySQL, using application-level sharding. We also use statement based replication. Currently, we use MySQL backups to bulk load HDFS evey day, but this takes a long time, and we're looking for real-time options.Because we use OLTP SQL, we need update and delete statement support, and because we're using statement based replication (row-based has a bug which makes it not work for our particular use case,) Hadoop Applier doesn't currently do much for us.What is the status of the project? Is it being worked on actively? is there a roadmap?

Sorry for the delay in the reply.Thank you for sharing your use case with us.

Making the Applier work with statement based replication is difficult for us, because the rows are not guaranteed to be the same while they are replicated to the binary logs, especially for unsafe queries.Also, we would like to know the RBR specific bug which is blocking your work, that shall help us to get a better insight.

We have considered adding update and delete, but there are no concrete plans yet.

Sorry, I tried, but I am not able to reproduce to the problem. Can you please give the exact steps you followed to build the code? Did you use an IDE ? Request you to please mention the directory path in which you executed the build command.

Thank you,Shubhangi

P.S. Please note this might be a bug, and you may report it on bugs.mysql.com, under the category 'MySQLServer: Binlog.

The exception says 'NoClassFound', and I suspect the classpath is not set correctly.For hadoop versions 2.0.0 and above, the classpath doesn't support wild characters. If you add the jars explicitly to the CLASSPATH, your app will work.

the applier is running for few min and i get following error :Written 215 bytes to datafile in the following directory: hdfs://localhost:8020/user/hive/warehouse/test_db.db/test_insertWritten 319 bytes to datafile in the following directory: hdfs://localhost:8020/user/hive/warehouse/test_db.db/test_insert14/05/15 14:55:39 INFO hdfs.DFSClient: Exception in createBlockOutputStreamjava.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1344) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1193) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:531)14/05/15 14:55:39 WARN hdfs.DFSClient: DataStreamer Exceptionjava.lang.NullPointerException at org.apache.hadoop.hdfs.DFSOutputStream$Packet.writeTo(DFSOutputStream.java:279) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:579)FSDataOutputStream#close error:java.io.IOException: All datanodes 127.0.0.1:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:531)hdfsOpenFile(datafile1.txt): FileSystem#append((Lorg/apache/hadoop/fs/Path;)Lorg/apache/hadoop/fs/FSDataOutputStream;) error:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): failed to create file /user/hive/warehouse/test_db.db/test_insert/datafile1.txt for DFSClient_NONMAPREDUCE_1866428327_1 on client 127.0.0.1 because current leaseholder is trying to recreate file. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2458) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2340)... at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:316) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1161)Failed to open datafile1.txt for writing!hdfsOpenFile(datafile1.txt): FileSystem#append((Lorg/apache/hadoop/fs/Path;)Lorg/apache/hadoop/fs/FSDataOutputStream;) error:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): failed to create file /user/hive/warehouse/test_db.db/test_insert/datafile1.txt for DFSClient_NONMAPREDUCE_1866428327_1 on client 127.0.0.1 because current leaseholder is trying to recreate file.

at the same time i see errors in namenode log file :2014-05-15 14:55:39,128 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.append: failed to create file /user/hive/warehouse/test_db.db/test_insert/datafile1.txt for DFSClient_NONMAPREDUCE_1866428327_1 on client 127.0.0.1 because current leaseholder is trying to recreate file.2014-05-15 14:55:39,128 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:mysql (auth:SIMPLE) cause:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/hive/warehouse/test_db.db/test_insert/datafile1.txt for DFSClient_NONMAPREDUCE_1866428327_1 on client 127.0.0.1 because current leaseholder is trying to recreate file.

Hi Karan,Sorry for the delay in the reply.It seems the datanode did not allow the transfer. I am not sure of the root cause: can you try using " hdfs dfs -put " for the same, to verify first? If it works fine, that implies the Applier is facing some issue.

Other users have reported the same issue.One of them resolved it on linux VM by manually linking $JAVA_HOME/jre/lib//xawt/libmawt.so to $JAVA_HOME/jre/lib/libmawt.so.

Also, can you please check for the following:1. Do you have the JAVA_HOME set ?2. Do you have CLASS_PATH set to point to jars required to run Hadoop itself?(command ~: export CLASSPATH= $(hadoop classpath) )3. Can you please try running Hadoop and check if it runs fine?

May be installing Oracle JDK ( I use 1.7.0_03) instead of openJDK would help.

Hi, Does Hadoop Applier support PARTITION on Hive? Currently we use Sqoop+shell script to replicate data from Mysql to HDFS(Hive) and use add partition based on date and hour. It's necessary for us to use Impala otherwise Impala may crash because of the lack of memory.

I am not sure whether Hadoop Applier meet our needs or provide a way to apply shell script along with it.

Fortunately, Apache Hadoop is a tailor-made solution that delivers on both counts, by turning big data insights into actionable business enhancements for long-term success. To know more, visit Hadoop Training Bangalore