Data Streaming with MariaDB

While Big Data is being used across the globe by companies to solve their analytical problems, sometimes it becomes a hassle to extract data from a bunch of data sources, do the necessary transformation and then eventually load it into an analytical platform such as Hadoop or something else.

This obviously takes time and effort, instead of a bunch of ETL jobs, MariaDB provides a data streaming solution directly from OLTP MariaDB TX to OLAP MariaDB AX.

Fast real-time data streaming is achieved with the help of a CDC (Change Data Capture) framework that streams data from MariaDB to MariaDB ColumnStore using MariaDB MaxScale 2.2.x as a bin-log router. MaxScale makes use of the Binary Logs from a MariaDB TX server and steams the data directly to MariaDB AX (MariaDB ColumnStore) for analytics.

The Setup

Requirements

Here is a quick look at the setup that we are going to be working on, I am using Oracle Virtual Machines with CentOS 7

1x CentOS 7 VM for MariaDB TX 3.0

1x CentOS 7 VM for MaxScale as a Replication Router

1x CentOS 7 VM for CDC Adapter + Connector and ColumnStore API

1x CentOS 7 VM for ColumnStore, we will be using a single node “combined” ColumnStore setup for simplicity, refer to MariaDB Columnstore Installation Guide page for detailed distributed setup instruction using CentOS 7 VMs.

Our Setup will look like this

MariaDB (192.168.56.101)

MaxScale (192.168.56.102)

CDC Server (192.168.56.103)

ColumnStore (192.168.56.104)

Note: We will assume that you already have a MariaDB (Source of our data streaming) and ColumnStore (Data’s final destination) readily available for use.

Preparing the VM OS

There are a few important things that are required before we start the installations.

Note: the following steps must be performed and validated on all the VMs

Disable SELinux

For the purposes of testing, we want SELinux disabled. Make sure that your SELinux configuration, in the file /etc/selinux/config, looks something like this on all the nodes:

# This file controls the state of SELinux on the system.# SELINUX= can take one of these three values:# enforcing - SELinux security policy is enforced.# permissive - SELinux prints warnings instead of enforcing.# disabled - No SELinux policy is loaded.SELINUX=disabled
# SELINUXTYPE= can take one of these two values:# targeted - Targeted processes are protected,# minimum - Modification of targeted policy. Only selected processes are protected.# mls - Multi Level Security protection.SELINUXTYPE=targeted

The change here is the SELinux setting of course.

After a reboot of the server, check if the SELinux has actually been disabled, use either of the two commands (sestatus/getenforce) to confirm

We will be using the curl script to set up the MariaDB repositories, on production environments internet access is normally not available, in that case, we can download the RPMs externally and transfer the files to the servers using your favorite secure file transfer tools.

Install CDC Server

Setup the Repository

Login to the CDC Server (102.168.56.103) we will need to set up two extra repositories that are currently not automatically added by the curl script. CDC Adapter, however, will automatically be taken care of by it.

Setup Communication Between CDC and ColumnStore

Now that the CDC Adapter, API, and Connectors have been installed, we can start to set up communication between the CDC and ColumnStore servers.

In this exercise, we are using single instance ColumnStore, in case of a distributed install, we will be working with ColumnStore UM1 Node.

ColumnStore Configuration

Connect to CDC server using ssh client and pull the /home/mysql/mariadb/columnstore/etc/Columnstore.xml from ColumnStore server using scp, rsync or any other method available for file transfer between the servers. Columnstore.xml should be downloaded to /etc and owned by root user.

The file can be downloaded from ColumnStore or any node will do as the Columnstore.xml is automatically synchronized between all the nodes. Use mariadb as the remote user to connect to ColumnStore.

As this is the first time connecting to ColumnStore from CDC server, Linux will ask for yes/noand for mariadb password.

Once the file is downloaded, ensure that it has 644 permission so that everyone can read it.

Events that are captured in the binary logs of the master MariaDB Server:

INSERT 3 Records to cdc_tab.

UPDATE col for a row in cdc_tab.

INSERT 3 Records to cdc_tab.

DELETE a row from cdc_tab.

Setup MaxScale

Log on to the MaxScale server, edit the /etc/maxscale.cnf file, remove everything under the [maxscale] threads=auto section and add the following configuration to it. This will set MaxScale for data streaming to ColumnStore using Avro Listener.

Take note of the server_id in the [replication-router] section, this points to the master MariaDB server’s ID that we defined earlier while setting up MariaDB as a master.

#Replication configuration that points to a particular server_id and binlog[replication-router]type=service
router=binlogrouter
user=maxuser
passwd=maxpwd
server_id=2master_id=1binlogdir=/var/lib/maxscale
mariadb10-compatibility=1filestem=mariadb-bin
#Replication listener that will listen to the Master DB using the port #6603[replication-listener]type=listener
service=replication-router
protocol=MySQLClient
port=6603#Avro service that will generate JSON files form the Bin-Logs that were received from the MasterDB and store them in the <avrodir> using the replication-router[avro-router]type=service
router=avrorouter
source=replication-router
avrodir=/var/lib/maxscale
#Avro listener that is used by the Avro router on a specific port to be used by CDC[avro-listener]type=listener
service=avro-router
protocol=cdc
port=4001

[replication-router]

This section defines a bin-log router from Master MariaDB to MaxScale Slave using replication user maxuser which was created earlier.

[replication-listener]

This is the mysql client listener service. Any server with MariaDB client can connect to MaxScale on the port 6603 specified in this section. We will connect to MaxScale later to set it up as a REPLICATION SLAVE using this port.

[avro-router]

This is the router service that routes the bin-log data into AVRO (JSON) files. CDC will use these AVRO files to generate bulk loading scripts for ColumnStore database.

[avro-listener]

Avro Listener uses avro-router service, this listener is used by CDC Adapter to get the AVRO files and stream them to ColumnStore.

Once /etc/maxscale.cnf has been modified on the MaxScale server, restart MaxScale service.

Now we can Connect to MaxScale MariaDB service using the installed client and setup MaxScale as a REPLICATION SLAVE; use proper Master MariaDB IP in the CHANGE MASTER TO MASTER_HOSTfollowed by START SLAVE and SHOW SLAVE STATUS to ensure the slave is running without any issues.

This setup is just like setting up a MariaDB Master/Slave replication, the only difference, in this case, is that we are setting up MaxScale as a bin-log router slave. Take note that, we have used -h 192.168.56.102 argument to pass in MaxScale’s IP address and -P 6603 argument to pass in the [replication-listener] port from the /etc/maxscale.cnf file. That port is for mysqlclientservice, that is why we are able to connect to MaxScale’s MariaDB interface.

-P is the port on which MaxScale [avro-listener] service is listening to

cdc_test is the database name on the source (MariaDB) and target (ColumnStore)

cdc_tab is the table name on the source (MariaDB) and target (ColumnStore)

The target database/table should be already created before starting mxs_adapter, if not, mxs_adapter will provide a script to create the table in ColumnStore automatically. Let’s start the service and see what happens.

There is another problem! Our source table had a serial column which MariaDB treats as a Primary Key. Since ColumnStore does not support Primary Keys / Indexes we will need to change the CREATE TABLE script before executing it in the ColumnStore DB.

Another thing to note here is that the table structure in ColumnStore is quite different from the source. There are a few additional columns. This table is an event table that captures all the INSERT, UPDATE and DELETE events. We can use this data to identify the latest row for each specific ID and run some analytics.

One more thing to notice here is that at the moment, one mxs_adapter can only handle one table at a time. This is to prevent streaming the entire MariaDB databases to ColumnStore. Instead, we could create aggregate tables on MariaDB TX and stream that data to ColumnStore for better analytics use case.

Let’s start the mxs_adapter once more and see if it can stream the data from our source MariaDB TX to target MariaDB AX.

We can see the INSERT, UPDATE (Before and After) and a DELETE. SEQUENCE column indicates the sequence in which these events were triggered and the timestamp column indicates the time as and when these events took place.

Avro files

While setting up MaxScale we specified avrodir as /var/lib/mysql . This can be any different location.

In case of some issue with the data streaming, one can try to restart MaxScale service. If there’s still a failure, as a last resort one can delete *.avsc, *.avro, avro.index and avro-conversion.iniand restart the MaxScale service. It should be able to recover.

Since the mxs_adapter is running, as and when new data is inserted in the cdc_test.cdc_tab table, it will automatically be streamed into ColumnStore.

Conclusion

Using this setup, we can stream data directly from OLTP MariaDB TX not only to MariaDB AX but also to other sources that can take Avro/JSON data. There is a Kafka Adapter already available in the Data Adapters download page.