I was interested to hear about semi-sync replication improvements
in MySQL’s 5.7.4 DMR release and decided to check it out. I
previously blogged about poor semi-sync performance and
was pretty disappointed from semi-sync’s performance across WAN
distances back then, particularly with many client threads.

While Shard-Query can work over multiple nodes, this blog post
focuses on using Shard-Query with a single node.
Shard-Query can add parallelism to queries which
use partitioned tables. Very large
tables can often be partitioned fairly easily. Shard-Query can
leverage partitioning to add paralellism, because each partition
can be queried independently. Because MySQL 5.6 supports the
partition hint, Shard-Query can add parallelism to any
partitioning method (even subpartioning) on 5.6 but it is limited
to RANGE/LIST partitioning methods on early versions.

The output from Shard-Query is from the commandline client, but
you can use MySQL proxy to communicate with Shard-Query too.

In the examples I am going to use the schema from the Star Schema
Benchmark. I generated data for scale factor 10, which
means about 6GB of data in the largest table. I am going to show
a few different queries, and …

Here are results for Shard-Query 2.0 Beta 1* on the Star Schema Benchmark at scale factor
10. In the comparison below the “single threaded” response
times for InnoDB are the response times reported in
my previous test which did not use
Shard-Query.

Shard-Query configuration

Shard-Query has been configured to use a single host. The
Shard-Query configuration repository is stored on the host.
Gearman is also running on the host, as are
the Gearman workers. In short, only one host is involved in
the testing.

It is finally here. After three years of
development, the new version of Shard-Query is finally available for broad
testing.

This new version of Shard-Query is vastly improved over previous
versions in many ways. This is in large part due to the
fact that the previous version of Shard-Query (version 1.1)
entered into production at a large company. Their feedback
during implementation was invaluable in building the new
Shard-Query features. The great thing is that this means
that many of the new 2.0 features have already been tested in at
least one production environment.

This post is intended to highlight the new features in
Shard-Query 2.0. I will be making posts about individual
features as well as posting benchmark results.

Earlier this week we all read GigaOM's article with this title:
"Why the days are numbered for Hadoop as we know it"I know GigaOM
like to provoke scandals sometimes, we all remember some other
unforgettable piece, but there is something behind
it...

Hadoop today (after SOA not so long ago) is one of the worst case
of an abused buzzword ever known to men. It's everything,
everywhere, can cure illnesses and do "big-data" at the same
time! Wow! Actually Hadoop is a software framework that
supports data-intensive distributed applications, derived from
Google's MapReduce and Google File System (GFS) papers.

My take from the article is this: Hadoop is a foundation,
low-level platform. I used the word …

In my previous post I covered the shard-disk paradigm's pros
and cons, but the conclusion that is that it cannot really
qualify as a scale-out solution, when it comes to massive OLTP,
big-data, big-sessions-count and mixture of reads and
writes.

Read/Write splitting is achieved when numerous
replicated database servers are used for reads. This way the
system can scale to cope with increase in concurrent load. This
solution qualifies as a scale-out solution as it
allow expansion beyond the boundaries of one DB, DB
machines are shared-nothing, can be added as a slave to the
replication "group" when required.

Scale challenges in the Analytics world are with the growing
amounts of data. Most solutions have been leveraging those 3 main
aspects: Columnar storage, RAM and parallelism.
Columnar storage makes scans and data filtering more precise and
focused. After that – it all goes down to the I/O - the faster
the I/O is, the faster the query will finish and bring results.
Faster disks and also SSD can play good role, but above all: RAM! …

TokuDB has a big advantage over B-trees when trickle loading data
into existing tables. However, it is possible to preprocess the
data when bulk loading into empty tables or when new indexes are
created. TokuDB release 4 now uses a parallel algorithm to speed
up these types of bulk insertions. How does the parallel loader
performance compare with the serial loader? We use the Air
Traffic Control (ATC) data and queries described in a Percona blog and also used in
an experiment with TokuDB 2.1.0 to gain some
insight.

Our ATC data is about 122M rows in size, is stored in a 40GiB CSV
file, and can be found in our Amazon S3 public bucket. See the
end of this blog for details. We …

At Kscope this year, I attended a half day in-depth session
entitled Data Warehousing Performance Best Practices, given by
Maria
Colgan of Oracle. My impression, which was confirmed by folks
in the Oracle world, is that she knows her way around the Oracle
optimizer.

These are my notes from the session, which include comparisons of
how Oracle works (which Maria gave) and how MySQL works (which I
researched to figure out the difference, which is why this blog
post took a month after the conference to write). Note that I am
not an expert on data warehousing in either Oracle or MySQL, so
these are more concepts to think about than hard-and-fast advice.
In some places, I still have questions, and I am happy to have
folks comment and contribute what they know.

One interesting point brought up:
Maria quoted someone (she said the name but I did not grab it)
from …

Content reproduced on this site is the property of the respective copyright holders.
It is not reviewed in advance by Oracle and does not necessarily represent the opinion
of Oracle or any other party.