Pages

Friday, July 1, 2016

During starting up a apache cassandra 1.2 instance, I noticed in the log of the following error.

INFO 10:38:23,334 Opening /var/lib/cassandra/data/MYKEYSPACE/MYCOLUMNFAMILY/MYKEYSPACE-COLUMNFAMILY-hf-2508 (2275767 bytes)
ERROR 10:38:23,467 Exception in thread Thread[SSTableBatchOpen:2,5,RMI Runtime]
org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException
at org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
at org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:418)
at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:209)
at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
at org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
at java.io.DataInputStream.readUTF(DataInputStream.java:589)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at org.apache.cassandra.io.compress.CompressionMetadata<init>(CompressionMetadata.java:83)
... 11 more

Yes, if you noticed that the cassandra sstable version is hf which belong to cassandra 1.1 as this node is just right after cassandra 1.2 upgrade and first cassandra 1.2 boot up.

Tracing the stacktrace above with cassandra 1.2 source code, it turn out to be the compression metadata cannot be open due to file corruption. I tried using nodetools upgradesstables, scrub and restart cassandra instance, this error still persist. I guess in this case, nothing can really help so I end up stopping the cassandra instance. remove this data sstables together with its metadata sstables and then start it up again. The error is gone and I ran a repair.

Sunday, June 19, 2016

Recently I got the opportunity to upgrade a production cassandra cluster from 1.1.12 to 1.2.19 and during the midst of upgrading, I noticed the following in the log file during boot up of a cassandra 1.2 instance.

As the logging level is WARN, I did not worry that much. Going into the codes, it turn out that in cassandra 1.2 , a metric known as ConnectionMetrics is added. This metric is under domain org.apache.cassandra.metrics and of type connection and name is Timeouts. This is not available in cassandra 1.1.

Saturday, June 18, 2016

Recently I was assigned a project to upgrade cassandra from 1.1 to 1.2 (I know it is ancient cassandra but who cares? we just want it to work and cassandra deliver just that) and one of the main feature of cassandra 1.2 is the virtual nodes.

Although there is a red warning note in this instruction, but I took sometime to investigate it knowing that we not enable bleeding edge technology or home based customized the cassandra code. If you selecting cassandra in 1.2 for your upgrade and you want to try on virtual nodes upgrade as well, choose one less version than 1.2.19. why? read here https://github.com/apache/cassandra/blob/cassandra-1.2.19/NEWS.txt#L19-L23

I started three nodes cassandra 1.2.18 in sandbox environment where I can safely test the cassandra upgrade from 1.1 to 1.2 and after upgraded that, upgrade to virtual nodes.

As you can read above, I have created a shuffling process and enable it. The tokens started to change to 256 and the shuffling count suddenly coming down. I thought hey man, this can actually work! happily I announce to the team, looks like we able to migrate to cassandra vnodes.

However, on the next morning, when I check the upgrade process, oh gosh, the upgrade goes into a loop it seems.

The shuffling counts stay at 744, it is unfortunately we have to stay with the non vnodes technology. If you have success virtual nodes upgrade, please leave your comment below like what version path you taken and what shuffling steps you taken to successfully upgrade c* cluster to vnodes.

I end this article with the steps I have taken. If you intend to upgrade to vnodes, I suggest don't waste time and might as well spin up a new cluster if more and more upgrade is not possible. One comes to mind now is the partitioner (random to murmur3) and vnodes technology.

Friday, June 17, 2016

Today, we will again look into mongodb but on the specific topic of aggregation.

Aggregations operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. MongoDB provides three ways to perform aggregation: the aggregation pipeline, the map-reduce function, and single purpose aggregation methods.

Let's start with a sample aggregation from the zip code.Importing 29353 objects within second. Blazing fast, maybe because I'm using ssd, heh.

All query in the examples works. It's amazing all three queries quickly bring results within second! Amazing. Whilst this is an short article to convince you to use aggregation on mongodb, and if you have been convince, you should really try on the following useful links too.

Sunday, June 5, 2016

Once I was asked by a company what is the data structure in java and i was not prepare at all, but as usual, why bother remembered every details when we can google and start to read. It turn out that the answer they are seeking is the java collection framework and the next question comes, like what are the characteristics of the collections.

Well, to be really honest, who go remember every details when we can read the javadoc? Anyway, recently I found this chart circulating in the facebook which remind me of the questions asked. So I thought this is helpful and we should not memorized every fine details but the essence point is you know where to get the material and willing to share.

So here goes!

This is a short article and I hope you find this useful in your daily coding reference than use it to answer some funny questions. haha!

Saturday, June 4, 2016

Sector/Sphere is an open source software suite for high-performance distributed data storage and processing. It can be broadly compared to Google's GFS/MapReduce stack. Sector is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting.

Today, we will take a look into another big data technology, sector/sphere Let's download the source here.

It's a bit pity as compilation failed and that definitely a blocker for new people to pick up this great software. If you develop or know cpp, please leave a message in the comment how to make this compilation works. Otherwise, if you want to know more on this software, this is another useful link.

Friday, June 3, 2016

It's been a while for a quiet moment since I actively blogging due to the fact
of family issue. But I hope things will goes even smoother and continue what I
like the best, learning information technology and contribute back to the
opensource society.

is a declarative, data centric programming language designed in 2000 to allow a team of programmers to process big data across a high performance computing cluster without the programmer being involved in many of the lower level, imperative decisions.[1][2]

As this article only meant for introductory, we will just go through with whatever documentation officially available from HPCC Systems, LexisNexis Risk Solutions. As such to speed up of ecl acquaintance, download a virtual image from this link. This virtual machine which already preconfigured hpcc system ready together with ecl to play with.

For me, I have chosen image of current version with gold release running on a 64bit cpu. Next, you need to install virtualbox on your pc in order to run this virtual image. In the past, I have describe many times how to install virtualbox via apt-get.

Particulars that you might want to pay attention how to quickly get the downloaded virtual image to run on the virtualbox is, hpcc systems require two network adapters and make sure you have them configured correctly. You don't have to create a new virtual machine but just select from file dropdown and choose 'Import Appliance'. Then navigate to the downloaded image and import it. Next, power on the virtual machine. You should see something similar as of following.

Open your browser and point to http://<your hpcc system ip address>:8010/#/stub/ECL . Click on
ECL on top of menu bar and then in the submenu below, click on Playground. There are some example
how to model the data store, insert the data and them query the data. If you want more explanation
you can read on this link.

That's it for this learning experience. I must say it is very easy to setup to quickly learn what is ECL compare to the previous hadoop system. If you are looking into big data analytic, ecl might be a good option to begin with.