Blog about software development on JVM and related technologies

Main menu

Post navigation

In the previous blog post we measured the effect of simplest JVM JIT optimisation technique of method inlining. The code example was a bit unnatural as it was super simple Scala code just for demonstration purposes of method inlining. In this post I would like to share a general approach I am using when I want to check how JIT treats my code or if there is some possibility to improve the code performance in regards to JIT. Even the method inlining requires the code to meet certain criteria as bytecode length of inlined methods etc. For this purpose I am regularly using great OpenJDK project called JITWatch which comes with bunch of handy tools in regard to JIT. I am pretty sure that there is probably more tools and I will be more than happy if you can share your approaches when dealing with JIT in the comment section bellow the article.

Java HotSpot is able to produce a very detailed log of what the JIT compiler is exactly doing and why. Unfortunately the resulting log is very complex and difficult to read. Reading this log would require an understanding of the techniques and theory that underline JIT compilation. Free tool like JITWatch process those logs and abstract this complexity away form the user.

those settings will produce log file hotspot_pidXXXXX.log. For purpose of this article I re-used code from previous blog located on my GitHub account with JVM flags enabled in build.sbt.

In order to look into generated machine code in JITWatch we need to install HotSpot Disassembler (HSDIS) to install it to $JAVA_HOME/jre/lib/server/. For Mac OS X that can be used from here and try renaming it to hsdis-amd64-dylib. In order to include machine code into generated JIT log we need to add JVM flag -XX:+PrintAssembly.

The most interesting from our perspective is TriView where we can see source code, JVM bytecode and native code. For this particular example we disabled method inlining via JVM Flag “-XX:CompileCommand=dontinline, com/jaksky/jvm/tests/jit/IncWhile.inc“

To just compare with the case when the method body of IncWhile.inc is inlined – native code size is grater 216 compared to 168 with the same bytecode size.

Compile Chain provides also a great view on what is happening with the code

Inlining report provides a great overview what is happening with the code

As it can be seen the effect of tiered compilation as described in JIT optimisation starts with client C1 JIT optimisation and then switches to server C2 optimisation. The same or even better view can be found on Compiler Thread activity which provides a timeline view. To refresh memory check overview of JVM threads. Note: standard java code is subject to JIT optimizations too that’s why so many compilation activity here.

JITWatch is really awesome tool and provides many others views which doesn’t make sense to screenshot all e.g. cache code allocation, nmethodes etc. For detail information I really suggest reading JITWatch wiki pages. Now the question is how to write JIT friendly code? Here pure jewel of JITWatch comes in: Suggestion Tool. That is why I like JITWatch so much. For demonstration I selected somewhat more complex problem – N Queens problem.

Suggestion tool clearly describes why certain compilations failed and what was the exact reason. It is coincidence that in this example we hit again just inlining as there is definitely more going on in JIT but this window provides clear view into how we can possibly help JIT.

Another great tool which is also a part of JITWatch is JarScan Tool. This utility will scan a list of jars and count bytecode size of every method and constructor. The purpose of this utility is to highlight the methods that are bigger than HotSpot threshold for inlining hot methods (default 35 bytes) so it provides hints where to focus benchmarking to see whether decomposing code into smaller methods brings some performance gain. Hotness of the method is determined by set of heuristics including call frequency etc. But what can eliminate the method from inlining is its size. For sure just the method size it too big breaching some limit for inlining doesn’t automatically mean that method is performance bottleneck. JarScan tool is static analysis tool which has no knowledge of runtime statistics hence real method hotness.

To wrap up, JITWatch is a great tool which provides insight into HotSpot JIT compilations happening during program execution and it can help you to understand how decision made at the source code level can affect the performance of the program.

Like this:

Previous article structure of JVM – java memory model briefly mentions bytecode executions modes and article JVM internal threads provides additional insight into internal architecture of JVM execution. In this article we focus on Just In Time compilation and on some of its basic optimisation techniques. We also discuss performance impact of one optimisation technique namely method inlining. In the reminder of this article we focus solely on HotSpot JVM however principles are valid in general.

HotSpot JVM is a mixed-mode VM which means that it starts off interpreting the bytecode, but it can compile code into very highly optimised native machine code for faster execution. This optimised code runs extremely fast and performance can be compared with C/C++ code. JIT compilation happens on method basis during runtime after the method has been run a number of times and considered as a hot method. The compilation into machine code happens on a separate JVM thread and will not interrupt the execution of the program. While the compiler thread is compiling a hot method JVM keeps on using the interpreted version of the method until the compiled version is ready. Thanks to code runtime characteristics HotSpot JVM can make sophisticated decision about how to optimise the code.

Java HotSpot VM is capable of running in two separate modes (C1 and C2) and each mode has a different situation in which it is usually preferred:

C1 (-client) – used for application where quick startup and solid optimization are needed, typically GUI application are good candidates.

C2 (-server) – for long running server application

Those two compiler modes use different techniques for JIT compilation so it is possible to get for the same method very different machine code. Modern java application can take advantage of both compilation modes and starting from Java SE 7 feature called tiered compilation is available. Application starts with C2 compilation which enables fast startup and once the application is warmed up compiler C2 takes over. Since Java SE 8 tiered compilation is a default. Server optimisation are more aggressive based on assumptions which may not always hold. These optimizations are always protected with guard condition to check whether the assumption is correct. If an assumption is not valid JVM reverts the optimisation and drops back to interpreted mode. In server mode HotSpot VM runs a method in interpreted mode 10 000 times before compiling it (can be adjusted via -XX:CompileThreshold=5000). Changing this threshold should be considered thoroughly as HotSpot VM works best when it can accumulate enough statistics in order to make intelligent decision what to compile. If you wanna inspect what is compiled use -XX:PrintCompilation.

Among most common JIT compilation techniques used by HotSpot VM is method inlining, which is practice of substituting the body of a method into the places where the method is called. This technique saves the cost of calling the method. In the HotSpot there is a limit on method size which can be substituted. Next technique commonly used is monomorphic dispatch which relies on a fact that there are paths through method code which belongs to one reference type most of the time and other paths that belong to other type. So the exact method definitions are known without checking thanks to this observation and the overhead of virtual method lookup can be eliminated. JIT compiler can emit optimised machine code which is faster. There is many other optimisation techniques as loop optimisation, dead code elimination, intrinsics and others.

The performance gain by inlining optimisation can be demonstrated on simple Scala code:

Where method inc is eligible for inlining as the method body is smaller than 35 bytes of JVM bytecode (actual size of inc method is 9 bytes). Inlining optimisation can be verified by looking into JIT optimised machine code.

Difference is obvious when compared to machine code when inlining is disabled use –XX:CompileCommand=dontinline,com/jaksky/jvm/tests/jit/IncWhile.inc

Difference in runtime characteristics is also significant as the benchmark results show. With disabled inlining:

When inlining enabled JVM JIT also capable to use next optimizations like loop optimizations which might case that our whole loop is eliminated as it is easily predictable. We would get time around 3 ns which is for 1GHz processor unreal to perform billion of operations. To disable most of loop optimizations use -XX:LoopOptsCount=0 JVM option.

so the performance gain by inlining a method body can be quite significant 2 seconds vs 300 miliseconds.

In this post we discussed mechanics of Java JIT compilation and some optimisation techniques used. We particularly focused on the one of the simplest optimisation technique called method inlining. We demonstrated performance gain brought by eliminating a method call represented by invokevirtual bytecode instruction. Scala also offers a special annotation @inline which should help us with performance aspects of the code under the development. All the code for running the experiments is available online on my GitHub account.

Like this:

In the Structure of Java Virtual Machine we scratched the surface of class file structure, how it is connected to java memory model via class loading process. Also we briefly discussed bytecode structure and its execution including short introduction to Just In Time runtime optimisation. In this post we will look more at internals of execution engine, however there there is no ambition to substitute a detailed VM implementation documentation for HotSpot JVM but just provide enough details to gain bigger picture.

Basic Threading model in HotSpot JVM is a one to one mapping between Java threads (an instance of java.lang.Thread) and native operating system threads. The native thread is created when the Java thread is started and reclaimed once it terminates. The operating system is responsible for scheduling all threads and dispatching to any available CPU. The relationship between java threads priorities and operating system thread priorities varies across operating systems.

HotSpot provides monitors by which threads running application code may participate in mutual exclusion (mutex) protocol. Monitor is either locked or unlocked. Only one thread may own the lock at any time. Only after acquiring ownership of the monitor thread may enter the critical section protected by this monitor. Critical sections are referred as synchronised blocks delineated by synchronised keyword.

Apart from application threads JVM contains also internal threads which can be categorised to following groups:

VM thread spends its time waiting for requested operations to appear in the operation queue (VMOperationQueue). Operation are typically passed to VM Thread because they require the VM to reach safepoint before they can be executed. When the VM is at safepoint all threads inside the VM have been blocked and any threads executing in native code are prevented from returning to the VM while the safepoint is in progress. This means that VM operation can be executed knowing that no thread can be in the middle of modifying heap. All threads are in a state such that their Java stacks are unchanging and can be examined.

Most familiar VM operation is related to garbage collection, particularly stop-the-world phase of garbage collection that is common to many garbage collocational algorithms. Other VM operation are: thread stacks dumps, thread suspension or stopping, inspection or modification via JVMTI etc. VM operation can by synchronous or asynchronous.

Safepoints are initiated using cooperative pooling based mechanism. Thread asks: “Should I block for a safepoint?” Moment when this is happening often is during thread state transition. Threads executing interpreted code don’t usually ask the question, instead when safepoint is requested interpreter switches to different dispatch table which includes that question. When safepoint is over dispatched table is switched back. Once safepoint has been requested VM Thread must wait until all threads are known to be in safepoint safe state before proceeding with operation. During safepoint thread lock is used to block any threads that were running and releasing lock when operation completed.

Like this:

Java based applications run in Java Runtime Environment (JRE) which consists of set of Java APIs and Java Virtual Machine (JVM). The JVM loads an application via class loaders and run it by execution engine.

JVM runs on all kind of hardware where executes a java bytecode without changing java execution code. VM implements a Write Once Run Everywhere principle or so-called platform independence principle. Just to sum up JVM key design principles:

Platform independence

Clearly defined primitive data types – Languages like C or C++ have size of primitive data types dependent on the platform. Java is unified in that matter.

Java uses bytecode as an intermediate representation between source code and machine code which runs on hardware. Bytecode instruction are represented as 1 byte numbers e.g. getfield 0xb4, invokevirtual 0xb6 hence there is maximum of 256 instructions. If the instruction doesn’t need operand so next instruction immediately follows otherwise operands follows instruction according to instruction set specification. Those instructions are contained in class files produced by java compilation. Exact structure of class file is defined in “Java Virtual Machine specification” section 4 – class file format. After some version information there are sections like: constant pools, access flags, fields info, this and super info, methods info etc. See the spec for the details.

A class loader loads compiled java bytecode to the Runtime Data Areas and the execution engine executes the java bytecode. Class is loaded when is used for the first time in the JVM. Class loading works in dynamic fashion on parent child (hierarchical) delegation principle. Class unloading in not allowed. Some time ago I wrote article about class loading on application server. Detail mechanics of class loading is out of scope for this article.

Runtime data areas are used during execution of the program. Some of these areas are created when JVM starts and destroyed when the JVM exits. Other data areas are per-thread – created on thread creation and destroyed on thread exit. Following picture is based mainly on JVM 8 internals (doesn’t include segmented code cache and dynamic linking of language introduced by JVM 9).

Program counter – exist for one thread and has address of current executed instruction unless it is native then the PC is undefined. PC is in fact pointing to at a memory address in the Method Area.

Stack – JVM stack exists for one thread and holds one Frame for each method executing on that thread. It is LIFO data structure. Each stack frame has reference to for local variable array, operand stack, runtime constant pool of a class where the code being executed.

Native Stack – not supported by all JVMs. If JVM is implemented using C-linkage model for JNI than stack will be C stack(order of arguments and return will be identical in the native stack typical to C program). Native methods can call back into the JVM and invoke java methods.

Stack Frame – stores references that points to the objects or arrays on the heap.

Local variable array – all the variables used during execution of the method, all method parameters and locally defined variables.

Operand stack – used during execution of the bytecode instruction. Most of the bytecode is manipulating operand stack moving from local variables array.

Heap – area shared by all threads used to allocate class instances and arrays in runtime. Heap is the subject of Garbage Collection as a way of automatic memory management used by java. This space is most often mentioned in JVM performance tuning.

Non-Heap memory areas

Method area – shared by all threads. It stores runtime constant pool information, field and method information, static variable, method bytecode for each classes loaded by the JVM. Details of this area depends on JVM implementation.

Runtime constant pool – this area corresponds constant pool table in the class file format. It contains all references for methods and fields. When a method or field is referred to JVM searches the actual address of the method or field in the memory by using constant pool

Method and constructor code

Code cache – used for compilation and storage of methods compiled to native code by JIT compilation

…

Bytecode assigned to runtime data areas in the JVM via class loader is executed by execution engine. Engine reads bytecode in the unit of instruction. Execution engine must change the bytecode to the language that can be executed by the machine. This can happen in one of two ways:

JIT (Just In Time) compiler – compensate disadvantage of interpretation. Start executing the code in interpreted mode and JIT compiler compile the entire bytecode to native code. Execution is than switched from interpretation to execution of native code which is much faster. Native code is stored in the cache. Compilation to native code takes time so JVM uses various metrics to decide whether to JIT compile the bytecode.

How the JVM execution engine runs is not defined by JVM specification so venders are free to improve their JVM engines by various techniques.

Like this:

During past several years working habits and working style has been rapidly changing. And this trend will continue for sure, just visit Google trends and search for “Digital nomad” or “Remote work” here. However some profession undergoes this change with a better ease than others. But it is clear that companies that understand that trend benefit from that.

Working in a different style requires a brand new set of tools and approaches which provides you similar working conditions as when people are co-located at the same office. Video conferencing and phone or skype is just the beginning and doesn’t cover all aspects.

In the following paragraphs I am going to summarize tools I found useful while working as software developer remotely in fully distributed team. Those tools are either free or offers some free functionality and I still consider them very useful in various situations. Spectrum of the tools starts with some project management or planning tools to communication.

For the communication – chat and calls Slack becomes standard tool widely adopted now. It allows to you to freely organize your teams, let them create channels they need. It supports wide range of plugins e.g. chat bots and it is well integrated with others tools. Provides application on all desktop and mobile platforms.

When solving some issue or just want to present something to the audience screen sharing becomes very handy tool. Found Join.me pretty handy. Free plan with limited size of the audience was just big enough. Working well on Mac OS and Windows, Linux platform I haven’t tried yet.

When it comes to the pure conferencing phone calls or chat Discord recently took my breath away by awesome sound quality. Again it offers desktop and mobile clients. plus you can use just browser version if you do not wish to install anything on your PC.

Now I slightly move to planning and designing tools during software development process and doesn’t matter if you use Scrum or Kanban. Those have their place there. Shared task board with post it notes. The one I found useful and free is Scrumblr. The only disadvantage is that it is public. It allows you to design the number of sections, change the colors of the notes and add them markers etc.

When we touched an agile development methodology there is no planning and estimation without planning poker. I found useful BitPoints. Simple yet meeting all our needs and free online tool where you invite all participants to the game. It allows you to do various setting like type of deck etc.

When designing phase reached shared online diagramming tools we found rally useful is Sketchboard. It offers wide range of diagrams types and shapes. It offers traditional UML diagrams for sure. Free versions offers few private diagrams otherwise you go public with your design. Allows comments and team discussion.

Some time we just missed traditional white board session and just brain storm. So a web white board tool AWW meet our needs. Simple yet power full.

This concludes the set of tools I found useful during a past year while working in a distributed team remotely. I hope that you found at least one useful or didn’t know it before. Do you have other tools you found useful or have better variants of those mentioned above? Please share it in comment section!

Working on the next project using again awesome Apache Kafka and again fighting against fundamental misunderstanding of the philosophy of this technology which probably usually comes from previous experience using traditional messaging systems. This blog post aims to make the mindset switch as easy as possible and to understand where this technology fits in. What pitfalls to be aware off and how to avoid them. On the other hand this article doesn’t try to cover all or goes to much detail.

Apache Kafka is system optimized for writes – essentially to keep up with what ever speed or amount producer sends. This technology can configured to meet any required parameters. That is one of the motivations behind naming this technology after famous writer Franz Kafka. If you want to understand philosophy of this technology you have to take a look with fresh eye. Forget what you know from JMS, RabbitMQ, ZeroMQ, AMQP and others. Even though the usage patterns are similar internal workings are completely different – opposite. Following table provides quick comparison

JMS, RabbitMQ, …

Apache Kafka

Push model

Pull model

Persistent message with TTL

Retention Policy

Guaranteed delivery

Guaranteed “Consumability”

Hard to scale

Scalable

Fault tolerance – Active – passive

Fault tolerance – ISR (In Sync Replicas)

Core ideas in Apache Kafka comes from RDBMS. I wouldn’t describe Kafka as a messaging system but rather as a distributed database commit log which in order to scale can be partitioned. Once the information is written to the commit log everybody interested can read it at its own pace and responsibility. It is consumers responsibility to read it not the responsibility of the system to deliver the information to consumer. This is the fundamental twist. Information stays in the commit log for limited time given by retention policy applied. During this period it can be consumed even multiple times by consumers. As the system has reduced set of responsibilities it is much easier to scale. It is also really fast – as sequence read from the disk is similar to random access memory read thanks to effective file system caching.

Topic partition is basic unit of scalability when scaling out Kafka. Message in Kafka is simple key value pair represented as byte arrays. When message producer is sending a message to Kafka topic a client partitioner decides to which topic partition message is persisted based on message key. It is a best practice that messages that belongs to the same logical group are send to the same partition. As that guarantee clear ordering. On the client side exact position of the client is maintained on per topic partition bases for assigned consumer group. So point to point communication is achieved by using exactly the same consumer group id when clients are reading from topic partition. While publish subscribe is achieved by using distinct consumer group id for each client to topic partition. Offset is maintained for consumer group id and topic partition and can be reset if needed.

Topic partitions can be replicated zero or n times and distributed across the Kafka cluster. Each topic partition has one leader and zero or n followers depends on replication factor. Leader maintains so called In Sync Replicas (ISR) defined by delay behind the partition leader is lower than replica.lag.max.ms. Apache zookeeper is used for keeping metadata and offsets.

Kafka defines fault tolerance in following terms:

acknowledge – broker acknowledge to producer message write

commit – message is written to all ISR and consumer can read

While producer sends messages to Kafka it can require different levels of consistency:

0 – producer doesn’t wait for confirmation

1 – wait for acknowledge from leader

ALL – wait for acknowledge from all ISR ~ message commit

Apache Kafka is quite flexible in configuration and as such it can meet many different requirements in terms of throughput, consistency and scalability. Replication of topic partition brings read scalability on consumer side but also poses some risk as it is some additional level of complexity to achieve this. If you are unaware of those corner cases it might lead to nasty surprises especially for new comers. So let’s take a closer look at following scenario.

We have topic partition wit a replication factor 2. Producer requires highest consistency level, set to ack = all. Replica 1 is currently leader. Message 10 is committed hence available to clients. Message 11 is not acknowledged nor committed due to the failure of replica 3. Replica 3 will be eliminated from ISR or put offline. That causes that message 11 becomes acknowledged and committed.

Next time we loose Replica 2 it is eliminated from ISR and same situation repeats for messages 12 and 13.

Situation can still be a lot worse, if cluster looses current partition leader – Replica 1 is down now.

What happens if Replica 2 or Replica 3 goes back online before Replica 1? One of those becomes a new partition leader and we lost data messages 12 and 13 for sure!

Is that a problem? Well the correct answer is: It depends. There are scenarios where this behavior is perfectly fine. Imagine collecting logs from all machines via sending them through Kafka. On the other hand if we implement event sourcing and we just lost some events that we cannot recreate the application state correctly. Yes we have a problem! Unfortunately, if that doesn’t changed in latest releases, that is default configuration if you just install new fresh Kafka cluster. It is a set up which favor availability and throughput over other factors. But Kafka allows you to set it up in a way that it meets your requirements for consistency as well but will sacrifice some availability in order to achieve that (CAP theorem). To avoid described scenario you should use following configuration. Producer should require acknowledge level ALL. Do not allow to kafka perform a new leader election for dirty replicas – use settings unclean.leader.election.enable = false. Use replication factor (default.replication.factor = 3) and require minimal number of replicas to be in sync state to higher than 1 (min.insync.replicas = 2).

We already quickly touched the topic of message delivery to consumer. Kafka doesn’t guarantees that message was delivered to all consumers. It is responsibility of the consumers to read messages. So there is no semantics of persistent message as known from traditional messaging systems. All messages send to Kafka are persistent meaning available for consumption by clients according to retention policy. Retention policy essentially specifies how long the message will be available in Kafka. Currently there are two basic concepts – limited by space used for keeping messages or time for which the message should be at least available. The one which gets violated first wins.

When I need to clean the data from the Kafka (triggered by retention policy) there are two options. The simplest one is just delete the message. Or I can compact messages. Compaction is a process where for each message key is just one message, usually the latest one. That is actually a second semantics of key used in message.

What features you cannot find in Apache Kafka compared to traditional messaging technologies? Probably the most significant is an absence of any selector in combination with listen (wake me on receive). For sure can be implemented via correlation id, but efficiency is on the completely different level. You have to read all messages, deserialize those and filter. Compared to traditional selector which uses custom field in message header where you don’t need even to deserialize message payload that is on completely different level. Monitoring Kafka on production environment essentially concerns elementary question: Are the consumers fast enough? Hence monitoring consumers offsets in respect to retention policy.

Kafka was created in LinkedIn to solve specific problem of modern data driven application to fill the gap in traditional ETL processes usually working with flat files and DB dumps. It is essentially enterprise service bus for data where software components needs exchange data heavily. It unifies and decouples data exchange among components. Typical uses are in “BigData” pipeline together with Hadoop and Spark in lambda or kappa architecture. It lays down foundations of modern data stream processing.

This post just scratches basic concepts in Apache Kafka. If you are interested in details I really suggest to read following sources which I found quite useful on my way when learning Kafka:

In this post dedicated to Big Data I would like to summarize hadoop file formats and provide some brief introduction to this topic. As things are constantly evolving especially in the big data area I will be glad for comments in case I missed something important. Big Data framework changes but InputFormat and OutputFormat stays the same. Doesn’t matter what’s big data technology is in use, can be hadoop, spark or …

Let’s start with some basic terminology and general principles. Key term in mapreduce paradigm is split which defines a chunk of the data processed by single map. Split is further divided into record where every record is represented as a key-value pair. That is what you actually know from mapper API as your input. The number of splits gives you essentially the number of map tasks necessary to process the data which is not in clash with number of defined mappers for your mapreduce slots. This just means that some map tasks need to wait untill the map slot is available for processing. This abstraction is hidden in IO layer particularly InputFormat or OutputFormat class which contains RecordReader, RecordWriter class responsible for further division into records. Hadoop comes with a bunch of pre-defined file formats classes e.g. TextInputFormat, DBInputFormat, CombinedInputFormat and many others. Needless to say that there is nothing which prevents you from coming with your custom file formats.

Described abstraction model is closely related to mapreduce paradigm but what is the relation to underlying storage like HDFS? First of all, mapreduce and distributed file system (DFS) are two core hadoop concepts which are “independent” and the relation is defined just through the API between those components. The well-known DFS implementation is HDFS but there are several other possibilities(s3, azure blob, …). DFS is constructed for large datasets. Core concept in DFS is a block which represents a basic unit of the original dataset for a manipulation and processing e.g. replication etc. This fact puts also additional requirements on dataset file format: it has to be splittable – that means that you can process a given block independently from the rest of the dataset. If the file format is not splittable and you would run a mapreduce job you wouldn’t get any level of parallelism and the dataset would be processed by a single mapper. Splittability requirement also applies if the compression is desired as well.

What is the relation between a block from DFS and split from mapreduce? Both of them are essentially key abstractions for paralelization but just in different frameworks and in ideal case they are aligned. If they are perfectly aligned that hadoop can take full advantage of so-called data locality feature which runs the map or reduce tasks on a cluster node where the data resides and minimize the additional network traffic. In case of imprecise alignment a remote reads will happen for records missing for a given split. For that reasons file formats includes sync markers or points.

To take an advantage and full power of hadoop you design your system for a big files. Typically the DFS block size is 64MB but can be bigger. That means that biggest hadoop enemy is a small file. The number of files which lives in DFS is somehow limited by the size of Name Node memory. All the datasets metadata are kept in memory. Hadoop offers several strategies how to avoid of this bad scenario. Let’s go through those file formats.

HAR file (stands for hadoop archive) – is a specific file format which essentially packs a bunch of files into a single logical unit which is kept on name node. HAR files doesn’t support additional compression and as far as I know are transparent to mapreduce. Can help if name nodes are running out of memory.

Sequence file is a kind of file based data structure. This file format is splittable as it contains a sync point after several records. Record consists of key – value and metadata. Where key and value is serialized via class whose name is kept in the metadata. Classes used for serialization needs to be on CLASSPATH.

Map file is again a kind of hadoop file based data structure and it differs from a sequence file in a matter of the order. Map file is sorted and you can perform a look up. Behavior pretty similar to java.util.Map class.

Avro data file is based on avro serializaton framework which was primarily created for hadoop. It is a splittable file format with a metadata section at the beginning and then a sequence of avro serialized objects. Metadata section contains a schema for a avro serialization. This format allows a comparison of data without deserialization.

Google Protocol buffers are not natively supported by hadoop but you can plug the support via libraries as elephant-bird from twitter.

So what about file formats as XML and JSON? They are not natively splittable and so “hard” to deal. A common practice is to store then into text file a single message per line.

For textual files needless to say that those files are the first class citizens in hadoop. TextInputFormat and TextOutputFormat deal with those. Byte offset is used as key and the value is the line content.

This blog post just scratches the surface of hadoop file formats but I hope that it provides a good introduction and explain connection between two essential concepts – mapreduce and DFS. For further reference book Hadoop Definite guide goes into the great detail.

Java applications archives as jar, war and ear files are elementary distribution blocks in the java world. At the beginning managing all of these libraries and components were a bit cumbersome and error prone as the project dependencies depends on another libraries and all those transitive dependencies creates so called dependency hell. In order to ease developers of this burden Apache Maven (maven like tools) were developed. Every artefact has so called coordinates which uniquely identifies it and all dependencies are driven by those coordinates in a recursive fashion.

Maven ease the management at the stage of the artefact development but doesn’t help that much when we want to deploy the application component. Many times that’s not such a big deal if your run time environment is clustered J2EE aplication server e.g. Weblogic cluster. You hand the ear or war over to your ops team and they deploy it to the cluster via cluster management console to all nodes at once. They need to maintain an archive of deployed components in case of roll back etc. This is the simplest case (isolated component and doesn’t solve dependencies e.g. libraries provided in the cluster etc.) where management is relatively clean but relying on the process a lot. When we consider different run time environment like run the application as a server less java process (opposed to J2EE cluster) then the stuff gets a bit more complicated even for a simplest case. Your java applications are typically distributed as a jar file and you need to distribute it to every single linux server where instance of this process is running. Apart of that standard jar file doesn’t contain dependencies. One possible solution to that would be to create shaded (fat) jar file which has all dependencies embedded. I suppose that you have a repository where all builds are archived. Does it make sense to store those big archives where the major part are 3rd party libraries? This is probably not the right way to go.

Another aspect of roll out process is ability to automate it. In case of J2EE clusters like weblogic there is often scripting tool provided (WLST ~ weblogic scripting tool). The land of pure jar is again a lot worse. You can take some advantage of maven but that doesn’t solve all the problems. Majority of production environments in java world run on linux operating system so why not to try to take advantage of linux server standard distribution management like yum, apt etc. for distributing rpm linux packages. This system provides atomicity, dependency management between rpm linux packages, easy way to roll back (keeps track of versions), minimise the number of manual steps – potential of human error is reduced and involves native auditing. It is pretty easy to get an info about installation history.

To pack your java jar file application you need a tool called rpmbuild which creates a linux package from a SPEC file. SPEC file is something like pom in maven world plus it contains instruction how to install, uninstall etc. Packages containing a required plus handy tools are rpmdevtool and rpmlint. On linux OS it is simple to install it. On Windows OS you need a cygwin installed with the same tool set. In order to build your rpm work space run the following command. It is highly recommended not to run it under root account if there is no special need for it.

rpmdev-setuptree

This command creates rpmbuild folder – that is the place where all linux RPM packaging will happen. It contains sub-folders: BUILD, RPMS, SOURCES, SPECS, SRPMS. For us the important ones are RPMS – will contain final linux rpm and SPECS – this is the place we need to put our SPEC file describing installation and content of our application.
This file is the core of the linux rpm packaging. It contains all the information about version, dependencies, installation, un-installation, upgrade etc. We can create our skeleton SPEC file by running following command:

rpmdev-newspec

Majority of directives in this file are clear from its name, e.g. Name, Version, Summary, BuildArch etc. BuildRoot require special attention. It is a sort of proxy which mimics a root of system under the construction e.g. If I want to install my [application_name] (replace this place holder with actual name) to /usr/local/[application_name] location I have to create this structure under BuildRoot during the installation. Then there are important sections which corresponds to various phases of installation: %prep, %build, %install – which is the most important for as as we do not build from sources but just pack already built jar file to linux rpm package. Last very important section of this file is %files which lists all files which will be in the final linux rpm package and hence installed in the target machine. Apart from that there can be additional hooks to installation and un-installation process as %post, %preun, %postun etc. which allows you to customize a process as you need. Sample SPEC file follows:

Several things to highlight in the SPEC file example: tmppath points to the location where the installed application is prepared, that is essentially what is going to be packed to rpm package. %defattr set the standard attributes to files if special one are not specified. %config denotes configuration files which means that for the first installation those standard one are provided but in case of upgrade those file will not be overwritten as they are probably customized to this particular instance.
Now we are ready to create the linux rpm package just the last step is pending:

rpmbuild -v -bb --clean SPECS/nameOfTheSpecFile.spec

Created package can be found in RPMS subfolder. We can test the package locally by

rpm -i nameOfTheRpmPackage.rpm

To complete the smoke test lets remove the package by

rpm -e nameOfTheApplication

Creating a SPEC file should be pretty straightforward process and once you create your SPEC file for the application building of linux rpm package is one minute job. But if you want to automate it there is a maven plugin which generates a SPEC file for you. It is essentially wrapper of rpmbuild utility which means that plugin works fine on linux with tool set installed but on windows machine you need have cygwin installed and create wrapper bat file to mimic rpmbuild utility for the plugin. Detailed manual can be found for example here.

Couple things to highlight when creating a SPEC file. Prepare the linux package for all scenarios – install, remove, upgrade and configuration management right from the beginning. Test it properly. It can save you a lot of troubles and manual work in case of large installations. Creating a new version of java application is only about about replacing jar file, re-packaging rpm bundle.

In this quick walk through I tried to show that creating of linux rpm package as a unit for software deployment of the java application is not that difficult and can neaten a roll out process. I just scratch the surface of linux rpm packaging and I was far away from showing all capabilities of this approach. I will conclude this post by several links which I found really useful.

Scalability, Availability, Resilience – those are just common examples of computer system requirements which forms an overall application architecture very strongly and have direct impact to “indicators” such as Customer Satisfaction Ratio, Revenue, Cost, etc. The weakest part of the system has the major impact on those parameters. Topic of this post availability is defined as percentage of time that a system is capable of serving its intended function.

In BigData era Apache Hadoop is a common component to nearly every solution. As the system requirements are shifting for purely batch oriented systems to near-to-real-time systems this just adds pressure on systems availability. Clearly if system in batch mode runs every midnight than 2 hours downtime is not such a big deal as opposed to near-to-real-time systems where result delayed by 10 min is pointless.

I this post I will try to summarize Hadoop high availability strategies as a complete and ready to use solutions I encountered during my research on this topic.

In Hadoop 1.x the well known fact is that the Name Node is a single point of failure and as such all high availability strategies tries to cope with that – strengthen the weakest part of the system. Just to clarify widely spread myth – Secondary Name Node isn’t a back up or recovery node by nature. It has different tasks than Name Node BUT with some changes Secondary Name Node can be started in the role of Name Node. But neither this doesn’t work automatically nor that wasn’t the original role for SNN.

High availability strategies can be categorized by the state of standby: Hot/Warm Standby or Cold Standby. This has direct correlation to fail over(start up) time. To give an raw idea(according to doc): Cluster with 1500 nodes with PB capacity – the start up time is close to one hour. Start up consists of two major phases: restoring the metadata and then every node in HDFS cluster need to report block location.

Typical solution for Hadoop 1.x which makes use of NFS and logical group of name nodes. Some resources claim that in case of NFS unavailability the name node process aborts what would effectively stop the cluster. I couldn’t verify that fact in different sources of information but I feel important to mention that. Writing name node metadata to NFS need to be exclusive to a single machine in order to keep metadata consistent. To prevent collisions and possible data corruption a fencing method needs to be defined. Fencing method assures that if the name node isn’t responsive that he is really down. In order to have a real confidence a sequence of fencing strategies can be defined and they are executed in order. Strategies ranges from simple ssh call to power supply controlled over the network. This concept is sometimes called shot me in the head. The fail over is usually manual but can be automated as well. This strategy works as a cold standby and hadoop providers typically provides this solution in their High Availability Kits.

Because of the relatively long start up time of back up name node some companies (e.g. Facebook) developed their own solutions which provides hot or warm standby. Facebook’s solution to this problem is called avatar node. The idea behind is relatively simple: Every node is wrapped to so called avatar(no change to original code base needed!). Primary avatar name node writes to shared NFS filler. Standby avatar node consist of secondary name node and back up name node. This node continuously reads HDFS transaction logs and keeps feeding those transactions to encapsulated name node which is kept in safe mode which prevents him from performing of any active duties. This way all name node metadata are kept hot. Avatar in standby mode performs duties of secondary name node. Data nodes are wrapped to avatar data nodes which sends block reports to both primary and standby avatar node. Fail over time is about a minute. More information can be found here.

Another attempt to create a Hadoop 1.x hot standby coming form China Mobile Research Institute is based on running synchronization agents and sync master. This solution brings another questions and it seems to me that it isn’t so mature and clear as the previous one. Details can be found here.

An ultimate solution to high availability brings Hadoop 2.x which removes a single point of failure by a different architecture. YARN (Yet Another Resource Negotiator) also called MapReduce 2. And for HDFS there is an another concept called Quorum Journal Manager (QJM) which can use NFS or Zookeeper as synchronization and coordination framework. Those architectural changes provides the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.

This post just scratches the surface of Hadoop High Availability and doesn’t go deep in detail daemon by daemon but I hope that it is a good starting point. If someone from the readers is aware of some other possibility I am looking forward to seeing that in the comment section.

Using standard J2EE containers for application deployment is not always suitable option. Time to time you need to run an java application (jar file) as a server less, more light weight linux process. Using standard java -cp …. MainClass is feasible but sooner or latter you will reveal that there is something important missing. Especially if you are supposed to run multiple components in this way. I becomes relly messy and hard to manage pretty soon. On linux system there is a solution which is a lot better – run the component as a linux service.
Lets make is simple and easy to understand. Linux service is essentially a “process” which is driven by init script and has defined API – set of standard commands for management of the underlying linux process. Those linux service commands looks as following(processor represents actual name as defined in init script, see latter):

service processor start
service processor status
service processor stop
service processor restart

That’s a lot simpler, easy to manage and monitor, right? You don’t need to know where particular jar file is located etc. Examples of init scripts can be usually located /etc/init.d/samples or just simply read scripts in /etc/init.d which contains various init scripts for different kinds of linux services already present on the system.
For java applications there is a bunch of projects which acts as a service wrappers. That enables you to quickly and easily turn jar file to regular linux service as a program daemon. There are wrappers even for windows OS. For some reasons I was directed to use just linux server standard tools so the reminder of this post will be about making the linux service program daemon in a common way via shell scripts.
First of all there is a necessity to create startup and shutdown script with a need to properly manage pid (process id) file accordingly. A good practice is to have a dedicated user to run a particular linux services and have them installed under /usr/local/xxx .
startup script follows:

Startup script creates pid file located /var/run/processor – user under which the installation is running needs to have appropriate privileges.
Finally the init script which needs to be placed into /etc/init.d folder:

In the init script there is a need to change to appropriate java apps installation folder JAVA_HOME if not default and SERVICE_USER to user which is supposed to run this service. Service can be started under root account or SERVICE_USER without password specification or any other user with knowledge of credentials.

If you have a production like experience with java service wrappers mentioned at the beginning of the article don’t hesitate and share it! This way it serves the purpose at given situation.