Usually, I don’t use singletons that objects need to fetch using a static method. I prefer using dependency injection to give each object what it needs. But for lower level concerns, like logging or security, it is common to use singletons that are accessible from a static method. Like the way you get a logger with SLF4J:

Logger logger = LoggerFactory.getLogger(HelloWorld.class);

The logger object is a singleton obtained via a static method.

So how should we implement those static methods to return a singleton?

Lazy or not?

The first question you need to ask is whether or not you need lazy initialiation of the singleton. According to Wikipedia, “lazy initialization is the tactic of delaying the creation of an object, the calculation of a value, or some other expensive process until the first time it is needed” (emphasis mine). So if building your singleton is not really expansive, there is probably no need to build it lazily. In that case, the easiest way to build a singleton is to use a static field:

public class SingletonHolder
{
public static final Singleton INSTANCE = new Singleton();
}

Then, to get this singleton, you only need to grab the instance:

Singleton singleton = SingletonHolder.INSTANCE;

Of course, if you prefer, you could add a getter to the class to return the instance instead of having it public.

The important thing about this singleton implementation is the use of a static field to hold the instance. Static fields are initialized once when the class is loaded. So INSTANCE will be unique because it will be instanciated only once. But beware of serialization and reflection hacks that could wreak havoc and make INSTANCE not unique anymore. Check Item 3 of Joshua Bloch’s excellent Effective Java 2nd Edition for details.

If you are using Java 1.5+, maybe the best approach to create a singleton is by using enums:

public enum EnumSingleton
{
INSTANCE;
}

To get the instance:

EnumSingleton.INSTANCE;

The advantage of this method is that the singleton is guaranteed to be unique, even in the face of serialization or reflection. And you have to admit it is very simple!

According to Michael Borgwardt in this Stackoverflow answer, 99.99% of the time, this is all you need to build a singleton in Java. I’m sure that 99.99% is more a figure of speech than a scientifically validated number, but you got the idea!

I’m lazy!

If your singleton is really that expensive to build, maybe you want to defer its creation to the moment it is first accessed.

One challenge when implementing a lazy-initiliazed singleton is to correctly handle multiple threads. For a singleton that is expansive to build, we want the first thread that needs it to trigger its creation, while all other threads that need the singleton afterward to simply grab it. We don’t want to have multiple threads triggering the creation of multiple singletons (it won’t be a singleton anymore!).

Probably the most simple way to acheive this is to synchronize the method use to get and build the singleton, like this:

Nothing complicated about this. For many implementations that needs lazy initialization, this is perfectly fine. The main drawback of it is that under heavy thread usage, it might be a performance bottleneck (but please, do profile before jumping to that conclusion). As we can see, the whole getInstance method is synchronized, but what we really need to protect is the creation of the singleton. Once it is created, there is no need for the method to be synchronized anymore. How can we acheive this?

As we can see, the creation of the singleton is protected by a synchronized block. But the method itself is not. Once the singleton is built, the INSTANCE is returned and the synchronized block won’t be used anymore. Intituively, it seems to be ok. The main problem with this is, well, it don’t work in Java! Because of memory visibility and reordoring of instructions by the compiler and/or runtime, even if a thread sees the INSTANCE as non-null, it might not be properly initialized yet (as if the constructor as not been run). See the famous The “Double-Checked Locking is Broken” Declaration for more details.

But wait! Thanks to changes in the Java memory model from version 1.5, you can make double-checked locking work by either making INSTANCE volatile, or by making sure the Singleton class is immutable. See again the Declaration for details.

That method is called Initialization-on-demand holder. Notice that the singleton is hold by a private static class. But is it lazy and thread-safe? Will the singleton only be initialized on the first call to getInstance? Quoting Brian Goetz:

This idiom derives its thread safety from the fact that operations that are part of class initialization, such as static initializers, are guaranteed to be visible to all threads that use that class, and its lazy initialization from the fact that the inner class is not loaded until some thread references one of its fields or methods.

So INSTANCE will only be initialized the first time it is accessed and all other threads will get the initialized INSTANCE afterward. The beauty of this method is its simplicity. Less code and no synchronized. And according to Brian Goetz, it is faster.

In Item 71 of Efective Java 2nd Edition, Joshua Bloch recommends this last method for lazy initialization of a static field. For lazy initialization of instance fields, he recommends to use double-checked locking (of course, only if you really need it).

Conclusion

Like I said in the introduction, in general, I think it is best to use dependency injection instead of singletons accessible from a static method. But if you do decide that it is what you need, you should go with the simplest solutions, like a singleton initialized on a static field.

Recently, I’ve been using @ContextConfiguration to annotate my Spring test classes. This allows to have a context available for the test, without having to load it manually in a @Before method. It also gives to opportunity to use @Autowired within the test class.

But when using this feature with Spring AOP, you can run into a situation where test classes get advised. I often use the same package name for my business and test classes. For example, if my business classes are in the package ‘com.company.xyz’, then I put my tests into a another source folder (src/test/java), but using the same ‘com.company.xyz’ package name for the test classes. With this scheme, if you define a pointcut for all public methods of the ‘com.company.xyz’ packages, Spring will try to advise the public method of your test classes. This happens because, thanks to the @ContextConfiguration, the test classes are now part of the Spring context.

What is the problem with having the test classes advised? Most of the time, my test classes do not implement any interface. When Spring AOP tries to advise a class that does not implement an interface, it needs to create a cglib proxy, instead of a JDK proxy. Since all the business classes that I need to advise implements at least one interface, I don’t need to include cglib in the project. So when trying to advise my test classes, Spring throws an exception saying that cglib is not available.

So, I needed a way to tell Spring AOP not to advise my test classes. It turns out that the pointcut expression syntax allows to tell Spring not to advise a class if it implements a specific interface. So, using this Stackoverflow’s answer, I came up with this code:

So here, we define one pointcut for classes implementing the @ContextConfiguration interface and another one for all the public methods of the com.company.xyz package. We can then use these two pointcut expressions to prevent Spring from advising the test classes (notice the !isTestClass).

Last Saturday, I went to Hack/Reduce. Organized by the folks at Hopper, the event was an opportunity to learn to use Hadoop. They took the time to prepare a EC2 cluster of more than a hundred nodes. They also loaded a set of popular dataset for us to play with. With clear instructions on how to deploy our jobs to the cluster, we were ready to hack! Each team had an idea about what they want to do with the data. The Hopper’s guy were there to help us realize it!

I teamed up with Mathieu Carbou and David Avenante to build a full inverted index from the Wikipedia dataset. Our job used Mathieu’s xmltool to parse each Wikipedia page and some Lucene tokenizers to extract words and positions. I was a lot of fun to see this running on more than 400 cpus!

At the end of the day, each team took the time to present what they were able to accomplish. It was really impressing to see what was done in a single day! One team dig through flights data and discover that it is cheaper to travel on Friday and Saturday. Many teams also leveraged the bixi dataset to extract interesting information about Montrealers’s usage of the bikes. Neat stuff!

I’d like to thank the Hopper’s staff for such a nice event! Well done guys!

Today started with Scaling Web Apps with RabbitMQ. We were introduced to the basics of AMQP. We went through some use cases where using a message queue system made sense for a web application. An example is images processing, which can be done asynchronously by sending those jobs to a process via RabbitMQ.

Next it was Varnish in action. Varnish is a web server accelerator that works by caching page content. It is configured to sit in front of the web server so that it protects it from serving request for which the response is already in the cache. It is also called a reverse proxy. We were presented some general configuration guidelines.

I was then introduced to What every developer should know about Performance. It was the third talk I attended by Morgan Tocker and I did not regret it because he’s a great talker. For him, the main aspect to consider when trying to optimize the performance of a web application is the response time. But not only the average response time. It is important to properly log the response time of all requests in order to determine in what circumstances it is bad. We should also log what each request is doing so that we know where the time was spent. He mentioned that we should not be afraid of activating those logs in production as the overhead of this is generally low and because this is crucial data in order to optimize the performances.

Just before lunch, we were lectured by a Microsoft’s representative on Interoperability and Web Standards. The presenter worked hard to convince us that Microsoft has changed and that it is now working to implement open standards. This was an awkward moment.

After lunch was a presentation on Solr Search Engine: Beyond The Basics. Despite the fact that I already know Solr well, I learned a couple of things during this talk, like that you can define a default core. The presenter was obviously well-versed in Solr and the slides were funny!

My last talk of the day was Step by Step: GC Tuning in the HotSpot JVM. That was pretty technical stuff for a Friday afternoon! We learned the basics of generational GC in HotSpot. We discussed the differences between the Parallel Old collector and the Concurrent Mark-Sweep collector. We went trough some settings to control the behavior of those collectors. The presenter suggested that we should always let the GC log activated in production in order to monitor any problem related to GC in an application. Similar to what Morgan Tocker said earlier this day, we should not fear the overhead incurred by logging, as this is vital data.

So that was it! I really enjoyed the time spent at Confoo. I learned quite a lot!

I’ve started the day with Designing HTTP Interfaces and RESTful Web Services. So far, it’s my favorite talk of the event. The guy is very knowledgeable about all things HTTP and REST and had a funny way to present. We learned about what it really takes to design clean and efficient HTTP and REST api. It was the first time I heard about the Richardson Maturity Model. I encourage you to take a look at the slides of this one, it was really good!

Then I went to Scalable Architecture 101. I think they were too much information into this one. We basically go through all the typical servers (web, database, mail, cache, DNS and others!) and talk about what it takes to make those scalable. One point that the presenter insists on, and that I agree with, is to never use a database to store web sessions. Use Memcached!

After a nice lunch with other attendees, I attended a Panel: Which NoSQL database should you use? Three panelists representing respectively CouchDB, Cassandra and MongoDB were there answering questions of a moderator and the audience. Basically, they talked about how those 3 products handle ACID. To me, CouchDB stood as the most interesting of the group, but I do not think the Cassandra representative did a good job at promoting this db.

I concluded the day with Linked Data: The new black. It was an introduction to RDF and the principles of the semantic web. I think I was a bit tired at this point because I had trouble keeping my focus. But the guy did a fairly good job at showing how this is useful and that we are only at the beginning! Web 3.0 looks promising!

I’ve spent my first day at Confoo.ca here in Montréal. I learned a lot in a single day! Here’s my thoughts about the conferences I attended.

I’ve started the day with Java EE 6 – how J2EE became popular again. We learned about the new Java EE 6 released a couple of months ago. We spent most of the time exploring the features of the web profile. It’s interesting to see how the standard Java technologies are now influenced by open source products like Spring, Guice and Hibernate.

Then, I went to a presentation on whether or not we still need Java web frameworks. The presenter’s main argument was that the traditional Java web frameworks like Struts, Tapestry or Wicket are not appropriate anymore. He even got as far to say that MVC is not valid anymore! Some attendees were skeptic. He advocates the approach of picking only the products that are needed to get the job done, not relying on a complete frameworks. He then presented us the solution that they built for a client using different products like Jersey, Socket.io and Reddis.

After lunch, I went to a somewhat related talk entitled Why MVC is not an application architecture. The presenter started by explaining us that, originally, the MVC pattern was nothing more than the Observer pattern for UI. According to him, it has nothing to do with the Web, especially not for today’s web application. He then got into different patterns to help PHP developers build more layered applications. To me, it was interesting to hear how the PHP community is applying patterns similar to the one described in the Core J2EE Patterns book.

After that, I got into a completely different subject, that is An Overview of Flash Storage for Databases. The talk was an overview of the current state of the technologies related to enterprise SSD hard drives. While still expansive, those drives are way more efficient in terms of IOPS. The presenter claimed that, to really take benefits of them, applications, especially RDMS, needs to change the algorithms they use to read and write data to disk.

The last talk I attended was Building servers with Node.js. We started by being introduced to the general principles of evented I/O. We then got a nice introduction to Node.js by seeing the traditional echo server example. While I really think that asynchronous is the way to go to have scalable web servers, I am not comfortable with the spaghetti code that results of the use of callbacks. I like the approach taken by other evented I/O products like cool.io and async-http-client to make the code more readable. But nevertheless, I think this is a really nice software!

The Reactor pattern is a common design pattern to provide nonblocking I/O.﻿ Instead of having multiple threads that are blocked waiting for IO to complete on a connection, you assign a single thread that is responsible to monitor all the connections. When all the IO operations are completed for a connection, that thread can fire up an event so that another thread starts processing the data coming from the connection. This approach works well when you have to handle a lot of connections, because you are not force to dedicate a thread for each connection, which might consume lot of resources if the number of connections is high.

Since Java 1.4, the Selector class provides an implementation of this pattern. You start by registering connections to a Selector instance. Then you call the select method of the Selector to get a list of all the connections that are ready to do IO operations, like read and write. A single thread can be assigned the responsibility of polling the selector and send notifications when a connection is complete. See this nice tutorial for more details on how to use the Selector class.

To familiarize myself with the Selector class, I wrote nio-crawler. It uses a single thread to fetch links from the web, using the Selector class. When a page is fully downloaded, it is passed to a handler thread that parses the HTTP and the HTML to get the new links to follow. The handler threads never block on IO, so they never sit idle (as long as there is data to parse, of course).

I started learning Clojure a couple of weeks ago. Tough my brain is hurting, I must say that I really enjoy it. It is so refreshing! Clojure has been on my radar for a year or two, but that video from Rich Hickey gives me the motivation to actually start learning it. I am currently using the labrepl to practice writing clojure code. Nice stuff!

Browsing through Github, you can find a couple of really high-quality projects written in Clojure, like Compojure and Leiningen. It always amazes me to see how fast the community can build around a promising new language.

﻿Nutch allows to crawl a site or a collection of sites. If your objective is to simply crawl the content once, it is fairly easy. But if you want to continuously monitor a site and crawl updates, it can be harder. Harder because the Nutch documentation does not have many details about that.

After a bit of digging, I found that Nutch offers an Adaptive Fetch Schedule class that can be used for that purpose. To understand how this class works, let’s recap how Nutch manage crawl.

Nutch maintains a record on file of all the urls that it has encountered while crawling. This record is called the crawl db. Initially, the crawl db is build from a list of urls provided by the user using the inject command. An important concept in Nutch is the generate/fetch/update process. The generate command looks up in the crawl db for all the urls due for fetch and regroup them in a segment. An url is due for fetch if it is either a new url or if it is time to re-crawl it. More on that later. The fetch command will, well, fetch on the web all the urls of the segment. After that, the update command will add the results of the crawling (stored in the segment) into the crawl db. Each url crawled will be updated to indicate the fetch time and the next scheduled fetch. New urls discovered will also be added and marked as not fetched.

By default, Nutch will set the next scheduled fetch of a page to be the fetch time + a constant interval. The default value is 30 days, but it can be changed in the file nutch-site.xml via the db.fetch.interval.default property to whatever value. On a later generate call, if the time has come, the url will be added to a segment and re-crawled. This default behavior can be acceptable if roughly all pages of a site change at approximately the same rhythm. But if the site being crawled contains a lot of pages that almost never change, you would probably want Nutch to visit these pages less often and concentrate on the one that changes frequently. But it is not possible to do that with the default fetch schedule that uses the same constant interval for each url.

Enter the Adaptive Fetch schedule. This fetch schedule will adapt to the rhythm of changes of a page and set the next schedule time accordingly. When a new url is added to the crawl db, it is initially set to be re-fetched at the default interval. The next time the page is visited, the Adaptive Fetch schedule will increase the interval before the next fetch if the page has not changed and decreased it if the page has changed. Note that a maximum and a minimum interval is defined in the configuration. The interval will never be longer than that maximum or smaller than the minimum. So after a while, the pages that changes often will tend to be visited more than the one that does not.

db.fetch.schedule.class

The implementation of fetch schedule

db.fetch.interval.default

The default number of seconds between re-fetches of a page

db.fetch.schedule.adaptive.min_interval

The min number of seconds between re-fetches of a page

db.fetch.schedule.adaptive.max_interval

The max number of seconds between re-fetches of a page

db.fetch.schedule.adaptive.inc_rate

If a page is unmodified, the interval before the next fetch will be increased by this rate

db.fetch.schedule.adaptive.dec_rate

If a page is modified, the interval before the next fetch will be decreased by this rate

db.fetch.schedule.adaptive.sync_delta

If true, try to synchronize with the time of page change by shifting the next fetchTime by a fraction (sync_rate) of the difference between the last modification time, and the last fetch time

If a page was modified, the Adaptive Fetch schedule will store the last fetch time as the last modification time. Nutch will use that information in the If-Modified-Since header of the http request of the next fetch. If the web server supports this and the page has not changed since, it will only returns a 304 code. Note that there is a bug in Nutch 1.0 that prevents this to work properly. I have reported the bug and it will be fixed for Nutch 1.1. You can use the trunk in the meantime.

How does Nutch can detect if a page has changed or not? Each time a page is fetched, Nutch computes a signature for the page. At the next fetch, if the signature is the same (or if a 304 is returned by the web server because of the If-Modified-Since header), Nutch can tell if the page was modified or not. By default the signature of a page is built not only with its content, but also with the http headers returned with the page. So even if the content of a page has not changed, if an http header is not the same (like an etag or a date), the signature changes. To solve that problem, there is the TextProfileSignature class. It is designed to look only at the text content of a page to build the signature. To use it, you need to set the db.signature.class property to org.apache.nutch.crawl.TextProfileSignature.

A word about the setting db.fetch.schedule.adaptive.sync_delta. I set it to false for my crawls because I have not been able to really understand what it is good for. As I described earlier, the next fetch time is computed by adding a dynamic interval to the last featch time. But with this setting set to true, the interval is applied to a reference time which is a time located between the last fetch time and the last modification time. If someone can enlighten me about the usefulness of this, please do!