Over the last year or so I have reduced the amount of blogs I write. This is partly due to the change in my work and schedules. Since then I found that I am not able to keep up with the moderation of the comments on Wordpress, and I am getting a lot of spam.

Of late I have become a fanatic of learning about aircraft, following aircraft on flightradar24, keeping track of tail numbers I flew on, asking questions and all that. I even track the on time performance of any flight I book or take before I plan my journey!

I have been working on understanding how IPython works, the kernels, the client etc. I have managed to figure out how the zmq client and server mechanism works and makes it so simple to add so many types of clients. Its really awesome.

Nowadays the emphasis seems to be on developer productivity - you have the tools and libraries that have been built by people who know how to do it right, these are open source, and there is a community to help you out. Or there is a large corpus of previous projects in the company that you can simply reuse and save lot of work.All you need to do is get the libraries, follow the examples and you should be good to go! You should be able to build castles out of thin air!

Web scraping has been around for a long time, and there are a lot many libraries to get started on this. The whole process is very standardised. Scrapy(http://scrapy.org/), one of the easy to use Python web scraping tools is even available on the cloud - http://scrapinghub.com/.

A while ago I had posted about issues with poll() vs pol(timeout) on the Java concurrent blocking queues, and issues with the bucket sizes of concurrent hash maps. Those solutions worked well - but the issues with locks seems to love me so they are back. And this time time it is on the other end of the operation - the offer() and offer(timeout) calls.

We all have given estimates for development work - and we all have either missed them at some point or have been asked why it takes so much time to get something done. How and what we respond to these questions varies - but in all cases the common theme is that there were either scope changes, or unexpected issues and defects, or integration problems - and we always promise to do better next time.

As part of our daily lives we have at some point in life come across situations where we felt we should better remember and better stick to our monthly budgets - this is by far easier than tracking individual expenses - but this still requires that we remember numbers top of our heads.

I have been writing code on the job for over 8 years now, and have written many thousands of lines of code in Java - partly because Java is verbose, and partly because of the size of the problem being solved and the size of my deliverables. I have also tried my hand at Python where I can get more done in fewer lines of code, and Python frameworks like Django where I can get a basic website up in fewer lines of code than anywhere else.

Big Data - large data sets on the order of petabytes - this is a buzz word that you must have heard each time you hear about the next big internet website or social network or e-commerce website. Amazon, Google,Facebook, Netflix and many more large internet corporations have proved time and again that Big Data and analysis on these data sets can provide you the critical business intelligence that will take you to the next level and make you the absolute leader. Great!

Most of us have worked on applications that are small enough that they can be deployed to the users desktop, and also on applications that are deployed to the servers. This could be applications like web applications that are deployed to a web server - in this case the code is same across all servers. There can be cases where there are many components that use the same code base but are distinct enough that you can put different version of the code for each component. But generally, the way the code is deployed depends on the nature of the application. And depending on the deployment, there can be some issues with how you manage the code.

If you have ever worked on any sort of adapter - a piece of code that takes entities from a particular system and then makes it available to a different system - then you definitely must have come across a situation where the two systems do not support the same set of operations, or mutations on the entity. There will be string reasons for this, and since it is two different systems there will be another thousand reasons why each was built the way it was built.

A few weeks back when I was looking at how I could build a synchronous task execution system for a Django app, I came across Celery, and then I found something called Kombu (http://kombu.readthedocs.org/en/latest/), which was a message passing framework for Celery.

Open development process - the Utopian dream where you are building on a software platform or use a product and you find a bug; you checkout the source code, figure out what to fix and how; make a fix, add tests, create a patch and submit it. It gets approved, pushed out and you are a happy developer! You fixed a problem, you have a contribution to show and you earn brownie points! This is developer heaven.

What is the significance of the transient keyword in Java? If you know the answer, good! you are a person who uses this a lot or a person who has read this very recently. If this seems like a word from a half remembered dream, well don’t worry you have company. I was and am will be confused if you asked me about this in an hour. It is one of those things that I learnt but never had to use it - mainly because I never worked on code that required me to worry about how my objects were serialized. I could delegate that to the libraries.

We all know data, we all know consistency of data when dealing with transactions. There is another aspect of data - temporality, meaning data at a given point of time. What is the value of something now? and what was the value of it as of yesterday morning? I haven’t worked too much with temporal data, but have used few applications that provided this - in their own ways. I was reading through the new Google research paper on Spanner - their global time aware database, and came across the TrueTime API - this forced me to think about the temporality of data and how important that is.

Have you ever been in a situation where a connection to a resource was lost and your application either did not tell you about it, or did not try to reconnect or ends up in a mess trying to reconnect? How does that feel? You might have wondered that it would be great to have something that can automatically recover so that you don’t need to intervene, or even avoid any manual recovery work that comes after a restart to recover from these issues.

On my day job I build applications that need to get data in and out fast to the other services on our distributed architecture - speed is they key here and so is the ability to be able to reuse and build new services with existing code. We are more than happy to be able to build something that is configuration driven and can be easily re-purposed and deployed for another requirement by just changing the XML files. All this leads to a design that is flexible, separates out responsibilities, has distinction between interface and implementation and more importantly layered. The layer that consumes does not enrich and the layer that enriches does not publish back again.

Distributed computing and parallel computing used to be something I considered very very high tech stuff that I was not working on. But over the years I figured out that what I was working on some of these - without me knowing. Early on I realized that I was building distributed systems when I got to work on a redesign of a messaging layer at an investment bank - the amount of components that touched the messages and the whole way in which we distributed the logic and load across a set of services made me realize that what I had worked on previously was also something similar - only that I had called it Service Oriented Architecture (SOA).

Every so often, I hear someone saying or writing about how a particular solution seems wrong and how given a chance they would design it properly and show how it should be done. My reaction - bravo! if you do get a chance to do the thing cleanly again, definitely do it. Until of course when your solution has evolved enough and there is a new system or a new guy that claims the same thing yet again.

I decided to go ahead and try using the cache that I built for some scenario that I might expect to see in real life. Part of the work I do daily demands that I start one thread to create some data that goes into a cache or a messaging layer and then another thread in the same application consumes it - pretty basic multithreading in Java and nothing really special.

Every now and then I try and see if I can take something that I built in Java and build it in Python. After having thought of doing a socket application in Python using the Twisted framework, something other than the basic echo. Finally I did it today, by building a simple rudimentary cache with just PUT and GET commands, not even a delete. I admit it is lazy of me, but this illustrates the simplicity and this can be easily built on top of.

Often times when I set out to do something new on the job, I hear a simple requirement over and above the actual business function - keep it generic enough that it can be reused. And I set about it to achieve the mythical generic solution that can be reused again and again for a variety of similar use cases. Do I succeed? Depends on how you define success. Given that my hands are tied because of an existing platform and the solution has to fit in this rather than be a rouge process, I make compromises and we end up with something that sort of looks much better than hard coding but also cannot be directly reused without making small changes. The closest I got is making a component that can take Javascript functions in configuration and I can use these scripts to tweak behavior as I see fit.

Yesterday I wrote a post about using Proxy class in Java to create a trivial mock framework - creating the same thing in Python takes even fewer lines of code. The generic proxy can be implemented very easily as shown below.

Most of us, if not all, have written unit tests for our code using JUnit or comparable tools. In most cases we are in control of what is being tested and we can provide all the inputs that are needed to test the scenario. But then there are cases where there are external factors or classes that cannot be instantiated for tests and we need to find ways to simulate them - the suggested approach is to use mock objects.

I am a core Java developer, and have been doing that for about 8 years now, and when people come to me and ask me how much more memory my application will need or complain that Java applications eat up memory on the boxes - I just give them a dry smile, not wishing to comment or explain to them why my applications are no different from their applications. Even C++ or Python applications can and will use memory and that is what even they have memory leaks. Java is not the only memory hog out there - yes we demand our pound of flesh up front in the form of heap space, but we live within that.

These days applicationscalability is an implicit requirement in whatever you build - you might be doing a few hundred users on day 1 but as time goes by everyone is expecting your code to deliver for thousands of users, or maybe for few users with near real time performance. Even a second of delay is not tolerated. This applies not only to the web applications like Twitter or Facebook but also to enterprise applications that are used for boring back office processing.

I have been working with middleware systems for the last 8 years, building and using them in many ways. And even after all this I must say that there is lot of confusion in my head as to what a messaging platform must offer and what it should do. Confusion not because I don’t know my stuff, but more because you can solve the problem in more ways than one.

This post is not about how to use Python decorators - there are many posts on the internet about this and this one in particular is one of my favorites which faithfully shows up in page one of my Google search on this topic -

Disclaimer - I am not going to reveal here that messaging systems are built from a secret alien technology that none of us knew about. We all know that messaging applications are built using simple sockets, TCP or UDP protocol, some sort of mechanism to represent queues or topics, some storage on the back end to persist messages between restarts, a means to subscribe to data and receive call backs. These are generally all that you have in a messaging system.

If I told you that a person is visting you, gave you no information about that guy, and asked you to prepare a list of things for him to do while he was visiting, will you be able to do a good job of it? You might actually be able to get a so so result but the outcome of whatever you do will be far from good.

Multithreaded programming is is not rocket science but is still difficult to get right. Anyone who has done a moderately complex solution with more than two threads knows how things can quickly get out of hand. In fact it is possible to get things in a mess even with two threads.

If you walked past most programmers these days and asked what is it that they were coding for then the likely answers would be - business requirement, use case, user story, bug fix - maybe a few more, but very few of them will say that they are coding to solve a problem. You may ask if it really matters, i think it does impact how we think about the solution.

Electronic Trading or Computer Trading - whichever name you use for this - is how most financial markets work these days.Not just stocks or cash products, even commodities are traded electronically. All this is driven by huge computer systems that work very fast and very efficiently and handle huge volumes at large scales.

Before you decide to shoo-shoo this post and call me someone who has no idea how great distributed systems are - forget that idea. I am not talking about all the social networks or the high volume websites out there, I am not talking about the oh-so-agile and oh-so-great enterprise software that you built last night. I have worked all my life in building messaging layers that facilitate distributed systems and I still swear by them. I am just going to talk about a different thing here.

If you tell me about unit tests and the importance of having those to build a good application, my emotions swing between two extremes - one where I fantasize about having the perfectly unit tested software, and two where I want to vent my frustration on those tests being broken - everything in between rarely crosses my mind.

We all need to measure time, whether we are late for an appointment or if we want to decide how much more we can laze before we need to get going. In our daily life, thankfully, we only deal with minutes and hours, rarely with seconds. This makes it easy. But it is a different story when we deal with computers and software - we need to measure in terms of milliseconds and microseconds. Some applications like low latency market data for financial institutions demand nano second precision in this fast paced world of electronic trading.

No matter what technology you use, you always have to work with files - source code, compiled binaries, configurations etc. File is ubiquitous when you work with computers. But then have you ever suggested to someone that you should probably store certain data in a file instead of that funky-new-storage-application because it will be simpler? And having done that have you ever heard them saying - ‘ who uses files anymore?’.

I first heard about NoSQL databases in an interview when a i-am-the-dude developer asked me about my experiences with NoSQL. I had no idea and later looked up the internet about these and found that they are key-value stores - like a hash map. Or like Berkeley DB and used to store objects - In fact I had worked on building a huge messaging platform where we used Berkeley DB JE to do just that - store Java beans representing messages so that we can easily reconstruct them.

Java provides a LinkedBlockingQueue as part of the standard library from Java 5. This is a very easy to use blocking queue to share data between two threads and not run into any concurrency issues. I have used this as part of many uses cases where we had to deal with a producer creating a large number of objects to be consumed in a very short period of time and a set of consumer threads running to process these. There have also been cases where the producer is free to create objects at will, but the consumer controls how they are delivered to the upstream and needs a queue to hold things in between.

In my last post I described what kind of scenarios in our everyday business problems can be solved using map and reduce - we can do this even though we don’t have the kind of computing power that Google or Facebook have. In this post I will show how we can implement a map and reduce approach to solve a pseudo-real business problem in Python. Python provides in-built functions for map and reduce which we will use in this example.

Most of us have heard about MapReduce - the Google framework for solving problems in parallel - which is based on their BigTable and Hadoop which is an open source implementation of MapReduce. These are tools that are used by Google and Yahoo and many other companies to build scalable websites and all that. All these need many nodes to be setup and work load distributed across these nodes to get the results that we want.

The first time that I heard about Django was actually when I heard about another equally good and well known web development framework - Google App Engine. I had read about App Engine somewhere and was itching to try out a test site over the weekend. I am basically a middleware and messaging guy and feel most comfortable with command line - so I was not fully sure I would be able to build the web pages for my experiment with App Engine - but somewhere in the tutorial while showing the easy to use templates, the guy mentioned that App Engine supports Django templates - that’s how I got to know Django and fell in love with it mostly for the ease with which you can build things and even more for the Admin website.

Any developer will know that it is not possible to code from scratch and re-invent the wheel each time we want to solve a problem.Since most problems fall into generic categories with some custom logic, standard libraries were developed which shipped with the programming languages and helped ease things. The Java standard library, MFC for C++ etc are some examples. While some languages have required us to install the library separately, languages like Python have followed the batteries included approach where we get everything in the standard install.

We all have heard about or experienced first hand some situations where an application owner comes back saying that a seemingly small change will take an obscenely long time because it would impact the other components and regression testing will take time. In these cases generally you will also see someone from you team or above you start a long winding discussion on why you should have test packs and unit tests so that you can automatically regression test the whole thing and making a change takes only a very very unbelievable short time. Am sure that sounded familiar - whichever side you were on or not on.

At least once in our developer lives we would have come across a situation where we need to look at an existing code base, make sense of if, understand exactly what it is doing and fix something. If you are one of the many active contributors to the various open source projects there then you might have done this again and again. You look at a project, find an issue, make a fix, send a patch get it reviewed and get brownie points when they accept it. If you are a software developer in a large company then you probably got put in that place when a developer quit or someone had to revive the code of a legacy application.

We all have heard about clustered application servers and databases and how they can replicate data between instances. We have heard about distributed cache implementations like Ehcache, Hazelcast etc and even NOSQL databases like MongoDB that replicate data between all the instances. They do a great job of replicating data efficiently and ensuring data integrity. They provide different guarantees on data replication and availability.

In my last post I described how a workflow engine works and the various key parts of a workflow engine. In this post I will show you how to build simple workflow engine in Python. This will be built in Python using the django framework.

If you have ever worked on a project that involves building an application for a business process, then you definitely have heard of or used a workflow. A workflow is nothing but a set of steps - with rules that define which step comes after the current step, until you complete all the steps and the result is achieved. So, if you said go to the shop, get a notepad, do your homework on it, and then turn it in for review - that is a simple workflow. This one stops at submitting for review, but if I said that once you get the review result, then go and do something else - then we have added a step in between which requires someone to come in and manually review your work. This is still a workflow, but one that involves a manual approval step.

Lets admit it - writing an application in Java takes a lot more code lines and configuration than in other languages like Python or Ruby. However, Java has been around for many years and it will stay around for many years simply because of the number of applications in various enterprises built around Java. However, of late with the new kid’s on the block like Rails, Django and all these rapid development paradigms that are out there, poor Java programmers do feel left out. Of course we have Groovy that is borne out of Java and a couple of such options, we do have to stick to Java and the different Java frameworks like Spring, Hibernate etc for our day jobs.

We all have heard about open source software, the philosophy behind that, how it came into being etc etc. The general opinion is that these open software are free, we can see the source code and change that to fit our needs - perfect! No more paying yearly licence fees! And then there are problems like getting support for these and who to sue when your million dollar system built on open source technologies fails - but that is a different story.

Previously I blogged about how to build a simple grid computing framework using Pythonstandard library. I have recently been working on converting it into a functional library that can be used by anyone very easily. That is now ready and available for you all. I have not enforced any licence at this point and it is generally free, but I would encourage for people to consider donating some amount to help me build the resources required to maintain this as a long term project.

We all have been through those phases in our lives as developers where we religiously followed coding standards and used the most appropriate design patterns. Before we got to that stage we would have been in a stage where we did not care what these meant or we did not even know these, and sometimes we will be in a situation where we cannot use them for some reason no matter how much we want to. In all these cases the job got done and we got paid. So, do we really need these? Do they really matter? Or are they just instruments of torture which more experienced and opinionated developers use on juniors? Is it possible at all to write a piece of code that is not coded as per standards, has no design patterns and absolutely no documentation and still works? Works without failing?

Sometime back when I was in India and talking to my brother, he mentioned that he could not get himself to work on building something abstract like software which mostly takes shape inside the developers head. He works in accounts and finance and is more used to dealing with numbers and facts that exist on the books. To an extent he is correct - software starts as an idea in someones head and then manifests itself as a set of applications or services that process or provide data that performs the tasks and in these days of big enterprises spread across the globe and using IT to drive business, it helps make money by providing efficiencies.