IntroductionThe author asserts that software of today is built for passing the tests of the QA and not for the rigours of the Production environment. The author provides tips to design systems which will withstand the assaults it will have to face in the Production environments.The author states that most decisions made upfront are the decisions that are the ones that impact the system the most, are most difficult to reverse or change and ironically these are the ones that are taken when the knowledge about the required system is minimal.The author ironically states decrees such as “Use EJB container-managed persistence!”, “All UIs shall be constructed with JSF!”, “All that is, all that was and all that shall ever be lives in Oracle!” are given by architects in ivory towers.

In the stability section the author speaks about how to create and maintain stable systems in this section. The first example the author gives is of an airline company which had the following code:

The author suggests that one should be prepared for as many points of breakages as possible. Tight coupling between systems leads to cascading failures. To avoid this there should be loose coupling between systems. As a corollary calls across systems should be asynchronous. This is not always possible and where possible complicates communication. So one needs to take a proper call on where to have asynchronous processing and where not to.

In chapter 4 the author discusses anti-patterns that lead to failures. The first anti-pattern is that all points of integration are fragile and can lead to failures. It is highlighted that most connections are based on TCP/IP. In TCP IP the first step is a three way handshake to setup the connection between the two systems that need to communicate. The first step is for the requestor to send a SYN packet. This has to be acknowledged by a SYN-ACK packet from the listener and finally the requestor sends a SYN packet to complete the three way handshake and establish the connection. If there is no listener then the failure is quick as the OS responds with RESET packet telling the requestor that its request does not have listener. This is a manageable situation. But if the listener is slow then the request will languish in the listen queue till it is timed out. The typical timeout is in minutes. This means that the requestor can wait for a long time before realising the problem.The classic example of firewall killing the TCP connection between the application server and database server due to long time idle connection is quoted as an example.The next example is about how it is difficult to timeout HTTP Connections in Java.It is stressed again and again that it is better to be cynical than optimistic when developing software. Be prepared for the worst.It is suggested that circuit breakers, i.e. stopping to retry a transaction after a particular number of failed attempts and/or maintaining the status of the underlying layer and deciding to not invoke the layer if there is a problem and timing out after waiting for a reasonable amount of time for the underlying layer to respond are two key mechanisms to avoid cascading problems from one layer to another.The storing of large datasets in the session is highlighted as one of the more frequent ways of running out of memory. It is suggested that either the session be kept light or Softreference be used for storing large datasets to prevent out of memory errors.The author rightly points out that the usage of synchronized keyword can be dangerous in a highly concurrent environment.The advice is also to test third party libraries for breakability.The author coins a new word called “Attack of Self Denial” where an event is published which leads to a flood of requests to the specified application. E.g. the news of a deep discount on a product for a retailer could be a cause for “Attack of Self Denial”. One needs to be prepared beforehand to handle such situations better.A very good suggestion is that if it is not possible to build a shared nothing architecture then limit the number of systems sharing the resources. E.g. instead of sharing the sessions with all the application servers sharing it amongst two application servers so that the replication factor is limited.One key point that the author brings about is most systems “treat the database with far too much trust.” and this is the major cause of problems in most systems. The author illustrates this with an example of how making an unbounded query resulted in continuous crashes at a retailer. The author suggests that always limit the results fetched from the database as a precaution.

The Stability Patterns———————-In this the author lists the patterns that will help the system be more stable.1. Timeout: Use timeouts whenever interacting with a third party, especially when this involves some form of network, even though it may be within a LAN. It is a fail fast pattern to be used along with a circuit breaker, where if a few requests timeout then the resource is marked as down till it is found to be good again. The retry to check if the resource is good can be done at a regular interval suitable for the resource, i.e. delay the retry.2. Circuit Breaker: Akin to the electrical circuit breaker a software circuit breaker prevents the entire software from collapsing under stress by stopping requests to the faulty interface. The users may see errors if this happens to be a crucial interface, but this is better than the user not being able to use the whole system. Typically if an interface frequently times out or fails frequently then the circuit breaker can mark this interface as broken for sometime. After a suitable amount of time it can retry the interface and if found functional it can close the circuit once again enabling the execution of the specific interface. All opening and closing of circuit breaker should be logged and made visible to the operations team so that they are aware of the change in the status.3. Bulkheads: Bulkheads are compartments in a ship which prevent the ship from sinking if there is a damage to the hull. Each bulkhead stops the water from entering beyond it. Similarly use of multiple servers to deploy applications is one form of bulkhead. If the application is compartmentalized so that impact to one compartment does not impact the other is creating bulkheads in application. One example quoted is that of the airlines where ticketing system, flight status systems, flight search system, checkin system could all be deployed separately so that one does not interfere with the other.Another example is if there are two systems which require the same service and if both the systems are critical, it makes sense to have separate setup of the common service for the two systems. Problem access of the common service in one system will not impact the service access of the other system.Grouping pool of thread for specific purpose in a single process will ensure that problem in one thread pool does not prevent the process from servicing other types of requests.The negative side of bulkheads is that it can make optimization of resource usage difficult. One would potentially have to provide more capacity than actually required.4. Steady State: Maintaining a steady state of the systems is very important. Any kind of fiddling with the system for any reason can lead to instability. At the same time to maintain steady state some cleaning up is required. Log files will be generated by the applications and it is important to have a process that will keep removing the log files at the same rate or greater than the rate of generation. Similarly archival of records in a database is important to ensure that the queries on the database continue run consistently.It is important to ensure that one has a finite, controlled number of entries in the in memory cache. Use an LRU or LFU mechanism to keep clearing the cache if it is expected to keep growing beyond known values.5. Fail Fast: Quickly failing a request is very important to the health of the transaction. One should upfront have the statuses of all the external systems before beginning to process a transaction and if any of the external system is in a state which will mean that the transaction will fail then it is better to fail the transaction immediately. This will ensure that no compute power is wasted in processing doomed transactions.6. Handshaking: It is important to have handshaking between any two systems so that the server process has the ability to state that it has its hands full and cannot respond and the client does not waste time trying to make a request which is going to take the server a long time to respond. This helps in failing fast.7. Test Harness: A test harness should be able to emulate bizarre problems, like accepting a connection, but not sending any response, resetting the connection without ever accepting it and so on, not responding for a very very long time, send out large amounts of data as response. Testing the against such a test harness will help test how the system will behave under unexpected conditions.8. Decoupling Middleware: A middleware typically helps shielding the requestor from the nitty-gritties of the server and also from the failures of the server. It helps decoupling two systems while integrating them.

The author very rightly concludes that “Sadly, the absence of a problem is not usually noted. You might be salvaging a badly botched implementation in which case you now have an opportunity to look like a hero. On the other hand, if you’ve done a great job of designing a stable system from the beginning, it’s unlikely that anyone will notice your system’s lack of downtime. That’s just the way it is. Deliver an unbreakable system, and users will surely go on tocomplain about something else. That’s just what users do. In fact, with a system that never goes down, the users will most likely complain that it’s slow. Next, you’ll look at capacity and performance and how to get the most out of your resources.”

In a case study it is illustrated how usage of sessions killed the application. The bots and the regular users increased the number of sessions far beyond what the system could handle and site crashed. This was later resolved by supporting session through URL rewriting so that no new sessions are created by the bots and also by creating a throttling mechanism to control the total number of sessions in the system. The key learning is that the performance test only tested for happy paths and never for situations like bots hitting the site.When planning for capacity it is important to ensure that the software written is optimal and has minimum wastage. If this is not done it would lead to increasing costs of resources required to run the application. As an example if an HTML page has 1K of junk data, this will translate into 1GB of extra bandwidth usage if there are a million requests to this page. The cost of resources multiplies as the usage of the application increases.

Some good patterns to follow are:1. Pool resources, size them properly and monitor them.2. Use caching, limit the maximum memory that can be used by the cached objects and monitor the hit ratio.3. Precompute whatever is possible and recompute only when absolutely necessary.4. Tune Garbage Collection

Some Network points1. Servers in production tend to be multi-homed and it is important to bind the applications to the right home to prevent security issues.2. Given the above scenario it becomes important to correctly make the network routing scenarios.3. Use Virtual IPs where native clustering of applications is not possible. Applications need to be written keeping in mind that this will be the case in production systems.

Some Security aspects:1. Follow the principle of “least privilege”. This states that every action should done with the least privilege required to execute the action. Rnu each application with its own user so if one application is compromised it is only that application and none of the others.2. Ensure that the passwords use to access other services are secured properly. Ensure that the memory dumps of the processes will not reveal the passwords. Keep the passwords away from the installation directory.

Some Availability Aspects1. The cost of the a system grows exponentially with the required availability. Availability should be defined realistically, not idealistically.2. The SLAs should be well defined and measurable. SLAs should be defined by features and dependent on 3rd party SLAs available. The location from where the application is accessed also matters.3. Load Balancing and Reverse Proxies should be used to balance the load across the multiple servers and across the various tiers.4. Clustering will be required in scenarios where the servers need to communicate with each other to exchange some data.

To ensure reliability of the system the topology of the QA environment should be same as that of the Production although the capacity may be far lower.Configuration of the application and environment related configuration should be separated out.Application should be able to announce if it has not started properly.Provide command line options to configure the systems. GUI can be used when sufficient time is at hand and automation is not required.

Every system needs to be transparent, i.e. it needs to show what it is using and what it is doing. Without this information it is very difficult to manage the system. While it is necessary to know the status of the individual parts, it is important to also know the status across all the parts of the system. This helps in analysing any problem that is manifesting in the system.It is not necessary to log the stack trace of a business exception like a validation error which states a mandatory parameter was not entered. It is vital to log the stack trace in case a non business exception occurred.It is important to have a network separate from the production data network for monitoring traffic.A good monitoring system provide visibility to to business outcome and not just technical parameters.

A very good comparison between crystals and tight coupling in software design. “A cluster of objects that can exist together only in a tight collaboration resembles a crystal in a metal. The objects stay together in a tightly bound relationship, just as the atoms in a crystal are tightly bound. In metal, small crystals mean greater malleability. More malleable metals recover from stress better. Large crystals encourage crack formation. In software, large “crystals” make it harder to change the software. When objects in one grain participate in multiple collaboration patterns, they bridge two crystals, forming a larger grained crystal—further reducing the malleability of the software.There is no limit to how far this region of tightly bound crystals can spread. In the extreme case, the crystal grows until it is the boundary of the application. When that happens, every object suits exactly one purpose to which it is supremely adapted. It fits perfectly into place and ultimately relates to every other object. These crystal palaces might even be beautiful in a baroque sort of way. They admit no improvement, in part, because no incremental change is possible and, in part, because nothing can be moved without moving every other object. These tend to be dead structures. Developers tiptoe through crystal palaces, speaking in hushed tones and trying not to touch anything.”