Discussions

For some folks, when the app goes down for whatever reason, it means coming in early tomorrow morning.. but for others, it means,
“Mom, Dad, I’m so proud of you and happy for you on this, your 50th wedding anniversary — and I’m truly glad you brought me into this world, but since you raised me to be such a responsible and devoted member of society, you’ll understand me when I say, ‘I’ve gotta go — our app is down.’”
We’re wondering what % of apps out there are treated like this, and what teams do to ensure that they stay available, all hours of the day, no matter what time zone the users are in – so we’re asking 2 quick questions to get to the root of it. Answer them here:
http://spreadsheets.google.com/viewform?formkey=dFJpclI2TG5lc2FKZlVzM2JMeEUtZmc6MA..
ZeroTurnaround will announce the results, and give out 5 personal licenses of JavaRebel to winning respondents at the end of next week.

In the enterprise space state is just as important as behavior and it is not so easy to tackle in or outside of a container which is why most systems go occasionally offline (at least internally) to sync changes across different data processing domains (messaging, databases, …). Even if you can isolate classes from each other you cannot do the same for the external side effects (of multiple & different versions). This last item seems foreign to many OSGi disciples who apparently do not understand one of the main reasons why operations stop and restart processes in applying a managed change – to be sure the process can indeed be restarted when something goes terribly amiss in production outside of the container (hardware). It is much easier rolling back at the time the change is applied.

William -
I know you don't agree with me on this one, but OSGi should be selected for its ability to support modularity in a relatively standardized way, not for its ability to reload JARs (which, judging from Eclipse, it is not very good at ;-)
Peace,
Cameron Purdy | Oracle Coherence
http://coherence.oracle.com/

In the enterprise space state is just as important as behavior and it is not so easy to tackle in or outside of a container which is why most systems go occasionally offline (at least internally) to sync changes across different data processing domains (messaging, databases, …).

I don't see the relevance to HotSwap in production. It's true that OSGi guys think they can solve upgrades by modularization, which I find unlikely. But we think that we could solve e.g. 80% of upgrades without any application reengineering and then still go offline in the 20% of cases where external/too complicated stuff is involved. Plus we _can_ do state/user session migration instantly.

Again just in case you missed the obvious which would not be the case if you had some degree of operations experience outside of a developer role especially in a ITIL based enterprise environment.

one of the main reasons why operations stop and restart processes in applying a managed change – to be sure the process can indeed be restarted when something goes terribly amiss in production outside of the container (hardware). It is much easier rolling back at the time the change is applied.

The biggest reason for outages is not the planned ones and even when planned they are typically done to correct software issues with previous unplanned outages.
By the way I would love to see how you propose the test team (checkpoint/gatekeeper team) actually test the hot deployment change. To be any real sort of test they would have to have real-time (deployment/state/workload ) replicates of production especially if you are not going to restart processes because everything is going to work is it not.
Most enterprise applications are clustered and grid (state) based with lots of redundancy and fault tolerance built-in. Outages only happen at runtime instance level rarely at the service level unless of course it is unplanned.

+1, that's what I was trying to figure out. In our case we do rolling restarts, and in a cluster (even as small as two nodes), the customer never experiences any service outage. Of course, the rolling restarts are done about 95% of the time, there are code/schema changes at times that would allow for the small window of version mismatches to cause data corruption and other issues, in those cases we do incur some unavailability.
Are you proposing that you can take care of these far few and between cases when all nodes in a cluster must be synchronously updated?
Ilya

+1, that's what I was trying to figure out. In our case we do rolling restarts, and in a cluster (even as small as two nodes), the customer never experiences any service outage. Of course, the rolling restarts are done about 95% of the time, there are code/schema changes at times that would allow for the small window of version mismatches to cause data corruption and other issues, in those cases we do incur some unavailability.

Are you proposing that you can take care of these far few and between cases when all nodes in a cluster must be synchronously updated?

Ilya

The only solution to this that I can think of is that you bring down some subset of the servers (say 50%) and update them but do not allow traffic to go to them. Once these are updated and back in service, you direct all new requests to the updated servers. Only existing conversations go to the older servers. Once those conversations end, you bring down the remaining servers and update them.
In general I think that there are probably cases where a full outage can't be avoided.

The only solution to this that I can think of is that you bring down some subset of the servers (say 50%) and update them but do not allow traffic to go to them. Once these are updated and back in service, you direct all new requests to the updated servers. Only existing conversations go to the older servers. Once those conversations end, you bring down the remaining servers and update them.

Even then, what if those "conversations" affect state shared across the old and the new versions of the application? For example, what if the new version needs to change the scheme of the underlying database in a way that the old version of the application will no longer work (or at least will no longer work correctly)? I see these types of challenges all the time, and there aren't easy answers -- regardless of what vendors say, there's usually an awful lot of very hard work going on to limit or eliminate the downtime from the POV of the end user.
Peace,
Cameron Purdy | Oracle Coherence
http://coherence.oracle.com/

Even then, what if those "conversations" affect state shared across the old and the new versions of the application? For example, what if the new version needs to change the scheme of the underlying database in a way that the old version of the application will no longer work (or at least will no longer work correctly)? I see these types of challenges all the time, and there aren't easy answers -- regardless of what vendors say, there's usually an awful lot of very hard work going on to limit or eliminate the downtime from the POV of the end user.

I understand and it gets at what I consider a fundamental truth of interface (in the general sense) design: the interface is forever. Sure, it's possible get everyone to abandon an old interface but it's something you can't count on or control without leaving clients high-and-dry. At best you'll have really problematic constraints to deal with. One of the problems I've seen lately is that people say they'll just refactor the interface so it's OK if it's not correct initially but they ignore that just because they refactor, it doesn't mean the client will.

One of the problems I've seen lately is that people say they'll just refactor the interface so it's OK if it's not correct initially but they ignore that just because they refactor, it doesn't mean the client will.

Refactoring is the ultimate cowboy approach to software:
* First, it assumes you have the whole code base in your grasp;
* Second, it precludes others from working on the same code-base;
* Third, as you pointed out, the ripples from the changes are permanent.
So the cost of refactoring code grows with proportion to the lack of access to (and control over) the entire code-base (including all the "clients"), and to the team size.
(Nonetheless, it's often still worth doing.)
Peace,
Cameron Purdy | Oracle Coherence
http://coherence.oracle.com/

* Third, as you pointed out, the ripples from the changes are permanent.

So the cost of refactoring code grows with proportion to the lack of access to (and control over) the entire code-base (including all the "clients"), and to the team size.

(Nonetheless, it's often still worth doing.)

I love refactoring. I just think that any time you are building abstractions to be used by other parties, you have to assume that it's forever. It's worth taking extra time to try to get it right. You probably won't but the costs of modifying these interfaces is exponentially higher than modifying things that do not escape your sphere of influence. Any fixes that can be made before you interface is published is going to pay-off in spades.
I used to work on B2B systems and one of the things that I learned was you can't tell Apple (Amazon, Walmart, ...) when or how to modify their client to your webservice. And in general, the difficulty of modifying an interface for a remote service increases exponentially with the number of parties that are clients to it.

lol - I have to wonder what idiot would work for a company that would mis-manage mission criticize apps where I have to come in on my parents 50th wedding anniversary.
Did this really happen to someone? Makes me wonder why we don't have labor unions for software engineers.

Hello,
one would think (or hope) that any corporation that operates 'mission critical applications' has staff onsite in case something breaks, not just on call.
Regards,
Dennis
Java Logging, Test Management

Doesn't this all come down to the origional requirements specification and testing of the app/infrastructure? If you've captured up time requirements from the start and tested to make sure you meet these requirements before release then you should minimize the risks here. If you want to release without proper software tesitng and test management then don't you deserve what's coming to you?

TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations technology projects - with its network of technology-specific websites, events and online magazines.