I can see the following errors in /var/log/eucalyptus/cloud-error.log:

0:36:55 [log:653891498@qtp-1693378617-9] ERROR /register
java.lang.RuntimeException: javax.persistence.PersistenceException: org.hibernate.exception.JDBCConnectionException: Cannot open connection
at com.eucalyptus.util.TxHandle.<init>(TxHandle.java:46)
at com.eucalyptus.util.EntityWrapper.<init>(EntityWrapper.java:98)
at com.eucalyptus.util.EntityWrapper.<init>(EntityWrapper.java:91)
at edu.ucsb.eucalyptus.util.EucalyptusProperties.getSystemConfiguration(EucalyptusProperties.java:117)
at edu.ucsb.eucalyptus.admin.server.Registration.getRegistrationId(Registration.java:199)
at edu.ucsb.eucalyptus.admin.server.Registration.doGet(Registration.java:210)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:389)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)
Caused by: javax.persistence.PersistenceException: org.hibernate.exception.JDBCConnectionException: Cannot open connection
at org.hibernate.ejb.AbstractEntityManagerImpl.throwPersistenceException(AbstractEntityManagerImpl.java:614)
at org.hibernate.ejb.TransactionImpl.begin(TransactionImpl.java:41)
at com.eucalyptus.util.TxHandle.<init>(TxHandle.java:40)
... 24 more
Caused by: org.hibernate.exception.JDBCConnectionException: Cannot open connection
at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:97)
at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:66)
at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:52)
at org.hibernate.jdbc.ConnectionManager.openConnection(ConnectionManager.java:449)
at org.hibernate.jdbc.ConnectionManager.getConnection(ConnectionManager.java:167)
at org.hibernate.jdbc.JDBCContext.connection(JDBCContext.java:142)
at org.hibernate.transaction.JDBCTransaction.begin(JDBCTransaction.java:85)
at org.hibernate.impl.SessionImpl.beginTransaction(SessionImpl.java:1353)
at org.hibernate.ejb.TransactionImpl.begin(TransactionImpl.java:38)
... 25 more
Caused by: java.sql.SQLException: Connection is broken: java.net.SocketException: Connection timed out
at org.hsqldb.jdbc.Util.sqlException(Unknown Source)
at org.hsqldb.jdbc.jdbcConnection.getAutoCommit(Unknown Source)
at sun.reflect.GeneratedMethodAccessor73.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.logicalcobwebs.proxool.WrappedConnection.invoke(WrappedConnection.java:162)
at $Proxy27.getAutoCommit(Unknown Source)
at org.hibernate.connection.ProxoolConnectionProvider.getConnection(ProxoolConnectionProvider.java:81)
at org.hibernate.jdbc.ConnectionManager.openConnection(ConnectionManager.java:446)
... 30 more

HOW FIXED: The fix consists of two trivial iptables calls being added to the eucalyptus upstart script. Upstream had these calls in there init scripts, but were inadvertently dropped when porting Eucalyptus to upstart. These iptables commands will ensure that the iptables kernel module (and most importantly, the ip connection tracker) is loaded and active before Eucalyptus comes up. WIthout said ip connection tracker, Eucalyptus will often establish a connection to the database, then iptables is loaded and connections are mangled, breaking the connection to the database. The user will see the problem in any one of a number of disguising ways (front end not working, api tools not responding, etc). All of these problems are due to an inaccessible database. After a while (10-20 minutes), Eucalyptus will reset the database connection. With this fix, the above problems should never happen. Eucalyptus should be back up and running within 1-2 minutes of boot (if not immediately).

REPRODUCING THE BUG:
Reboot your UEC (or sudo restart eucalyptus). If restarting eucalyptus takes a *long* time, you are experiencing one symptom of this bug. Once upstart thinks that eucalyptus is up, try: $(sudo wget --no-check-certificate https://localhost:8443) If this takes a long time, or fails to work, you are experiencing a symptom of this bug. Note that the problem is inherently due to a race condition, and therefore may not be immediately reproducible. Try rebooting/restarting a few times, and you will likely hit it.

REGRESSION POTENTIAL:
I cannot see any possible regression potential. The iptables modules will be loaded eventually. This patch just ensures that they get loaded before Eucalyptus tries to start services.

I also have to kill -9 the remaining eucalyptus-cloud process after the stop. I don't have the same exception in cloud-error.log. In cloud-output.log everything seems normal, then nothing gets added anymore. euca_conf is run, nothing shows in the log. Then after some time you start getting tons of:

We're having trouble reproducing this problem on our lucid systems against eucalyptus-cloud-1.6.2~bzr1124-0ubuntu3. We think that the issue is possibly related to some process startup behavior, but it would be great to get some more system information or a process by which the issue can often be triggered - some questions follow, the answers to which will help us try to reproduce:

1.) does this condition appear on a fresh boot (i.e. there are definitely no prior eucalyptus-cloud processes running when a new eucalyptus-cloud process is started)?
2.) does this condition appear when there is certainly the primary network interface up and configured?
3.) does this condition on 'start eucalyptus-cloud', 'restart eucalyptus-cloud' or both?
4.) do you see the condition when the cloud/walrus/sc are all on the same system or is this a stand-alone cloud service (or some other topology)?
5.) does this happen after a cloud has been configured/working or is it always during the initial setup (during registration of other components)?

If we can trap the problem a bit further, we'll surely be able to find the solution!

We've finally been able to (we believe) reproduce this type of condition on our Lucid machines, and have figured out the reason why it is being triggered on lucid UEC installations. The Eucalyptus front-end components (cloud, walrus, sc) require that, on startup, the network interface that was used to register components is up and configured (i.e. has an IP address that was used at component registration time). Our init process ensures this by running after the 'network' init script completes, but in the upstart case it looks like the eucalyptus components can start after the network process has started, but before it completes. In sum, if eucalyptus starts, then the interface becomes configured, this condition will be triggered and the front-end will need to be restarted so that it can go through the initialization process again.

Note that this is a startup time requirement only (i.e., once the system is up, and for some reason the interface goes down and comes up again), the service will stay active; it is just during the startup that eucalyptus requires that the interface is alive and configured.

Is there a way to have the eucalyptus upstart scripts wait until the network interface is fully alive before starting the front-end components?

Okay, I talked to Scott about similar upstart situations last week, and I think I have his blessing on what I thought would be hacky.

Basically, we need to create an additional, upstart (psuedo) job, that would start on started networking, and basically loop until some conditional is settled (in this case, ip_address_obtained), at which point this psuedo job would emit a signal. The Eucalyptus upstart jobs would simply wait on this signal.

I don't have a current Karmic test environment, since I'm totally focused on Lucid development right now. If I could get one person to confirm this fix on their Karmic UEC, I will upload to karmic-proposed and begin the SRU process (where the fix will again need to be confirmed).

Nick/Boris/Torsten- can any of you guys help verify the fix in the PPA?

Accepted eucalyptus into karmic-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

I've helped each of you guys either manually fix this bug, or upgrade to the PPA packages where I fixed it in the packaging.

The fix is currently awaiting verification. It would be great if someone could install the -proposed package and add a comment confirming that their issue is fixed, so that this upload can move into -updates.

I have another fix queued for eucalyptus karmic SRU, but I want to get this package moved to -updates before overwriting what's in -proposed now.

I enabled karmic-proposed, installed eucalyptus-*-1.6~bzr931-0ubuntu7.5
on the training cloud in the datacenter. I rebooted node and frontend
and the cloud was operational from the start. I started an instance and
ssh'ed to it. All looks good and operational.