I'm looking into this right now. At first glance, it appears there are two failure modes. Some requests don't go through to the app (empty request). Of the requests that do go through, some return with 404 errors. Those 404s are returned by the application itself.

The second issue is easier for me to reproduce, so I'm looking at that first.

~2018-10-29 16:20: Anton and I started investigating the issue. We found that the load balancer sometimes gave an empty response, and it sometimes forwarded the request to the application.

When the application did receive the request, it responded sometimes with a 200, sometimes with a 404, and occasionally with a 500 due to a datastore error.

While investigating the load balancer, we found that all 5 instances were added to the ELB. 3 of them were marked "down", which is expected because they aren't serving traffic over :80.

Minor action item: We should remove the 3 AppScale machines that do not have the load_balancer role from the ELB.

The two machines with the load_balancer role were alternating between being down and up according to the ELB. We discovered that this was because the ELB's health check was marking the machine as "down" after 2 consecutive failures (the previously mentioned 404s and 500s) and “up” only after 10 consecutive successes.

~2018-10-29 15:10: I temporarily relaxed the ELB's HealthyThreshold to the minimum permissible value (2) and the UnhealthyThreshold to the maximum permissible value (10). I did this just in case the application was making URLFetch calls to itself. I wanted to make sure that while debugging the application, the ELB was always forwarding requests to the app.

After making this change, the ELB quickly marked the relevant machines as “up”, and requests were consistently forwarded to the app. The app still occasionally responded with a 404 or a 500.

~2018-10-29 15:40: We found that 2 of the 4 running AppServer instances consistently responded to /config requests with 404s. The other 2 instances usually responded with a 200, but they would occasionally respond with a 500 due to a datastore error.

We isolated the two problematic instances by keeping the processes running but removing routing for them so that we could debug them without affecting the behavior for users.

We found that the two problematic instances encountered a datastore error while they were starting (app.lp.initialize_lp). Because of this, flask did not finish setting up all of the defined routes, and the app responded to all future requests to /config with a 404.

With the problematic instances unrouted, the app no longer responded with 404s. However, it still occasionally encountered errors during datastore operations. We tracked this problem down to a single datastore server that had lost its connection to ZooKeeper. Typically when this happens, kazoo (the ZooKeeper client library we use) will continually try to reconnect. However, a bug in the kazoo library prevented that from happening in this case. I submitted a pull request to address the kazoo bug here: https://github.com/python-zk/kazoo/pull/534

Timeline of relevant events on the deployment (all timestamps are UTC):2018-10-28 06:32:06: The datastore server at 172.31.16.247:4003 lost connection to ZooKeeper2018-10-28 06:38:28: The same datastore server encountered the kazoo bug, preventing it from reconnecting.2018-10-29 11:35:27: The AppServer instance at 172.31.16.65:20001 encountered an error while making a call to the problematic datastore during initialize_lp. All future calls calls to /config on this instance were handled with a 404.2018-10-29 11:35:28: The AppSerever instance at 172.31.25.57:20004 encountered the same problem as the other instance.2018-10-29 17:44:51: The problematic instances were stopped, fixing the 404s.2018-10-29 18:16:35: The problematic datastore server was restarted, fixing the occasional datastore API errors.