December 14, 2011

Over the weekend (12/09/11 – 12/10/11) we performed critical, preemptive upgrades for Rockerduck. During our upgrade cycle we were able to increase memory resources for Mailbox servers, rebalance resource distribution on Client Access servers and add additional Mailbox servers for quorum retention and additional high availability.

Mailbox, mailbox, mailbox…

By utilizing the current mailbox server layout, we were able to increase memory in Rockerduck mailbox servers in a staggering pattern without disrupting service to clients on Rockerduck. As each mailbox server was prepared for the upgrade, we moved all active mailboxes from the server to any passive mailbox node and then blocked the mailbox server from activating any database copy. After the memory upgrades were completed we then stress tested each server for 8 hours with a memory stress test for consistency. Once the upgrades were completed on the nodes, we were being the node back into the DAG and back up to availability.

Labs vs. Real World Results

Mailbox servers were not the only servers in Rockerduck to be upgrades. Over the past two weeks we’ve been monitoring the response statistics on CAS servers with a new memory / processor configuration.

Originally when we performed initial testing / scaling Rockerduck we seen the overall lowest latency and response time for RPC and Web Services from having a fewer CAS servers with higher RAM and processor. Over time, we’ve noticed the real world utilization result of overall latency on RPC was significantly outside the scope of our original Lab results causing us to reevaluate our delivery of CAS services.

All CAS servers for Rockerduck sit behind a hardware based load balancer. Each client that connects to the load balancer gets assigned to a specific CAS node for up to 5 hours on certain services (RPC, EWS) based off of the client WAN IP. Original design for the CAS nodes was 3 nodes with 8GB of RAM and 4 Processor cores available.

Unfortunately, this “least connected” model had the potential (and sometimes did) tie larger groups of users together from different IP addresses, essentially choking the server with queued requests.

The new setup for the CAS nodes is a balance of 6GB of RAM with 3 Processor cores available. This new configuration allowed us to introduce two new CAS servers to more efficiently process requests across multiple nodes without any additional “upgrades” to the CAS roles.

During our statistical collection phase, the new configuration nodes had a 40% reduction in response time on RPC requests and Address Book requests: