Don't go after the biggest problem first. Tackle the low hanging fruit first. Or tackle both in parallel if your team have sufficient man power.

Optimize without downtime

Move sorting query types into the application and limit the amount of data returned to the browser.

Reduce range

Range on primary keys.

Benchmark, make change, benchmark, make change (cycle of improvement)

Always have a plan to rollback.

Incremental rollout (rollout to a small set of users before rolling it out to all users).

Horizontal scalability. Rather than buying big, centralized servers, buy a lot of thin, cheap servers. If one fails, rollover to another servers. Spawn up a new server and join the cluster. Use the AWS auto-scale approach.

Work with a team. Access the current problems. Define the issues. Determine available hardware.

Have a road map.

The database server should never be on the same box as the web server, even when the box does not have a lot of hits. This makes it easy to isolate resource contention problem, and do performance tuning.

Things to keep in mind

Operating system impose certain limits on your program.

On Linux, you can only have approximately 32,000 files inside a directory. That is a lot of files, but in reality, if you have more than 1000 files per directory, you should see performance degraded. If you have a lot of files to be stored on the same filesystem, split them up, and store them in nested directories where the names of the directories are 2 or 3 letters long.

The same limitation mentioned above also dictate how many tables your mysql database can have. For each table, mysql use 3 files. So at the max, you can only have 32000 / 3 tables.

pmap - reports memory map for a process. Check out this blog for some explaination.

Keep it all in memory: I/O will kill your latency, so make sure all of your data is in memory. This generally means managing your own in-memory data structures and maintaining a persistent log, so you can rebuild the state after a machine or process restart. Some options for a persistent log include Bitcask, Krati, LevelDB and BDB-JE. Alternatively, you might be able to get away with running a local, persisted in-memory database like redis or MongoDB (with memory » data). Note that you can loose some data on crash due to their background syncing to disk.

Keep data processing colocated: Network hops are faster than disk seeks but even still they will add a lot of overhead. Ideally, your data should fit entirely in memory on one host. With AWS providing almost 1/4 TB of RAM in the cloud and physical servers offering multiple TBs this is generally possible. If you need to run on more than one host you should ensure that your data and requests are properly partitioned so that all the data necessary to service a given request is available locally.

Keep context switches to a minimum: Context switches are a sign that you are doing more compute work than you have resources for. You will want to limit your number of threads to the number of cores on your system and to pin each thread to its own core.

Keep your reads sequential: All forms of storage, wither it be rotational, flash based, or memory performs significantly better when used sequentially. When issuing sequential reads to memory you trigger the use of prefetching at the RAM level as well as at the CPU cache level. If done properly, the next piece of data you need will always be in L1 cache right before you need it. The easiest way to help this process along is to make heavy use of arrays of primitive data types or structs. Following pointers, either through use of linked lists or through arrays of objects, should be avoided at all costs.

Batch your writes: This sounds counterintuitive but you can gain significant improvements in performance by batching writes. However, there is a misconception that this means the system should wait an arbitrary amount of time before doing a write. Instead, one thread should spin in a tight loop doing I/O. Each write will batch all the data that arrived since the last write was issued. This makes for a very fast and adaptive system.

Use your cache wisely: With all of these optimizations in place, memory access quickly becomes a bottleneck. Pinning threads to their own cores helps reduce CPU cache pollution and sequential I/O also helps preload the cache. Beyond that, you should keep memory sizes down using primitive data types so more data fits in cache. Additionally, you can look into cache-oblivious algorithms which work by recursively breaking down the data until it fits in cache and then doing any necessary processing.

Non blocking as much as possible: Make friends with non blocking and wait free data structures and algorithms. Every time you use a lock you have to go down the stack to the OS to mediate the lock which is a huge overhead. Often, if you know what you are doing, you can get around locks by understanding the memory model of the JVM, C++11 or Go.

Async as much as possible: Any processing and particularly any I/O that is not absolutely necessary for building the response should be done outside the critical path.

Parallelize as much as possible: Any processing and particularly any I/O that can happen in parallel should be done in parallel. For instance if your high availability strategy includes logging transactions to disk and sending transactions to a secondary server those actions can happen in parallel.