on the destination node, RowMutationVerbHandler uses Table.Apply to hand the write first to CommitLog.java, then to the Memtable for the appropriate ColumnFamily.

When a Memtable is full, it gets sorted and written out as an SSTable asynchronously by ColumnFamilyStore.switchMemtable

When enough SSTables exist, they are merged by ColumnFamilyStore.doFileCompaction

Making this concurrency-safe without blocking writes or reads while we remove the old SSTables from the list and add the new one is tricky, because naive approaches require waiting for all readers of the old sstables to finish before deleting them (since we can't know if they have actually started opening the file yet; if they have not and we delete the file first, they will error out). The approach we have settled on is to not actually delete old SSTables synchronously; instead we register a phantom reference with the garbage collector, so when no references to the SSTable exist it will be deleted. (We also write a compaction marker to the file system so if the server is restarted before that happens, we clean out the old SSTables at startup time.)

读的流程

StorageProxy gets the nodes responsible for replicas of the keys from the ReplicationStrategy, then sends read messages to them

This may be a SliceFromReadCommand, a SliceByNamesReadCommand, or a RangeSliceReadCommand, depending

On the data node, ReadVerbHandler gets the data from CFS.getColumnFamily or CFS.getRangeSlice and sends it back as a ReadResponse

The row is located by doing a binary search on the index in SSTableReader.getPosition

For single-row requests, we use a QueryFilter subclass to pick the data from the Memtable and SSTables that we are looking for. The Memtable read is straightforward. The SSTable read is a little different depending on which kind of request it is:

If we are reading a slice of columns, we use the row-level column index to find where to start reading, and deserialize block-at-a-time (where "block" is the group of columns covered by a single index entry) so we can handle the "reversed" case without reading vast amounts into memory

If we are reading a group of columns by name, we still use the column index to locate each column, but first we check the row-level bloom filter to see if we need to do anything at all

The column readers provide an Iterator interface, so the filter can easily stop when it's done, without reading more columns than necessary

Since we need to potentially merge columns from multiple SSTable versions, the reader iterators are combined through a ReducingIterator, which takes an iterator of uncombined columns as input, and yields combined versions as output

If a quorum read was requested, StorageProxy waits for a majority of nodes to reply and makes sure the answers match before returning. Otherwise, it returns the data reply as soon as it gets it, and checks the other replies for discrepancies in the background in StorageService.doConsistencyCheck. This is called "read repair," and also helps achieve consistency sooner.

As an optimization, StorageProxy only asks the closest replica for the actual data; the other replicas are asked only to compute a hash of the data.