@rbranson: If your data fits in main memory, you're doing it wrong. #strangeloop

@peakscale: Using schemaless DBs an "overreaction" & "confuses the poor impl. of schemas with the value that schemas provide"

@adrianco: GM: Performance analysis is complicated by your brain thinking LINEARLY about a computer system that is NONLINEAR.

@littleidea: it's better to have infinite scalability and not need it, than to need infinite scalability and not have it

Looks like Google is on the right track with their language understanding efforts. How hierarchical is language use: In this paper, we review evidence from the recent literature supporting the hypothesis that sequential structure may be fundamental to the comprehension, production and acquisition of human language. Moreover, we provide a preliminary sketch outlining a non-hierarchical model of language use and discuss its implications and testable predictions.

Lots of techniques for Enhancing the Scalability of Memcached. Very detailed and filled with many potential wins in your own code. Optimized memcached increases throughput by 6X increase and performance per watt by 3.4X over the baseline, though a commenter pointed out tests were against an older version of memcached. Some of the changes: Hash table locking mechanism changed to allow for parallel access; Bag LRU – The data structure is changed to an array of different sized LRU’s with Single linked-list bags of cache items; DELETE and STORE operations now use a parallel hash table approach with striped locks; remove locks on GETs; and many more.

Copying data 3 times for safety has to be expensive, especially as storage requirements skyrocket. StorageMojo in More efficient erasure coding in Windows Azure storage shows how using advanced erasure codes can provide reliability and dramatically reduced storage requirements. We'll probably see more of this in the future.

Beginning on October 8th Stanford is having what looks like a really cool course: An Introduction to Computer Networks. One of the teachers is Nick McKeown. I've listened to him speak a few times and he's excellent.

Nature decouples. You can understand nature one layer of the onion of the time. You don't don't need to know about quarks and gluons to understand water turbulence. What makes science possible is you can study the different layers of the onion independently. Software is still usually a Big Ball of Mud.

I've been taking a course on architecture so I found the article Fundamental: Stress-Strain Curves In Web Engineering by John Allspaw quite thoughtful. A lot of parallels between structural engineering and software engineering want to be made, but as engineering is a conscious balancing of forces with goals, the problem for software engineering is there are no equations of equilibrium to guide structural decisions. Also good, A Mature Role for Automation: Part I.

Since you can't predict the future your best bet is to measure and react. Then generate lots of lots of bets. Put them in the field. Measure which ones are succeeding. And then scale up which ones win.

On spinlocks and sleep(): Yes, we really did achieve a 3.7X speedup on a garbage collection benchmark by removing a call to sleep().

Scaling Riak to 25 million ops/day at Kiip. Excellent set of notes on the talk. Kiip team found Riak extremely solid. Some advice: Scale early, Don’t use secondary index (2i) in real-time queries, The JavaScript engine requires a lot of RAM, Don’t restart nodes in rapid succession.

Good Cassandra Counters thread on Google Groups. Rohit Bhatia with a nice TLDR: if you want 99.99% accurate counters, and can manage with eventual consistency. Cassandra works nicely.

The new data have and have nots. Commerce Weekly: Big data in retail: This model not only caters to large retailers over smaller retailers because of the size of their wallets, but because it’s easier for brands to interact with the corporate headquarters of a major retailer with 1,000 stores than to interact with 1,000 owners of independent stores, Hawkins writes. He goes into detail about how this business model will affect the industry on several fronts — you can read his piece in its entirety here.

Twitter has released Algebird: Algebird is our lightweight abstract algebra library for Scala and is targeted for building aggregation systems (such as Storm).

Dmitriy Samovskiy with a Concise Introduction to Infrastructure as Code: Once you achieve high levels in monitoring and deployment (not necessarily highest though), you can start doing things like self-healing, autoscale, testing through fault injection and other cool things < Nice, short list of all the things you can do to create IaC.

Reader Comments (1)

On Oracle NoSQL benchmark setup costs:

"We used 15 servers, each configured with two 335 GB SSD cards. We did not have homogeneous CPUs across all 15 servers available to us so 12 of the 15 were Xeon E5-2690, 2.9 GHz, 2 sockets, 32 threads, 193 GB RAM, and the other 3 were Xeon E5-2680, 2.7 GHz, 2 sockets, 32 threads, 193 GB RAM. There might have been some upside in having all 15 machines configured with the faster CPU, but since CPU was not the limiting factor we don't believe the improvement would be significant.

The client machines were Xeon X5670, 2.93 GHz, 2 sockets, 24 threads, 96 GB RAM. Although the clients had 96 GB of RAM, neither the NoSQL Database or YCSB clients require anywhere near that amount of memory and the test could have just easily been run with much less.

Networking was all 10GigE."

I would estimate the total price = 20K per server + 12-20K per 2xSSD cards (which is Fusion IO ) = 380-600K per cluster. This gives us ~2 ops per $1.