Of the 7 lessons, one that I found particularly interesting, was Lesson 3: Open Schema.
It talks about taking a different approach from traditional RDBMS database design, and the pros and cons of that. But the other lessons are interesting too.

Performance tuning in general is an interesting and challenging area. One book I've read about it, which is very good, IMO, is "Writing Efficient Programs", by Jon Bentley of Bell Labs. I own a copy and had read it cover to cover when I bought it - it's that good.

Excerpts from the Wikipedia article about Jon Bentley (linked above):

[ After receiving his Ph.D., he joined the faculty at Carnegie-Mellon University as an assistant professor of computer science and mathematics.[1] At CMU, his students included Brian Reid, John Ousterhout, Jeff Eppinger, Joshua Bloch, and James Gosling, and he was one of Charles Leiserson's advisors. Later, Bentley moved to Bell Laboratories.
...
He wrote the Programming Pearls column for the Communications of the ACM magazine, and later collected the articles into two books of the same name. He has published or presented over 200 papers.
...
Bentley received the Dr. Dobb's Excellence in Programming award in 2004. ]

Though it was written many years ago, and is now probably out of print (it was when I checked a while ago), it has many fascinating articles about performance tuning. What is more of interest, is that the chapters in the book talk about performance tuning at many different levels, from tuning at the level of algorithms and data structures, down to tuning at the level of individual subroutines and the code within them, and even further down the stack to the generated assembly language code and the hardware. (The book does not give actual examples of tuning at the level of hardware, but it does have plenty of "war stories", as Bemtley calls them, about real-life performance tuning cases, and one of them is an astonishing case where the performance of quicksort (IIRC), was improved by over a million times, by working at many levels of the stack, from the architecture down to the hardware.

Loop unrolling is another interesting example, where, IIRC, the speed of a binary search was increased several times. There are many other such examples and war stories.

One particular interesting one (though the code is not given, only a brief description) was a war story where someone said "I remember a young man building an interpreter for an interpreter ...", "thereby packing the program down into an incredibly small amount of space" - or words to that effect. That was actually needed in that case, because of the low memory of the target environment.

Bentley gives many tuning rules in the book, which can be applied to specific situations. Most of them are at the programming level, so are accessible to most programmers (who may not have the luxury of changing the architecture or the hardware).
He also gives situations where the rules are useful, and where they may not be useful.

UPDATE: Though I said it is out of print, I just searched and found an Amazon link for the book: