Wednesday, April 11, 2012

PostgreSQL Scalability Analysis Deconstructed

In 2010, I presented my universal scalability law (USL) at the SURGE conference. I came away with the impression that nobody really understood what I was talking about (quantifying scalability) or, maybe DevOps types thought it was all too hard (math). Since then, however, I've come to find out that people like Baron Schwartz did get it and have since applied the USL to database scalability analysis. Apparently, things have continued to propagate to the point where others have heard about the USL from Baron and are now using it too.

Robert Haas is one of those people and he has applied the USL to Postgres scalability analysis. This is all good news. However, there are plenty of traps for new players and Robert has walked in several of them to the point where, by his own admission, he became confused about what conclusions could be drawn from his USL results. In fact, he analyzed three cases:

PostgreSQL 9.1

PostgreSQL 9.2 with fast locking

PostgreSQL 9.2 current release

I know nothing about Postgres but thankfully, Robert tabulated on his blog the performance data he used and that allows me to deconstruct what he did with the USL. Here, I am only going to review the first of these cases: PostgreSQL 9.1 scalability. I intend to return to the claimed superlinear effects in another blog post.

The USL in Brief

The USL is expressed in terms of the relative capacity
\begin{equation}
C_N = \frac{X_N}{X_1}
\end{equation}
a ratio defined exclusively by the data; the actual throughput measurements, $X_N$. Here, $X_1$ is the throughput measured for a single client $N=1$ load and $X_N$ is the throughput measured with any $N > 1$ clients.

But that's only one side of the proverbial coin. On the flip side, the USL provides the following abstract model
\begin{equation}
C_N = \frac{N}{1 + \alpha N + \beta N (N-1)}
\end{equation}
against which we can regress those same throughput measurements. The USL equation states that the relative capacity is characterized by three effects:

The level of concurrency or ideal parallelism (the '1' in the denominator)

Degradation of linear scaling due to contention effects (the $\alpha$ term)

Retrograde scaling due to coherency delays (the $\beta$ term)

Since each of these physical effects has a corresponding term in the USL model, the assertion is: measurements must match the model. Where they don't match, start explaining why not. The method of nonlinear statistical regression is used to determine what values of the $\alpha$ and $\beta$ parameters best account for the variation in the scalability measurements. The relative capacity $C_N$ is a normalized number. To get the corresponding throughput values (that can be compared with measurements), multiply the relative capacity by $X_1$, i.e., $X_N = C_N \, X_1$. More details about regression techniques and the USL are presented in my Guerrilla capacity planning book and training class.

My purpose here is not to discourage Robert or anyone else from using the USL model, but if you are going to use it, then it really is in your own interest to learn to use it correctly. It's not just a matter of throwing performance data at the USL model and seeing if the plots look like what you were expecting. If the USL always did what we expected, there would be little point in doing it at all because no new insights would be gleaned. It's also not a one-shot, but an exploratory process in which the USL should be applied iteratively. Performance modeling and analysis is a very technical skill, and that's why I run Guerrilla training classes. To give you a better idea of how all that should go, let's see what the USL analysis of Postgres 9.1 looks like when it's done a bit more rigorously.

The Original Analysis

For reference, the full complement of load test points runs between $N = 1$ and $N = 80$ clients. However, the original plot (Fig. 1 by Robert) only shows the USL analysis performed with gnuplot on a subset of data points for $N \le 32$ clients (shown as little circles).

What happened to the remaining 48 data points? Apparently, Robert decided that he wanted "to avoid confusing the tool." It's not clear whether that is an anthropomorphic reference to the limitations of gnuplot (never used it) or the USL. Regardless, that statement tells me more about the modeler than the model. Only modelers get confused, not models. How can that happen? Let me count the ways:

If all three throughput measurements $X_N$ per each client load point $N$ were shown (rather than the median) there would be a spread in the data. Which is the "true" value?

Without a full appreciation of these concepts, there is always the tendency to believe the data over the model. It's that bias that led Robert to fit the USL to a subset of his data. Since the USL curve passes through most of those data points, that's better. Right? Wrong!

Same USL Analysis in R

Next, let me redo the USL fit in R so as to get calibrated with Fig. 1. First, I create a dataframe in R, normalize the throughputs (Norm) using $C_N$ and check the efficiencies (Effcy) using $C_N/N$.

There are 21 data points with clients in the range $1 \le N \le 80$ with corresponding throughputs lying in the range $4373.30 \le X_N \le 36734.52$. Then, I do the formal statistical regression on the same $N \le 32$ data subset as that used in Fig. 1 to get the following numerical parameter values.

These values are then combined with $X_1$ and $C_N$, as described above, to plot the USL curve as well as some other numerical scalability quantities that are collected into a legend in Fig. 2. As you can see, it agrees with the USL curve in Fig. 1. In addition, I've shown the full range of measurements (circles) as well as the (dashed) line corresponding to ideal linear scaling, i.e., $\alpha = \beta = 0$ in the USL model.

Figure 2. Same USL analysis as Fig. 1 using R.

The virtue of including the additional reference information is that it allows me to draw the following conclusions. Suppose Robert had not yet measured a client load above $N = 32$ in his benchmark tests. I can see that PG 9.1 scales quite linearly up through the first dozen or so clients, and then begins to fall away from linear due to the nonzero but small value of $\alpha \approx 1\%$. That looks very promising!

However, something "bad" seems to happen above $N \approx 20$ clients. The USL predicts that scaling (as depicted by the regression curve) will go severely retrograde. Moreover, because the $\alpha$ and $\beta$ parameters in the USL model have distinct physical interpretations, I can say that the retrograde scaling (if it occurs—remember, we're pretending we don't have the additional data points yet) will not be due to contention (the $\alpha$ term in the USL) but delays caused by some kind of pairwise exchange (the $N^2$ term in the USL) between the clients—a coherency penalty. Since I'm not a Postgres expert, I have no idea what that could mean, but Robert might. Furthermore, locks are associated with sequencing or serialization (the linear $N$ term in the denominator of the USL) so, any retrograde behavior will be due to something other than locking.

Complete USL Analysis

Now, let's suppose the additional measurements become available and we fit the USL to all those data. The result is shown in Fig. 3. This is the result that put Robert off. Why? Because he hasn't yet learnt his catechism about models coming from God.

Figure 3. ULS analysis of the full complement of 21 data points.

Robert looks at the USL curve and says "Why?" I look at the USL curve and say "Why not?" The USL predicts that the peak should be closer to $N = 30$, whereas the data peaks prematurely around $N = 20$ (when viewed from the standpoint of the USL). When you think about it, if something is giving you a hint that the scalability could be better than the current data indicates, wouldn't you want to investigate further to see if that performance improvement might be achievable?
Reviewing Fig. 3 more carefully, I can also say with some authority that PG 9.1 is now scaling better than expected (based on the USL curve in Fig. 2) because it did not go as retrograde as anticipated by the USL, based on the load points up to $N = 32$. In other words, if we already see one improvement, could there be yet more improvements?

Another way of stating this conclusion is to note the throughput data are somewhat exceeding USL expectations in the window just prior to the predicted USL peak (say, $15 < N < 20$), they are failing to meet USL expectations in the window $20 < N < 60$, but finally recover again in the window $60 < N < 80$. This view of the data immediately raises questions about possible room for performance improvements. Can PG do better? Can the scaling be more consistent? If not, why not? Maybe PG can't get there, but we need to understand why not. In this sense, USL modeling is often more about driving explanations than fitting data.

Extended USL Analysis

Finally, let's switch gears again and ask, What would happen if we could magically "turn off" the coherency penalty (i.e., set $\beta = 0$) in Fig. 3? The USL prediction is shown in Fig. 4.

Figure 4. ULS analysis with $\beta$ parameters suppressed.

This USL analysis says the following. If you could maintain the level of contention at the value in Fig. 3 (i.e. $\alpha \approx 4\%$), then near linear scaling could take throughput performance up to a higher plateau (denoted $X_{roof}$ in the legend) given by:
\begin{equation}
\frac{X_1}{\alpha} = \frac{4373.30}{0.03855} = 113,444.90~\text{TPS}
\end{equation}
If we consider the data circles beyond say, $N = 40$ clients in Fig. 4 to constitute an approximate throughput plateau for PG 9.1, then the predicted $X_{roof}$ plateau would be about 4x higher than that. As you can see, under these assumptions, the USL scalability curve would be off the chart!

In case you think that's a completely crazy projection coming from the USL, then check out what actually happened to the newer version of Postgres with the so-called fast locking modification. Those PG 9.2 throughput measurements, shown as green joined data points, reached an $X_{roof}$ that is 5x higher than expected from the PG 9.1 data. The PG 9.2 fast lock data makes the PG 9.1 USL prediction look conservative. Some of the additional boost to $X_{roof}$ may have come from the apparent superlinear effect, but I won't go into that aspect here.

Of course, the architectural changes to Postgres 9.2 were not made as a consequence of any USL modeling (that I know about), but they could have been. I hope you can see now how the modeling scenarios (different parameter values) that I've presented here could have been applied to the PG data within the context of the USL. Most importantly, each of these USL scenarios in Figs. 2, 3 and 4 should be seen as fitting together logically and reinforcing one another. All that information was hidden in the original data.

Indeed, I can say with hindsight, that whatever changes were made between PG 9.1 and 9.2 with fast locking, it had the predicted effect of making the USL $\beta$ parameter vanishing small (but not zero since some slight retrograde behavior persists). In addition, the fast locks reduced the $\alpha$ parameter to something smaller than the PG 9.1 value of $\alpha = 0.03855$. Simply put, if the USL had been used to reach these conclusions, I would've been in a position to say, "I told you so." ;-)

Summary

These are just some of the insights you can gain from the appropriate use of the USL model. In fact, this PostgreSQL example would make a great case study for a Guerrilla training class.

The reason I ignored the points where N>32 is because the data were collected on a system with 32 cores. So there are multiple factors limiting scalability here. Where clients <= cores, we have one set of bottlenecks, principally due to lock contention within PostgreSQL but perhaps also partly due to operating system or hardware effects. However, once the number of clients exceeds the number of cores, we're bound to hit a wall: if all the available CPU resources are already in use, adding more clients can't really continue to improve throughput. What I want to measure is whether it's possible to add throughput by adding more hardware resources (cores), NOT whether or not throughput will flatten out when we run out of cores. The answer to the latter question seems pretty self-evident: if we're efficiently using every core, then the best we can hope for when we run out of cores is that throughput will remain stable. In reality, of course, it will drop off slightly, because task-switching is not perfectly efficient.

I believe a list of basic reading resources could help any technical lead who meaasures performance. Apart from your blog I think these well known links helped me.But I had to search hard. These are the basics though.

Hi Dr. Gunther,I have a question that is maybe more linked to response time analysis than scaling (if you allow such a distinction). Many performance tools and collectors return service time metrics in a synthetic way, such as: avg time, median, 90th, 95th, 99th percentiles. Is it possible from these numbers to understand which dsn they belong to (exponential, power law, normal, etc)? I'm wondering if this would help as an indication of the correct queueing model to be used (I'm re-reading 2.11 paragraph "Generalized servers" of your great "Analyzing Computer System Performance with Perl::PDQ" book).ThanksMatteoP

Indeed, it is possible to talk about response-time (R) scalability and from the USL standpoint, we can derive R from the throughput data (X). I explain how to do all this in my upcoming Guerrilla classes.

If we take the Postgres data (above) as the example, then what we can calculate from the USL model is the mean R: as in, statistical mean (average) or 1st moment. The median is p50 and is not a moment of the underlying statistical dsn; which we generally don't know. If I were to plot mean-R for the Postgres data, I already know that it should have the classic "hockey stick" shape. To the degree that it doesn't, we have to explain why not.

With regard to your point about identifying queueing models, most queueing models compute measures that are characterized by the statistical mean. We may also be able to calculate higher moments from the assumed dsn in certain cases.

So, on the one hand, we would prefer to have sample moments (average, variance, etc.) from the data to compare with any queueing models. Percentiles (e.g., p50, p90, p95), on the other hand, are merely a way of producing a ranked categorization of the sampled data.