Validation and the Scientific Organization: Organizing U.S. Climate Modeling (4)

Validation and the Scientific Organization: Organizing U.S. Climate Modeling (4)

This entry is the last I will be writing about organizing U.S. climate modeling, software, and open source communities – for a while. At the end of this entry are links to the blogs/articles in a couple of series. I am going to start by quoting a comment from atmoaggie on the previous entry.

“The difference between all of those (rbr: types of models in a previous comment) and climate models is the ability to study their validity.

I would like to see a climate modeling 10 year forecast of some parameters, such as, maybe, average SST for the month of June 2021. Too specific? How about average global SST for JJA (summer) 2021. Still too specific? Maybe the average global SST for the next 10 years.

I, too, work in modeling. In storm surge modeling, one can very easily tune a model to better match the results for one storm (by adjusting air-sea drag, e.g.) only to find that the model is not useful for forecasting as another parameter or physical calculation is incorrect (the sea floor friction formulation, e.g.)

I bring this up to illustrate what can go wrong when modeling a hindcast, tuning to match observations, and applying that model to forecasts. And climate is far more complex, I think, than tides and TC wind and pressure-forced storm surges.”

I want to bring together two streams of thought that I have pursued over the past few months – validation and the scientific organization. First, I will discuss whether or not climate models can be validated and then argue that the development of a validation plan is at the center of developing a scientific organization.

Validation: As suggested in some of my earlier entries the question about whether or not climate models can be validated is a controversial issue. The controversy lies, first, in philosophy. The formal discussion of whether or not climate models can or cannot be validated often starts with a greatly cited paper by Naomi Oreskes et al. entitled Verification, Validation, and Confirmation of Numerical Models in the Earth Sciences. In fact quoting the first two sentences in the abstract:

“Verification and validation of numerical models of natural systems is impossible. This is because natural systems are never closed and because model results are always nonunique.”

Oreskes et al. argues that the performance of the models can be “confirmed” by comparison with observations. However, if the metric of “validation” is a measure of absolute truth, then such absolute validation is not possible. By such a definition little of the science of complex systems, which would include most biological science, medical science, and nuclear weapons management, can stand up to formal validation.

I will return to the stream I started with the quote from atmoaggie, which makes reference to storm surge model (see here for excellent discussion of storm surges Resio and Westerink). The point of the comment is that the storm surge model can be tuned and thereby calibrated based on observations of past storm surges and theory, but the model may still fail in future predictions of storm surges. This points out a weakness in the development of models of natural systems, that the adjustments of the models to represent a historical situation does not assure that model correctly represents the physics of cause and effect. In fact, this is a general problem with modeling of complex natural systems, if you get the answer “right,” then that does not mean you get it right for the right reason. Hence, in the spirit of Oreskes et al. validation is not possible – there is no absolute to be had.

Yet, aren’t storm surge models useful and usable? The same situation is true for weather models and river forecast models, their correctness cannot be assured in any absolute sense, but aren’t they useful and usable? Atmoaggie poses a set of predictions, all of which are reasonable propositions, that may or may not be convincing to him or her. These do not represent a complete set of metrics to evaluate models, and the success or failure of these predictions does not state in any absolute sense whether or not the models have usable information. There are many more elements of model evaluation that determine our level of confidence in the use of models.

It is easy, therefore, to establish that models that cannot be formally validated can be both useful and usable. The results of these models might not be certain, but the degree of confidence that can be attributed to their calculations is very high. This confidence is, in general, established by many forms of model evaluation and additional sources of relevant information, most importantly, observations and basic physical principles.

Validation, verification, evaluation, certification, confirmation, calibration: All of the words in this list have been used in discussions of how to assess the quality of models. For some, there are nuanced differences between the words, but in the general discussion they are all likely to take on the same meaning – some quantitative measure of model quality. The word “validation;” however, is special. Within political or philosophical arguments, the statement “models cannot be validated,” carries a powerful message, especially if one establishes as a principle that the elimination or the reduction of uncertainty is required prior to taking action (see Shearer and Rood). Many scientists take on the mantra that climate models cannot be validated. When I worked at NASA, the culture was that measurements of temperature (for example) could be validated, but that models could not. But if one is talking about temperatures from satellites over a deep layer of the atmosphere, in the spirit of Oreskes et al., can satellite temperature measurements be validated? We can state with stunning confidence that the satellite temperatures are within a certain closeness of a more intuitive or accepted measure of temperature – like a thermometer on a balloon. This is, to me, more calibration than validation, but in my world at NASA, calibration was done in a lab with standards (and that is why we have NIST). At NASA we talked about models being “evaluated.”

Other arguments I have heard about climate modes defying validation are based on to what do we chose to validate against – what is our standard? Suppose that you are interested in how well the model represents the Pacific Ocean, and I am interested in how well it represents the Arctic Ocean. And the scientist down the hall wants to know how well it represents the ice-age cycles, and another wants to know how well it represents the 20th century temperature variability. There is no absolute way to make these choices. More fundamentally, if it is a climate model then how do we measure “climate?”

The list goes on – I have frequently heard arguments of one community making critical remarks about the “science” of other communities. The weather forecast community relies strongly on forecast skill scores, but these measures are by no means unique and for a variety of reasons often only indirectly relevant to the quality of climate models. There is no fundamental reason that an excellent climate model would automatically be an excellent weather forecasting model. The opposite is true as well. Over the years of my career there have been criticisms of climate science by other fields of physics. The gist of their arguments is that they don’t validate models the same way we do, and since we do a good job, they don’t. These arguments make great fuel for political argument and the maintenance of doubt. (Here is an interesting article by Oreskes and Renouf.)

Validation is, therefore, both controversial and important. I pose that validation is at the center of the development of the scientific organization.

Validation and the Scientific Organization: The definition I have posed for the scientific organization is an organization that as a whole functions according to the scientific method. Therefore, if it is a climate modeling organization the model development path, the modeling problems that are being addressed, are determined in a unified way. In that determination, it is required that ways to measure success be identified. This leads to a strategy of evaluation that is determined prior to the development and implementation of model software. With the existence of an evaluation strategy, a group of scientists who are independent of the developers can be formed to serve as the evaluation team.

The development of an evaluation plan requires that a fundamental question be asked? What is the purpose of the model development? What is the application? If the model is being developed to do “science,” then there is no real constraint that balances the interests of one scientific problem versus another. There is little or no way to set up a ladder of priorities.

Again, I will emphasize that to achieve this, and it can be achieved, is a matter of governance and management. It is a process of developing organizational rather than individual goals. It is a myth to imagine that if a group of individuals are each making the “best” scientific decisions, the accumulation of their activities will be the best integrated science. Science and scientists are not immune to the The Tragedy of the Commons. If one wants to achieve scientifically robust results from a unified body of knowledge, then one needs to manage the components of that body of knowledge so that as a whole the scientific method is honored. Enough on that pulpit.

Back to evaluation and validation – Minimally, the arguments about the nuanced meaning of validation and evaluation are a subject about which the climate modeling community needs to develop a standard. By my interpretation, the evaluation of climate models can be structured and quantified as “validation.”

The software we produced was an amalgam of weather forecasting and climate modeling. For the validation plan the strategy was taken to define a quantitative baseline of model performance for a set of geophysical phenomena. These phenomena were broadly studied and simulated well enough that they described a credibility threshold for system performance. They were chosen to represent the climate system. Important aspects of this validation approach were that it is defined by a specific suite of phenomena, formally separated validation from development, and relied on both quantitative and qualitative analysis.

The validation plan separated "scientific" validation from "systems" validation. It included steps of routine point-by-point monitoring of simulation and observations, formal measures of quality assessment by measure of fit of simulations and observations, and calculation of skill scores to a set of "established forecasts." There was a melding of methodologies of practices of the study of weather and the study of climate. We distinguished the attributes of the scientific validation from the systems validation. The systems validation, focused on the credibility threshold described above, used simulations that were of longer time scales than the established forecasts and brought attention to a wider range of variables important to climate. The scientific validation was a more open-ended process, often requiring novel scientific investigation of new problems. The modeling software system was released for scientific validation and use after a successful systems validation.

The end result of this process was the quantitative description of the modeling system against a standard set of measures over the course of one modeling release to the next. Did it meet the criterion of the absolute validation? No. Did it provide a defensible quantitative foundation for scientific software and its application? Yes.

All told, it does little to base a body of scientific knowledge on the premise that validation is “impossible.” Rather than following such a premise, which immediately devalues the knowledge base, it is more useful to develop a systematic approach to robust, appropriate validation. This stands to represent the complexity of the Earth’s climate and its investigation that serves not only the scientific method, but the communication of that science to other scientists, and to those with a stake in those scientific results. It sets a standard.

The deep-ocean temperature trends in the 15 models which had complete ocean data archived for the period 1955-1999 are surprising, because I expected to see warming at all depths. Instead, the models exhibit wildly different behaviors, with deep-ocean cooling just as likely as warming depending upon the model and ocean layer in question (click for the full-size version):

Three of the models actually produced average cooling of the full depth of the oceans while the surface warmed, which seems physically implausible to say the least. More on that in a minute.

The most common difference between the models and the observations down to 700 m (the deepest level for which we have Levitus observations to compare to) is that the models tend to warm the ocean too much. Of those models that don’t, almost all produce unexpected cooling below the mixed layer (approximately the top 100 m of the ocean).

From what I understand, the differences between the various models’ temperature trends are due to some combination of at least 3 processes:

1) CLIMATE SENSITIVITY: More sensitive models should store more heat in the ocean over time; this is the relationship we want to exploit to estimate the sensitivity of the real climate system from the rates of observed warming compared to the rates of warming in the climate models.

2) CHANGES IN VERTICAL MIXING OVER TIME: The deep ocean is filled with cold, dense water formed at high latitudes, while the upper layers are warmed by the sun. Vertical mixing acts to reduce that temperature difference. Thus, if there is strengthening of ocean mixing over time, there would be deep warming and upper ocean cooling, as the vertical temperature differential is reduced. On the other hand, weakening mixing over time would do the opposite, with deep ocean cooling and upper ocean warming. These two effects, which can be seen in a number of the models, should cancel out over the full depth of the ocean.

3) SPURIOUS TEMPERATURE TRENDS IN THE DEEP OCEAN This is a problem that the models apparently have not fully addressed. Because it takes about ~1,000 years for the ocean circulation to overturn, it takes a very long run of a climate model before the model’s deep ocean settles into a stable temperature, say to 0.1 deg. C or less. While some knowledgeable expert reading this might want to correct me, it appears to me that some of these models have spurious temperature trends, unrelated to the CO2 (and other) forcings imposed on the models during 1955-1999. This is likely due to insufficient “spin-up” time for the model to reach a stable deep-ocean temperature. Until this problem is fixed, I don’t see how models can address the extent to which the extra heat from “global warming” is (or isn’t) being mixed into the deep ocean. Maybe the latest versions of the climate models

Quoting Neapolitan:You complain about me using a five-year period--and in response use a one-month period?

LOL. Almost missed this, since a new blog entry was created. I provided a link for you to check other months as well. Anyway, someone in the future provided a link to a graph of a longer time period [Cue Twilight music], and I'm giving it below.From here near the bottom of the page. See comment № 197 of the next blog for a more extensive response.

"COLQUITT, Ga -- The heat and the drought are so bad in this southwest corner of Georgia that hogs can barely eat. Corn, a lucrative crop with a notorious thirst, is burning up in fields. Cotton plants are too weak to punch through soil so dry it might as well be pavement.

"Farmers with the money and equipment to irrigate are running wells dry in the unseasonably early and particularly brutal national drought that some say could rival the Dust Bowl days%u2026.

"In Texas, where the drought is the worst, virtually no part of the state has been untouched. City dwellers and ranchers have been tormented by excessive heat and high winds. As they have been in the southwest, wildfires are chewing through millions of acres%u2026.

"Most troubling is that the drought, which could go down as one of the nation%u2019s worst, has come on extra hot and extra early...

"Oklahoma has had only 28 percent of its normal summer rainfall and the heat has blasted past 90 degrees for a month.

"The question, of course, becomes why. In a spring and summer in which weather news has been dominated by epic floods and tornadoes, it is hard to imagine that nearly a third of the country is facing an equally daunting but very different kind of natural disaster."

- - - - - - - - - - - - - - - -

Again, don't worry about it. I'm sure it's all just coincidence. ;-)

C'mon...That's only a five year data period; it would be more interesting to see that graph over a longer time period. For some additional perspective, see the map below:

It's very funny that Neapolitan never attempts to refute your claims. Maybe he knows better. You wouldn't be his alter ego, would you? ;-)

In all fairness, he did respond to all but the last comment. Really, he had already given his interpretation of the data in my previous post, and I simply clarified my view as it compared to his. I'm not sure there was much more to say without further data.

Alter Ego? Maybe we're just the same person with dual personalities...LOL. I am kidding, of course.