Rushing to see numbers

Scientific programmers and algebra students start out with analogous bad habits.

Beginning algebra students rush to enter numbers into a calculator. They find it comforting to reduce expressions to floating point numbers as frequently as possible. This is understandable, but it’s a habit they need to break for numerous reasons: it’s more work, harder for a grader to follow, less accurate, etc. Better to do as much calculation as possible with symbols, then stick in numbers at the end.

A similar phenomena happens in scientific programming. We’re anxious to see numbers, so we print out numbers as soon as we produce them. There’s a tendency to sprinkle code with printf statements, then write scripts to scrape text output to gather results.

It’s better to keep numbers in data structures as long as possible, then dump the data to text as the last step. Why? For one thing, the output format might change: instead of a text dump, maybe you’ll want to put the data in a database or format it as an HTML table. More importantly, the “last step” will often change. What was the last step now becomes the next-to-last step because you think of something else to do. So you do more with the data in memory before producing output, rather than scraping the text output.

I quickly learned to delay plugging numbers into algebraic expressions. It took me longer to learn not to output numeric results until the last moment.

Post navigation

13 thoughts on “Rushing to see numbers”

I think the analogy is faulty. As a grade-school student I wanted to see numbers. As a researcher I want to see patterns. And the patterns of the final result may often be much less illuminating than the patterns of intermediate steps along the way.

That is how I build up an intuitive sense of what I am modelling. Treating my own computations as a black box is exactly the wrong way to go about it; illuminating the insides of the box is usually more valuable than looking what the box produces.

I don’t mean to imply we shouldn’t look at numbers early and often. I mean that we shouldn’t print numbers early and often, at least not as an alternative to keeping data in memory. Inspecting values via a debugger or printf statements is very useful.

Printing for inspection is good. Printing as a way to avoid keeping up with results is not so good.

My students struggle with this concept when trying to solve exponential equations. At the end, I want both an exact answer and an approximation. They are so quick to turn everything into decimals. All the decimals are way too cumbersome, not to mention inaccurate. They also don’t see numbers like we do. Manipulating expressions is a form of recreation. It is my own little playground.

Writing smaller command-line utilities allows you to chain things together. They can all be written and tested independently. If you get used to that development workflow you can have a lot of momentum on a project without writing much code.

You can do all that with data structures too obviously, but the investment at each step is greater.

With data analysis, I’m willing to bet that most people spend more time writing one offs than full-fledged programs. I think scripting up things quickly is usually the cheapest solution in these situations.

Writing a good command line utility requires thinking ahead and producing output that is easy for something else to consume. And writing such a utility may require gathering all output in memory before sending it to stdout. So the thought process isn’t that different from keeping data in memory. It’s still different from indiscriminately printing numbers as they are produced.

Is it really common for scientific/statistical programmers to work they way you warn against? For a 7th grader with an algebra problem that totally makes sense. But a scientific programmer? It just seems so obvious that that’s a really really really bad way to do things…

Immediate result is pretty useful when you wanted to pipeline the computation, once you get the first step done, print out the result, hand it over to another machine, and start processing another batch.

Numeric precision is another thing – typical printing format for floating point number lose precision – maybe we should consider binary serialize? That way things are quicker, and no lost in precision.

I think what you said also applies to other places. I see people fetching JSON objects and immediately render values to the page. Eventually no one gets around to writing the caching and all the other interesting things… An application never gets made… just a shell

I would add that looking at the numbers may not provide any insight at all. If you dump a matrix in a graphics program, can you tell if it’s a rotation? The test is to use the matrix to rotate an object, then display the object.

As a software architect I would say the same is true with non-scientific programming as well. People want to delve quickly into the code rather than explore the patterns in the design. They’re rushing to the concrete rather than exploring patterns in the abstract.

When I started coaching a boy in junior high school math, he insisted on expressing solutions in decimals because they looke “more exact” than fractions or surds. It took me much effort to convince him that there’s more information in the latter kind of answer.