May 25, 2013

Is there anything in software engineering that is just a string? If not, shouldn't String be an abstract class, forcing developers to subtype and at least name datatypes?

Domain-Driven Security

Former colleague Dan Bergh Johnsson, application security expert Erlend Oftedal, and I have been evangelizing the idea of Domain-Driven Security. We truly believe proper domain and data modeling will kill many of the standard security bugs such as SQL injection and cross-site scripting.

This blog post is a case for Domain-Driven Security and a case against strings.

The addHeader() Method in Java

Let's be concrete and dive directly into programming with HTTP headers.

In Java EE's interface HttpServletResponse we find the following method (ref):

void addHeader(java.lang.String name, java.lang.String value)

Not a heavily debated method as far as I know. On the contrary it looks like most such interfaces do. An implementation of the interface may look like this (ref):

It shows we can really set any string as an HTTP header. And that's convenient, right?

The Ubiquitous String

java.lang.String is the ubiquitous datatype that solves all our problems. It can contain anything and nothing and of course it has its sibling in any popular programming language out there. Let's have a look at what a string is.

Java uses Unicode strings in UTF-16 code units which handle over 100,000 characters. As far as I know C# and JavaScript does the same. The max size of strings is often limited by the max size of integers, typically 2^31 - 1 which is just over 2 billion.

So, a string …

is anything between 0 and 2 billion in length,

can contain 100,000 different characters, and

can be null.

Hardly a good spec for HTTP headers.

HTTP Headers By the Spec

RFC 2047 gives us the formal specification of how HTTP headers should look. An excerpt will suffice for our discussion.

message-header = field-name ":" [ field-value ]

field-name = token

field-value = *( field-content | LWS )

field-content = <the OCTETs making up the field-value

and consisting of either *TEXT or combinations of token, separators, and quoted-string>

… which will be interpreted as two responses by the web browser. This is an example of the security attack called HTTP response splitting (link to WASC from where I've adapted my example). And that's just one of the dangers of letting users mess with headers. Setting or deleting cookies is another. In fact, the whole header section is in danger.

The HTTP splitting vulnerability has been fixed under the hood in at least Tomcat 6+, Glassfish 2.1.1+, Jetty 7+, JBoss 3.2.7+. (Thanks for that info, Jeff Williams.)

Should We Fix the addHeader() API?

Now we can ask ourselves two different things. The first is – should we fix the addHeader() and related APIs? Yes. They should look something like this:

… where the two domain classes HttpHeaderName and HttpHeaderValue accept strings to their constructors and validate that the strings adhere to the RFC 2047 specification. In one blow all Java developers are relieved of the burden to write that validation code themselves and relieved of always having to remember running it.

Should String Be An Abstract Class?

The larger question is about strings in general. Yes, they are super convenient. But we're fooling ourselves. We think the time we save by not modeling our domain, by not writing that validation code, by not narrowing down our APIs to do exactly what they're supposed to, we think that time is better spent on other activities. It's not.

I truly believe nothing is just a string. Nothing is any of 100,000 characters and anything between 0 and 2 billion in length.

Therefore String should be an abstract class, forcing us developers to subtype and think about what we're really handling.

Even better, why not have a way to declare that a class can only be used in object composition? That way programmers could choose if an "is-a" relation or a "has-a" relation is most suitable for narrowing down the String class.

24 comments:

I think your post is a good example of why HttpServletResponse should accept a custom type of String, but still doesn't solidly argue that String should be abstract.

To me, "Hi" is still a String. What kind of String would you make it? Obviously it depends upon the context, so perhaps it would be a LogMessage if logging, or perhaps an ErrorMessage if throwing an exception. Is this essentially what you're getting at? Or could there be a SimpleString that would handle cases of overlap, since LogMessage and ErrorMessage might be identical in behavior?

Absolutely. Besides, people would start to write hundreds of versions of THE 'standard' string. If certain methods require a string to have a certain format, then there's other ways to a achieve these restrictions in java.

If I would model "Hi" I would start with the class EnglishWord and a few test cases around it. If it didn't suffice I'd move on to EnglishSentence or EnglishText. All three would differ significantly from String.

I want a really tight model of what my system actually handles. It makes the code more readable and avoids several bugs including null values. Such coding style depends on good refactoring tools so that an expansion from EnglishSentence to EnglishText would not be a major task.

I agree with you on most of the points since currently i am working on a project and have to validate almost 50 fields most of which are string due to client's requirements, but when it comes to declaring classes as EnglishWords / EnglishSentences it would cause problem for other language usage.

I agree with immutability and as of today Java has used its only tool for making sure strings stay immutable, namely declaring them final.

Declaring String abstract implies removing final but that's not at the core of what I'm getting at. I suggest declaring String abstract since that's simply the only way to force developers not to use plain strings.

If there was another way of doing this, for instance declaring a class immutable and restricting in what ways you could extend, override, and overload I would not be worried about the challenges Effective Java mentions.

Actually, my thoughts on this do apply to integers. Almost nothing is "just an integer" and a plethora of bugs have come to life because we use integers in a sloppy way.

To start with Java lacks even the basic support of unsigned integers. At the same time lots of things we model as integers does not make sense in negative form. This has led to infamous bugs where you could order a negative number of bookshelves or stocks.

For languages who do support unsigned integers you instead get bugs due to implicit typecasts, for instance memory corruption in C/C++.

As for variable naming it doesn't help much. A string is a string even though I call my string variable humanName. I'm arguing we need formal modelling and compiler or runtime support for making sure our data adheres to the model.

What about using annotations to describe the type of string and get those strings validated? Not just for validation purposes but also to describe what type of data is contained in the string. That would be really useful for code reviews and SAST tools

I'm not sure annotations will be expressive enough for validation. Take the spec for HTTP headers for instance. How would such an annotation look? How would it capture all the rules for allowed characters and syntax? I suspect we'd be moving towards regular expressions and thus be doomed. :)

I am not quite sure where your argument for making String itself an abstract class (on a whole) comes in. You spent much of the article talking about the problem with addHeader allowing virtually anything to be added and why perhaps creating a customized string class would be advisable in this probably needed. Which you are right.

But all you have shown is an edge use case for this particular domain. Hardly enough evidence to warrant String itself to be an abstract class. Create your own version of the String class for your domain and move on.

I think you have done a good job arguing that HttpHeaderName and HttpHeaderValue are not Strings. On the other hand, having them sub class string would imply that they *are* Strings, participating in an "is a" relationship.

Having them in their own classes seems like a good idea, but I don't see any reason why they would sub class String. They might use a string internally to store the actual data they need, but that doesn't mean they need to *be* Strings.

"where the two domain classes HttpHeaderName and HttpHeaderValue accept strings to their constructors and validate that the strings adhere to the RFC 2047 specification"...

Well you can still accept plain Strings and validate them in the addHeader method in the same manor you would do with your header objects. And of course mention this validation in the documentation. Or accept both variants where the Strings' method will create your header objects and delegate the call to the other method. It would be more readable code than just forcing a programmer to always wrap the String by another object.

I think the more pragmatic view should be considered. Having abstract String would be really annoying. What would be the String's abstract methods? You can create your own wrappers over String or just use some annotation that will describe your specific validation policy or whatever.

The main "problem" is that both "storage type" and "domain type" are represented in the same way - and most people end up using a simple storage type when they should be encapsulating one or more storage types within a domain type.

It is worth thinking about ways individuals and the programming language itself can aid in making people use domain types instead of storage types but I cannot see how making all storage types abstract would accomplish that.

Hi - totally agree with the sentiment, but the example is undermining the article. First off, we recently tested the following containers, and all the modern ones are not susceptible to header injection. (Note that it is not yet specified in the Servlet Spec, and I have taken that up with the expert group).

Funny how the appsec community doesn't notice when something gets fixed :-).

Second, in Tomcat, they already have specific classes to model headers and groups of headers (see MimeHeaders and MimeHeaderField). They didn't extend String, but they made a separate class for this. As usual, the details under the covers are complex, since converting bytes on the wire to UTF-16 Java Strings is messy.

Still, totally agree with your points about strong typing and the benefits for security.

I am interested in your concept, but have a suggestion on the approach.Rather than inheriting the String class, it would be much better to wrap the String class with classes like javax.servlet.http.HttpHeaderName to introduce the validations and restrictions. This way all the String manipulations taken care of JVM is preserved.

So this doesn't really relate to the topic of your post, but I wanted to be able to reference the RFC for my devs to get them to make sure we weren't vulnerable to the response splitting example you give. At a minimum I wanted to make sure that we didn't allow CRLF in header values.

Problem was it seems CRLF are allowed?! From the RFC (which is posted in the blog entry)

field-content =

TEXT =

LWS = [CRLF] 1*( SP | HT )

So LWS includes CRLF and indeed from RFC2616 (not sure why you referenced RFC2047 in the article?) we have:

"HTTP/1.1 header field values can be folded onto multiple lines if the continuation line begins with a space or horizontal tab."

and specifically:

"A CRLF is allowed in the definition of TEXT only as part of a header field continuation."

Which means the allowed set of characters for the header value is not:

"Finally the header value can consist of any ASCII chars 9, 32-126 except 19 chars called separators … or a mix of tokens, separators, and quoted strings."

As 13 and 10 are allowed, sequentially (in that order), as long as the next character is a space or horizontal tab.

HTTP Response Splitting turns out to be harder to defend against than just stripping CRLF, who knew?

I totally agree with your argument, I have had the same thoughts regarding "String customerId" and "String accountNumber", and whatnot. However, making String abstract is not the only solution here, as others have commented. You can of course declare an HttpHeader or a CustomerId or AccountNumber type without subclassing String.

I think it makes perfect sense in some context to have unconstrained strings, and I would really like to see your validator for the "EnglishWord" class ;)

Other than that, I definitely agree that having an API accepting any string as an input, and outputting it without validating it is dangerous. But, then again, NEVER trust user input, no matter whether it's coming from a text box or an http header or cookie or whatever ;)