The evolving Swift string API and implementation

As Microsoft did a couple of years ago, Apple’s language designers are also designing the next version of Swift in public.[1] One example of the new design is the discussion of String Processing For Swift 4 (GitHub). If you read through the relatively long document, you can at least see that they’re giving the API design a tremendous amount of thought.

API Considerations for Strings

There are so many factors to weigh when building the API, especially for a low-level construct like String.

As they state right at the beginning of the document, they are concerned with “Ergonomics, Correctness, Performance” (probably in that order).

How does the API affect storage?

Is it still possible to use a COW pattern in order to save memory for multiple copies of the same string? Other, similar languages like C# and Java have slowly moved to a more-eager copying mechanism to reduce complexity in the memory-manager for strings, especially when used in multi-tasking.

How allocation-efficient is the base string library? Does the API help the more well-worn code paths avoid allocation unless absolutely necessary?

What about slicing support? Does the API force copying when it would not be needed? Does it at least allow the decision to copy to be delayed until absolutely necessary?

How accessible are the various supported representations? (E.g. UTF8 vs. UTF16)

How compatible/performant is the optimally ergonomic API with the Objective-C interoperability? This is a common case and must be as close to allocation-free as possible and fast (because thunking between Swift code and Objective-C/Cocoa APIs is very common).

Is immutable the default, with mutability opt-in? (This prevents unwanted copies and dangling references in the reference-counted world of Swift … although Strings are actually structs rather than classes.)

Does the API do the “right thing” by default? In the case of Swift’s string-handling, this means that the caller of the API works with Unicode graphemes, by default.

What about case-sensitive/insensitive comparisons? Accent sensitivity?

What about ordering? Collation? Localization?

Does the API scale nicely to allow increasing specificity, with good defaults?

Is there consistency within the string API?

What about consistency with similar constructs, like Array?

How does the API fit with with developer expectations? Should the String be a Collection? If so, what is the default item-type?

Why doesn’t the Character have the same or a similar API as a String? (E.g. why can’t you get the sub-structure of the grapheme cluster for a character without first casting it to a String?)

Slices/Substrings

A good example is the discussion of how to represent string slices: should there be a separate type, called Substring, analogous to the ArraySlice that already exists for an Array?

“Long-term storage of Substring instances is discouraged. A substring holds a reference to the entire storage of a larger string, not just to the portion it presents, even after the original string’s lifetime ends.

“[…]

“The downside of having two types is the inconvenience of sometimes having a Substring when you need a String, and vice-versa. It is likely this would be a significantly bigger problem than with Array and ArraySlice, as slicing of String is such a common operation. It is especially relevant to existing code that assumes String is the currency type – that is, the default string type used for everyday exchange between APIs. To ease the pain of type mismatches, Substring should be a subtype of String in the same way that Int is a subtype of Optional<Int>.”

To implement Collection or not?

For those that watch as the API for Swift evolves from one major version to another—with each change introducing non–backward-compatible incompatibilities—this document should hopefully reassure them that the changes are not made lightly. It may seem like the designers don’t have a plan, but, over the years, designers and opinions change. E.g. Witness the discussion of what the default representation of the string should be.

“[…] in Swift 1.0, String was a collection of Character (extended grapheme clusters). […] In Swift 2.0, String’s Collection conformance was dropped, because we convinced ourselves that its semantics differed from those of Collection too significantly.”

After listing several reasons why the change in Swift 2.0 was not a good direction, they conclude that in 4.0, they should revert to the original behavior.

“It would be much better to legitimize the conformance to Collection and simply document the oddity of any concatenation corner-cases, than to deny users the benefits on the grounds that a few cases are confusing.”

Again, the discussion is open and public and, despite the claims of some who think that they’re just a bunch of cowboys changing stuff willy-nilly, they have a documented plan.

It’s unfortunate that it took them so long to get there, but this kind of design isn’t always easy.

Consolidating Index Types

Because Swift uses Unicode grapheme clusters as the default “items” view for strings, the discussion of string indices might seem unnecessarily abstract for developers coming from other languages, where the index is always an int int bytes.

“String currently has four views–characters, unicodeScalars, utf8, and utf16 […]”

Because of these different views, it’s necessary to discuss how to reduce API surface by consolidating the various index types used to refer to individual elements in these different “views” on a String.

Doing the Right Thing

It’s not like C#—and most other mainstream languages—have anything to brag about with their string-handling. In that respect, even Swift 1 and 2 are light-years ahead in Unicode correctness with their focus on grapheme clusters rather than the utterly nonsensical 90s-era bytes still used in those other languages.

“A Substring passed where String is expected will be implicitly copied. When compared to the “same type, copied storage” model, we have effectively deferred the cost of copying from the point where a substring is created until it must be converted to String for use with an API.

“A user who needs to optimize away copies altogether should use this guideline: if for performance reasons you are tempted to add a Range argument to your method as well as a String to avoid unnecessary copies, you should instead use Substring.”

Their goal is noble, though it’s unclear to what degree the vision can be realized. The following citation could be written as the high-level goal of any API.

“We should represent these aspects as orthogonal, composable components, abstracting pattern matchers into a protocol like this one, that can allow us to define logical operations once, without introducing overloads, and massively reducing API surface area.”