It was 7 minutes after midnight. The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.

That’s 121 characters, in Twitter’s eyes: An exact count of the letters, numbers, spaces, periods and single-quotes contained in the tweet.

Twitter written out in Scrabble letters. Via Pixabay, in the public domain.

Whitespace Counts

Note that Twitter counts leading and trailing spaces when calculating a string’s length. Given these two strings:

It was 7 minutes after midnight. The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.
It was 7 minutes after midnight. The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.

Line 1 is 122 characters long (leading whitespace); Line 2 is 121 characters long.

Twitter also counts multiple white spaces between words in a tweet. Given these two strings:

It was 7 minutes after midnight. The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.
It was 7 minutes after midnight. The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.

Line 1 is 125 characters long (4 extra spaces between “was” and “7”); Line 2 is 121 characters long.

Finally, Twitter counts carriage return / newline combinations as a character. {Technically, it combines carriage return (\r) and newline (\n) into just a newline (\n).} Given these two strings:

It was 7 minutes after midnight.
The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.
It was 7 minutes after midnight. The dog was lying on the grass in the middle of the lawn in front of Mrs. Shears' house.

Lines 1 and 2 combined are 121 characters (after trimming the trailing space off Line 1, following the period, and replacing it with a newline); Line 3 is 121 characters.

So, trim your tweet strings, or you may get unexpected results; and consider running your tweets through a regular expression to remove duplicate whitespace, unless you anticipate multiple spaces being necessary within a tweet.

Counting Link Lengths

Per the Twitter documentation, as of this writing every link run through t.co will be either 22 characters (http) or 23 characters (https) in length, regardless of the length of the actual URL you submitted.

Therefore, we can use a regular expression to find all links in our tweet’s body, and count each link as being 22 characters long, regardless of its actual length. Something like this should find web links:

Will the regex above pass every correct URL possible? Nope; a really weird one might get rejected. Will it pass some poorly formed URLs? Probably.

For all but edge cases, the regex above will do just fine for finding web links. If you need something better, good; find it or make it, then use it, but please spare everyone the details of why your regex is better than this one. On behalf of everyone who has a life to live, we thank you.

Counting matches against this pattern is a two-step process:

First, we need to remove the actual length of the URL(s) in our tweet from the length of our string; then

We need to add back, to the length of our string(s), either 22 or 23, for each URL in the tweet, depending on whether it is secure.

Important note on what t.co will shorten: The t.co shortener will shorten what looks like a top-level and second-level domain combo, even if it doesn’t have a protocol.

In other words, if you were to include the term ASP.NET in your tweet, Twitter would consider that a URL, prepend it with http://, then run it through the t.co shortener.

To avoid this behavior, either put a space between the two (e.g., ASP .NET) or replace the period with a UTF-8 extended character that looks like a period, but isn’t (e.g., \u002e).

Important note on what t.co won’t shorten: Twitter will not wrap certain links with t.co. These include two or more links that are joined by commas or periods without spaces (e.g., http://www.site.com,http://www.example.com), and links containing credentials (e.g., http://user:pass@example.com).

If you send these kinds of links via the API, they will not be shortened; their full length will count against the 140-character limit.

Important note on the length of t.co shortened links: As a rule, the t.co shortener will use whichever protocol was passed to it when shortening your link. That is, if your tweet has a non-secure (http) link, t.co will shorten it with http; if you submit a secure (https) link, t.co will shorten it with https.

Also, the length of a t.co shortened link can change. To be 100 percent certain of the current length of a t.co link, you can run a GET request on help/configuration in the Twitter REST 1.1 API; it will return short_url_length and short_url_length_https, which will be integers giving the current length of a t.co shortened link.

Counting Complex Characters

The string is first normalized using Unicode’s Normalization Form C; and

Twitter counts codepoints, not UTF-8 bytes, for extended characters.

This is fortunate for us, for a number of technical reasons I won’t bore you with. (It has to do with very geeky stuff relating to how the encoding between UTF-8, in which Twitter operates, and UTF-16, in which .NET opperates, is translated; plus how normalization changes the means by which extended characters are encoded.)

I recognize that because I am assessing this string with UTF-16 encoding rather than UTF-8 encoding, I may get unexpected results in terms of length.

That is, it’s possible UTF-16 does not use the same number of codepoints to render a character as UTF-8 uses; and it’s possible that even after normalizing, I may send to Twitter a UTF-16 character encoding that it cannot properly normalize as UTF-8, which may be truncated.

All things even, that error would probably result in this function overestimating the length of a string, rather than underestimating it. I am also assuming that for most people, letting Twitter handle the conversion to UTF-8 (and subsequent normalization) will be OK.