Arthur's dev blog

Setting HTML/Text to Clipboard revisited

After getting feedback that my original clipboard code doesn't handle all scenarios, especially with Chrome, I went back to the code to get a better understand of what's going on and find the correct way to set plain text and HTML snippet to clipboard.

Highlights

Setting plain text and html data.

Unicode support for plain text.

Clipboard HTML format

Version.

Parts start/end offsets.

StartFragment/EndFragment comments.

<html> and <body> elements.

Unicode handling.

TL; DR

Setting HTML and Plain text

To set both plain text and rich HTML you need to create DataObject instance, set its data with both plain text and HTML format data, then set the data object to clipboard. The receiving client will read the appropriate data depending on its capabilities.

Plain text Unicode support
Note that the plain text was set twice, using regular and Unicode format. It is important to set both as without the regular format some older clients will not get any text as they do not handle Unicode, and without Unicode format non-ASCII text won't work properly or even won't paste any text at all as some clients expecting proper Unicode support so they don't use the regular format at all.

HTML format

To set HTML snippet to clipboard it must be embedded in HTML Clipboard Format, this allows to surround the html snippet with context – additional styling elements that apply on the html snippet but should not be pasted, the receiving client is responsible to properly interpret them.

For example, to only copy "Hello World אבג " text from HTML snippet in figure 1 you need to create HTML Clipboard Format string shown in figure 2 .

Only what is between <!–StartFragment–> and <!–EndFragment–> should be pasted.

Version

The correct format version is 0.9, unfortunately there are blog posts1 and libraries2 that incorrectly use 1.0 for clipboard format version. It may not seem very important but as described here using version 1.0 may identify your application as creating invalid clipboard format and may cause the receiving client to incorrectly interpret it.

Start/End offsets headers

HTML – byte count from the beginning of the clipboard format to the start/end of the HTML context that surrounds the HTML snippet.

Fragment – byte count from the beginning of the clipboard format to the start/end of the HTML snippet that is copied.

Selection – optional, can contain additional information on selected portion of the copied snippet, I add it with fragment values just in case some clients incorrectly require it.

Note: I use zero's padding on offsets to keep the header size constant, convenience during format string generation.Note: All the indexes are byte count and not char count, see "Unicode handling" section.

StartFragment/EndFragment and HTML context

Marks the start and end of the actual fragment that needs to be pasted, everything inside the fragment is pasted as-is, including all the HTML elements present. The HTML surrounding the fragment is the context and can be used to provide the styling for the fragment (useful if copied snippet was part of larger HTML), it is the receiving client responsibility to parse the HTML in the context and apply its styling properly into the pasted text.

Note: You can also get the fragment substring using StartFragment/EndFragment offsets in the header.Note: Not all clients handle context correctly, Chrome, for example, ignores styles set in context.

<html> and <body> elements
Although not explicitly required by the format the context must include "<html>" and "<body>" elements to be properly interpreted by some receiving clients. Providing multiple <html> or <body> elements or set them as part of the fragment may also cause issues for some clients, therefore, in my code, I parse the snippet to add the elements if they are missing or insert the StartFragment/EndFragment comments inside the snippet so they will always be in the context and not inside the fragment.

Note: Chrome for example will fail to parse the format if <html> element is missing

Unicode handling

The only character set supported by the clipboard is Unicode using UTF-8 encoding3.
Format header uses only ASCII characters so there is no special handling required, but the text of the context (starting at StartHTML) could be using any other characters including characters that require UTF-16 or higher encoding. This has two consequences:

As mentioned earlier the Start/End offsets in the header are byte count, therefore it can be higher than the character count of the text, providing invalid offsets can cause clients to trim the end of the HTML snippet (common mistake is to use string.Length that returns character count and not byte count). In my example " אבג " uses 2 bytes encoding therefore EndFragment-StartFragment=32 and not 29.

When setting format string to clipboard object it will use .NET default string encoding (UTF-16) so not-ASCII characters will be encoded incorrectly, resulting in '?' characters appearing in receiving client. To fix it you can either re-encode the string into UTF-8 (those the "××‘×’" string) or using UTF-8 encoding stream to set the data to clipboard (Note: it was fixed in .NET 4.0).

I have just imported this code to my project (https://github.com/michal-czardybon/herring). It works, but for some reason I can’t paste the html to Word/WordPad. It is strange, because I can paste it to Libre Office or to html editor in my email client. Any idea why Word/WordPad does not accept what is in the clipboard? (It only accepts the plain-text version).

I can’t figure out now how I finally got it working… but right now it seems to work well in my software. I have just diffed the code of your class and I see I did not significant modifications. Sorry for disturbing apparently without a good reason.

Hello! I have used your class, but I have an encoding problem. The french accents (é è à ç ô…) get transformed to Chinese characters on paste, whether it goes through the conversion mechanism (Encoding.Default.GetString…) or not. I had to remove accents using Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(htmlContent)). Any idea ?

I just came back here after almost a year of using this code, because out Chinese customers complained that they cannot copy&paste text with Chinese characters. And it appears that the Encoding.Default for Traditional Chinese locales in Windows is broken and
htmlFragment = Encoding.Default.GetString(Encoding.UTF8.GetBytes(htmlFragment));
returns incorrect string.

So I would suggest passing the UTF-8 encoded string directly to the clipboard as a memory stream, so we don’t need the re-encoding.