From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; de; rv:1.8.0.1) Gecko/20060202 Fedora/1.5.0.1-2 Firefox/1.5.0.1
Description of problem:
Replacing a character matched by its decimal ASCII value doesn't work
Version-Release number of selected component (if applicable):
sed-4.1.5-1
How reproducible:
Always
Steps to Reproduce:
1. edit a text file and add some weird characters (I chose the one with ASCII 160)
With vim, you can enter those characters with CTRL-V160 while in insert mode
2. save file
3. try to delete these characters with sed:
cat file | sed 's/\d160//' | newfile
4. open newfile (with vim) and watch how character 160 has been with character 196 (move cursor to character and press 'ga')
Actual Results: ASCII charactre 160 replaced with ASCII character 196
Expected Results: ASCII character 160 should have been removed
Additional info:

Yes, it definitely looks like it's only matching by byte instead of character.
It just so happens that the UTF-8 representation of \xa0 (which is \xc2\xa0)
ends in the byte \xa0 and this is throwing everything off.

Following your recipe, I created the file with a character #160. Vim encodes it
to UTF-8, creating a two-byte sequence 194,160. Now I pipe that file through
`sed s/\d160//`, which gives me file containing just the character 194, because
160 got erased. I think this is correct behavior.
Let's see how it works on UTF-8 characters:
$ echo "<A with acute, Ã>" | od -t x1
0000000 c3 81 0a
$ echo "<O with acute, Ã>" | od -t x1
0000000 c3 93 0a
$ echo "Ã" | sed 's/Ã/Ã/' | od -t x1
0000000 c3 93 0a
result: ok, UTF-8-only text handling works as intended.
Let's see how it works on binary files:
$ echo "Ã" | sed 's/\x81/\x82/' | od -t x1
0000000 c3 82 0a
$ echo "Ã" | sed 's/\xc3/\xc2/' | od -t x1
0000000 c2 81 0a
result: ok, binary handling works as intended.
For me, it just works. I think per-byte handling is feature of sed: if you work
with UTF-8 files, use full characters or their full codes, and the work gets
done. On binary files, just call the sequences by their value explicitly, and
the work gets done, too.
In UTF-8, the principle of non-overlap is defined. This means, that no
character is represented by a byte sequence, that is itself contained in a byte
sequence that represents other characters. So for example, when looking for
character 'd', \x64, you are guaranteed that \x64 is not part of other code
point. So for me, sed gets the job done as intended.
I'm closing it as notabug.

Note

You need to
log in
before you can comment on or make changes to this bug.