I need to replace all non-ASCII (\x00-\x7F) characters with a space. I'm surprised that this is not dead-easy in Python, unless I'm missing something. The following function simply removes all non-ASCII characters:

def remove_non_ascii_1(text):
return ''.join(i for i in text if ord(i)<128)

And this one replaces non-ASCII characters with the amount of spaces as per the amount of bytes in the character code point (i.e. the – character is replaced with 3 spaces):

@dstromberg: slower; str.join()needs a list (it'll pass over the values twice), and a generator expression will first be converted to one. Giving it a list comprehension is simply faster. See this post.
– Martijn Pieters♦Nov 19 '13 at 18:42

1

The first piece of code will insert multiple blanks per character if you feed it a UTF-8 byte string.
– Mark RansomNov 19 '13 at 19:13

@MarkRansom: I was assuming this to be Python 3.
– Martijn Pieters♦Nov 19 '13 at 19:15

2

"– character is replaced with 3 spaces" in the question implies that the input is a bytestring (not Unicode) and therefore Python 2 is used (otherwise ''.join would fail). If OP wants a single space per Unicode codepoint then the input should be decoded into Unicode first.
– jfsFeb 19 '16 at 17:01

interesting suggestion, but it assumes the user wishes non ascii to become what the rules for unidecode are. This however poses a follow up question to the asker about why they insist on spaces, to perhaps replace with another character?
– jxramosFeb 18 '16 at 21:15

Thank you, this is a good answer. It doesn't work for the purpose of this question because most of the data that I'm dealing with does not have an ASCII-like representation. Such as דותן. However, in the general sense this is great, thank you!
– dotancohenFeb 20 '16 at 20:16

Yes, I know this does not work for this question, but I landed here trying to solve that problem, so I thought I’d just share my solution to my own problem, which I think is very common for people as @dotancohen who deal with non-ascii characters all the time.
– Alvaro FuentesFeb 24 '16 at 19:13

There have been some security vulnerabilities with stuff like this in the past. Just be careful how you implement this!
– deweydbNov 7 '16 at 18:44

Thank you, this is an important observation. If you do find a logical way to handle the case of combining-marks, I would happily add a bounty to the question. I suppose that simply removing the combining mark yet leaving the uncombined character alone would be best.
– dotancohenNov 20 '13 at 10:50

1

A partial solution is to use ud.normalize('NFC',s) to combine marks, but not all combining combinations are represented by single codepoints. You'd need a smarter solution looking at the ud.category() of the character.
– Mark TolonenNov 20 '13 at 10:55