I've written a wrapper script that I alias to cat, called safecat, that protects me from accidentally cating binaries and getting a ton of screeching from my PC speaker. Right now it considers a file to be binary and will refuse to cat it if 30% or more of the characters within the first page are non-text characters (ASCII range 32-127 and \n,\r,\t,\b).

It just occurred to me that I really don't have any problem with catting a binary so long as I don't get the screeching and my terminal isn't messed up afterwards. Is there a set of characters or character sequences that I can specifically look for and refuse to cat if those are present? That would be more robust.

If you are working in pure ASCII then you only want to let characters 32-to-127, 13, 10, and 9 through. Those below 32 are control characters other than 13, 10 and 9 which are carriage return & line feed (used for line ends) and top respectively. Characters above 127 are not defined in ASCII but usually map to something, exactly what they map to depends on your current codepage settings though.

One useful tool to remember if you want to see what text is in a binary file (i.e. help text and other documentation stored within an executable or some binary document format that you don't have the reader for) is the strings command. It scans for sequences of printable characters and outputs them without the rest. the output can be piped through to other tools like grep and less. It is no use if the text is compressed or otherwise not pain-text though. You could update your wrapper to call this instead of cat instead of just refusing to do anything (though I suggest outputting a message first so the user knows that the content output has been filtered).

Characters most likely to cause problems are ESC(27), SI(14), SO(15), and DC-3/X-OFF(19). Some terminals support CSI(155 = 128 + 27) as a short form for introducing ESCape sequences.

Escape (ESC) introduces control sequences. Shift-In (SI) and shift-out (SO) can change character sets and other functionality. X-OFF (DC3) may stop the terminal from sending any data. Bell (8) may be noisy.

You may want to filter non-formatting control-characters in the range under decimal 32. Most used formatting characters are TAB(9), LF(10), CR(13), and FF(12). BS(7) and VT(11) are less commonly used now.

Control characters are arranged in groups by functionality which could make filtering easier.

Existing tools already handle the problem fairly well. Consider aliasing one of them as cat. This can break command chains. You can always get the raw cat back by prefixing the command with a backslash.

A character is a control character if (before transformation according to
the mapping table) it has one of the 14 codes 00 (NUL), 07 (BEL), 08 (BS),
09 (HT), 0a (LF), 0b (VT), 0c (FF), 0d (CR), 0e (SO), 0f (SI), 18 (CAN), 1a
(SUB), 1b (ESC), 7f (DEL). One can set a 'display control characters' mode
(see below), and allow 07, 09, 0b, 18, 1a, 7f to be displayed as glyphs. On
the other hand, in UTF-8 mode all codes 00-1f are regarded as control
characters, regardless of any 'display control characters' mode.