I have a working snapshot of a module that actually does work right on such things called Encode::Guess::Educated. It has no noncore dependencies.
It is designed to detect the encoding of English-language biomedical research papers. It can reliably detect not merely ASCII and UTF-{8,16,32}, but also the very conflicting 8-bit encodings.

The reason it can do this is that it works off a training model. I looked at three different corpora to do this: one containing 3½M non-ASCII codepoints, one containing 14M of them, and one containing 29M of them.
It makes an educated guest based on conformance to a particular model. And it does very well.

Right now it has only a CLI API and an OO API, no Export-based one. Here’s the easiest way to use the CLI API, via
a simple program called gank:

The underlying class’s default training model derives from the complete
PubMed Open Access corpus, and it therefore attains an extremely high measured accuracy of 99.79% when used on English-language biomedical texts.
It also does well on other texts using any Latin-based alphabet. I have
comparative statistics using two alternate training models, but
the PMCOA model is fine for most purposes.

You may also give gank a -s option to give you a short ‘score-card’ of the
various encodings it considered:

The rest is which encoding have that score, and in the
order of preference for breaking ties of the same score.
I have it arranged so it says it’s the smallest subset
that works; i.e., ascii < latin1 < cp1252, etc.

There’s also a -l option to give you a long report that illustrates
what each possible shoice would be if it were in that encoding,
with paired lines of literal UTF-8 and \N{...} named characters.

I need to do more work on its API — this is just a proof of concept, although it does comes with a halfway decent test suite — and of course document it,
but I’m hunkered down right now correcting page-proofs on Camel4,
so I probably won’t get to sprucing up the module for another 7–10 days.