I hope this post saves someone from a similar afternoon to the one I just spent, puzzling over what appeared to be an impossible test result.

TL;DR: string.encode!(string.encoding, …) does nothing, even if string isn’t valid for string.encoding. To really force an encoding, hop through BINARY. Regex matches on unsanitized binary data are common cases of explosions, but RSpec matchers patch =~ so an RSpec test can’t catch encoding failures.

Parts of Tddium, like many other Rails applications, use ActiveRecord object serialization for variable-field data that doesn’t need to be represented in a relational schema. For example, a model like this:

Internally, ActiveRecord serializes to YAML, which has a lot of nice properties. For a variety of reasons, in some of our apps, we’re still using Syck to process YAML. The source of the message data that gets written to raw_messages doesn’t always reliably produce UTF-8, so, being paranoid, we sanitize both in the producer and the consumer. In the producer, we use Ruby’s encode! method with options to replace invalid and undefined conversions, and in the consumer we drop messages that contain invalid characters. We’d always had some messages dropped, but we recently needed a dropped message for debugging, and we found that the producer’s sanitization wasn’t so zealous as we thought.

Some Background on Transcoding in Ruby

(For reference, all of this was done with ruby-1.9.2-p290, and rspec 2.6.0.)

Here’s an example of a string that claims to be UTF-8, but isn’t actually. Ruby properly detects that it’s invalid.

I don’t know why this happens. I suspect it’s that RSpec redefines =~ to wrap it with a nice error message, it somehow that makes it not explode in the way the Ruby native =~ operator does. Any ideas welcome.