Truncating HTML in Ruby

For the blogging Web site we wanted snippets of blog post for the front page and search results. This wasn’t a problem at all for our test blog posts, which we made by pasting lorem ipsum into a text area:

<%= post.body.first(250) %>…

But this did cause a problem for real world blog posts, where the user would, for some reason beyond my understanding, write the post in Microsoft’s Word first, then paste in the, er, HTML. I sanitized and removed various bits of HTML, but there came a point where it just didn’t make sense to have

<img src="…

So, I set out to truncate XML properly. To this end I extended String with a truncate_html method. I just stuck it in lib/ in my Rails project and require‘d it in the Post model. Here, have it:

I aggregated all changes made by everyone in the comments (taking care of HTML entity chars and adding an option to append a tail text to the end of the resulting string) and also added an option to not cut words in half when doing the truncate.