each of the Arrays created as a result of the previous algorithms can contain from 0 up to dozens of features overall. And yet virtually all/most of my vectors are one dimensional. I want to do some clustering with this data but the 1 dimensionality is a big problem. Why is this happening and how can I fix it?

I figured out that the error happens precisely when I clean up the data. If I don't do the clean up, HashingTF performs normally. What am I doing wrong in the clean up and how can I perform a similar clean up without messing with the format?

[^a-zA-Z,_:] matches all whitespaces. It results in a single continuous string which when tokenized creates a single token and a Vector with one entry. You should exclude whitespaces or use regex tokenizer as replacement.