Identification of molecules observed in data derived from mass spectrometry experiments remains is very difficult, and hinders the wider application of high throughput untargeted metabolomics. The standard analysis pipelines compare observed spectra with databases of known molecules, but these databases have very low coverage resulting in the majority of the measured molecules being unidentified. Here, I will present an alternative approach for exploring datasets from untargeted metabolomics experiments that uses a technique from text mining (Latent Dirichlet Allocation; LDA ) to decompose the observed spectra across a set of shared components. We show that these often represent molecular substructures. In other words, we are able to break unknown molecules down into building blocks, many of which can be identified. I will show results on a number of standards and real datasets, as well as describe some future directions of this work.