Monday, February 3, 2014

Python sentence segmentation, kind of quick and mostly legit

Sentence segmentation (splitting a big block of text into sentences) is not trivial. You can't just split on periods, for example, because you'll get tripped up on every Dr. and Ms. and etc. and so on! However, it's mostly solved and in libraries, so here's a quick way to do it in python.

NLTK is a pretty general-purpose natural language processing toolkit. You could install the whole thing via instructions on their website. But that will also install a lot of other NLP tools. Also, a lot of these tools can be trained, which makes them more accurate if you have training data, but more difficult to get started if you don't have such training data. To get a pre-trained model:

- download Punkt from NLTK Data (direct link to Punkt)
- unzip it and copy english.pickle into the same directory as your python file. This is the trained model, which has been serialized out to a file. (obviously, this assumes you're segmenting English text; if not, grab one of the other .pickle files.)
- in your python code, unpickle it like so:import picklesegmenter_file = open('english.pickle', 'r')sentence_segmenter = pickle.Unpickler(segmenter_file).load()
- then call:sentences = sentence_segmenter.tokenize(text)
(where "text" is a string containing all your text).