I have been working on Joshua, a toolkit for SMT. Before extracting grammar from parallel corpus, one necessary step is to eliminate sentences of more than 100 words. For Hansard, it is common that you will encounter sentences like that. So one needs to implement a function to do filtering. Here is what I did.