I have a large corpus of text-based documents (100,000+) from which I want to extract proper names (e.g. a person's name).

Could anyone recommend techniques and/or software that would be useful in accomplishing this goal. I'm not particularly interested in low-level text parsing, so much as I am in more high-level things such as recognizing and/or ranking.

Names that refer to different things. Lily could be a name for a person, a place, a cat or just the flower.

NLP can use surrounding grammar constructs to tell some of these cases apart.

That said, a simple (and naive) technique that you could try would be to use the capitalisation of the words. If you see a capital starting letter in the middle of a sentence, it is usually a name of some sort.

You might be able to reasonably assume that any such word refers to the same thing within the same document. Two such words in a sequence are probably a name/surname combination etc.

If capitalisation in the documents cannot be trusted, you might be able to trust that of a proper wordlist, instead, in order to get a list of proper names for the applicable languages.