Hello,
I have just released a project on sourceforge that contains 4 different parsers for Entrez Gene ASN file based on regex, Parse::RecDescent, Parse::Yapp, and Perl-byacc. They differ in performance and the regex-based parser is the best performer, processing over 13000 records a minute on average (It finishes the 900+ MB human annotation file in 11 minutes on one Intel Xeon 2.4 GHz CPU). The other parsers are at least a few fold slower but I included them since it'd be of intererst to people learning to use those tools or choosing among the tools for a practical project. All parsers are short OO-modules (<100 lines if not counting POD/YACC-generated code), so they are easy to use and understand.
Right now my parsers do not assemble data into Bioperl objects (because for my project I only needed to put them into a proprietary XML format, which is not released (not that it's anything special, just IP issues. Without IP issues, I could've released the parser code in Feb.)). They behave like XML-parsers, namely, they parse entrez gene records and assemble content into data structures only. But I hope it could serve as a base that Bioperl objects can be built (the data structure is easy to use). Please feel free to use the code for any Bioperl or other projects as I released them under GPL (thanks to my company and a collaborating company's consent).
Please also feel free to contact me if you have any suggestion or bug report.
The URL for the sourceforge project is http://sourceforge.net/projects/egparser/
Thanks,
Mingyi
Dr. Mingyi Liu
Computational Biologist
GPC Biotech Inc.
610 Lincoln St.
Waltham, MA 02451
USA