> I guess I'm looking for feedback on what I can do to make this actually
> work; one of my main problem now seems to be robust HTML scraping issues.
>> At the moment, I'm just using the html and xml collections provided as
> part of PLT Scheme to extract content from the HTML-ified reference
> documentation, but the parsers seem particularly unhappy about the
> non-well-formedness in there. Should I be using a different set of
> parsers?
You have several choices for your HTML scraping.
1. WebIt's 'xml-match'
http://celtic.benderweb.net/webit/
Jim Bender's WebIt framework gives you some nifty pattern matching tools
for working with XML expressions.
Eg.
(xml-match xexpr
[(strong ,text)
(printf "Found this text: ~a~n" text)])
would be a snippet that would match values of xexpr like
(strong "Hi there!")
but not
(em (strong "Hi there!"))
So, the library is "fragile" in that you have to handle your own
recursion through the tree. Given that the docs may change HTML format,
this is probably a relatively tedious approach, as it relies on the
document structure being static.
2. XML query languages
A more robust approach might be to use an XML query language. This
approach lets you write (pardon the generalization) "SQL-like" queries
over an (S)XML document. This way, you can say something like "give me
all the nodes that are wrapped in the <strong> tag."
Jim Bender's library (above) contains an XML query language of this nature
http://celtic.benderweb.net/webit/docs/xquery-pre/
and there is the SXPath library provided by Oleg's SSAX/SXML implementation
http://okmij.org/ftp/Scheme/xml.html
(and, you may get lost/enjoy other things found from the root of that
site at http://okmij.org/ftp/).
3. HTMLPrag
All of that said, though, perhaps you should look at Neil's
web-scraper-helper-thinger.
http://www.neilvandyke.org/htmlprag/http://planet.plt-scheme.org/#webscraperhelper.plt1.1
Again, that might take some of the difficulty out of crafting the SXPath
queries to pull a page apart.
All of these approaches are going to stretch your knowledge of Scheme in
one way or another, and there certainly may be other ways to go about
(permissively?) parsing HTML/XML in PLT Scheme. These were the three
that came to mind, and my apologies if I've left anything out or
mis-attributed any work.
Good luck!
Matt