Module documentation for 0.5.1

Scalpel

Scalpel is a web scraping library inspired by libraries like
Parsec
and Perl’s Web::Scraper.
Scalpel builds on top of TagSoup
to provide a declarative and monadic interface.

There are two general mechanisms provided by this library that are used to build
web scrapers: Selectors and Scrapers.

Selectors

Selectors describe a location within an HTML DOM tree. The simplest selector,
that can be written is a simple string value. For example, the selector
"div" matches every single div node in a DOM. Selectors can be combined
using tag combinators. The // operator to define nested relationships within a
DOM tree. For example, the selector "div" // "a" matches all anchor tags
nested arbitrarily deep within a div tag.

In addition to describing the nested relationships between tags, selectors can
also include predicates on the attributes of a tag. The @: operator creates a
selector that matches a tag based on the name and various conditions on the
tag’s attributes. An attribute predicate is just a function that takes an
attribute and returns a boolean indicating if the attribute matches a criteria.
There are several attribute operators that can be used to generate common
predicates. The @= operator creates a predicate that matches the name and
value of an attribute exactly. For example, the selector "div" @: ["id" @= "article"] matches div tags where the id attribute is equal to "article".

Scrapers

Scrapers are values that are parameterized over a selector and produce a value
from an HTML DOM tree. The Scraper type takes two type parameters. The first
is the string like type that is used to store the text values within a DOM tree.
Any string like type supported by Text.StringLike is valid. The second type
is the type of value that the scraper produces.

There are several scraper primitives that take selectors and extract content
from the DOM. Each primitive defined by this library comes in two variants:
singular and plural. The singular variants extract the first instance matching
the given selector, while the plural variants match every instance.

Example

Complete examples can be found in the
examples folder in the
scalpel git repository.

The following is an example that demonstrates most of the features provided by
this library. Supposed you have the following hypothetical HTML located at
"http://example.com/article.html" and you would like to extract a list of all
of the comments.

Tips & Tricks

The primitives provided by scalpel are intentionally minimalistic with the
assumption being that users will be able to build up complex functionality by
combining them with functions that work on existing type classes (Monad,
Applicative, Alternative, etc.).

This section gives examples of common tricks for building up more complex
behavior from the simple primitives provided by this library.

OverloadedStrings

Selector, TagName and AttributeName are all IsString instances, and
thus it is convenient to use scalpel with OverloadedStrings enabled. If not
using OverloadedStrings, all tag names must be wrapped with tagSelector.

Matching Wildcards

Scalpel has 3 different wildcard values each corresponding to a distinct use case.

anySelector is used to match all tags:

textOfAllTags = texts anySelector

AnyTag is used when matching all tags with some attribute constraint. For
example, to match all tags with the attribute class equal to "button":

textOfTagsWithClassButton = texts $ AnyTag @: [hasClass "button"]

AnyAttribute is used when matching tags with some arbitrary attribute equal
to a particular value. For example, to match all tags with some attribute
equal to "button":

Complex Predicates

It is possible to run into scenarios where the name and attributes of a tag are
not sufficient to isolate interesting tags and properties of child tags need to
be considered.

In these cases the guard function of the Alternative type class can be
combined with chroot and anySelector to implement predicates of arbitrary
complexity.

Building off the above example, consider a use case where we would like find the
html contents of a comment that mentions the word “cat”.

The strategy will be the following:

Isolate the comment div using chroot.

Then within the context of that div the textual contents can be retrieved
with text anySelector. This works because the first tag within the current context
is the div tag selected by chroot, and the anySelector selector will match the
first tag within the current context.

Then the predicate that "cat" appear in the text of the comment will be
enforced using guard. If the predicate fails, scalpel will backtrack and
continue the search for divs until one is found that matches the predicate.

Generalized Repetition

The pluralized versions of the primitive scrapers (texts, attrs, htmls)
allow the user to extract content from all of the tags matching a given
selector. For more complex scraping tasks it will at times be desirable to be
able to extract multiple values from the same tag.

Like the previous example, the trick here is to use a combination of the
chroots function and the anySelector selector.

Consider an extension to the original example where image comments may contain
some alt text and the desire is to return a tuple of the alt text and the URLs
of the images.

The strategy will be the following:

to isolate each img tag using chroots.

Then within the context of each img tag, use the anySelector selector to extract
the alt and src attributes from the current tag.

scalpel-core

The scalpel package relies on curl to provide networking support. For small
projects and one off scraping tasks this is likely sufficient. However when
using scalpel in existing projects or on platforms without curl this dependency
can be a hindrance.

For these scenarios users can instead depend on
scalpel-core which does not
provide networking support and does not depend on curl.

Troubleshooting

My Scraping Target Doesn’t Return The Markup I Expected

Some websites return different markup depending on the user agent sent along
with the request. In some cases, this even means returning no markup at all in
an effort to prevent scraping.

To work around this, you can add your own user agent string with a curl option.

If you do not require network support, you can instead depend on
scalpel-core which does not
does not depend on curl.

Changes

Change Log

HEAD

0.5.1

Fix bug (#59, #54) in DFS traversal order.

0.5.0

Split scalpel into two packages: scalpel and scalpel-core. The latter
does not provide networking support and does not depend on curl.

0.4.1

Added notP attribute predicate.

0.4.0

Add the chroot tricks (#23 and #25) to README.md and added examples.

Fix backtracking that occurs when using guard and chroot.

Fix bug where the same tag may appear in the result set multiple times.

Performance optimizations when using the (//) operator.

Make Scraper an instance of MonadFail. Practically this means that failed
pattern matches in <- expressions within a do block will evaluate to mzero
instead of throwing an error and bringing down the entire script.

Pluralized scrapers will now return the empty list instead mzero when there
are no matches.

Add the position scraper which provides the index of the current sub-tree
within the context of a chroots’s do-block.

0.3.1

Added the innerHTML and innerHTMLs scraper.

Added the match function which allows for the creation of arbitrary
attribute predicates.

Fixed build breakage with GHC 8.0.1.

0.3.0.1

Make tag and attribute matching case-insensitive.

0.3.0

Added benchmarks and many optimizations.

The select method is removed from the public API.

Many methods now have a constraint that the string type parametrizing
TagSoup’s tag type now must be order-able.

Added scrapeUrlWithConfig that will hopefully put an end to multiplying
scrapeUrlWith* methods.

The default behaviour of the scrapeUrl* methods is to attempt to infer the
character encoding from the Content-Type header.

0.2.1.1

Cleanup stale instance references in documentation of TagName and
AttributeName.

0.2.1

Made Scraper an instance of MonadPlus.

0.2.0.1

Fixed examples in documentation and added an examples folder for ready to
compile examples. Added travis tests to ensures that examples remain
compilable.

0.2.0

Removed the StringLike parameter from the Selector, Selectable,
AttributePredicate, AttributeName, and TagName types. Instead they are now
agnostic to the underlying string type, and are only constructable with
Strings and the Any type.

0.1.3.1

Tighten dependencies and drop download-curl all together.

0.1.3

Add the html and html scraper primitives for extracting raw HTML.

0.1.2

Make scrapeURL follow redirects by default.

Expose a new function scrapeURLWithOpts that takes a list of curl options.

Fix bug (#2) where image tags that do not have a trailing “/” are not
selectable.