FunnelWeb is a webcrawler which extracts website content such as titles, descriptions,
images and content blocks from existing websites. It filters this content and uploads
it into a new website which uses the Plone CMS. It gives you many options for adjusting
how content is migrated. It is an invaluable tool when you want to migrate a site which doesn’t
use a CMS or there isn’t a tool can migrate content directly from the sites database.

Funnelweb is organised as a series of steps through which crawled items pass before eventually being
uploaded. Each step has one or more configuration options so you can customise import process
for your needs. Almost all imports will require some level of configurations.

Any arguement from the pipeline can be overridden via the command-line

e.g

$> bin/funnelweb --crawler:url=http://www.whitehouse.gov

All arguments are –(step:argument)=value.
The first part of each configuration key is the step e.g. crawler. The second part is the particular
configuration option for that particular step. e.g. url. This is then followed by = and value or values.

some options require multiple lines within a buildout part. These can be overridden
via the commandline by repeating the same argument e.g.

Funnelweb imports HTML either from a live website, from a folder on disk, or a folder
on disk with HTML which was retrieved from a live website and may still have absolute
links refering to that website.

Funnelweb can only import things it can crawl, i.e. content that is linked from
HTML. If your site contains javascript links or password protected content, then
you may have to perform some extra steps to get funnelweb to crawl your
content.

To crawl a live website, supply the crawler with a base HTTP URL to start crawling from.
This URL must be the URL which all the other URLs you want from the site start with.

If you’d like to skip processing links with certain mimetypes you can use the
drop:condition option. This TALES expression determines what will be processed further

[funnelweb]
recipe = funnelweb
drop-condition: python:item.get('_mimetype') not in ['application/x-javascript','text/css','text/plain','application/x-java-byte-code'] and item.get('_path','').split('.')[-1] not in ['class']

Funnelweb has a built-in clustering algorithm that tries to automatically extract the content from the HTML template.
This is slow and not always effective. Often you will need to input your own template extraction rules.

Note that for a single template e.g. template1, ALL of the XPaths need to match otherwise
that template will be skipped and the next template tried. If you’d like to make it
so that a single XPath isn’t nessary for the template to match then use the keyword optional or optionaltext
instead of text or html before the XPath.

In the default pipeline there are four templates called template1, template2, template3 and template4.

When an XPath is applied within a single template, the HTML it matches will be removed from the page.
Another rule in that same template can’t match the same HTML fragment.

If a content part is not useful to Plone (e.g. redundant text, title or description) it is a way to effectively remove that HTML
from the content.

To help debug your template rules you can set debug mode

$> bin/funnelweb --template1:debug --template2:debug

Setting debug mode on templateauto will give you details about the rules it uses.

By default, funnelweb will automatically create Plone aliases based on the original crawled URLs, so that any old links
will automatically be redirected to the new cleaned-up urls. You can disable this by

$> bin/funnelweb --plonealias:target=

You can change what items get published to which state by setting the following

You might need to insert further transformation steps for your particular
conversion usecase. To do this, you can extend funnelweb’s underlying
transmogrifier pipeline. Funnelweb uses a transmogrifier pipeline to perform the needed transformations and all
commandline and recipe options refer to options in the pipeline.

If you have decided you need to customise your pipeline and you want to install transformation
steps that use blueprints not already included in funnelweb or transmogrifier, you can include
them using the eggs option in a funnelweb buildout part

Some transmogrifier blueprints assume they are running inside a Plone
process such as those in plone.app.transmogrifier (see http://pypi.python.org/pypi/plone.app.transmogrifier). Funnelweb
doesn’t run inside a Plone process so these blueprints won’t work. If
you want upload content into Plone, you can instead use
transmogrify.ploneremote which provides alternative implementations
which will upload content remotely via XML-RPC.
transmogrify.ploneremote is already included in funnelweb as it is
what funnelweb’s default pipeline uses.

When using the default blueprints in funnelweb the following are some of the attributes that
will become attached to the items that each blueprint has access to. These can be used in the various
condition statements etc. as well as your own blueprints.

_site_url

The base of the url as passed into the webcrawler

_path

The remainder of the URL. _site_url + _path = URL

_mimetype

The mimetype as returned by the crawler

_content

The content of the item crawled, include image, file or HTML data.

_orig_path

The original path of the item that was crawled. This is useful for setting redirects so
you don’t get 404 errors after migrating content.

_sort_order

An integer representing the order in which this item was crawled. Helps to determine
what order items should be sorted in folders created on the server if your site
has navigation which has links ordered top to bottom.

_type

The type of object to be created as returned by the “typeguess” step

title, description, text, etc.

The template steps will typically create fields with content in them taken from _content

_template

The template steps will leave the HTML that wasn’t seperated out into different fields in this
attribute.

_defaultpage

Set on an Folder item where you want to tell the uploading steps to set the containing item
mentioned in _defaultpage to be the default page shown on that folder instead of a content listing.

_transitions

Specify the workflow action you’d like to make on an item after it’s uploaded or updated.

_origin

This is used internally with the transmogrify.siteanalysis.relinker blueprint as a way to
tell it that you have changed the _path and you now want the relinker to find any links that
refer to _origin to now point to _path.

The code of funnelweb itself is fairly minimal. It just sets up and runs a transmogrifier pipeline.
The hard work is actually done by five packages which each contain one or more transmogrifier
blueprints. These are: