Scoping guidance for specific types of sites

Archive-It's software enables partners to capture and display content hosted by popular social media services and many other commonly used platforms. These services frequently update and change the ways that they serve content, so use the guidance on this page to avoid running into any problems.

Before you crawl

Be specific with your seed URLs.

Be as specific as possible when choosing your seed URL; in other words, add only the page that you want to archive as the seed.

DO NOT use the whole site as a seed: www.facebook.com or www.twitter.com

DO use: http://twitter.com/internetarchive/

Double-check your seeds!

Do you need an ending / (slash) ? Please be sure to read below for specific instructions on seeds for each site. Not doing so could result in archiving millions of documents unintentionally.

Run a test crawl first

We strongly recommend a test crawl on all new seeds before performing a production (non-test) crawl. This will ensure that your seeds are configured correctly and that you won't unintentionally crawl much more content than desired at the expense of your account's data budget.

Limit your crawls

You may want to set up data and/or document limits for these sites if the test crawl shows an unusually large volume of content and you have confirmed that your seed URL is correct.

After you crawl

It is especially important to review your first captures before regularly crawling your new seeds. Please look through your reports and the archived content after you run your first crawls in order to ensure that your archived content looks accurate, and that you didn't crawl more than you intended.

Sites with automated scoping rules

Recommended scoping rules exist for many popular platforms, including social media sites. Automated default scoping rules will be applied when new seed URLs from the following platforms are added to a collection or when you manually apply the rules to existing seeds:

Automated scoping rules for new seeds

When you add a new seed from one of the platforms that apply automatic scoping rules, you’ll see a + icon beside the seed. On hover, this will alert you to the fact that scoping rules will be automatically applied and that these rules may include ignoring robots.txt files.

After clicking “Add Seeds”, a banner will appear if any of your seeds had scoping rules automatically applied.

To view the rules that were applied to your seed, click on the hyperlinked seed URL listed in the “Seeds” tab of the collection management interface. Once you are in the seed settings, navigate to the "Seed Scope" tab. Rules with a link icon in the “Controls” column indicate automatically applied group rules:

To toggle off or delete any of the automatically applied rules, click the link icon in the “Controls” column. A dialog box will warn that, if unlinked, these rules will not be automatically updated when changes are made to our recommended scoping.

Once you click “Confirm” to edit grouped scoping rules, you will be able to toggle individual rules on and off. Once a rule is toggled to the “off” position, the option to delete it will appear (as illustrated in the first rule listed below).

Note: To facilitate testing Facebook seeds at varying data levels, the 3GB default data limit can be deleted without affecting the other automated Facebook rules, which are linked together.

Automated scoping rules for existing seeds

You can add automated scoping rules in bulk to selected existing seeds from your seed list. Be aware that adding the automated rules will delete all existing seed-level scoping rules from those seeds and replace them with the automated set. You may also want to review your collection-level scoping rules and delete any that are no longer necessary.

Select seed(s) from a collection’s seed tab by clicking the checkboxes to the left of each seed. You can select any seeds; only those with automated scoping rules will have seed-level rules applied.

Click the “Add Rules” button to add automated scoping rules, where relevant, to the selected seeds. A dialog box will appear listing the templates that will be applied and to how many seeds.