Scrapy will look for configuration parameters in ini-style scrapy.cfg files
in standard locations:

/etc/scrapy.cfg or c:\scrapy\scrapy.cfg (system-wide),

~/.config/scrapy.cfg ($XDG_CONFIG_HOME) and ~/.scrapy.cfg ($HOME)
for global (user-wide) settings, and

scrapy.cfg inside a scrapy project’s root (see next section).

Settings from these files are merged in the listed order of preference:
user-defined values have higher priority than system-wide defaults
and project-wide settings will override all others, when defined.

Scrapy also understands, and can be configured through, a number of environment
variables. Currently these are:

The first line will print the currently active project if you’re inside a
Scrapy project. In this example it was run from outside a project. If run from inside
a project it would have printed something like this:

You use the scrapy tool from inside your projects to control and manage
them.

For example, to create a new spider:

scrapygenspidermydomainmydomain.com

Some Scrapy commands (like crawl) must be run from inside a Scrapy
project. See the commands reference below for more
information on which commands must be run from inside projects, and which not.

Also keep in mind that some commands may have slightly different behaviours
when running them from inside projects. For example, the fetch command will use
spider-overridden behaviours (such as the user_agent attribute to override
the user-agent) if the url being fetched is associated with some specific
spider. This is intentional, as the fetch command is meant to be used to
check how spiders are downloading pages.

This section contains a list of the available built-in commands with a
description and some usage examples. Remember, you can always get more info
about each command by running:

scrapy<command>-h

And you can see all available commands with:

scrapy-h

There are two kinds of commands, those that only work from inside a Scrapy
project (Project-specific commands) and those that also work without an active
Scrapy project (Global commands), though they may behave slightly different
when running from inside a project (as they would use the project overridden
settings).

Create a new spider in the current folder or in the current project’s spiders folder, if called from inside a project. The <name> parameter is set as the spider’s name, while <domain> is used to generate the allowed_domains and start_urls spider’s attributes.

This is just a convenience shortcut command for creating spiders based on
pre-defined templates, but certainly not the only way to create spiders. You
can just create the spider source code files yourself, instead of using this
command.

Downloads the given URL using the Scrapy downloader and writes the contents to
standard output.

The interesting thing about this command is that it fetches the page how the
spider would download it. For example, if the spider has a USER_AGENT
attribute which overrides the User Agent, it will use that one.

So this command can be used to “see” how your spider would fetch a certain page.

If used outside a project, no particular per-spider behaviour would be applied
and it will just use the default Scrapy downloader settings.

Supported options:

--spider=SPIDER: bypass spider autodetection and force use of specific spider

Opens the given URL in a browser, as your Scrapy spider would “see” it.
Sometimes spiders see pages differently from regular users, so this can be used
to check what the spider “sees” and confirm it’s what you expect.

Supported options:

--spider=SPIDER: bypass spider autodetection and force use of specific spider

--no-redirect: do not follow HTTP 3xx redirects (default is to follow them)

Starts the Scrapy shell for the given URL (if given) or empty if no URL is
given. Also supports UNIX-style local file paths, either relative with
./ or ../ prefixes or absolute file paths.
See Scrapy shell for more info.

Supported options:

--spider=SPIDER: bypass spider autodetection and force use of specific spider

-ccode: evaluate the code in the shell, print the result and exit

--no-redirect: do not follow HTTP 3xx redirects (default is to follow them);
this only affects the URL you may pass as argument on the command line;
once you are inside the shell, fetch(url) will still follow HTTP redirects by default.