For Windows, follow instructions here under “Starting the SQL Command Line Terminal”.

With the client started, you should see the postgres=# is the postgres prompt.

We simply create a database for our project with this command:

1

postgres=#createdatabasescrape;

Notice I had to include the semicolon there. This is pretty much all the raw SQL code we’re going to touch. If you are curious, you can learn more on how to interact with PostGres in their docs.

Settings.py

Create settings.py within my_scraper/scraper_app/. Our settings.py file will only have 4 variables defined (pop quiz: these variables are all caps, do you remember why? Refer to the Data Viz tutorial for a refresher).

If you were walking through the scrapy tutorial on your own, it would create a settings.py file for you with a few variables that you should define.

Our settings.py file is simply a list of global variables specific to our project so we can import our settings later.

We first give our web scraper a name, and where our spider module is located.

The drivername is the type of database we’re using – Postgres. Since we’re using Postgres that we installed on our own computer, the location, or the host is localhost. The port is the default port that Postgres listens on.

The username is your username for your machine. The password may not be needed (just an empty string, 'password': ''), or may be the password used when setting up Postgres initially.

The database is the name of the database we created earlier, postgres=# create database scrape;.

We will return to our settings.py file to add a fourth variable, ITEM_PIPELINES, once we setup our pipelines (how we handle the scraped & parsed data – e.g. we save them to the database).

A few things I want to point out with this example. First, the from sqlalchemy import * line. The import * literally imports everything into our models.py file. This is typically not good; it can sacrifice performance, and is also unclear to whomever reads your code later. We specifically want the create_engine() function from sqlalchemy, and if we just import *, it is difficult to initially see that create_engine() is defined in and imported from sqlalchemy.

Let’s be better developers and change our import statement to from sqlalchemy import create_engine. Here, we avoid importing everything from the sqlalchemy package, and we are more explicit with what we are using from sqlalchemy.

We make a general import settings statement – it does not import every item in settings.py, but it gives us access to any item we want by later using settings.DATABASE. You can think of the difference between import settings versus from sqlalchemy import * as “take the basket” versus “take everything out of the basket.”

Last item I want to point out before we move on is the usage of the double astricks within the URL() function: **settings.DATABASE. First, we are accessing the DATABASE variable within settings.py. The ** actually unpacks all the values within the DATABASE dictionary. The URL function, a constructor defined within SQLAlchemy, will map keys and values to a URL that SQLAlchemy can understand to make a connection to our database.

Then, the URL() function will parse out the elements, and create the following URL for the create_engine() function to read:

1

'postgresql://lynn@localhost:5432/scrape'

Now create_engine() can read that URL to eventually make a connection to our database.

Next, we create the table for our ORM. We have to import declarative_base() from SQLAlchemy in order to map a class that defines our table structure to Postgres, as well as a function that will take our metadata of our table to create the table(s) we need.

We give our class a table name, “deals”, as well as 8 fields. Each field will be mapped to a column in our table which it’s created through create_deals_table().

For each field, we define the type of field that it is, Integer for our primary key field, and the rest are Strings. We pass in the label of the column as a string for everything but the id field. Last, for most fields, we allow them to be empty (nullable=True) if we don’t have those fields in the deal data that we collect.