1
00:00:00,000 --> 00:00:04,703
[MUSIC]
2
00:00:04,703 --> 00:00:07,457
We've managed to make a couple
of spiders that were great for
3
00:00:07,457 --> 00:00:10,120
sites that don't require interaction.
4
00:00:10,120 --> 00:00:13,955
But many sites do indeed require
some sort of interaction.
5
00:00:13,955 --> 00:00:17,350
For example,
logging in to a site with a username and
6
00:00:17,350 --> 00:00:20,310
password requires a form submission.
7
00:00:20,310 --> 00:00:21,850
There are many different reasons for
8
00:00:21,850 --> 00:00:25,520
needing to work with forms when
getting and scraping data.
9
00:00:25,520 --> 00:00:28,510
Let's head back into our code to
take a look at some techniques.
10
00:00:30,440 --> 00:00:34,010
Our Horse Land site is
hosted on GitHub pages,
11
00:00:34,010 --> 00:00:37,120
which doesn't support
backend technologies.
12
00:00:37,120 --> 00:00:43,010
So we'll be using a bit of a workaround
from Formspree to handle the form posts.
13
00:00:43,010 --> 00:00:47,530
Check the teacher's notes for additional
information about formspree.io and
14
00:00:47,530 --> 00:00:48,780
how to get started with that.
15
00:00:48,780 --> 00:00:53,890
If we'd look at our form page,
we see that it's a pretty simple form
16
00:00:53,890 --> 00:00:57,980
with just a first name,
last name, and a job title.
17
00:00:57,980 --> 00:01:02,390
Scrapy has a class called form request,
which allows for form processing.
18
00:01:02,390 --> 00:01:05,830
And, hold your horses, it's easy to use.
19
00:01:05,830 --> 00:01:08,850
Let's mosey on over to our code and
create a new spider.
20
00:01:10,600 --> 00:01:16,415
So I'll create a new file, gonna be a
Python file, and we'll call it formSpider.
21
00:01:18,650 --> 00:01:21,654
The first form request
will need to be imported.
22
00:01:21,654 --> 00:01:28,150
So from scrapy.http import FormRequest.
23
00:01:28,150 --> 00:01:29,348
And we need to import spider.
24
00:01:29,348 --> 00:01:32,483
Scrapy.spiders.
25
00:01:32,483 --> 00:01:35,409
import Spider.
26
00:01:37,420 --> 00:01:41,310
We need to create a new class that
inherits from Spider as our next step.
27
00:01:43,200 --> 00:01:49,120
Call it FormSpider and, as we've seen,
we need to give our Spider a name.
28
00:01:51,080 --> 00:01:52,547
We'll just call it horseForm.
29
00:01:55,620 --> 00:01:57,508
And we define our start URL.
30
00:02:00,302 --> 00:02:01,270
Which again, is a list.
31
00:02:02,370 --> 00:02:04,126
What's the URL for our form?
32
00:02:06,890 --> 00:02:08,189
We'll just cut and paste that in.
33
00:02:11,799 --> 00:02:14,740
This looks pretty familiar this far,
I think.
34
00:02:14,740 --> 00:02:20,030
Next we define our parse method and we'll
define the formdata we want to pass in.
35
00:02:21,310 --> 00:02:25,611
So define parse and formdata.
36
00:02:25,611 --> 00:02:28,673
Let's go use the developer tools in
the browser to see what the form
37
00:02:28,673 --> 00:02:30,530
fields are called.
38
00:02:30,530 --> 00:02:34,022
Come over here, Developer Tools.
39
00:02:36,627 --> 00:02:38,962
So they're down in here in this form.
40
00:02:46,930 --> 00:02:54,540
So we have firstname,
Lastname, And jobtitle.
41
00:02:54,540 --> 00:02:56,430
All lower case and no spaces.
42
00:02:58,260 --> 00:02:59,190
So we want firstname.
43
00:03:00,860 --> 00:03:01,950
My first name is Ken.
44
00:03:04,982 --> 00:03:09,140
Lastname, Alger.
45
00:03:09,140 --> 00:03:14,133
And jobtitle is Teacher.
46
00:03:15,720 --> 00:03:19,234
Now we need to return a form
request from response object.
47
00:03:19,234 --> 00:03:26,940
So return FormRequest.from_response.
48
00:03:26,940 --> 00:03:33,250
We'll return the response, the form
number on the page we're processing,
49
00:03:33,250 --> 00:03:40,510
and that's zero based, formnumber,
and then the form data we want.
50
00:03:40,510 --> 00:03:43,340
So formdata = formdata.
51
00:03:45,380 --> 00:03:48,370
And then a callback for what to do next.
52
00:03:48,370 --> 00:03:49,120
So callback.
53
00:03:51,880 --> 00:03:54,540
We'll make a method
here called after_post.
54
00:03:55,740 --> 00:03:59,560
This passes the data we
defined into the form and,
55
00:03:59,560 --> 00:04:04,280
by default, utilizes the submit
button to submit our data.
56
00:04:04,280 --> 00:04:08,640
Then it will do whatever we
define in the after_post method.
57
00:04:08,640 --> 00:04:14,350
Here we could do data saving or
data processing or further scraping tasks.
58
00:04:14,350 --> 00:04:19,470
For now, let's just print out
that the form was processed and
59
00:04:19,470 --> 00:04:21,943
the response object itself.
60
00:04:21,943 --> 00:04:26,865
So we'll define after_post, self,
and again, that takes a response.
61
00:04:26,865 --> 00:04:31,260
We'll print and we'll do
62
00:04:31,260 --> 00:04:36,254
a little formatting, just so
63
00:04:36,254 --> 00:04:41,260
we can see it in the terminal.
64
00:04:41,260 --> 00:04:45,533
And we'll print the response.
65
00:04:45,533 --> 00:04:47,090
Let's just copy this line here.
66
00:04:50,750 --> 00:04:51,710
There we go.
67
00:04:51,710 --> 00:04:58,039
And we can, all right,
let's open a Terminal window,
68
00:05:00,410 --> 00:05:08,861
Go to our Spiders folder, And
have Scrapy run our crawler.
69
00:05:13,863 --> 00:05:14,580
We look up here.
70
00:05:16,080 --> 00:05:19,370
Great, we see that the spider found and
submitted our form.
71
00:05:19,370 --> 00:05:24,110
In our case here, it was posted
to formspree.io for processing.
72
00:05:24,110 --> 00:05:27,490
Here's our printed information and
our 200 response code.
73
00:05:27,490 --> 00:05:30,390
Great, I've included links
in the teacher's notes
74
00:05:30,390 --> 00:05:32,310
about form request as well.
75
00:05:32,310 --> 00:05:35,350
I'd encourage you to look at it
as it is a powerful tool for
76
00:05:35,350 --> 00:05:39,708
processing forms and
can even be used to handle login forms.