6.1 Customize Pipeline

When the extract finished, we use Pipeline to persist the result of extract.We can also customize the pipeline to do some common function. In this chapter we will introduce the Pipeline, and use two examples to explane how to customize the pipeline.

6.1.1 Introduction of Pipeline

The interface ofPipelinedefine is here:

publicinterfacePipeline{
// ResultItems persist the result of extract，it is a structure of map// The data in the page.putField(key,value) can use the ResultItems.get(key) to getpublicvoidprocess(ResultItems resultItems, Task task);
}

We can see, Pipeline persist the data which was extracted by thePageProcessor. This work we can also do in the PageProcessor. But why we use the Pipeline? There is some reason for this:

To separate the modules. The extract of page and persist the data are the to stages of a spider. On one hand, separate the modules can make the structure of the code more clear. On the other hand, we can separate the process, process in another thread or even in another server.

The function of Pipeline is more stable, it is very easy to make it as a common component. There is a big difference between process of different pages. But the persist of data is almost the same,such as save in a file or persist in the database. It is very commons for almost of the pages. There is lots of common Pipeline in the WebMagic, such as write to the console, save in a file, save in a file as a JSON format.

In the WebMagic, a Spider can have a lot of Pipeline, to use the Spider.addPipeline() can add a Pipeline. These Pipeline can all be process. For example, you can use:

6.1.2 Put the result on the console

publicvoidprocess(Page page){
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+)").all());
//save the author, the data will be save in ResultItems finally
page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
if (page.getResultItems().get("name")==null){
//when we set the skip,this page will not be processed by the`Pipeline`
page.setSkip(true);
}
page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
}

Now we want to write the result in the console. ConsolePipeline can do this.

To Reference this example, you can customize your own Pipeline. Get the data from the ResultItems and process as your own method.

6.1.3 persist the result in the MySQL

First, we introduce a examplejobhunter. It's a WebMagic which integrate a spring framework to crawl the job information. This example also show how to use Mybatis to persist the data in the MySQL database.

In Java, we have many methods to save the data in database, such as jdbc、dbutils、spring-jdbc、MyBatis. These tools can do the same things, but their complexity is not the same. If we use JBDC, we should get the data in the ResulrItem and save it.

If we use the ORM framework to persist the data, we will face a big problem. That is the framework all need a well defined model, but not a Key-Value format ResultItem. We use the Mybatis as a example to define a DAO MyBatis-Spring.

Basic Pipeline mode

We have finished the work of save the data! But how to use a original Pipeline interface? It's very easy! If you want to save a object, then you should save the data as a object when you extract it from a page.