Get Started:

First crawler:

Write a class implements PageProcessor. For example, I wrote a crawler of github repository infomation.

publicclassGithubRepoPageProcessorimplementsPageProcessor{privateSitesite=Site.me().setRetryTimes(3).setSleepTime(1000);@Overridepublicvoidprocess(Pagepage){page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());page.putField("author",page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());page.putField("name",page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString());if(page.getResultItems().get("name")==null){//skip this pagepage.setSkip(true);}page.putField("readme",page.getHtml().xpath("//div[@id='readme']/tidyText()"));}@OverridepublicSitegetSite(){returnsite;}publicstaticvoidmain(String[]args){Spider.create(newGithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();}}