Octoparse – For All Your Web Scraping Needs

Octoparse is a windows-based web data extractor application. It can be used to extract data from most public websites in different fields for various uses. It is becoming popular and widely-used, since it is dedicated to providing users with well-rounded and most professional scraping service. It offers users with clear guidelines and all-inclusive tutorials that can help users grasp the rules and configure their own scraping tasks ASAP. Plus, Octoparse has offered a very user-friendly UI so that any user, even without any coding skills, can handle this extraction tool as well.

Overview
Octoparse is a powerful extractor. It is aimed at providing users with the most professional data extraction services. Users can customize their own extracting tasks by clicking and dragging the blocks in the Workflow Designer pane and customizing the crawling pattern. Users can also choose to schedule their tasks and have them run on the Cloud Platform. Basically, Octoparse offers two editions of extraction service plan – Paid Extraction Plan and Free Extraction Plan. Notice to all users, both editions can satisfy the basic needs of users. That means users can scrape and export data no matter which edition they use. Further, if users want to scrape data with a faster speed and larger-scale amount, Octoparse suggests users choose the featured service – Cloud-based Extraction, which is only available in the Paid edition. Octoparse provides users with various formats of data to export, like Excel, CSV, HTML, TXT, and database(MySQL, SQL Server, and Oracle).

Plans and Pricing

Octoparse free and paid editions both provide basic extraction functionalities. They offer users with the basic set of features. The difference is that Paid editions will allow users to scrape data in a larger scale and use the Cloud-based Service. Users can extract data on a 24-7 basis using Octoparse’s cloud service. The price of Standard Edition subscription is $89/month, limited with 6 simultaneous threads though, while the Professional Edition subscription cost $189/month with 14 simultaneous threads. Plus, if you’d like the Octoparse team to customize your crawler based on your requirements, it will cost $99/crawler.

Workflow

The workflow of Octoparse is designed in a very user-friendly way. The tips for users are quite clear, and the icons and operations are quite straightforward and easy to handle. By simulating and learning a series of human web browsing behaviors, like opening a web page in the built-in browser, pointing and clicking the web elements by selecting the listed related options in the pop-up designer window, Octoparse will be able to transfer repetitive manual extraction operations into automated web extraction process and retrieve the structured data users need.

Two modes are provided for users: Wizard Mode and Advanced Mode. Octoparse suggest that beginners start with the Wizard mode to learn the scraping process ASAP. To meet more complicated and larger-scale scraping needs, Octoparse has provided the Advanced mode for users to crawl data they need. For the Advanced Mode, you need drag-&-drop the blocks inside of workflow designer to configure your task. Before you start your task using Advanced Mode, it is expected that newbies finish the training sessions first.To start a task in Advanced Mode, choose New Task (Advanced Mode) as shown below, and advanced features will be available:

As mentioned above, Octoparse can automatically extract structured data by simulating user browsing behaviors. Users can customize the simulating process by selecting the options in the pop-up designer window. The designer pop-up window contains both basic and advanced actions that users will need to build their own task. For more detailed instructions, users can visit their official site and refer to rich tutorials, including both video demonstrations and explicit manual.

Cloud Service for Scraping

To scrape data with a faster speed and in a larger scale. Octoparse provides Cloud Service, which scrape the web with multiple cloud servers running the task simultaneously based on distributed computing. To obtain the Cloud Service, You can upgrade your account from baisc edition to paid editions – Standard Edition or Pro Edition. . Then, you can schedule your Cloud-based tasks and perform the extraction with multiple threads working concurrently. Notice to users, till now, Standard Edition limits you with only 6 concurrent threads (14 in Professional Edition). More advance, we can add Cloud Servers to meet your increasing crawling or scraping needs. Please contact us if you want more advanced Cloud Service. Plus,Extraction Scheduling is also offered.
The speed of Cloud-based scrape has really impressed me after I tried a simple link extraction: over 3000 links in 1.5 min.
Cloud-based service is truly a good news for users with a higher demand on data scraping or crawling time.
Plus, the V6.4 Octoparse has added the feature of scheduling task in Start Interface.
Users can apply the saved scheduled settings to batch tasks.
Advanced Options for Scraping
To optimize tasks, Octoparse has provided Advanced Options for users to resolve certain issues occurring during a scraping process, like X path mismatch, data missing, time delay, etc. Users can improve the scraped data quality by utilizing these tools, including:
• Regexout working
• Xpath editing
• Execution timeouts setting
• Scrolling down
• Pageanchor hook
• etc.

RegEx Tool

Octoparse provides a built-in regex generator so that users needn’t know how to wrestle with a character or string match. Users only need to use the RegEx Tool to refine the scraped fields, fill out the meeting conditions. Then, this tool will automatically generate matched Regular Expression after you click the Apply button.

API

The Octoparse API allows users to connect their system to their scraped data in real time. You can either import the Octoparse data into your own DB, or use our API to require access to your Account’s data. Just configure the rule for your task, run it in cloud, and Octoparse cloud servers will do the rest. API request data are returned as XML.

The Octorparse’s API allows the user to extract data on a timely basis: from a date time till a date time with max interval being 1 hour. Not that convenient. Insert date time markers into the link parameters as follows:

In this link, I’ve highlighted the timing (8am till 9am in this case) to distinguish it from white space notation (%20). The cloud task extraction pane shows only end time and average time, thus it’s hassle to manually calculate start time:

{start time} = {complete time} – {average used time}

In the newly released Version 6.4 about API, two interfaces have been added for users:

Get Task Data based on storage time
Get Task Data from the last Cloud Extraction.

The successful HTTP response with correct access token and taskID will get same JSON -formatted data like the paging data in 3.1.

Proxying

Does it ever drive you crazy that your IP address has been banned and you cannot access a website if you scrape a website frequently? Yeah, it always happens especially when you extract data from business directories, which apply strict bans on recurring IP(s). However, Octoparse enables you to scrape these websites by rotating anonymous HTTP proxy servers. In the cloud extraction mode, Octoparse applies more than 500 3rd party proxies for automatic IP rotation.

For local extraction, you have to add a list of external proxy addresses manually and configure them for automatic rotation. To learn how to include IP rotation into scraping project, please refer here.

IPs are rotated with an interval of time that you set. In this way, you can extract data from the website without the risk of getting the IP address banned – in case you do not overload site’s bandwidth.

Smart Mode

In the new version of Octoparse 6.2, a brand new service was added- Smart Mode. It will help users access to data in a much easier way. This new feature will help users convert web pages into structured table of data instantly after users enter or paste a simple URL. It is really simple and absolutely free of charge!

About the New Version 6.4

Octoparse has released the latest Octoparse V6.4. In this new version, the company has tried to improve user experience when using the Octoparse Extraction Service. In Version 6.4, authority is not necessary anymore when data needs to be exported to Mysql; To make tracing easier, the company will show you the Extraction Failure Report to display/list detailed info from the scraped website for Local Extraction; Also, to meet a higher demand from users, the pagination threshold can be set with a higher value now.

Support
The customer support is responsive and provides equal assistance both to paying plan users and free plan users. Support is accessible via , Email, Skype (no limit for free users).