Extract

Convert websites into structured, usable data

Just enter the URL where your data is located and Import.io takes you there. If your data is behind a login, behind an image, or you need to interact with a website, Import.io has you covered. Once you are at a web page, you simply point and click on the items of interest and Import.io will learn to extract them into your dataset. Once extractors are fully trained they can be set to run on a schedule over multiple different web pages, creating large datasets ready for transformation, analysis and integration into your applications and internal systems.

Point-and-click training

Import.io makes it easy for you to show us how to extract data from a page. Simply select a column in your dataset, and point at the item of interest on the page.

Interactive workflows

Record sequences of the actions that you need to perform on a website. For example, you may need to navigate between pages, enter a search term or change a default sort order on a list.

Easy scheduling

Set up your web data extraction to run “on the regular” using pre-set or custom schedules: weekly, daily, hourly, whatever your business needs. Set it and forget it.

Reliable, high quality data...every time

Machine Learning auto-suggest

When you first enter a URL, Import.io attempts to auto-train your extractor by using advanced machine learning techniques. Go from URL to dataset with one click. And we’re constantly getting better.

Download images and files

Download images and documents along with all the web data in one run. Retailers pull product images from manufacturers, data scientists build training sets for computer vision.

Data behind a login

Authenticated extraction allows you to get data that is only available after logging into a website. You provide the appropriate credentials and Import.io will do the rest.

Website screenshots

Import.io helps ensure compliance and accuracy by allowing you to capture and save screen shots of every page from where you extracted the data. This is a feature is easily accessible and useful as it creates an audit-able record of the extracted data.

Robots.txt

Choose to obey the robots.txt file for the website and thus avoid gathering data from pages that the website owner has deemed that they don’t wish to be crawled.

Notifications

Be notified as soon as data is extracted. Receive email notifications or use webhooks to make sure that you always know when the latest data is available.

Operate at scale, web scale

Multiple pages

Extract data from multiple pages. We automatically detect paginated lists, or you can explicitly click on the “next” page to help us learn.

List page, detail page

List pages contain links to detail pages that contain more data. Import.io allows you to join these into a chain, pulling all of the detail page data at the same time.

URL generator

Use patterns such as page numbers and category names to automatically generate all of the URLs that you need in seconds.

Auto-optimize extractors

Whenever you save your extractor, Import.io will automatically optimize the extractors to run in the shortest time possible.

Multi-URL training

Train the same extractor with multiple different pages. When a website displays different data variations on the same page types you want to train against all variations.

Upload custom datasets

Combine web data with other data from sources outside of Import.io. Simply upload a CSV or Excel file in and this becomes a table that can be used in any of your reports.

Advanced options

Country specific extraction

Control the geographical location from which your web data extraction is running. Extract pricing data in a local currency. All countries supported.

PII masking

Automatically remove personally identifiable information (PII) when extracting web data. We can detect and redact PII such as names, phone numbers and addresses.

XPath & Regex

Write your own custom extraction rules using XPath and RegEx. This can be especially useful for pulling hidden data and setting up advanced configurations.

Web scraping FAQ

What is Web Scraping?

Web scraping (or screen scraping) is a way to get data from a website. By using a web scraping tool, sometimes called a website scraper, you’re able to extract lots of data through an automated process. The tool works by sending a query to the requested pages, then combing through the HTML for specific items. Without the automation, the process of taking that data and saving it for future use would be time-consuming. Many web scraping tools offer different features that can be used to scrape web pages and convert the data into handy formats you can then use.

Why Use Web Scraping?

With so much information now online, getting that data can often prove the difference between success and stagnation. Web data can be extremely valuable not only since it is accurate but also because it is kept up to date. With the right data in hand, you can analyze what you need to determine new insights and find exciting discoveries.

Do You Need Special Training for Web Scraping?

Of course, the use of code to extract data can seem intimidating at first, but no extensive coding experience is needed when using Import.io. Some training will be helpful, such as the point and click training mentioned above, but Import.io provides an easy-to-use interface that allows you to perform a variety of data scraping tasks, all without the need to be deeply familiar with coding or machine learning.