Web Robots Documentation

In Web Robots framework robot scripts are written in Javascript using a simple steps functions. Besides our own steps functions we also include the following libraries that can be used in any steps within robot:

Functions

done([int delay])

Done() notifies the extension that current step has completed execution so extension can execute the next step. IMPORTANT: this function must be called ONLY ONCE during step execution. Typically it is placed at the end of step’s execution. done() must be skipped when step makes a click on DOM elements which causes loading of another page (example: click on a button to submit a form), in this case done() is called automatically by page loading event.

delay – delay in milliseconds, optional. Used to slow down robot’s crawling rate.

next(String url, String nextStep[, Object passedData])

Next() adds to execution que a new step to execute.url – a URL to load for this particular step. If loading is not necessary an empty string “” must be passed. Use “” when previous step loads a necessary web page for this step (usually by by clicking button or submitting a form).nextStep – a string containing step’s name to execute.passedData – any data that should be passed to the next step, optional.

IMPORTANT: Web Robots system uses LIFO (last in, first out) principle on que execution. If there are many next() statements generated que manager will execute the latest one first.

fork(String url, String nextStep[, Object passedData])

Fork() is like a next() statement, except that it starts a new robot from that particular step. Fork() is useful when running large robots and there is a possibility to run parts in parallel. In development mode fork() works as next() for development and troubleshooting purposes, while actual forking is performed on cloud instances in production mode.

url – a URL to load for this particular step. Must be a valid URL, not an empty string.nextStep – a string containing step’s name to execute.passedData – any data that should be passed to the next step, optional.

IMPORTANT: Fork() should be used only once during robot run. Newly forked robot starts with default settings, so any settings regarding retries, proxy, skipVisited etc. should be applied after fork(). Fork can launch many parallel robots, which can generate significant load on the target website.

emit(String tableName, array Data)

emit() sends extracted data to database. It can be called several time during a single step.tableName – a table name to which data will be saved. One robot can emit data into multiple tables. Example: during execution robot emits data to “Users” and “Products” tables.Data – an array of JSON objects. Even when emitting a single JSON object, put it into array.

IMPORTANT: Always emit an array variable. If there is a need to emit a single variable still wrap it in array notation. For example emit(“Strings”, [myString]).

click(var selector)

Performs a click on a DOM element found by selector. click() simulates a real click better than jQuery .click() function because it does a series of events that would happen during a click by a person (mousedown, click, mouseup, etc).selector – this variable can be two types: a string CSS selector of an item to click on; or a DOM object to click on.

IMPORTANT: Do not use done() in a step where click() results in a new page loading.

waitFor is used to wait for a specific DOM element to appear on a page and then perform some actions. Useful when some elements appear dynamically some time after the initial page loading.selector – a string CSS selector expected to appear in DOM.maxWaitTime – optional parameter maximum waiting time in milliseconds. Default value 10,000ms is used if this parameter is skipped.actions – a function that is executed after selector appears in DOM.

setProxy([string server, int port, array bypassList])

IMPORTANT: This is an advanced function. It can cause browser to be stuck with proxy setting if robot does not finish running cleanly, ie. developer closes the Web Robots IDE extension during robot run.

Configures Chrome to start using proxy specified in server string. If port is omitted it will be set to 8888. Proxy settings are in effect until setProxy is called again or robot finishes run. setProxy() without any parameter will set proxy settings to system defaults without proxy.server – proxy server, optional.port – proxy server’s port, optional.bypassList – an array of strings, optional. List of domains that should be bypassed. See Chrome’s documentation of bypassList parameter to gain more insight how to use it.

clearCookies()

Clears browser cookies for current page.

Example:

clearCookies();

setRetries(int interval, int count, int total)

Modifies robot’s step retry behaviour from default values. Default values are 60,000ms retry interval, 3 retry count, 150 total count. Following this logic, if some step is retried 3 times and still fails – robot proceeds to the next step. If robot encounters 150 retries during run it stops and marks run as failed.interval – retry interval in milliseconds.count – maximum retry count on a single step.total – cumulative allowed retry limit during robot run. If this limit is reached robot will stop and mark run as failed.

Best practice is to increase retry parameters only if source is not reliable and reloading page can yield results. If source is reliable and errors are happening due to error prone javascript used in robot – more retries will not help.

Example:

steps.start = function () {
next("http://webrobots.io", "stepTwo");
// setting robot to retry after 10 second for up to 10 times and stop if total of 6000 retries reached.
setRetries(10000, 10, 6000);
// All subsequently exectued steps will adhere to the new retry policy
done();
}
steps.stepTwo = function () {
console.log("This step will be retried in 10 seconds if there is no done() event");
done();
}

isNumber(var number)

Returns true if variable number can be parsed as a real number and false if not.

setSettings(Object settings)

skipVisited – boolean setting to control behaviour if next() statements wants to open already visited page.Applies only to subsequent next() steps after setting is enabled. Default value: false.respectRobotsTxt – boolean settings to control robot’s behaviour in respect to robots.txt file. If set to true, robot will read and parse robots.txt file from target domain and start respecting it’s directions. Applies only to subsequent next() steps after setting is enabled. Default value: false.