How to Integrate Copyscape’s API into Your PHP Applications

Scraping is an unfortunate consequence of having an online business. What scrapers do is taking content from your website and then posting it on their own website in an effort to use the content for search engine rank. In using Copyscape you can find out who is copying and scraping your original content and use the DMCA process to ask the site owner to take it down.

Copyscape has a manual search function, but the Copyscape API can automate many of the searches for you. The API lets you connect to Copyscape, scan the Internet for duplicate content, and displays a list of websites with identical content to yours. The API is fully supported by the PHP programming language, and it simplifies many of the manual searches you do for your site. The automation saves you time when you have several people copying your content, and when you publish content from third-party copywriters.

Overview of the Copyscape API

For experienced PHP programmers, the Copyscape API is not lengthy or difficult. The API returns XML to the calling application, which PHP can parse using internal functions. The response includes the top sites with the most copied content, the URL of the copyright infringer, and the percentage of words that match.

Before you can use the Copyscape API, you need an account. You can sign up for a new account from the main Copyscape.com home page. Creating an account is free, but you need to add money to the account to use the API. It costs $.05 for each call to the API, so you must factor this cost into your application when you determine your budget and the number of calls to the API you want to automate. Unfortunately, there is no sandbox to test your code, so it costs money to build an application that calls the API. It should cost no more than $5.00 to code and test the application, because the API does not have very many function calls. After you get the syntax down, the testing process should only cost a dollar to fully test.

Coding the API into a PHP Web Application

After you deposit some money into the account, you need your API key. At the top of your PHP function, add the following code, which are constants for the Copyscape API calls:

The “username” is the username you used to sign up with a Copyscape account. The “apikey_from_copyscape” is the API key listed in your account settings. These variables will never change. This example is taken from a Copyscape API class, so the variables are defined as private. If you are not using a class, then remove the “private” term from the code.

The first step in calling the API is setting up the URL that points to the Copyscape API and tells the API which content page to search on. The URL can be built dynamically, but for instructional purposes, the following example is mostly static except for the variables defined earlier:

Notice in this example, the values for the API URL, the username and API key are taken from the variables defined earlier. These values never change, so you can also choose to hardcode them in your URL variable. The “q” query string value is the URL you want to send to the API. Notice each value is encompassed in the “urlencode” function, which sets up the values for URL queries understood by the target server. The “c” query string tells the API how many values you want to return, and in this case, it’s set at 5. The “f” query string value specifies XML, so you can parse the XML from the response.

The above statement searches URLs, but you can also define a query that does a full text search. For instance, if you want to search a paragraph from a document you have stored on your computer, you can also use the API, but you remove the “q” parameter and set the “o” operation to “t,” which specifies that you want to do a full text search. To perform a full text search, you must use an HTTP post function, which is described later. The following code sets up a URL for a full text search:

Using cURL to Obtain API Results

Now that the URL is set up, you use the cURL framework to call the API. The cURL framework is installed in most host providers, so you do not need to add it to the host account. The following code obtains the XML from the Copyscape API:

//search by url

$curl=curl_init();

curl_setopt($curl, CURLOPT_URL, $this->url);

curl_setopt($curl, CURLOPT_TIMEOUT, 60);

curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

$response=curl_exec($curl);

if(curl_errno($curl)) //error from the URL

{

return 'Curl Error: ' . curl_error($curl);

}

curl_close($curl);

The XML is stored in the “$response” variable. The code above queries the Copyscape API and displays an error if the query is malformed. The above code queries for one of your URLs. To post a query for a full text search, use the following code:

//search by text

$curl=curl_init();

curl_setopt($curl, CURLOPT_URL, $this->url);

curl_setopt($curl, CURLOPT_TIMEOUT, 60);

curl_setopt($curl, CURLOPT_POST,count($fields));

curl_setopt($curl,CURLOPT_POSTFIELDS,urlencode($textsearch));

curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

$response=curl_exec($curl);

if(curl_errno($curl)) //error from the URL

{

return 'Curl Error: ' . curl_error($curl);

}

curl_close($curl);

Parsing the XML and Displaying the Results

PHP 5 includes the simplexml_load_string function that parses XML for you. After you retrieve the XML, you loop through each XML element and print the results or process the results in your own application. The following code parses through each XML result using the “$results” variable obtained earlier from the cURL code:

The first line of code uses the “simplexml_load_string” function to parse the XML into its elements. The code loops through each result and builds the HTML that is used to display the results in your Web application. If you have several sites copying you, the list can be very long, so the count stops at 10 to avoid overloading the server with too many results. The results include a snippet of text that is found as a duplicate on your site, the percentage of words copied, the URL of the site copying you and the title of the page where your copied content is found. The “$html_results” variable contains the formatted URL to display in the user’s browser.

Getting Your Copyscape API Balance

Before any queries work on the API, you must have enough money in your account to cover the query. Copyscape includes a function call that gets your current balance, so you can verify that you have money in the account before querying. Querying Copyscape with a 0 dollar balance will result in an error, so you can avoid the error by checking your balance first. You use the same methods to query your balance as you use to query for duplicate content. The URL parameters are changed, but the query responds with one XML element that contains the balance.

The first step to get your balance is to build the URL in the same way you build the URL for the duplicate content queries. The only difference is the operation (the “o” parameter) is set to “balance.” The following code sets up the URL for the balance query:

Then, you use the same cURL instructions to query the API in the same way you query for duplicate content using the following code:

//get the balance

$curl=curl_init();

curl_setopt($curl, CURLOPT_URL, $this->url);

curl_setopt($curl, CURLOPT_TIMEOUT, 30);

curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

$response=curl_exec($curl);

if(curl_errno($curl)) //error from the URL

{

return 'Curl Error: ' . curl_error($curl);

}

curl_close($curl);

After the XML is retrieved, you parse the XML and get the results from the XML elements:

$xml = simplexml_load_string ($response);

$balance = number_format(bcadd($xml->value, 0, 2), 2, '.', ',');

$totalqueriesleft =$xml->total;

$totallefttoday =$xml->today;

The code above gets the balance and formats it into a dollar figure. The query gets the amount of queries available to you for the day, so you know if your balance and allotted queries allow you to perform the search on the Copyscape API.

These are the basic functions to query Copyscape. Copyscape has a list of other functions you can use, which they list in your private account settings after you log in. You can automate the queries, but you must ensure that you automate any API queries after verifying your balance and ensuring you have the budget for the searches.

Written by David Walsh

David Walsh is the editor in chief here at Web Hosting Search. Having been in the industry for many years now he knows pretty much everything about everything. At least that's what he keeps telling everyone at the office. So, don't hesitate to drop him a line if you've got a question - david(a)webhostingsearch.com.

Although we receive referral fees from hosting providers, we are committed to providing fully impartial, accurate recommendations of web hosting services. Non-identifiable cookies are used to track usage and improve the site.x