Data analysis projects these days seem to often involve the same steps:

Fetch documents (usually HTML, JSON or XML) from a remote web site.

Extract data from those documents, and save the extracted data as rows in a CSV file.

For each row in the CSV file, fetch more data from a different remote web site.

Parse that data and save the extracted data as rows in another CSV file.

Publish the data.

Here's a bit more information on the steps involved:

Fetching with cURL

To fetch data, PHP's curl_init, curl_setopt, curl_exec and curl_getinfo commands all work nicely. I find it useful to wrap them in a CurlClient class, with subclasses for any non-standard web services that need special handling, as this makes it easy to set all the basic options, handle rate limiting, and re-use the same cURL handle, allowing connections to be kept open between requests.

Setting CURLOPT_ENCODING to 'deflate,gzip' allows the server to send compressed data, and a combination of gzopen, CURLOPT_FILE and gzclose makes cURL write the response straight to a compressed file without needing to parse it.

Parsing the response

If the data is JSON, use json_decode to turn it into an array. If it's HTML or XML, use DOMDocument and DOMXPath to extract the data. DOMXPath::evaluate('string(…)') is particularly useful for queries that should return a single piece of text.

Storage

CSV is an obvious choice for storage, as it's so straightforward to use for objects that have a single layer of properties without any nested objects.

fputcsv is easy enough to use; fgetcsv is a bit trickier. The general procedure is something like this:

Publishing

There are several ways to publish a CSV file online:

Simply putting the CSV file online is often enough. It's a standard format and people can easily make use of it.

Import the file to a Google Spreadsheet and publish it. From there, people can use libraries like Tabletop.js and sheetsee.js to work with the data, or can download the raw data. The limits on spreadsheet size have recently been raised, so it should be able to handle fairly large files.

A JSON file accompanying the CSV file, containing metadata. Ideally, this will define the data type of each field. Simple Data Format (SDF) is one possible format for this datapackage.json metadata file. Perhaps this file could even contain a mapping of each property to a URL, analogous to a JSON-LD context document (CSV-LD?).