How to | Clean Up Data Imported from a ZIP File

In addition to importing ZIP files stored on your machine, the Wolfram Language can also import ZIP files directly from a URL. In most cases, processing data in ZIP files is relatively straightforward. However, depending on how the data is formatted, it may need additional processing, as is the case in this example. Despite this, the Wolfram Language lets you import, process, and then plot the data in just a few steps.

The U.S. Census Bureau maintains data on the trade balance between the U.S. and all other countries. You can download the data from their websitein a ZIP file, which contains a spreadsheet with the data.

There are a couple of methods you can use to get the data from the ZIP file into the Wolfram Language. You can import the ZIP file directly from its URL, as done in this notebook, or download and extract it separately and then import the spreadsheet from your local machine.

In the examples that follow, only parts of the data are shown. This is done simply to save space in this document. However, when you process other datasets on your own, you will need to view much larger parts of the data so that you know where to focus your efforts.

Begin by importing the ZIP file from the website. Since the dataset is much too large to show in this document, the output has been suppressed with a semicolon ():

The first worksheet contains some basic descriptive information about the data. Use (the short form for Part) and (the short form for Span) to view the parts of the worksheet that have relevant information.

For example, the Part specification will get the first worksheet spanning from the 6 element to the end. Looking at the data, you can see that the prefixes I and E on the three-letter month abbreviations represent imports or exports:

The third element of each list containing the actual trade data is the relevant country or organization. To isolate and analyze the data for China, for instance, you need to do some additional processing. Here, you will extract data on imports from China by using pattern matching in Cases.

For the pattern, specify a list where the first two elements are underscore characters, and the third element is . The underscore character is a pattern object that can stand for any Wolfram Language expression. For the fourth element, use two underscore characters together. This represents a pattern object that can stand for any sequence of one or more Wolfram Language expressions:

View just the data for 1985. The 12 elements following correspond to the imports for each month, while the thirteenth element is the total of those imports (). The next 12 elements following correspond to the exports for each month, while the thirteenth element is the total of those exports ():

You now have a means of regularly accessing and processing U.S. trade data. For the purposes of this example, the data processing was broken down into steps. For regular use, you may want to combine some of these steps into a function.