quasi-public goods, natural resource economics, public policy, computational economics and other stuff I like

Data hacks with list.files() and unz()

I stumbled on a pretty cool functionality in base R yesterday that I thought I would share. It saved me several hours of time so I thought it might eventually be able to save you some.

Here’s the deal: I was downloading groundwater elevation levels from The California Department of Water Resources website. The problem is that DWR organizes the groundwater levels data in an very inconvenient way. I can get time-series data on individual wells filtered by county, basin, or any number of other spatial identifiers. The kicker is I was only able to download data for 50 wells at a time.

I chose to do my searches at the basin level but some basins I was interested in had several thousand wells. Since I could only download 50 wells at a time, this process involved the rather tedious work of manually clicking the ‘download data’ button several hundred times. Perhaps one day my skill set will evolve to include the ability to automate this process using more advanced programming skills than I currently possess…but at present there was really no way around the manual option…but I did find a way to at least automate the data processing step.

The important details are this: Each data download creates a separate zipped file with 2 .csv files in it, one with the actual well data (depth to groundwater over time) for the 50 wells and one with meta information (lat/long coordinates of wells) on the 50 wells. The really useful aspect of this download process is that the .csv files that I want always take the name of the zipped file they are in.

For example, if I search for the ‘Chowchilla Basin’ in California’s Central Valley I get 8 pages of 50 wells. Manually downloading each page I get 8 zipped files for the Chowchilla Basin. The zipped files have names like “wdlzip188.zip” and the .csv files that I’m interested, within each zipped file, are “GST_188” and “GWL_188.”

In order to process the data I needed to read each of the .csv files for a basin, combine the data from like .csv files then merge the two types of .csv files together so I get the well readings and well information in a single data frame. The manual process would have looked like this:

1. choose a groundwater basin
2. choose a zip file
3. unzip the file
4. read the two .csv files from the unzipped file into R
5. repeat 1 – 4 for all zipped files in a basin folder (some basins have as many as 70 zipped files)
6. put like .csv’s together then merge the 2 .csv types
7. repeat 1 – 6 for all basins (there are around 20)

this process, no doubt would have taken me several hours. Using the 2 commands “list.files” and “unz” I was able to automate this by creating an object with all the names of the files I need to access then iterating through that object.

The important detail here is that all of the separate zipped files for each basin are stored in a folder with the name of that basin. So, for example, the Chowchilla Basin had 8 pages of 50 wells so the folder ‘chowchilla_basin_5_22_05’ has 8 zipped files. All of these basin files (12 in total: chowchilla, kaweah, kern county, kings, madera merced, modesto, pleasant valley, tulare lake, tule, turlock, westside) are contained in a directory called “dwr_sjv_wells.”

#first create an empty list. the elements of this list will be the data frames created by #combining all the well data for each basin...that is, each basin will have it's own element #in the list and that #element will be a data frame:
basin_wells <- list()
#next, we read all of the basin files from the root directory:
setwd("R:/dwr_sjv_wells/")
#get a list of all file names in the current directory
folder.list <- list.files()
#we use the counter i to iterate through this list of folders
for(i in 1:length(folder.list)){
#create a temporary variable with the name of the subdirectory we want to access
dir <- folder.list[i]
path <- paste("R:/dwr_sjv_wells",dir,sep="")
setwd(path)
#list.files give us a list of files in the basin directory
#which we can use to iterate over the zipped file folders
#in the basin directory. I then use lapply to coerce this
#list to a data frame just to make it easier to deal with
folders <- list.files()
#extract numbers from folder names
folders.sub <- substr(folders, 7, nchar(folders))
iter <- data.frame(rbindlist(lapply(folders.sub,function(x){
z <- strsplit(x,"[.]")
zz <- unlist(z)
zzz <- data.frame(nbr=zz[1],ext=zz[2])
return(zzz)
}
)
)
)
#-------------------------------------------------------------------------------------
#now we have:
# a. a directory which is the location of the current basin file
# b. a list of all names of the various zipped files within that directory
#next we use "j" to index through the zipped files in the basin directory
for(j in 1:length(folders)){
new.path <- paste(path,"/",folders[j],sep="")
if(j==1){
#here i give different names to the 2 types of
# .csv files contained in the zipped file
file1 <- paste("GWL_",iter[j,1],".csv",sep="")
file2 <- paste("GST_",iter[j,1],".csv",sep="")
#since file1 and file2 are paths to zipped files we need
#to use the command "unz" to access these files. "unz"
#creates a connection to a zip file and "read.csv" reads
# the .csv file inside the connection.
tmp1 <- unz(new.path,file1)
well.gl <- read.csv(tmp1,header=T)
tmp2 <- unz(new.path, file2)
well.info <- read.csv(tmp2,header=T)
}else{
file1 <- paste("GWL_",iter[j,1],".csv",sep="")
file2 <- paste("GST_",iter[j,1],".csv",sep="")
tmp1 <- unz(new.path,file1)
tmp.gl <- read.csv(tmp1,header=T)
tmp2 <- unz(new.path,file2)
tmp.info <- read.csv(tmp2,header=T)
well.gl <- rbind(well.gl,tmp.gl)
well.info <- rbind(well.info,tmp.info)
}
}
#now, we merge the well data and well info using the common field "site_code."
x <- merge(well.gl,well.info,by="site_code")
#finally, we save the data frame x (which contains the merged well data and well info for #all wells in a basin) as an element in the list
basin_wells[[i]] <- x
}

It deserves some mention here that the proper way to do this is probably to wrap the entire operation in a function and use one of the options from the apply family of R functions instead of a loop…maybe I’ll post a new, efficient version in a few days. In the meantime feel free to use the comment section to offer improvements on my solution for downloading and organizing data from website that don’t allow a single download of all data.

To recap:

1. Create a directory (“dwr_sjv_wells”) with a separate subdirectory for each groundwater basin
2. Set the data download to send each .zip file to the appropriate basin file.
3. Use files.list() to create a object with name of each file in the dwr_sjv_wells directory and iterate through this object.
4. Use files.list() again to iterate through each of the .zip files in the subdirectory.

Of course, all this could be avoided if DWR would just post the data in a format that researchers might actually use instead of actively trying to discourage the use of their data by making it difficult to use…but I’m not holding my breathe on that.