Data Mining the California Solar Statistics with R: Part I

Data Mining the California Solar Statistics with R: Part I

Intro

Today I’m taking a look at the data set available from California Solar Statistics availalbe from https://www.californiasolarstatistics.ca.gov/. This data set lists all the applications for state incentives for both residential and commercial systems, it contains information about the PV (Photovoltaic) system size, location, cost, incentive amount, panel and inverter manufacturer and a lot more, if you want more details check out their web page. I’m interested in taking a look at this data set and seeing what we can learn about residential solar installations in CA.

Getting Started – Cleaning the data

The working data sets have already been screened for input errors but the data still isn’t in the format I am looking for, I would like to see the total installed kW of solar by county by year. Additionally, the input which contains the install county has inconsistent nomenclature. For example, some values for county will say “Los Angles County” while others will just say “Los Angeles”, this is a problem because it will create more counties than exist when I try to group by county. Also, the data currently contains canceled applications, I’m only interested in taking a look at installed systems. In this first block of code I’m going to get the data in a more usable format for my purposes.

##load data, this may take a min
solarData=read.csv(file = "WorkingDataSet_4-15-2015.csv")
## We're interested in the residential data
solarData=subset(solarData,Host.Customer.Sector == "Residential")
## same labels are redundant, remove "county" to avoid this using gsub command
solarData$Host.Customer.Physical.Address.County=gsub( " County","",solarData$Host.Customer.Physical.Address.County)
##There are two instances where no county was listed, I am removing them here
solarData=solarData[solarData$Host.Customer.Physical.Address.County != "",]
#remove cancelled applications, we're insteresed installed systems
solarData=solarData[solarData$First.Cancelled.Date == "",]
#only keep installed applications, if people have filed to receieve their incentives, I am counting it as installed
solarData=solarData[solarData$First.Incentive.Claim.Request.Review.Date != "",]
#extract install year and month variables
solarData$First.Incentive.Claim.Request.Review.Date=as.Date(solarData$First.Incentive.Claim.Request.Review.Date)
##using a package called lubridate with the functions year and month to extract year and month as variables form the data
solarData$year=year(solarData$First.Incentive.Claim.Request.Review.Date)
solarData$month=month(solarData$First.Incentive.Claim.Request.Review.Date)

Next I want to group the data by county and year. The plyr package is great for aggregating data like this, it can be achieved with just a few lines of code. For more about the plyr package see http://plyr.had.co.nz/.

#Get data by county
countyData = ddply(solarData, .(tolower(Host.Customer.Physical.Address.County),year ),summarize,
Systems = length(na.omit(year)),
Total.Size = sum(na.omit(CSI.Rating)))
##rename column name to "County" from "Host.Customer.Physical.Address.County" this will be important when I want to merge that data set with another one
colnames(countyData)[1] ="County"

Now that I’ve got the data set in the format I’m looking for, I’ll use the gglpot2 package to load a county map of CA. Later We’ll be able to plot the solar install data we have per county on this map. I create a variable called “sort” so that after I merge this data set with the solar data I can put it back in the original order from my CA data frame. The merge function doesn’t keep the order of either your data frames. This is important because I won’t be able to plot the data by county correctly if it isn’t in the same order that it came from the ggplot package in.

Loading the California Map from ggplot2

CA=map_data('county')
CA=subset(CA,region=="california")
CA$sort=rep(0,nrow(CA))
##create a sort variable so that after we merge the data sets, we can put in proper order to plot
for (i in 1:nrow(CA)){
CA$sort[i]=i
}
colnames(CA)[6] ="County"

I also need to subset the data so that each year is it’s own data set if I want to compare state wide solar install plots by year. To do this I’m merging the county labels from my CA data frame with the county labels from my countyData data frame.

Looking at the plots by county it’s clear that the applications are more focused in southern CA. In a later post we’ll try and figure out how much of this is due to the high population in southern CA and how much is due to the increased solar insolation. One thing that lookes really strange/surprising to me is the drop in applications in 2014, during this period of time the cost of solar was dropping and the economy was improving. I reached out to the California Energy Comission to ask them about the drop in applications for incentives and they explained that while the CSI (California Solar Incentive) program was intended to last until 2016, the total budget for the program was starting to dry up in 2014 due the larger than anticipated growth in solar. So the drop in applications isn’t related to a decline in PV, just a lack of budget. It would be interesting to see how much the incentives running out changed the number of PV system installs but I haven’t been able to find a source for that data.

In my next post we’ll take a closer look at this data set and see what else we can learn.