I am currently developing a package I would like to submit to rOpenSci at some point this year. The package obtains current and historica data for hurricanes in the Atlantic and east Pacific oceans (eventually, central Pacific as well).

I already have a similar package in cran, HURDAT. However, this is a re-analysis project so the data may be slightly different (but has far more storms).

I’m seeking guidance/advice on the best way to finish off the former project; rrricanes. Here’s where it stands now:

beta

scrapes real-time and archived data for tropical cyclones back to 1998, Atlantic and east Pacific (east of 140W unstrict).

It is very slow; specifically with Forecast/Advisory products. These products contain the bulk of the data. However, the format is rather free; worse the earlier it is. Plus, it has slightly changed over the years. I’ve used a lot of regex (for me, anyway) to handle the different combinations.

Additionally, it’s accessing HTML pages which may or may not be available right away. On my somewhat slow machine, it may take 10-15 minutes to get one year’s worth of data in a nice clean structure. No good!

Originally I had avoided even thinking of keeping the datasets within the package; they’re about 30-40MB (in csv format) for just the current three core products (forecast/advisory, (wind|strike) probabilities). But, I don’t think anyone would realistically want to use it just because of how slow it is. Hell, I’m not even sure I want to use it and I’m writing it for me!!!

So, that being said, these are my thoughts:

A seperate repo to hold data alone. Not in a package repo; just a simple dataset GitHub repo. There would be three types of datasets:

A summary dataset by year (cyclone name, key, start date, end date)

A dataset for each year by product; one dataset for all forecast/advisory data or strike probabilities, etc.

A dataset for each storm by product.

The latter item is optional; I’m not sold on it yet. The largest dataset for an annual product will be about 700x125 so I don’t see any reason someone couldn’t just import that and do individual storm analysis off that.

And then the actual package repo. The package would maintain the same functionality of being able to scrape data. But, it would be given additional functionality to just pull datasets from GitHub.

Of course, this means I would need to update the GH dataset repo multiple times a day during an active cyclone but I can either do that on my local machine or on AWS.

All of this while keeping in mind I would like to submit the package to rOpenSci as, from what I’ve read, it seems elligible but is currently lacking.

My future plans for this package are not just to storm advisory data. I would also like to add

Let me first point out I renamed the package to rrricanes and updated the repo links above to reflect that. It is also now connected to rrricanesdata which holds most of the data in Rda format. I renamed it as I would like to push markdown reports of active storms to twitter and rrricanes was available. Thank purrr for the inspiration!

sckott:

What does this mean exactly?

timtrice:

The scraping utility gets timeout errors frequently. It’s not just my local connection as Travis has timed out occasionally as well. I plan on writing a tryCatch with maybe three attempts to reconnect on timeouts but it’s not a priority at this point.

sckott:

How often is data updated? Trying to get a sense if it makes sense to keep data in the package - if data updated very often vs. data updated like once per month or less ?

Data for years prior to current year will never be updated or changed. That theoretically could go into the pckage. The Rda files with compression_level 9 sit at 1.3MB right now but that’s only for forecast data and strike probabilities.

Data for the current season would be updated at a minimum every 6 hours for active storms and possibly more frequently during drastic changes or landfalling situations. My thinking here was running a cron job that would grab the new data and push it to the data repo. But users will still have the scraping functions available as a backup.

sckott:

timtrice:
seems elligible but is currently lacking

In which ways?

I haven’t entirely read over the expectations of the community for new projects. Particularly that I’m the only user and the package is still in testing, I just don’t know if the community would accept it in it’s current state. Although I have run some validation tests and removed a significant chunk of bugs, some data quality though may still be lacking either because of one little thing that throws off the regex

Two cases I had just found: Some advisories used \r and \n instead of \n found in most products. Another instance; some products are in proper-casing while others are upper-casing.

sckott:

I want to see if we can speed up your download times - can you point to that code that you say is slow?

The Forecast/Advisory products are by far the slowest. Here is an example:

There are 125 variables. During scraping, some are transformed (138.3W in the text to -138.3 in the dataframe or date/time values like 08/0900Z transformed to proper ymd_hms). Variables such as wind radius are moved from a long format to a wide format so each advisory has only one row in one table.

If you’re only getting data for one storm, this isn’t an issue. It might take a minute tops (depending on the life of the storm). But to get a whole season of data just for one basin will take several minutes.

It could also be because I’m using rvest. Intially grabbing the data and extracting the product text could possibly be done better.

sckott:

rnoaa does buoy data - though it may be different from the buoy data you’re talking about.

I have this bookmarked and thank you for reminding me! Ship and buoy data will add up so I’m not sure yet how I’m going to handle it. Once I get the GIS data added though I’d like to move to the reconnaissance data as I believe there is more value there.

Particularly that I’m the only user and the package is still in testing, I just don’t know if the community would accept it in it’s current state. Although I have run some validation tests and removed a significant chunk of bugs, some data quality though may still be lacking either because of one little thing that throws off the regex

No problem that you think you are the only user - its entirely possible others are using it - We do want submitted pkgs to have tests though, so mind that rule. There’s in general no issue with an early stage pkg being submitted, and in fact it’s nice to get a pkg in review at earlier stage to get feedback early rather than after the maintainer is in a way wedded to the pkg structure/conventions.etc.

Scot, that would be very helpful! Thank you. I apologize if the repo seems disorganized. I am working on getting a solid structure. Currently I’m updating the docs on the master branch to reflect the name change and doing minor cleaning.

And I’ll check out profvis. I did use it a bit when I first created the package and I do remember fetching the URL’s was the most time-consuming. But I still think I can clean up the code their a bit to make it slightly faster. I’ll use it again to test various ideas.

sckott:

We do want submitted pkgs to have tests though, so mind that rule. There’s in general no issue with an early stage pkg being submitted, and in fact it’s nice to get a pkg in review at earlier stage to get feedback early rather than after the maintainer is in a way wedded to the pkg structure/conventions.etc.

The package is definitely tested. Probably too much (I removed some tests in version 0.1.1). Many of the tests were to handle the minor fluctuations from one advisory to another (i.e., Adv 1 and Adv 1…COR) - make sure by fixing one thing I didn’t break the other. I’m trying to think of a way to branch have a develop branch and a test branch so the develop branch won’t take so long to build. Not sure if that’s feasible or even acceptable; just an idea.

I think I will go ahead and submit the package and see what the community thinks. I’m a bit overwhelmed with it in the sense of I know what I need to do but I may not be documenting it correctly or following proper procedures. Additionally, I know how things are supposed to work. Not sure I’m expressing it correctly to others.