Tag Archives: RSQLite

Lately – over the last month – I have been writing code that helps with the production of a timetable and abstract for conferences.

The latest version of that code can be found here, the associated bookdown project here, and the final result here. Note that the programme may change a good number of times before next Sunday (December 10, 2017) when the conference starts.

The code above is an improved version of the package I wrote for the Biometrics By The Border conference which worked solely with Google Sheets, by using it as a poor man’s database. I wouldn’t advise this because it is an extreme bottleneck when you constantly need to refresh the database information.

The initial project was driven by necessity. I foolishly suggested we use EasyChair to collect information from speakers, not realising that the free version of EasyChair does not allow you to download the abstracts your speakers helpfully provided to you (and fair enough–they are trying to make some money from this service). Please note that if I am incorrect in this statement, I’d be more than happy to find out. A word of advice: if this is still the case, then get your participants to upload an R Markdown (or just plain text) file instead of using the abstract box on the webpage. You can download these files from EasyChair in a single zip file.

The second project was also driven by necessity, but the need in this case had more to do with the sheer complexity of organising several hundred talks over a four day programme. Now that everything is in place, most changes requested by speakers or authors can be accomodated in a few minutes by simply moving the talk on the Google sheet that controls the programme, calling several R functions, recompiling the book and pushing it up to github.

So how does it all work?

The current package does depend on some basic information being stored on Google Sheets. The sheets for my current conference (Joint Conference of the New Zealand Statistical Association and the International Association of Statistical Computing (Asian Regional Section)) can be viewed here on Google Sheets. Not all the worksheets here are important. The key sheets are: Monday, Tuesday, Wednesday, Thursday, All Submissions, Allocations, All_Authors, Monday_Chairs, Tuesday_Chairs, Wednesday_Chairs, and Thursday_Chairs. The first four and the last four sheets (Monday-Thursday) are hand created. The colours are for our convenience and do not matter. The layout does, and some of this is hard-coded into the project. For example all though there are seven streams a day, only six of those belong directly to us as the seventh belong to a satellite conference. The code relies on every scheduled event (including meal breaks) having a time adjacent to it in the Time column. It also collects information about the rooms as the order of these rooms does change on some days. The All_Authors sheet comes from EasyChair. EasyChair provides the facility to download author email information as an Excel spreadsheet which in turn can be uploaded to Google Sheets easily. This sheet is simply the sheet labelled All in the EasyChair file. The All Submissions sheet is a copy-and-paste from the EasyChair submissions page. I probably could webscrape this with a little effort. The Allocations sheet is mostly reflective of the All Submissions sheet, but has user added categorizations that allow us to group the talks into sensible sessions. It also uses formulae to format the titles of the talks so that each title word is capitalized (this is an imperfect process), and that the name of the submitter (who we have assumed is the speaker) is appended to the title so that it can be pasted into the programme sheets.

How do we get data out of Google Sheets?

Enter googlesheets.

The code works in a number of distinct phases. The first is capturing all the relevant information from the web in placing in a SQLite database. I used Jenny Bryan’s googlesheets package to extract the data from my spreadsheets. The process is quite straightforward, although I found that it worked a little more smoothly on my Mac than on my Windows box, although this may have more to do with the fact that the R install was brand new, and so the httr package was not installed. The difference between it being installed and not installed is that when it is installed authentication (with Google) happens entirely within the browser, whereas without you are required to copy and paste an authentication code back into R. When you are doing this many times, the former is clearly more desirable.

Interaction with Google Sheets consists of first grabbing the workbook (Google Sheets does not use this term, it comes from Excel, but it encapsulates the idea that you have a collection of one or more worksheets in any one project, than the folder that contains them all is called a workbook), and then asking for information from the individual sheets. The request to see a workbook is what will prompt the authentication, e.g.

library(googlesheets)
mySheets = gs_ls()
mySheets

The call to gs_ls will prompt authentication. It is important to note that this authentication will last for a certain period of time, and should only require interaction with a web browser the very first time you do it, and never again. Subsequent requests will result in a message along the lines of Auto-refreshing stale OAuth token. The call to gs_ls will return a list of available workbooks. A particular workbook may be retrieved by calling gs_title. For example, this function allows me to get to the conference workbook:

I can use the object returned by this function in turn to access the worksheets I want to work with using the gs_read. The functions in googlesheets are written to be compliant with the tidyverse, and in particular the pipe %>% operator. I will be the first to admit this is not my preferred way of working and I could have easily worked around it. However, in the interests of furthering my skills, I have tried to make the project tidyverse compliant and nearly all the functions will work using the pipe operator, although they take a worksheet or a database as their first argument rather than a tibble.

Once I have a workbook object (really just an authenticated connection), I can then read each of the spreadsheets. This is done by a function called updateDB. The only thing worth commenting on in this function is that the worksheets for each of the days have headers which do not really resolve well into column names. They also have a fixed set of rows and columns that we wish to capture. To that end, the range is specified for each day, and the column headers are simply set to be the first seven (A–G) capital letters of the alphabet. These sheets are stored as tibbles which are then written to an internal SQLite database using the dbWriteTable function. There are eight functions (createRoomsTbl, createAffilTbl, createTitleTbl, createAbstractTbl, createAuthorTbl, createAuthorSubTbl, createProgTbl, createChairTbl) which operate on the database/spreadsheet tables to produce the database tables we need for generating the conference timetable, and for the abstract booklet. These functions are rarely called by themselves—we tend to call a single omnibus function rebuildBD. This function allows the user to refresh the information from the web if need be, and to recreate all of the tables in the database. The bottleneck in this function is retrieving the information from the internet which may take around 20-30 seconds depending on the connection speed.

The database tables provide information for four functions: writeTT, writeProg, writeIndices, and writeSessionChairs. Each of these functions produces one or more R Markdown files in the directory used by the bookdown project.

Bookdown, ePub and gitBook

The final product is generated using bookdown. Bookdown, in explanation sounds simple. Implementation is really improved by the help of a master. I found this blog post by Sean Kross very helpful along with his minimal bookdown project from github. It would be misleading of me to suggest that the programme book was really produced using R Markdown. Small elements are in markdown, but the vast majority of the formatting is achieved by writing code which writes HTML. This is especially true of the conference timetable, and the hyperlinking of the abstracts to the timetable and other indices. The four functions listed above write out six markdown pages. These are the conference timetable, the session chairs table, four pages for each of the days of the conference, and two indices, one linking talks to authors, and one linking talk titles to submission numbers (which for the most part were issued by EasyChair). There is not a lot more to discussion involved here. Sean’s project sets up things in such a way that changes to the markdown or the yaml files will automatically trigger a rebuild.

Things I struggled with

Our conference had six parallel streams. Whilst it is easy enough to make tables that will hold all this information, it is very difficult to decide how best to display this in a way that would suit everyone. The original HTML tables were squashed almost to the point where there was a character per line on a mobile phone screen. We overcame this slightly by fixing the widths of the div element that holds the tables and adding horizontal scrolling. Many people found this feature confusing, and it did not necessarily translate into ePub. We then added some Javascript functionality by way of the tablesaw library. This allowed us to keep the time column persistant no matter which stream people were looking at, and it allowed better scrolling of streams that were offscreen. However, this was still as step too far for the technologically challenged. In the end we resorted to printing out the timetable. I also used excellent Calibre software to take the ePub as input and output it in other format—most usefully Microsoft Word’s .docx format. I know some of you are shuddering at this thought, but it did allow me to create a PDF with the programme timetable rotated and then create a PDF. This made the old fogeys immensely happy, and me immensely irritated, as I thought the gitBook version was quite useful.

Not forgetting the abstracts

Omitted from my workflow so far is mention of the abstracts. We had authors upload LaTeX (.tex) files to EasyChair, or text files. If you don’t do this, and they use EasyChair’s abstract box, then you have to find a way to scrape the data. There are downsides in doing so (and it even occurs in the user submitted files) in that some unicode text seems to creep in. Needless to say, even after I used an R function to convert all the files to markdown, we still had to do a bunch of manual cleaning.

Anyway

I hope someone finds this work useful. I have no intention of running a conference again for at least four years, but I would appreciate it if anyone wants to build on my work.