Jumping Rivershttps://www.jumpingrivers.com
Helping You Move from Data Storage to Data InsightsMon, 09 Sep 2019 09:00:52 +0100en-US
hourly
1 https://wordpress.org/?v=5.2.2Catch us at these conferences!https://www.jumpingrivers.com/blog/catch-us-at-these-conferences/
https://www.jumpingrivers.com/blog/catch-us-at-these-conferences/#respondMon, 09 Sep 2019 09:00:52 +0000https://www.jumpingrivers.com/?p=1078At Jumping Rivers we're always to want to branch into the data science community, and so this year we're going to quite a few conferences in the autumn. You can catch us at: GSS (Government Statistical Service) Conference - Edinburgh From the 1-2 October, our very own Esther Gillespie (CEO) and Seb Mellor (Data Engineer)

]]>At Jumping Rivers we're always to want to branch into the data science community, and so this year we're going to quite a few conferences in the autumn. You can catch us at:

GSS (Government Statistical Service) Conference - Edinburgh

From the 1-2 October, our very own Esther Gillespie (CEO) and Seb Mellor (Data Engineer) will be attending the GSS conference in Edinburgh. If you see them, feel free to chat or ask for some merch! Unfortunately, there are no tickets available for this one.

EARL - London

EARL London boasts a very strong line up of speakers, from Sainsbury's to Stack Overflow. We're sponsorsing this one so expect a big Jumping Rivers presence. We've got Esther, Colin Gillespie (Project Manager) and Rhian Davies (Data Scientist) attending.

From the 10-12 September you'll be able to catch them 3 heading up our stall, where you can pop by for a chat about or for a coaster!

If you've still not grabbed yourself a ticket, you'll have to do it pretty soon!

Why R? - Warsaw

In just a couple of weeks time, Jumping Rivers will be going international! Four of our team will be crossing borders into Warsaw for the annual Why R? conference. If you like the sound of it, grab yourself a ticket!

Who's going? Myself (Theo Roe, Data Scientist), Colin, Roman Popat (Data Scientist) and Jack Walton (Data Scientist) will be attending. If you see us feel free to stop us for a chat, and grab one of infamous Jumping Rivers coasters!

As a treat, I'm doing a workshop on Friday morning titled "Shiny basics". I'm also talking in the 10-11:20am Saturday Shiny session about a recent project we took on at Jumping Rivers, titled Improving the communication of environmental data using Shiny.

You can catch Colin talking in the Sunday 15:05-16:05pm Vision 1 session. His talk is titled Hacking R as a script kiddie. This is about the relatively easy hacks that can be performed to access systems, as data science moves away from local machines to the cloud.

]]>https://www.jumpingrivers.com/blog/catch-us-at-these-conferences/feed/0We’re RStudio Trainers!https://www.jumpingrivers.com/blog/rstudio-certified-trainers/
Fri, 16 Aug 2019 09:53:34 +0000https://www.jumpingrivers.com/?p=1067We're RStudio Trainers! Big news. RStudio recently started certifying trainers in three areas: the tidyverse, Shiny and teaching. To be certified to teach a topic you have to pass the exam for that topic and the teaching exam. Even bigger news. Four of your lovely Jumping Rivers trainers are now certified to teach at least

Big news. RStudio recently started certifying trainers in three areas: the tidyverse, Shiny and teaching. To be certified to teach a topic you have to pass the exam for that topic and the teaching exam.

Even bigger news. Four of your lovely Jumping Rivers trainers are now certified to teach at least one topic! Check out the RStudio certified trainers page to see me (Theo Roe), Rhian Davies, Colin Gillespie and Roman Popat in action!

P.S. whilst we've got you, if you want to learn the tidyverse or shiny see here

]]>Upcoming R courses with Jumping Rivershttps://www.jumpingrivers.com/blog/2019-london-uk-r-courses/
https://www.jumpingrivers.com/blog/2019-london-uk-r-courses/#commentsSun, 04 Aug 2019 12:05:00 +0000https://www.jumpingrivers.com/?p=1057You'll be pleased to know that Jumping rivers are running R training courses up and down the UK, in London, Newcastle, Belfast and Edinburgh. I've put together a quick summary of the courses available through til the end of the year. They are sorted by place then date. You can find the booking links and

]]>You'll be pleased to know that Jumping rivers are running R training courses up and down the UK, in London, Newcastle, Belfast and Edinburgh. I've put together a quick summary of the courses available through til the end of the year. They are sorted by place then date. You can find the booking links and more detail over at our courses page. Don't be afraid to get in contact if you have any questions!

London

12/12 - Advanced Programming in R

This is a two-day intensive course on advanced R programming. The training course will not only cover advanced R programming techniques, such as S3/S4 objects, reference classes and function closures, we will spend a significant time discussing why and where these methods are used. The course will be a mixture of lectures and computer practicals. By the end of the course, participants will be able to use OOP within there own code.

Newcastle

2/12 - 4/12 - Rapid reporting for analysts: An Introduction to R programming through to reporting in three days

This course aims to take each individual through the fundamental approach to using R programming in her current role. Ensuring that the attendees build confidence on where and how to start when they get back to their desks. By the end of the course the individual should have already introduced some automation and will be working towards automating all of their reports. Our experience shows analysts who set up a reproducible report save between 20-80% time on their task

Belfast

2/9 - Mastering the Tidyverse (Data Carpentry)

The tidyverse is essential for any statistician or data scientist who deals with data on a day-to-day basis. By focusing on small key tasks, the tidyverse suite of packages removes the pain of data manipulation. The tidyverse allows you to

Import data from databases and data sources with ease

Remove the pain of data cleaning

Start understanding that data by transforming it, visualising it with imagery and modelling it

This training course covers key aspects of the tidyverse, including dplyr, lubridate, tidyr, stringr and tibbles.

3/9 - Intro to R

This is a one-day intensive course on R and assumes no prior knowledge. By the end of the course, participants will be able to import, summarise and plot their data. At each step, we avoid using "magic code", and stress the importance of understanding what R is doing.

Edinburgh

4/10 - Intro to R

See above description

11/10 - Programming with R

The benefit of using a programming language such as R is that we can automate repetitive tasks. This course covers the fundamental techniques such as functions, for loops and conditional expressions. By the end of this course, you will understand what these techniques are and when to use them. This is a one-day intensive course on R.

18/10 - Introduction to R

See above description

25/10 - Mastering the Tidyverse (Data Carpentry)

See above description

1/11 - Advanced Graphics with R

This is a one-day intensive course on advanced graphics with R. The standard plotting commands in R are known as the base graphics, but are starting to show their age. In this course, we cover more advanced graphics packages - in particular, ggplot2. The ggplot2 package can create advanced and informative graphics. This training course stresses understanding - not just one off R scripts. By the end of the session, participants will be familiar with themes, scales and facets, as well as the wider ggplot2 world of packages.

8/11 - Statistical Modelling with R

From the very beginning, R was designed for statistical modelling. Out of the box, R makes standard statistical techniques easy. This course covers the fundamental modelling techniques. We begin the day by revising hypotheses tests, before moving on to ANVOA tables and regression analysis. The class ends by looking at more sophisticated methods such as clustering and principal components analysis (PCA).

]]>https://www.jumpingrivers.com/blog/2019-london-uk-r-courses/feed/1Timing hash functions with the bench packagehttps://www.jumpingrivers.com/blog/digest-timings-bench-package/
Tue, 21 May 2019 15:16:23 +0000https://www.jumpingrivers.com/?p=1020This blog post has two goals Investigate the bench package for timing R functions Consequently explore the different algorithms in the digest package using bench What is digest? The digest package provides a hash function to summarise R objects. Standard hashes are available, such as md5, crc32, sha-1, and sha-256. The key function in the

The number of available hashing algorithms has grown over the years, and as a little side project, we decided to test the speed of the various algorithms. To be clear, I’m not considering any security aspects or the potential of hash clashes, just pure speed.

Timing in R

There are numerous ways of timing R functions. A recent addition to this list is the bench package. The main function bench::mark() has a number of useful features over other timing functions.

To time and compare two functions, we load the relevant packages

library("bench")
library("digest")
library("tidyverse")

then we call the mark() function and compare the md5 with the sha1 hash

The resulting tibble object, contains all the timing information. For simplicity, we've just selected the expression and median time.

More advanced bench

Of course, it’s more likely that you’ll want to compare more than two things. You can compare as many function calls as you want with mark(), as we’ll demonstrate in the following example. It’s probably more likely that you’ll want to compare these function calls against more than one value. For example, in the digest package there are eight different algorithms. Ranging from the standard md5 to the newer xxhash64 methods. To compare times, we’ll generate n = 20 random character strings of length N = 10,000. This can all be wrapped up in the single function press() function call from the bench package:

It’s also worth seeing how the results vary according to the size of the character string N.

Regardless of the value of N, the sha256 algorithm is consistently in the slowest.

Conclusion

R is going the way of “tidy” data. Though it wasn't the focus of this blog post, I think that the bench package is as good as other timing packages out there. Not only that, but it fits in with the whole “tidy” data thing. Two birds, one stone.

]]>Thoughts on SatRday Newcastlehttps://www.jumpingrivers.com/blog/satrday-ncl-review/
Wed, 15 May 2019 08:45:01 +0000https://www.jumpingrivers.com/?p=945Earlier this month I attended the inaugural SatRday Newcastle. This was my first time attending a SatRday event, and I had a really enjoyable day. The event was sponsored by Newcastle University, Sage, RStudio and Jumping Rivers. There were over 100 attendees from across the U.K. Most attendees were from industry, although there were also

]]>Earlier this month I attended the inaugural SatRday Newcastle. This was my first time attending a SatRday event, and I had a really enjoyable day. The event was sponsored by Newcastle University, Sage, RStudio and Jumping Rivers. There were over 100 attendees from across the U.K. Most attendees were from industry, although there were also a couple of academics present. There were also lots of R-Ladies, including women from the newly formed R-Ladies Newcastle, who are launching next month. There was even a four-month-old baby - well, you’ve got to start them young!

There were lots of interesting talks during the day, but here are a couple of my personal favourites.

Noa opened the day by telling stories about climate change, mozzarella and wisdom of the crowds. She taught us that “stories are sticky” and using them can be a great way to help colleagues understand, and remember statistical principles. She also highlighted the importance of trust, access and knowledge when working with data.

I don’t really follow football, but Joe’s talk on his soccermatrics package was fascinating. He’s created a tool to visualise and analyse football matches, including pitch heatmaps and individual player trajectories. Joe has a huge list of extra functionality he would like to add and is looking for collaborators. If you would like to help develop the soccermatics package, you can reach out on GitHub.

Thomas spoke about the benefits of creative coding, and how those little fun extra projects can provide a healthy distraction from a difficult task at hand, whilst also allowing you to learn new skills. He was, however, keen to stress that having time to code for fun is a privilege and not a requirement, so don’t worry if you don’t have time!

Did you miss us?

Don’t worry if you missed SatRday Newcastle, all of the talks were recorded and we will be sharing them online shortly. We’re also pleased to announce that SatRdays Newcastle will return on 4th April 2020. Save the date!

]]>R Packages: Are we too trusting?https://www.jumpingrivers.com/blog/r-packages-security-install/
https://www.jumpingrivers.com/blog/r-packages-security-install/#commentsMon, 04 Feb 2019 15:18:25 +0000https://www.jumpingrivers.com/?p=854One of the great things about R, is the myriad of packages. Packages are typically installed via CRAN Bioconductor GitHub But how often do we think about what we are installing? Do we pay attention or just install when something looks neat? Do we think about security or just take it that everything is secure?

]]>One of the great things about R, is the myriad of packages. Packages are typically installed via

CRAN

Bioconductor

GitHub

But how often do we think about what we are installing? Do we pay attention or just install when something looks neat? Do we think about security or just take it that everything is secure? In this post, we conducted a little nefarious experiment to see if people pay attention to what they install.

R-bloggers: The hook

R-bloggers is great a resource for keeping on top of what's happening in the world of R. It's one the resources we recommend whenever we run training courses. For an author to get their site syndicated to R-bloggers, they have to email Tal who will ensure that the site isn't spammy. I recently saw a tweet (I can't remember who from) who suggested tongue in cheek that to boost your website ranking, just grab a site that used to appear on R-bloggers.

This gave me an idea for something a bit more devious! Instead of boosting website traffic, could we grab a domain, create a dummy R package, then monitor who installs this package!

A list of contributing sites is nicely provided by R-bloggers. A quick and dirty script grabs select target domains. First we load a few packages

In the end, we went with vinux.in. Using the Wayback machine, this site seems to have died around 2017. The cost of claiming this site was £10 for the year.

By claiming this site, I have automatically got a site that has incoming traffic. One evil strategy is simply to set back and get traffic from R-bloggers.

blogdown & ggplot2: the bait

Next, I created a GitLab user rstatsgit and a blog via the excellent blogdown package. Now clearly we need something to entice people to run our code, so I created a very simple R package the scans ggplot2 themes. Nothing fancy, only a dozen lines of code or so. In case someone looked at the GitHub page, I just copied a few badges from other packages to make it look more genuine. I used netlify to link our new blog to our recently purchased domain. The resulting blog doesn't look too bad at all.

At the bottom of one of the .R files in the package, there is a simple source() command. This, in theory, could be used to do anything - grab data, passwords, ssh keys. Clearly, we don't do any of this. Instead, it simply pings a site to tell us if the package has been installed.

R-bloggers & twitter: Delivery

To deliver the content, I'm going for a combination of trying to get it onto r-bloggers via the old RSS feed and tweeting about the page with the #rstats tag.

Did people install the package

I'll update the blog post with results in a week or two.

Who is not to blame

It's instructive to think about who is not to blame:

Gitlab/GitHub: it would be impossible for them to police who code that is uploaded to their site.

devtools(install_git*()): They're many legitimate uses for this function. Blaming it would be the equivalent to blaming StackOverflow for bad advice. It doesn't really make sense.

R-bloggers: It simply isn't feasible to thoroughly vet every post. In the past, the site has quickly reacted to anything spammy and removed offending articles. They also have no control

The person who owned the site: Nope. They owned the site. Now they don't. They have no responsibility.

Who is to blame?

Well, I suppose I'm to blame since I created the site and package 😉 But more seriously if you installed the package, you're to blame! I think everyone is guilty of copying and pasting code from blogs, StackOverflow, forums and not always understanding what's going on. But the internet is a dangerous place, and most people who us R, almost certainly have juicy data that shouldn't be released to the outside world.

By pure coincidence, I've noticed that Bob Rudis has started emphasising that we should be more responsible about what we install.

How to protect against this?

This is something we have been helping clients tackle over the last two years. On one hand, companies use R to run the latest algorithms and try cutting edge visualisation methods. On top of this, they employ bright and enthusiastic data scientists who enjoy what they do. If companies make things too restrictive, people will either find a way around the problem or simply leave.

The crucial thing to remember is that if someone really wants to do something unsafe, we can't stop them. Instead, we need to provide safe alternatives that don't hinder work while at the same time reduce overall risk.

When dealing with companies we help them tackle the problem in a number of ways

]]>https://www.jumpingrivers.com/blog/r-packages-security-install/feed/1benchmarkme: new versionhttps://www.jumpingrivers.com/blog/benchmarkme-new-version/
Tue, 29 Jan 2019 08:35:22 +0000https://www.jumpingrivers.com/?p=846When discussing how to speed up slow R code, my first question is what is your computer spec? It's always surprised me that people are wondering why analysing big data is slow, yet they are using a five-year-old cheap laptop. Spending a few thousand pounds would often make their problems disappear. To quantify the impact

]]>When discussing how to speed up slow R code, my first question is what is your computer spec? It's always surprised me that people are wondering why analysing big data is slow, yet they are using a five-year-old cheap laptop. Spending a few thousand pounds would often make their problems disappear. To quantify the impact of the CPU on analysis, I created the package benchmarkme. The aim of this package is to provide a set of benchmarks routines and data from past runs. You can then compare your machine, with other CPUs.

The package is now on CRAN and can be installed in the usual way

# R 3.5.X only
install.packages("benchmarkme")

The benchmark_std() function assesses numerical operations such as loops and matrix operations. This benchmark contains two main benchmarks

benchmark_std(): this benchmarks numerical operations such as loops and matrix operations. The benchmark comprises three separate benchmarks: prog, matrix_fun, and matrix_cal.

## You can control exactly what is uploaded. See details below.
upload_results(res)

You can compare your results to other users via

plot(res)

The benchmark_io() function

This function benchmarks reading and writing a 5MB or 50MB (if you have less than 4GB of RAM, reduce the number of runs to 1). Run the benchmark using

res_io = benchmark_io(runs = 3)
upload_results(res_io)
plot(res_io)

By default, the files are written to a temporary directory generated

tempdir()

which depends on the value of

Sys.getenv("TMPDIR")

You can alter this to via the tmpdir argument. This is useful for comparing hard drive access to a network drive.

res_io = benchmark_io(tmpdir = "some_other_directory")

As before, you can compare your results to previous results via

plot(res_io)

Parallel benchmarks

The benchmark functions above have a parallel option - just simply specify the number of cores you want to test. For example to test using four cores

res_io = benchmark_std(runs = 3, cores = 4)

Previous versions of the package

This package was started around 2015. However, multiple changes in the byte compiler over the last few years has made it very difficult to use previous results. Essentially, the detecting if and how the byte compiler was being used became near on impossible. Also, R has just "got faster", so it doesn't make sense to compare benchmarks between different R versions. So we have to start from scratch (I did spend a few days trying to salvage something but to no avail).

The previous data can be obtained via

data(past_results, package = "benchmarkmeData")

Machine specs

The package has a few useful functions for extracting system specs:

RAM: get_ram()

CPUs: get_cpu()

BLAS library: get_linear_algebra()

Is byte compiling enabled: get_byte_compiler()

General platform info: get_platform_info()

R version: get_r_version()

The above functions have been tested on a number of systems. If they don’t work on your system, please raise GitHub issue.

Uploaded datasets

A summary of the uploaded datasets is available in the benchmarkmeData package

data(past_results_v2, package = "benchmarkmeData")

A column of this data set contains the unique identifier returned by the upload_results() function.

What’s uploaded

Two objects are uploaded:

Your benchmarks from benchmark_std or benchmark_io;

A summary of your system information (get_sys_details()).

The get_sys_details() returns:

Sys.info();

get_platform_info();

get_r_version();

get_ram();

get_cpu();

get_byte_compiler();

get_linear_algebra();

installed.packages();

Sys.getlocale();

The benchmarkme version number;

Unique ID - used to extract results;

The current date.

The function Sys.info() does include the user and nodenames. In the public release of the data, this information will be removed. If you don’t wish to upload certain information, just set the corresponding argument, i.e.

]]>We’re Hiring: Data Scientisthttps://www.jumpingrivers.com/blog/were-hiring-data-scientist/
Mon, 28 Jan 2019 15:00:43 +0000https://www.jumpingrivers.com/?p=844Jumping Rivers is a data science company based in Newcastle. We are not sector based and our clients range through all industries. We are looking for individuals who enjoy a challenge. Main Duties Technical Duties: Provide technical training Development of bespoke statistical algorithms. Building web applications using R and Shiny. Data analysis using R and/or

]]>satRdays Newcastle 2019 Conference is Here!https://www.jumpingrivers.com/blog/satrdays-newcastle-2019-conference/
Fri, 25 Jan 2019 16:54:39 +0000https://www.jumpingrivers.com/?p=840We are pleased to announce the very first Satrday event in Newcastle upon Tyne (and England). satRdays Newcastle is a one-day, low-cost, community organised R conference in the heart of Newcastle City Centre. Where? The event will be held at Newcastle University. Getting to Newcastle is really easy Train: 90 minutes from Edinburgh or 3

]]>We are pleased to announce the very first Satrday event in Newcastle upon Tyne (and England). satRdays Newcastle is a one-day, low-cost, community organised R conference in the heart of Newcastle City Centre.

Picture Credit

]]>R Conference Costs v2.0https://www.jumpingrivers.com/blog/r-conference-costs-v2-0/
Fri, 25 Jan 2019 10:31:07 +0000https://www.jumpingrivers.com/?p=833R conference Costs Last year we gave you a price break down of some of the most popular R conferences around the globe for 2017. We’re going to do it again for 2018. Remember, you can get up-to-date information on upcoming conferences via our GitHub page. It’s important to note that these costs are the

Last year we gave you a price break down of some of the most popular R conferences around the globe for 2017. We’re going to do it again for 2018. Remember, you can
get up-to-date information on upcoming conferences via our GitHub page.

It’s important to note that these costs are the prices of an industry ticket for the conference only. If you caught the tickets on early bird and are an academic/student you could see these prices fall by over 50% in some cases. I’ll also mention extra pricing below.

Conference

Cost ($)

#Days

Cost/Day

rstudio::conf 2019

795

2

398

eRum 2018

311

3

104

WhyR 2018

170

4

42

Earl London 2018

1170

3

390

satRday Amsterdam 2018

45

1

45

New York R

750

2

375

There’s quite a difference in the price per day. A simple bar plot will highlight this

Now, with add-ons it could have been a whole different story. For example, if you wanted to attend any of the 1 or 2-day workshops for rstudio::conf 2018 before the actual conference, you’re looking at adding an extra overall $995 - 1500. However, conferences like eRum and WhyR have no extra pricing.

Cost, but what about ...

Clearly, the cost is only one of many factors used when deciding to attend a conference. Location, networking, date and speakers all play a part. In particular, we are planning on attending the rstudio::conf in 2020, even though it's one of the more expensive events (but it looks fantastic!). In fact, next year the conference is in San Francisco (Jan 27-30th, 2020). The first 100 ticket purchasers will get the special price of $450!