The data.table package provides perhaps the fastest way for data wrangling in R. The syntax is concise and is made to resemble SQL. After studying the basics of data.table and finishing this exercise set successfully you will be able to start easing into using data.table for all your data manipulation needs. We will use data
Related exercise sets:
Vector exercises...

Some time ago I started working with Bayesian methods, using the great rstanarm-package. Beside the fantastic package-vignettes, and books like Statistical Rethinking or Doing Bayesion Data Analysis, I also found the ressources from Tristan Mahr helpful to both better understand Bayesian analysis and rstanarm. This motivated me to implement tools for Bayesian analysis into my

Over the last few days, Rcpp passed another noteworthy hurdle. It is now used by over 10 percent of packages on CRAN (as measured by Depends, Imports and LinkingTo, but excluding Suggests). As of this morning 1130 packages use Rcpp out of a total of...

If you want to work on large data science projects (analyses and machine learning) you need to be able to perform dozens of small tasks ...
For example, you'll need to be able to fluently perform dozens of little bits of data wrangling, just like this ...
The post Simple practice: data wrangling the iris dataset appeared first on SHARP SIGHT...

Organising useR!2017 was a challenge but a very rewarding experience. With about 1200 attendees of over 55 nationalities exploring an interesting program, we believe it is appropriate to call it a success - something the aftermovie only seems to confirm.
Behind the Scenes
To give you a glimpse behind the scenes of the conference organization, Maxim Nazarov held...

What do women do in films? If you analyze the stage directions in film scripts — as Julia Silge, Russell Goldenberg and Amber Thomas have done for this visual essay for ThePudding — it seems that women (but not men) are written to snuggle, giggle and squeal, while men (but not women) shoot, gallop and strap things to other...

I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it’s a tad... Continue reading →

The R package seplyr supplies a few neat new coding notations. An Abacus, which gives us the term “calculus.” The first notation is an operator called the “named map builder”. This is a cute notation that essentially does the job of stats::setNames(). It allows for code such as the following: library("seplyr") names

Contributing to an open-source community without contributing code is an oft-vaunted idea that can seem nebulous. Luckily, putting vague ideas into action is one of the strengths of the rOpenSci Community, and their package onboarding system offers a chance to do just that.
This was my first time reviewing a package, and, as with so many things in life, I...

Take a look at the data
This is a phrase that comes up when you first get a dataset.
It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?
Starting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of...

What is a choice simulator? A choice simulator is an online app or an Excel workbook that allows users to specify different scenarios and get predictions. Here is an example of a choice simulator. Choice simulators have...

Today, we’re continuing our blog series on new features in RStudio 1.1. If you’d like to try these features out for yourself, you can download a preview release of RStudio 1.1.
Object Explorer
You might already be familiar with the Data Viewer in RStudio, which allows for the inspection of data frames and other tabular R objects available in your R...

routr is now available on CRAN, and I couldn’t be happier. It’s release marks
the completion of an idea that stretches back longer than my attempts to bring
network visualization and ggplot2 together (see this post for ref).
While my PhD was stil...

I have a new visual essay up at The Pudding today, using text mining to explore how women are portrayed in film.
The R code behind this analysis in publicly available on GitHub.
I was so glad to work with the talented Russell Goldenberg and...

The recent release of the blscrapeR package brings the “tidyverse” into the fold. Inspired by my recent collaboration with Kyle Walker on his excellent tidycensus package, blscrapeR has been optimized for use within the tidyverse as of the current ...

This example groups stocks together in a network that highlights associations within and between the groups using only historical price data. The result is far from ground-breaking; you can already guess the output. For the most part, the stocks get grouped together into pretty obvious business sectors.
Despite the obvious result, the process of teasing out latent groupings from historic...

After blogging break caused by writing research papers, I managed to secure time to write something new about time series forecasting. This time I want to share with you my experiences with seasonal-trend time series forecasting using simple regression trees. Classification and regression tree (or decision tree) is broadly used machine learning method for modeling. They are favorite because...

Give me fuel, give me fire, reduced deprivation's my desire -
First attempts with simple features
The latest edition of the Scottish Index of Multiple Deprivation (SIMD) was released last year, and has been getting a bit more promotion recen...

Do you want to get your own simmer hexagon sticker? Just fill in this form and get one send to you for free.
Check out r-simmer.org or CRAN for more information on simmer, a discrete-event simulation package for R.

I will be at the AI Summit in San Francisco next month, which means I can't make it to Ignite in Orlando this year. Which is a bit of a shame, because there's a fantastic Data Science track at Ignite. There are 25 sessions on offer, with presentations from my Microsoft colleagues on Microsoft R, Cognitive Toolkit, Bot Framework,...

A/B Testing is a familiar task for many working in business analytics. Essentially, A/B Testing is a simple form of hypothesis testing with one control group and one treatment group. Classical frequentist methodology instructs the analyst to estimate the expected effect of the treatment, calculate the required sample size, and perform a test to determine
Related exercise sets:
Hacking statistics...

Background Sometimes we might want to compare three or four tube types for a particular analyte on a group of patients or we might want to see if a particular analyte is stable over time in aliqioted samples. In these experiments are essentially doing the multivariable analogue of the paired t-test. In the tube-type experiment, … Continue reading Compare...

Just before the summer holidays, BNOSAC presented a talk called Computer Vision and Image Recognition algorithms for R users at the UseR conference. In the talk 6 packages on Computer Vision with R were introduced in front of an audience of about 250 p...

A researcher was presenting an analysis of the impact various types of childhood trauma might have on subsequent substance abuse in adulthood. Obviously, a very interesting and challenging research question. The statistical model included adjustments for several factors that are plausible confounders of the relationship between trauma and substance use, such as childhood poverty. However, the model also include...

The last months, I have worked on brand logo detection in R with Keras. Starting with a model from scratch adding more data and using a pretrained model. The goal is to build a (deep) neural net that is able to identify brand logos in images.
Just ...

As data analysts, we all have tasks that we enjoy more than others. Some like the exploratory analysis steps, some like statistical computing, while others enjoy visualizing and communicating the results of their analyses. But we have yet to meet a dat...

As data analysts, we all have tasks that we enjoy more than others. Some like the exploratory analysis steps, some like statistical computing, while others enjoy visualizing and communicating the results of their analyses. But we have yet to meet a dat...

The dropout approach developed by Hinton has been widely employed in deep learnings to prevent the deep neural network from overfitting, as shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks. In the paper http://proceedings.mlr.press/v38/korlakaivinayak15.pdf, the dropout can also be used to address the overfitting in boosting tree ensembles, e.g. MART, caused by the so-called “over-specialization”. In particular, while first few

In the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm), it has been explained why we should consider modeling operational losses for non-material UoMs directly with Tweedie models. However, for material UoMs with significant losses, it is still beneficial to model the frequency and the severity separately. In the prevailing modeling practice for operational losses, it is often convenient to