Tuesday, 19 July 2016

This post would probably be the last in my series about merging R and ArcGIS. In August unfortunately I would have to work for real and I will not have time to play with R-Bridge any more.
In this post I would like to present a toolbox to perform some introductory point pattern analysis in R through ArcGIS. Basically, I developed a toolbox to perform the tests I presented in my previous post about point pattern analysis. In there, you can find some theoretical concepts that you need to know to understand what this toolbox can do.
I will start by introducing the sample dataset we are going to use, and then simply show the packages available.

Dataset
For presenting this toolbox I am using the same dataset I used for my previous post, namely the open crime data from the UK. For this post I downloaded crimes in the London area from the whole 2015. As you can see from the image below we are talking about more than 950'000 crimes of several categories, all across London.

I also included a polygon shapefile with the area around London and all its boroughs, this should be visible as blue lines around the city. I included this because point pattern analysis requires the user to set the border of the study area, as I mentioned in my previous post.

Spatio-Temporal Subset
The first package I would like to present is a simple spatio-temporal subsetting tool. This is completely based on R but it is basically a more flexible version of the selection tools available in ArcGIS.

Here users can select points based on various parameters at once. For example, they can subset the polygon shapefile, for example here I'm extracting the borough of Ealing, and extract points only for this area. Then they can subset by time, with the same strings I presented in my previous post about a toolbox for time series analysis. Optionally, they can also subset the dataset itself based on some categories. In this example I'm extracting only the drug related crimes, committed in Ealing in May 2015.
It is important to point out that in this first version of the toolbox users can only select one element in the SQL statements. For example here I have "name" = 'Ealing'. In ArcGIS users could also put an AND and then specify another option. However, in the R code I did not put a way to deal with multiple inputs and conditions (e.g. AND, OR) and therefore only one option can be specified.
The result is a new shapefile, plotted directly on the ArcGIS console with the required subset of the data, as shown below:

Spatio-Temporal Centroid
As you may already know, ArcGIS provides a function to calculate the centroid of a point pattern. However, if we wanted to test for changes in the centroid location with time we would need to first subset our data and then compute the centroid. What I did in this package is merge these two actions into one. This package, presented in the image below, loops through the dataset, subsetting the point pattern by time (users can choose between daily, monthly and yearly subsets) and then calculates the centroid for each time unit. Moreover, I also added an option to select the statistics to use between mean, median and mode.

The results for the three statistics are presented below:

Spatio-Temporal Density
This tool calculates the point density for specific regions and time frames by subsetting your dataset. This is something that you may be able to obtain directly from ArcGIS, but users would need to first subset their data and then perform the density analysis, this tool groups those two things into one. Moreover, the package spatstat, which is used in R for point pattern analysis has some clear advantages compared to the tool available in ArcGIS. For example, as I mentioned in my post it provides ways to calculate the best bandwidth for the density estimation. In the script this is achieved using the function bw.ppl, but this can be changed if you need a different method, you just need to replace this function with another. Moreover, as pointed out in this tutorial, ArcGIS does not correct for edge effects.
Working with this package is very similar to the others I presented before:

Users need to specify the input point pattern, then a polygon shapefile for the study area, which can be subset to reduce the area under investigation. Then users can include a temporal subsetting (here I used the string "2015-10/" which means from October to the end of the year, please refer to this post for more info) and subset their data extracting a certain category of crimes. Again here the SQL statements cannot include more than one category.
Finally, users need to provide a raster dataset for saving the density result. This needs to be a .tif file, otherwise in my tests the result did not appear on screen. The output of this script is the image below, for the borough of Bromley and only for robberies:

Spatio-Temporal Randomness
This is another tool to perform a test for spatial randomness, the G function I explained in my previous post, but on a subset of the main dataset. In fact, this test is available in ArcGIS under "Multi-Distance Spatial Cluster Analysis (Ripleys K Function)", but in this case we are again performing it on a particular subset of our data.
The GUI is very similar to the other I presented before:

The only difference is that here users also need to provide an output folder, where the plot created by R will be saved in jpeg at 300 dpi. Moreover, this tool also provides users with the point shapefile created by subsetting the main dataset.
The output for the borough of Tower Hamlets and only for drug related crimes in March 2015 is the plot below:

Spatio-Temporal Correlogram
As the name suggests I develop this tool to calculate and plot a correlogram on a spatio-temporal subset of my data. For this example I could not use the crime dataset, since I do not have a continuous variable in it. Therefore I loaded the dataset of ozone measurements from sensors installed on trams here in Zurich that I used for my post about spatio-temporal kriging. This tool uses the function correlog from the package xts to calculate the correlogram. This function takes several arguments among which an increment, the number of permutations and a TRUE/FALSE flag if data are unprojected or not. These are all data that users will need to input once they use the tool and are additional options in the GUI, which for the other points is more or less identical to what I presented before, except for the selection of the variable of interest:

The result is the image below, which is again saved in jpeg at 300 dpi. As for the spatio-temporal randomness tool, a shapefile with the spatio-temporal subset used to calculate the correlogram is also saved and opened in ArcGIS directly.

Friday, 15 July 2016

In this post I will introduce another toolbox I created to show the functions that can be added to ArcGIS by using R and the R-Bridge technology.
In this toolbox I basically implemented the functions I showed in the previous post about time series analysis in R.
Once again I prepared a sample dataset that I included in the GitHub archive so that you can reproduce the experiment I'm presenting here. I will start my description from there.

Dataset
As for my previous post, here I'm also including open data in shapefile from the EPA, which I downloaded for free using the custom R function I presented here.
I downloaded only temperature data (in F) from 2013, but I kept two categorical variables: State and Address.

As you can see from the image above the time variable is in the format year-month-day. As I mentioned in the post about the plotting toolbox, it is important to set this format correctly so that R can recognize it. Please refer to this page for more information about the formats that R recognizes.

Time Series Plot
This type of plot is available in several packages, including ggplot2, which I used to create the plotting toolbox. However, in my post about time series analysis I presented the package xts, which is very powerful for handling and plotting time-series data. For this toolbox I decided to maintain the same package and refer everything to xts for several reasons that I would explain along the text.
The first reason is related to the plotting capabilities of this package. Let's take a look for example at the first script in the toolbox, specific for plotting time series.

Similarly to the script for time series in the plotting toolbox, here users need to select the dataset (which can be a shapefile or a CSV, or any other table format that can be accessed in ArcGIS). Then they need to select the variable of interest, in the sample dataset that is Temp, which clearly stands for temperature. Another important information for R is the data/time column and its format, again please refer to my previous post for more information. Finally, I inserted an SQL call to subset the dataset. In this case I'm subsetting a particular station.
The result is the plot below:

As you can see there are quite a few missing values in the dataset related to the station I subset. The very nice thing about the package xts is that with this plot it is perfectly clear where are the missing data, since along the X axis these are evident by the lack of grey tick marks.

Time Histogram
This is a simple bar chart that basically plots time against frequency of samples. The idea behind this plot is to allow users to explore the number of samples for specific time intervals in the dataset.

The user interface is similar to the previous scrips. Users need to select the dataset, then the variable and then the time column and specify its format. I also included an option to select a subset of the dataset with a SQL selection. At this point I included a list to select the averaging period, and users can select between day, month or year. In this case I selected month, which means that R will loop through the months and subset the dataset for each of these. Then it will basically count the number of data sampled in each month and plot this information against the month itself. The result is the plot below:

As you can see we can definitely gather some useful information from this plot; for example we can determine that basically this station, in the year 2013, did not have any problem.

Time Subset
In some cases we may need to subset our dataset according to a particular time period. This can be done in ArcGIS with the "Select by Attribute" tool and by using an SQL string similar to what you see in the image below:

The package xts however, provides much more powerful and probably faster ways to subset by time. For example, in ArcGIS if we want to subset the whole month of June we would need to specify an SQL string like this:
"Time" >= '2013-06-01' AND "Time" < '2013-07-01'

On the contrary, in R and with the package xts if we wanted to do the same we would just need to use the string '2013-06', and R would know to keep only the month of June. Below are some other examples of successful time subset with the package xts (from http://www.inside-r.org/packages/cran/xts/docs/xts):

sample.xts['2013'] # all of 2013
sample.xts['2013-03'] # just March 2013
sample.xts['2013-03/'] # March 2013 to the end of the data set
sample.xts['2013-03/2013'] # March 2013 to the end of 2013
sample.xts['/'] # the whole data set
sample.xts['/2013'] # the beginning of the data through 2013
sample.xts['2013-01-03'] # just the 3rd of January 2013

With this in mind I created the following script:

As you can see from the image above, here there is an option named "Subset" where users can insert one of the strings from the examples above (just the text within square brackets) and select time intervals with the same flexibility allowed in R and the package xts.
The result of this script is a new shapefile containing only the time included in the Subset call.

Time Average Plots
As I showed in my previous post about time series analysis, with the package xts is possible to perform custom functions on specific time intervals with the following commands: apply.daily, apply.weekly, apply.monthly and apply.yearly.
In this toolbox I used these functions to compute the average, 25th and 75th percentiles for specific time intervals, which the user may choose. This is the toolbox:

The only differences from the other scripts are the "Average by", with which the user can select between day, week, month or year. Each of these will trigger the appropriate apply function. Then there is also the possibility to select the position for the plot legend: between topright, topleft, bottomright and bottomleft. Finally, users can select the output folder where the plot below will be saved, along with a CSV with the numerical values for mean, q25 and q75.

Time Function
This is another script that provides direct access to the apply functions I presented before. Here the output is not a plot but a CSV with the results of the function, and users can input their own function directly in the GUI. Let's take a look:

As you can see there is a field named "Function". Here users can insert their own custom function, written in the R language. This function takes a vector (x) and returns a vector and it is in the form:

function(x){sum(x>70}

Only the string within curly brackets needs to be written in the GUI. This will then be passed to the script and applied to the values averaged by day, week, month or year. Users can select this last aspect in the field "Average by". Here for example I am calculating the number of days, for each month, with a temperature above 70 degrees Fahrenheit (21 degrees celsius) in Alaska. The results are saved in CSV in the output folder and printed on screen, as you can see from the image below.

Trend Analysis
In this last script I included access to the function decompose, which I briefly described in my previous post. This function does not work with xts time series, so the time series needs to be loaded with the standard method, ts, in R. This method requires the user to include the frequency of the time series. For this reason I had to add an option for this in the GUI.
Unfortunately, the dataset I created for this experiment only has one full year and thus making a decomposition does not make much sense, but you are invited to try with your data and it should work fine and provide you with results similar to the image below:

Monday, 11 July 2016

In this post I present my third experiment with R-Bridge. The plotting toolbox is a plug-in for ArcGIS 10.3.x that allows the creation of beautiful and informative plot, with ggplot2, directly from the ESRI ArcGIS console.
As always I not only provide the toolbox but also a dataset to try it out. Let's start from here...

Data
For testing the plotting tool, I downloaded some air pollution data from EPA (US Environmental Protection Agency), which provides open access to its database. I created a custom function to download data from EPA that you can find in this post.
Since I wanted to provide a relatively small dataset, I extracted values from only four states: California, New York, Iowa and Ohio. For each of these, I included time series for Temperature, CO, NO2, SO2 and Barometric Pressure. Finally, the coordinates of the points are the centroid for each of these four states. The image below depicts the location and the first lines of the attribute table. This dataset is provided in shapefile and CSV, both can be used with the plotting toolbox.

Toolbox
Now that we have seen the sample dataset, we can take a look at the toolbox. I included 5 packages to help summarize and visualize spatial data directly from the ArcGIS console.

I first included a package for summarizing our variables, this creates a table with some descriptive statistics. Then I included all the major plotting types I presented in my book "Learning R for Data Visualization [VIDEO]" edited by Packt Publishing, with some modifications to the scripts for adapting them to the R-Bridge Technology from ESRI. For more information, practical and theoretical, about each of these plots please refer to the book.

Summary
This is the first package I would like to describe simply because a data analysis should always start with a look at our variables with some descriptive statistics. As for all the tools presented here its use is simple and straightforward with the GUI presented below:

Here the user has to point to the dataset s/he wants to analyze in "Input Table". This can be a shapefile, either loaded already in ArcGIS or not, but it can also be a table, for example a CSV. That is the reason why I included a CSV version of the EPA dataset as well.
At this point the area in "Variable" will fill up with the column names of the input file, from here the user can select the variable s/he is interested in summarizing. The final step is the selection of the "Output Folder". Important: users need to first create the folder and then select it. This is because this parameter is set as input, so the folder needs to exist. I decided to do it this way because otherwise for each new plot a new folder would have need to be created. This way all the summaries and plots can go into the same directory.
Let's take a look at the results:

The Summary package presents two tables, with all the variables we wanted to summarize, arranged one above the other. This is the output users will see on screen once the toolbox has competed its run. Then in the output folder, the R script will save this exact table in PDF format, plus a CSV with the raw data.

Histogram
As the name suggest this tool provides access to the histogram plot in ggplot2. ArcGIS provides a couple of ways to represent data in histograms. The first way is by clicking with the right mouse button on one of the column in the attribute table; a drop-down menu will appear and from there users can click on "Statistics" to access some descriptive statistics and a histogram. Another way is through the "Geostatistical Analyst", which is an add-on for which users need an additional license. This has an "Explore Data" package from which it is possible to create histograms. The problem with both these methods is that the results are not, in my opinion at least, suitable for any publication. You can maybe share them with your colleagues, but I would not suggest using them for any article. This implies that ArcGIS users need to open another software, maybe Excel, to create the plots they require, and we all know how painful it is in Excel to produce any decent plot, and histogram are probably the most painful to produce.
This changes now!!

By combining the power of R and ggplot2 with ArcGIS we can provide users with an easy to use way to produce beautiful visualizations directly within the ArcGIS environment and have them saved in jpeg at 300 dpi. This is what this set of packages does.
Let's now take a look at the interface for histograms:

As for Summary we first need to insert the input dataset, and then select the variable/s we want to plot. If two or more variables are selected, several plots will be created and saved in jpeg.
I also added some optional values to further customize the plots. The first is the faceting variable, which is a categorical variable. If this is selected the plot will have a histogram for each category; for example, in the sample dataset I have the categorical variable "state", with the name of the four states I included. If I select this for faceting the result will be the figure below:

Another option available here is the binwidth. The package ggplot2 usually sets this number as the range of the variable divided by 30. However, this can be customized by the user and this is what you can do with this option. Finally users need to specify an output folder where R will save a jpeg of the plots shown on screen.

Box-Plot
This is another great way to compare variables' distributions and as far as I know it cannot be done directly from ArcGIS. Let's take a look at this package:

This is again very easy to use. Users just need to set the input file, then the variable of interest and the categorical variable for the grouping. Here I am again using the states, so that I compare the distribution of NO2 across the four US states in my dataset.
The results is ordered by median values and it is shown below:

As you can see I decided to plot the categories vertically, as to accomodate long names. This of course can be changed by tweaking the R script. As for each package, this plot is saved in the output folder.

Bar Chart
This tool is generally used to compare different values for several categories, but generally we have one value for each category. However, it may happen that the dataset contains multiple measurements for some categories, and I implemented a way to deal with that here. Below is presented the GUI to this package:

The inputs are basically the same as for box-plots, the only difference is in the option "Average Values". If this is set, R will average the values of the variable for each unique category in the dataset. The results are again ordered, and are saved in jpeg in the output folder:

Scatterplot
This is another great way to visually analyze our data. The package ggplot2 allows the creation of highly customized plots and I tried to implement as much as I could in terms of customization in this toolbox. Let's take a look:

After selecting the input dataset the user can select what to plot. S/he can choose to only plot two variables, one on the X axis and one on the Y axis, or further increase the amount of information presented in the plot by including a variable that changes the color of the points and one for their size. Moreover, there is also the possibility to include a regression line. Color, size and regression line are optional, but I wanted to include them to present the full range of customizations that this package allows.

Once again this plot is saved in the output folder.

Time Series
The final type of plots I included here is the time series, which is also the one with the highest number of inputs from the user side. In fact, many spatial datasets include a temporal component but often time this is not standardized. By that I mean that in some cases the time variable has only a date, and in some cases it includes a time; in other cases the format changes from dataset to dataset. For this reason it is difficult to create an R script that works with most datasets, therefore for time-series plots users need to do some pre-processing. For example, if date and time are in separate columns, these need to be merged into one for this R script to work.
At this point the TimeSeries package can be started:

The first two columns are self explanatory. Then users need to select the column with the temporal information, and then input manually the format of this column.
In this case the format I have in the sample dataset is the following: 2014-01-01
Therefore I have the year with century, a minus sign, the month, another minus sign and the day. I need to use the symbols for each of these to allow R to recognize the temporal format of the file.
Common symbol are:
%Y - Year with century
%y - Year without century
%m - Month
%d - Day
%H - Hour as decimal number (00-23)
%M - Minute as decimal number (00-59)
%S - Second as decimal number

The remaining two inputs are optional, but if one is selected the other needs to be provided. For "Subsetting Column" I intend a column with categorical information. For example, in my dataset I can generate a time-series for each US state, therefore my subsetting column is state. In the option "Subset" users need to write manually the category they want to use to subset their data. Here I just want to have the time-series for California, so I write California. You need to be careful here to write exactly the name you see in the attribute table because R is case sensitive, thus if you write california with a lower case c R will be unable to produce the plot.
The results, again saved in jpeg automatically, is presented below:

Friday, 8 July 2016

With this post I would like to talk again about R-Bridge, which allows a direct communication between ArcGIS and R.
In the previous post, I presented a very simple application of R-Bridge where I built a toolbox to perform k-means clustering on point shapefiles. The purpose of that post was to explain the basics of the technology, but its scientific scope was limited. However, I would like to try step by step to translate more of my R scripts in ArcGIS, so that more people can use them even if they are not experts in R.
In this post I will start presenting a toolbox to handle NetCDF files downloaded from the European Centre for Medium-Range Weather Forecasts (ECMWF).

ECMWF
The European Centre for Medium-Range Weather Forecasts provides free access to numerous weather data through their website. You can go directly to this page to take a look at the data available: http://www.ecmwf.int/en/research/climate-reanalysis/browse-reanalysis-datasets
The data are freely accessible and downloadable (for research purposes!!) but you need to register to the website to be able to do so.

The data are provided in large NetCDF files, which include all the weather variables I selected and for the entire time frame. In R, NetCDF files can be easily imported, using the packages raster and ncdf4, as a raster brick. This will have a X and Y dimensions, plus a time dimension for each of the variables I decided to download. Since I wanted 5 years and the ERA dataset include four datasets per day, I had quite a lot of data.
I decided that for the research I was planning to do I did not need each and every one of these rasters, but rather some time averages, for example the monthly averages for each variable. Therefore I created an R script to do the job.

Now I decided to use R-Bridge to implement the R script I developed in ArcGIS. This should allow people not familiar with R to easily create time averages of weather reanalysis data provided by ECMWF.

Toolbox
I already covered in the previous post the installation process of R-Bridge and explained how to create a new toolbox with a script, so if you do not know how to do these thing please refer to this post.
For this script I created a simple GUI with two inputs and one output:

The first is used to select the NetCDF file, downloaded from the ECMWF website, on the users' computer. The second is a list of values from which the user can select what type of time average to perform:

Users can select between four types: hourly, daily, monthly and yearly averages. In each case, R will select the unique values for each of these categories and create average rasters. Please remember that ECMWF does not provide hourly data, but only observations at specific time of day, every 6 hours or so; therefore, do not expect the script to generate hourly rasters, but only averages for these time slots.

The final thing that users need to provide is the output folder, where R will save all the rasters. This needs to be a new folder!! R will then create the new folder on disk and then save the rasters in there.
For the time being I do not have a way to plot these rasters onto ArcGIS after the script is completed. In theory with the function writeRaster in the raster package it is possible to export a raster to ArcGIS directly, but users would need to provide the name of this output raster in the Toolbox GUI and in this case this is not possible because many rasters are created at once. I also tried to create another toolbox in Model Builder where the R script was followed by an iterator that should have opened the rasters directly from the output folder, but it does not work. If you have any suggestion for doing this I would like to hear them. In any case this is not an issue, the important thing is being able to produce average rasters from NetCDF files.

R Script
In the final part of the post I will present the R script I used for this Toolbox. Here is the code:

As I described in my previous post, the R code for an ArcGIS Toolbox needs to be included in a function that takes inputs and outputs from the ArcGIS GUI.
In this function the very first thing we need to do is load the required packages, with an option to install them if necessary. Then we need to assign object names to the inputs and output parameters.
As you can see I included quite a few print calls so that ArcGIS users can easily follow the process.

At this point we can move to the NetCDF file. As I mentioned, I will be using the raster package to import the NetCDF file, but first I need to open it directly with the package ncdf4 and the function nc_open. This is necessary to obtain the list of variables included in the file. In my tests I downloaded the temperature at 2m above ground and the albedo, therefore the variables' names were d2m and alb. Since these names are generally not known by end users, we need a way to extract them from the NetCDF file, which is provided by these lines.

Once we have that, we can start a for loop when we iterate through the variables. As you can see the first line within the loop needs to import the nc file with the function brick in the package raster. Within this function we need to specify the name of the variable to use. The raster names include the temporal information we are going to need to create the time averages. For this reason I created an object called TIME with POSIXct values created starting from these names, and then I collected these value into a data.frame. This will be used later on to extract only the indexes of the rasters with the correct date/time.

Now I set up a series of if statements that trigger certain actions depending on what the user selected in the Time Average list on the GUI. Let us assume that the user selected "Daily Averages".
At this point R first uses the function format to extract the days from the data.frame with date/time, named df. Then extracts the unique values from this list. The next step involves iterating through these days and create an average raster for each of these. This can be done with the function calc in raster. This function takes a series of rasters, a function (in this case mean) and can also save a raster object on disk. For the correct address for the output file I simply used the function paste to name the file according to the variable and day. The exact same process is performed for the other time averages.

Saturday, 2 July 2016

Last year at the ESRI User Conference in San Diego, there was an announcement of an initiative to bridge ArcGIS and R. This became reality I think early this year with R-Bridge.
Basically, ESRI has created an R library that is able to communicate and exchange data between ArcGIS and R, so that we can create ArcGIS toolboxes using R scripts.

I am particularly interested in this application because R has become quite powerful for spatial data analysis in the last few years. However, I have the impression that within the geography community, R is still considered a bit of an outsider. This is because the main GIS application, i.e. ArcGIS, is based on Python and therefore courses in departments of geography and geomatics tend to focus on teaching Python, neglecting R. This I think is a mistake, since R in my opinion is easier to learn for people without a background in computer science, and has very powerful libraries for spatio-temporal data analysis.
For these reasons, the creation of the R-Bridge is particularly welcomed from my side because it will allow me to teach students how to create powerful new Toolboxes for ArcGIS based on scripts created in R. For example, this Autumn semester we will implement in the course of GIS III a module about geo-sensors, and then I will teach spatio-temporal data analysis using R within ArcGIS. This way students will learn the power of R starting from the familiar environment and user interface of ArcGIS.
Since I never worked with R-Bridge before, Today I started doing some testing and I decided that the best way to learn it was to create a simple Toolbox to do K-Means clustering on point shapefiles, which I think is a function not available in ArcGIS. In this post I will describe in details how to create the Toolbox and the R Script to perform the analysis.

R-Bridge Installation
Installing R-Bridge is extremely simple. TYou only need a recent version of R (I have the 3.0.0) installed on your PC (32-bit or 64-bit, consistent with the version of ArcGIS you have installed) and ArcGIS 10.3.1 or more recent.
At this point you can download the installation files from the R-Bridge GitHub page: https://github.com/R-ArcGIS/r-bridge-install
You can unzip its content anywhere on your PC. At this point you need to run ArcGIS as aministrator (this is very important!!), and then in the ArcCatalog navigate to the folder where you unzip the zip you just downloaded.

Now you just need to run the script "Install R Bindings" and ArcGIS will take care of the rest. I found the process extremely easy!!

Getting Started
ESRI created two examples to help us get started with the development of packages for ArcGIS written in the R language. You can find it here: https://github.com/R-ArcGIS/r-sample-tools
When you unzip this you will find a folder named "Scripts" where you can find R scripts optimized for the use in ArcGIS. I started from these to learn how to create scripts that work.

Clustering Example - R Script
As I said, ESRI created a specific library for R to be able to communicate back and forth with ArcGIS: it is called "arcbinding" and it is installed during the installation process we completed before. This library has a series of functions that allow the R script to be run starting from the ArcGIS console and its GUI. For this reason the R script is a bit different compared to the one you would do to reach the same result outside of ArcGIS. Probably it is better if I just start including some code so that you can better understand.
Below is the full R script I used for this example:

As you can see, the whole script is wrapped in a function called tool_exec with two arguments in_params and out_params. These are the list of input and output parameters that will be passed to R from ArcGIS.
The next three lines are taken directly from the script that ESRI provides. Basically, if the user does not have the package sp installed, R will download, install and load it. You can copy and paste these lines if you need other packages installed on the user's machine to perform your analysis. In this case I am only using the function kmeans, available in the package stats, which is loaded by default in R.
At this point I inserted two print calls with the title and my name on them. This has no real purpose except to let you know that you can print information from R directly onto the dialog in ArcGIS with simple print calls. We will see at the end how they look.
Now we need to create an object for each input and output parameter you need. We will need to specify these in ArcGIS once we create the Toolbox. Since I want to cluster a shapefile, the first input parameter will be this object. Then I want the user to select the number of clusters, so I will create another option for this. Then I would like the user to be able to select the variables s/he wants to use for clustering, so I will need to create an option for this in ArcGIS and then collect it into the object variable. Finally, ArcGIS will save another shapefile with the points plus their cluster. This will be the only output parameter, and I collect it into the object out_shape.

Now I can start the real computation. The function arc.open allows to import in R the shapefile from the Toolbox in ArcGIS. If you want you can take a look at the structure of this object by simply inserting print(str(d)) right after it. This will print the structure of the object d in the dialog created in ArcGIS.
Now we have the function arc.select, which allows to extract from d only the variables we need and that are selected by the user on the Toolbox GUI.
At this point we need to create a data.frame that we are going to fill with only the variables the user selected in the Toolbox. The object variable is a list of strings, therefore we can use its elements to extract single columns from the object data, with the syntax data[,variable[1]].
Since we do not know how many variables will users select and we do not want to limit them, I created an if statement with a loop to attach additional columns to the object data_clust. Then I replaced the column names in data_clust with the names of the variables, this will help me in the next phase.
Now in fact, I want to produce histograms of the variables the user selected. This will allow me to check whether what I am about to do makes sense, and it is one of those things for which R excels. For this I can simply call the function plot and R will show it even if it is called from ArcGIS, as simple as that!! We only need to remember to insert dev.new() so that each plot is created separately and the user can see/save them all.
After this step we can call the function kmeans to cluster our data. Then we can collect the results in a new data.frame, and finally use the function arc.write to write the object out_shape with the results. As you can see we also need to specify the coordinates of each point and this can be done calling the function arc.shape.
Then we print the string "Done!!" and return the output parameters, that will be taken from ArcGIS and shown to the user.

Toolbox
Now that we've seen how to create the R script we can take a look at the Toolbox, since both things need to be done in parallel.
Creating a new Toolbox in ArcGIS is very simple, we just need to open ArcCatalog, click where we want to create it with the right mouse button and then select New->Toolbox.

Once this is done we will then need to add, within this Toolbox, a script. To do this we can again click with the right button on the Toolbox we just created and then select Add->Script...

At this point a dialog will appear where we can set the parameters of this script. First we add a title and a label and click proceed (my PC runs with Italian as the local language, sorry!!)

Then we need to select the R script we need to run. Since the creation of the Toolbox can also be done before taking care of the script, here we can select an R script not completed and ArcGIS will not have any problem. This is what I did to create this example, so that I could debug the R script using print calls and looking at the results on the ArcGIS dialog.

The next window is very important, because it allows us to set the input and output parameters that will then be passed to R. As we saw in the R script here I set 4 parameters, 3 inputs and 1 output. It is important that the order matches what we have in the R script, so for example the number of clusters is the second input.

The first parameter is the input data. For this I used the type "Table View", which allows the user to select a dataset s/he already imported in ArcGIS. I selected this because usually I first load data into ArcGIS, check them, and then perform some analysis. However, if you prefer I think you could also select the type shapefile, to allow users to select a shp directly from their PC.
The next parameter is the number of clusters, which is a simple number. Then we have the field variables. This is very important because we need to set it in a way that allow users to select variables directly from the dataset we are importing.

We can do that by setting the options "Filter" and "Obtained from" that you see in the image above. It is important that we set "Obtained from" with the name of our input data.
At this point we can set the output file, which is a shapefile.

One thing we could do is set the symbology for the shapefile that will be created at the end of the script. To do so we need to create and set a layer file. I did it by changing the symbology to another shapefile and then export it. The only problem is that this technique is not really flexible, meaning that if the layer is set for 5 clusters and users select 10, the symbology will still be with 5 colors. I am not sure if that can be changed or adapted somehow. If the symbology file is not provided the R script will still run correctly and produce a result, but this will not have any colors and users will need to set these afterwards, which probably is not a big deal.
Once this final step is done we can finish the creation of the tool and take a look at the resulting GUI:

Run the Toolbox
Now that we have created both the script and the Toolbox to run it, we can test it. I included a shapefile with the location of Earthquakes that I downloaded from the USGS website yesterday (01 July 2016). This way you can test the tool with real data. As variables you can select: depth, magnitude, distance from volcanoes, faults and tectonic plates. For more info on this dataset please look at one of my previous posts: http://r-video-tutorial.blogspot.ch/2015/06/cluster-analysis-on-earthquake-data.html
We only need to fill the values in the GUI and then click OK. You can see the result in the image below:

As you can see R first produces a histogram of the variable/s the user selects, which can be saved. Then creates a shapefile that is automatically imported in ArcGIS. Moreover, as you can see from the dialog box, we can use the function print to provide messages to the user. Here I put only some simple text, but it may well be some numerical results.