First of all, we’re happy to announce that since the 1.0.0 release, the stringr package for R is now powered by stringi. For more details, read more here.

Also please note that the stringi package version 1.0-1 is now on CRAN. Changelog:

* [GENERAL] #88: C++ API is now available for use in, e.g., Rcpp packages, see https://github.com/Rexamine/ExampleRcppStringi for an example.
* [BUGFIX] #183: Floating point exception raised in `stri_sub()` and `stri_sub

There is a time for some things, and a time for all things; a time for great things, and a time for small things — Miguel de Cervantes

Building R packages from sources may take a long time, especially if they contain a lot of C/C++/Fortran code. Long compile time might be especially frustrating if you are a package developer and you need to recompile your project very often.

Here is how long it takes to compile the stringi package on my laptop (if the ICU library is also compiled from sources):

On many R installations, the build process is set up so that only one C/C++ source file is compiled at a time:

Yet, there is a simple solution for that — we may ask GNU make to allow more than one job to be submitted at once. In order to do so, we edit the /lib64/R/etc/Renviron file (where /lib64/R/etc/ is the result to a call to the R.home() function in R) and set:

Thanks to that, we may now spend the time saved to enjoy more whomever or whatever we love. :)

Note that MAKE is an environmental variable and can also be changed from within the current R session (Sys.setenv) or while we start R from the Linux/Unix terminel (MAKE="..." R – thanks to Nick Kennedy for noticing that).

A reliable string processing toolkit is a must-have for any data scientist.

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds). As for now, about 850 CRAN packages depend (either directly or recursively) on stringi. And quite recently, the package got listed among the top downloaded R extensions.

Refer to the INSTALL file for more details if you compile stringi from sources (Linux users mostly).

Here’s a list of changes in version 0.5-2. There are many major (like date&time processing) and minor new features, enhancements, as well as bugfixes. In the current release we also focused on bringing stringr package’s users even better string processing experience, as since the 1.0.0 release it is now powered by stringi.

[BACKWARD INCOMPATIBILITY] The second argument to stri_pad_*() has been renamed width.

[NEW FEATURE] #149: stri_pad() and stri_wrap() now by default bases on code point widths instead of the number of code points. Moreover, the default behavior of stri_wrap() is now such that it does not get rid of non-breaking, zero width, etc. spaces

[GENERAL] #143: Searching for short fixed patterns (stri_*_fixed()) now relies on the current libC’s implementation of strchr() and strstr(). This is very fast e.g. on glibc utilizing the SSE2/3/4 instruction set.

Introduction

Being a teacher can be a very gratifying job. If you teach programming, which is your favorite hobby too, nothing can be better than that. Only thing can spoil your dream: cheating students. As we all know, one can learn programming only by writing code him/herself. Copying source code of another student completely makes no sense, as student does not learn, and what is more, he/she gets points for something he/she didn’t make.

When there are only few homeworks to check, it is easy to do it manually. But what if there is a large number of submissions? Then we need some application to automate the process. There are some known tools for “standard” programming languages, such as MOSS or JPLAG for e.g. C, C++, C#, Java, Scheme or Javascript.

But what if we want to automate the process of checking similarity of R source code? Till now there were no such a tool available. But things have changed.

SimilaR

SimilaR is a service designed to detect similar source code patterns in the R language code snippets. To create an account, you got to possess an e-mail address in edu domain and prove us somehow that you’re a tutor (show us your webpage etc.). Once the account is activated you just upload your students’ submissions and wait a moment for the results.

Let see a working example. Assume that one student submitted the following file:

So we log into SimilaR, choose Antiplagiarism system -> New submission and we get a picture like:

In the area marked with a green rectangle we provide a name for a new submission. We can identify a group of files with this name. In the blue rectangle we choose what is the smallest group of functions (functions in one group are not compared): group of files, one file, or we compare every function with each other. Since every student in our example provide her homework in separate file, we choose a second option.

After we click Submit, we obtain:

In this view we can make sure that system understands uploaded files as we expect. If something is wrong, e.g. the source code has syntax errors, we will be notified at this step. Please note that there are no comments in source codes and a style of indentation is homogeneous. If everything is OK, we click Confirm button.

After that we see a list of our submissions. We can see a progress of our submission which is dynamically updated. When it is ready, it goes to a top of the list and we can see it.

Let us see the results. There are 4 pairs, as there were 2 functions in each file. The pairs are ordered from most similar to the least. In the beginning, we see only first 10 pairs, and we can assess every pair, if we believe it is similar or not. After evaluating some pairs (see green rectangle), we can see more of them. This solution is needed, as the system is based on some statistical learning algorithms and we need as many learning data as we can obtain so that it will become even more useful in the future.

Summary

We hope that SimilaR will be a useful tool, and that it will make evaluating the similarity of students’ homeworks faster and more accurate as well as a teacher’s job more convenient. With this tool, R tutors can focus on what is the most important thing in the teaching process: teaching, not searching for a plagiarism and dishonest students. Prior using the system, make sure you agree with the Terms and Conditions

Summary

In this tutorial we showed how to submit a simple Map/Reduce job via the Hadoop Streaming API. Interestingly, we used an R script as the mapper and a C++ program as the reducer. In an upcoming blog post we’ll explain how to run a job using the rmr2 package.

Configuring a working Hadoop 2.6.0 environment on CentOS 7 is a bit of a struggle. Here are the steps we made to set everything up so that we have a working hadoop cluster. Of course, there many tutorials on this topic over the internet. None of the solutions presented there worked in our case. Thus, there is a high possibility that also this step-by-step guide will make you very frustrated. Anyway, resolving errors generated by Hadoop should make you understand this environment much better. No pain no gain.

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

# install.packages("stringi") or update.packages()
library("stringi")

Here’s a list of changes in version 0.4-1. In the current release, we particularly focused on making the package’s interface more consistent with that of the well-known stringr package. For a general overview of stringi’s facilities and base R string processing issues, see e.g. here.

(IMPORTANT CHANGE)n_max argument in stri_split_*() has been renamed n.

(IMPORTANT CHANGE)simplify=FALSE in stri_extract_all_*() and stri_split_*() now calls stri_list2matrix() with fill="". fill=NA_character_ may be obtained by using simplify=NA.

(IMPORTANT CHANGE, NEW FUNCTIONS)#120: stri_extract_words has been renamed stri_extract_all_words and stri_locate_boundaries – stri_locate_all_boundaries as well as stri_locate_words – stri_locate_all_words. New functions are now available: stri_locate_first_boundaries, stri_locate_last_boundaries, stri_locate_first_words, stri_locate_last_words, stri_extract_first_words, stri_extract_last_words.

(NEW FEATURE)#110: Fixed pattern search engine’s settings can now be supplied via opts_fixed argument in stri_*_fixed(), see stri_opts_fixed(). A simple (not suitable for natural language processing) yet very fast case_insensitive pattern matching can be performed now. stri_extract_*_fixed is again available.

(NEW FEATURE)#23: stri_extract_all_fixed, stri_count, and stri_locate_all_fixed may now also look for overlapping pattern matches, see ?stri_opts_fixed.

(NEW FEATURE)#117: stri_extract_all_*(), stri_locate_all_*(), stri_match_all_*() gained a new argument: omit_no_match. Setting it to TRUE makes these functions compatible with their stringr equivalents.

(NEW FEATURE)#118: stri_wrap() gained indent, exdent, initial, and prefix arguments. Moreover Knuth’s dynamic word wrapping algorithm now assumes that the cost of printing the last line is zero, see #128.

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

# install.packages("stringi") or update.packages()
library("stringi")

stringi is an R package providing (but definitely not limiting to) equivalents of nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.

We implemented each string processing function from scratch. The internationalization and globalization support, as well as many string processing facilities (like regex searching) is guaranteed by the well-known IBM’s ICU4C library.

Here is a very general list of the most important features available in the current version of stringi:

Do you find this plot fancy? If yes, you can find the code at the end of this article BUT if you spend a little time to read it thoroughly, you can learn how to create better ones.

We would like to encourage you and your children (or children you teach) to use our new R package – TurtleGraphics!

TurtleGraphics package offers R-users functionality of the “turtle graphics” from Logo educational programming language. The main idea standing behind it is to inspire the children to learn programming and show that working with computer can be entertaining and creative.

It is very elementary, clear and requires basic algorithm thinking skills, that even children are able to form them. You can learn it in just five short steps.

turtle_init() – To start the program call the turtle_init() function. It creates a plot region (called “Terrarium”) and places the Turtle in the middle pointing north.

library(TurtleGraphics)
turtle_init()

turtle_forward() and turtle_backward() – Argument to these functions is the distance you desire the Turtle to move. For example, to move the Turtle forward for a distance of 10 units use the turtle_forward() function. To move the Turtle backwards you can use the turtle_backward() function.

turtle_forward(dist=15)

turtle_turn() – turtle_right() and turtle_left(). They change the Turtle's direction by a given angle.

turtle_right(angle=30)

turtle_up() and turtle_down() – To disable the path from being drawn you can simply use the turtle_up() function. Let's consider a simple example. We use the turtle_up() function. Now, when you move forward the path is not visible. If you want the path to be drawn again you should call the turtle_down() function.

turtle_show() and turtle_hide() – Similarly, you may show or hide the Turtle image, using the turtle_show() and turtle_hide() functions respectively. If you call a lot of functions it is strongly recommended to hide the Turtle first as it speeds up the process.

turtle_hide()

These were just the basics of the package. Below we show you the true potential of it:

One may wonder what turtle_do() function is doing here. It is an advanced way to use the package. The turtle_do() function is designed to call more complicated plot expressions, because it automatically hides the Turtle before starting the operations that results in a faster proceed of plotting.