Tag: Analytics

Uptil now the R ecosystem of package developers has mostly shrugged away the Big Data question. In a fascinating insight Hadley Wickham said this in a recent interview- shockingly it mimicks the FUD you know who has been accused of ( source

* From in-memory to disk. If your data fits in memory, it’s small data. And these days you can get 1 TB of ram, so even small data is big!

* From one computer to many computers.

R is a fantastic environment for the rapid exploration of in-memory data, but there’s no elegant way to scale it to much larger datasets. Hadoop works well when you have thousands of computers, but is incredible slow on just one machine. Fortunately, I don’t think one system needs to solve all big data problems.

To me there are three main classes of problem:

1. Big data problems that are actually small data problems, once you have the right subset/sample/summary.

2. Big data problems that are actually lots and lots of small data problems

3. Finally, there are irretrievably big problems where you do need all the data, perhaps because you fitting a complex model. An example of this type of problem is recommender systems

Ajay- One of the reasons of non development of R Big Data packages is- it takes money. The private sector in R ecosystem is a duopoly ( Revolution Analytics ( acquired by Microsoft) and RStudio (created by Microsoft Alum JJ Allaire). Since RStudio actively tries as a company to NOT step into areas Revolution Analytics works in- it has not ventured into Big Data in my opinion for strategic reasons.

Revolution Analytics project on RHadoop is actually just one consultant working on it here https://github.com/RevolutionAnalytics/RHadoop and it has not been updated since six months

Blaze presents a pleasant and familiar interface to us regardless of what computational solution or database we use. It mediates our interaction with files, data structures, and databases, optimizing and translating our query as appropriate to provide a smooth and interactive session.

Dask.arrays provide blocked algorithms on top of NumPy to handle larger-than-memory arrays and to leverage multiple cores. They are a drop-in replacement for a commonly used subset of NumPy algorithms.

DyND is a dynamic ND-array library like NumPy. It supports variable length strings, ragged arrays, and GPUs. It is a standalone C++ codebase with Python bindings. Generally it is more extensible than NumPy but also less mature. https://github.com/libdynd/libdynd

LibDyND, a component of the Blaze project, is a C++ library for dynamic, multidimensional arrays. It is inspired by NumPy, the Python array programming library at the core of the scientific Python stack, but tries to address a number of obstacles encountered by some of its users. Examples of this are support for variable-sized string and ragged array types. The library is in a preview development state, and can be thought of as a sandbox where features are being tried and tweaked to gain experience with them.

C++ is a first-class target of the library, the intent is that all its features should be easily usable in the language. This has many benefits, such as that development within LibDyND using its own components is more natural than in a library designed primarily for embedding in another language.

This library is being actively developed together with its Python bindings,

When open source fights- closed source wins. When the Jedi fight the Sith Lords will win

So will R people rise to the Big Data challenge or will they bury their heads in sands like an ostrich or a kiwi. Will Python people learn from R design philosophies and try and incorporate more of it without redesigning the wheel

Converting code from one language to another automatically?

How I wish there was some kind of automated conversion tool – that would convert a CRAN R package into a standard Python package which is pip installable

Why humans need one set of accommodation to live in, another to work in, and a third to relax in. It seems we are using three times the number of buildings we should be using.

Why can’t analytics measure the cost to environment (not just carbon output) in any product and service?

What prevents a global effort for analytics against corruption ?

Why open source software underestimates the need of marketing and why proprietary software companies underestimate the need for open sourcing at least a small part of their extensive portfolio?

Why is education and training still so expensive in the era of MOOCs and Internet and Skype?

Why are expensive textbooks (and books and newspapers) still being printed on paper?

Why does it take 15 minutes to set up the projector before any presentation despite the advances in technology?

Why can’t I just 3D print most of my wardrobe and my gadgets?

When will we have virtual reality movies?

Why software companies focus on creating more and more languages, rather than use machine learning to create a language 1 to language 2 translator. How about a Google/Bing Translate for Computer Languages?

Why they do a lot of checking for giving me a credit card but not so much checking for giving me a gun in the USA? Why do 2 billion Indians and Chinese put up with corruption ? Why do Europeans work so few hours and Asians so many?

Why people who write packages in open source make less money than people who write apps for mobiles?

When can software startups focus on job search and dating search as the real problems humans care for- not just website search?

Why is there a digital divide and what a donation of 1000,000 phablets in poor countries to kids can do for the future?

When will we start consuming smarter rather than just less or more to heal climate change?

But mostly I am thinking of this? Happy New Year. Stay Awesome and Classy

Suppose – let us just suppose- you want to create random numbers that are reproducible , and derived from time stamps

Here is the code in R

> a=as.numeric(Sys.time())
> set.seed(a)
> rnorm(log(a))

Note- you can create a custom function ( I used the log) for generating random numbers of the system time too. This creates a random numbered list of pseudo random numbers (since nothing machine driven is purely random in the strict philosophy of the word)

Details

The currently available RNG kinds are given below. kind is partially matched to this list. The default is "Mersenne-Twister".

"Wichmann-Hill"

The seed, .Random.seed[-1] == r[1:3] is an integer vector of length 3, where each r[i] is in 1:(p[i] - 1), where p is the length 3 vector of primes, p = (30269, 30307, 30323). The Wichmann–Hill generator has a cycle length of 6.9536e12 (= prod(p-1)/4, see Applied Statistics (1984) 33, 123 which corrects the original article).

"Marsaglia-Multicarry":

A multiply-with-carry RNG is used, as recommended by George Marsaglia in his post to the mailing list ‘sci.stat.math’. It has a period of more than 2^60 and has passed all tests (according to Marsaglia). The seed is two integers (all values allowed).

"Super-Duper":

Marsaglia’s famous Super-Duper from the 70’s. This is the original version which does not pass the MTUPLE test of the Diehard battery. It has a period of about 4.6*10^18 for most initial seeds. The seed is two integers (all values allowed for the first seed: the second must be odd).

We use the implementation by Reeds et al. (1982–84).

The two seeds are the Tausworthe and congruence long integers, respectively. A one-to-one mapping to S’s .Random.seed[1:12] is possible but we will not publish one, not least as this generator is not exactly the same as that in recent versions of S-PLUS.

"Mersenne-Twister":

From Matsumoto and Nishimura (1998). A twisted GFSR with period 2^19937 – 1 and equidistribution in 623 consecutive dimensions (over the whole period). The ‘seed’ is a 624-dimensional set of 32-bit integers plus a current position in that set.

"Knuth-TAOCP-2002":

A 32-bit integer GFSR using lagged Fibonacci sequences with subtraction. That is, the recurrence used is

X[j] = (X[j-100] – X[j-37]) mod 2^30

and the ‘seed’ is the set of the 100 last numbers (actually recorded as 101 numbers, the last being a cyclic shift of the buffer). The period is around 2^129.

"Knuth-TAOCP":

An earlier version from Knuth (1997).

The 2002 version was not backwards compatible with the earlier version: the initialization of the GFSR from the seed was altered. R did not allow you to choose consecutive seeds, the reported ‘weakness’, and already scrambled the seeds.

Initialization of this generator is done in interpreted R code and so takes a short but noticeable time.

"L'Ecuyer-CMRG":

A ‘combined multiple-recursive generator’ from L’Ecuyer (1999), each element of which is a feedback multiplicative generator with three integer elements: thus the seed is a (signed) integer vector of length 6. The period is around 2^191.

The 6 elements of the seed are internally regarded as 32-bit unsigned integers. Neither the first three nor the last three should be all zero, and they are limited to less than 4294967087 and 4294944443 respectively.

This is not particularly interesting of itself, but provides the basis for the multiple streams used in package parallel.

"user-supplied":

Use a user-supplied generator.

Function RNGkind allows user-coded uniform and normal random number generators to be supplied.

Hosting a 6 weekend live online certification course on Business Analytics with R starting June 1 at Edureka.Check www.edureka.in/r-for-analytics for more details. Course has been decided to ensure more open data science than current expensive offerings that are tech rather than business oriented but more support and customization than a MOOC This is because many business customers don’t care if it is lapply or ddapply, or command line or GUI, as long as they get good ROI on time and money spent in shifting to R from other analytics software.

Message from our Sponsors and my favorite Analytics conference ( only if I could attend a cool analytics conference nearby in Asia (singapore/turkey?) -sighs) Even useR wont come to Asia ever?-

This is the number 1 conference for analytics in the world and it is next month in Chicago, USA? So you think you have the best analytics software or product or service. Here is where you can find it out!

It’s time to amp-up your analytics strategy. It’s time to beef up your analytics strategy by attending Predictive Analytics World Chicago, June 10-13, 2013. With over 30 case studies from leading organizations across a spectrum of industries, this is the must-attend event for anyone serious about their analytics strategy.

“Predictive Analytics World did a great job keeping up with the trends in Predictive Modeling. There were also plenty of opportunities to learn about the most valuable resources available to data scientists.”
– Conor Sontag, Marketing Evolution

“People who are in analytics must join Predictive Analytics World and see the state of the art projects.”
– Burak Buyuktombak, Avea Telecommunication Services (Turkey)