Saturday, December 29, 2012

I got the following interesting graph workshop invitation from Peter Boncz:

At SIGMOD 2013 a new workshop will be held: Graph Data Management Experiences & Systems (GRADES).

In the database research community, graph data management is nowadays a popular topic, though the published papers tend to be highly technically focused (e.g. on graph reachability indexing or subgraph search), and not always targeted at what we now call Big Data.In the GRADES workshop we strive for an emphasis on Experiences and Systems, that is, we value input from practitioners on graph data management application areas and systems, and the challenges these bring when handling large-scale problems, as found e.g. in life science analytics, social network marketing, digital forensics, telecommunication network analysis and digital publishing.

Given your interesting work on the Distributed GraphLab system, we would like to invite you to consider submitting a short paper, with this focus on experiences and systems. Note that graph data management benchmarks and RDF stores are also in scope.Please see for more details: http://event.cwi.nl/grades2013The workshop will be held Sunday June 23 2013 in New York, on the day preceding SIGMOD/PODS 2013.

The audience will be a mix of graph data management researchers from SIGMOD/PODS and industrial data management practitioners. One of the goals of the workshop is to confront the research audience with "real" open problems and challenges.The GRADES workshop is well-poised to bring industrial practitioners and owners of graph data management problems to the workshop, as it is sponsored and co-organized by the Linked Data Benchmark Council (LDBC,http://ldbc.eu) aimed at industrial benchmarking in Graph and RDF database systems. We expect to see industrial participation from Graph database and RDF data management vendors inside and outside LDBC and some of their more involved customers. Do not doubt to contact us if you wish additional information on LDBC.

You are of course encouraged to share this call with colleagues or other people who might be interested in attending the workshop and/or submitting a paper.best regards, thanks in advance for your participation, and wishing you a good and productive 2013,

Wednesday, December 19, 2012

Following my gensgd implementation which supports dense feature matrices, I was asked by my mega collaborator Justin Yan to implement a sparse version that supports libsvm format.sparse_gensgd is exactly the same algorithm (high dimensional matrix factorization), but for the sparse case. Perhaps a bit surprising, I will show below the sparse_gensgd algorithm can be used for classification with a very nice performance. As a case study I will discuss KDD CUP 2010.

Case study: ACM KDD CUP 2010

In this case study I will show you how you can get state-of-the-art performance from GraphChi CF toolkit for solving a recent KDD CUP 2010 task. Here is a text from the contest website describing the task:

This year's challenge asks you to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems. This task presents interesting technical challenges, has practical importance, and is scientifically interesting.

I have used libsvm data respository which converted the task into a binary classification. The problem is moderate in size: around 20M samples for training, 750K samples for testing, with 29M sparse features.

The winning team was NTU, and here is their winning paper. Here is a graph depicting their single algorithm improvement:

As you can see, prediction RMSE around 0.2815 is the best result obtained by a single model.

The data is sparse in the sense that any number of features can appear in one sample. For example, here are the first 3 lines of the data in libsvm format:

The target is either 1 or 0, this is the parameter we would like to predict as part of the matrix factorization procedure. The rest of the features are integer ids, where most of them are binary (1) but some of them are doubles (0.477). Given the validation dataset where only the features are given we would like to predict the target - either 0 or 1.

Instructions:
1) Install GraphChi as instructed here (steps 1-3).
2) Download the datasets kddb (training) and kddb.t (validation) and put them in the root GraphChi folder. (Tip: use bunzip2 to open those files).
3) Create a file named kddb\:info with the following two lines:%%MatrixMarket matrix coordinate real general1000 1000 19264097
4) Create a file named kddb.t\:info with the following two lines:%%MatrixMarket matrix coordinate real general1000 1000 748400
5) Run as instructed.

Monday, December 17, 2012

You can find below two emails I got from industry regarding my 3rd generation solver.
I definitely can understand some of the frustration - as I can claim anything in my blog,
for the industry guys it is harder to prove their claims as they charge money for it... While my solution cost nothing, so if it is worth more than nothing than everyone is happy about it.

i) Feature engineering of any data typeFeature vectors can be created from any data type. Features can also be combined for example, bag-of-words + votes (events) + preference ratings + custom features and so on. If data is remotely of value then include it as a feature. Feature vector lengths can range from the tens to tens of millions. The method is equally suitable for items x items, items x users and users x items problem types.

ii) No trainingNo training is required. Once the feature engineering is completed, the sparse binarized data is fed to the Xyggy engine. The method automatically learns and generalizes.

iii) Multiple items per queryA query consists of one or more items. Typically, the more items per query the better the results.

iv) Dynamic predictionsPredictions are calculated dynamically in near real-time. If the query items change, the predictions are re-calculated. Think of it as dynamic clustering.

v) Scalability and parallelismThe method can be viewed as IR+RecSys and will scale to any required size. The inherent parallelism offers scalability and performance routes to deliver services at web-scale.

vi) Relevance feedback, engineered serendipity and novelty detection.The Android showcase app demonstrates personalization with autonomous discovery utilizing both positive and negative relevance feedback, as well as engineered serendipity. The source code for the Android app will be released soon to show how these capabilities can be built into applications with the Xyggy api. If there is interest, the feature engineering process for this app can be explained as it is instructive.

More information can be found here. We welcome the opportunity to work with organizations who want a simpler and faster way to deploy scalable intelligent services.

I don't think you expected so much feedback for your blog post, but I guess this is good. Let me add my comments to your post

You will be surprised about how many things happen in the industry and nobody ever bothers to publish. I can easily believe that somebody else has tried this approach. The truth is that I have been trying to convince some of my clients to use tensor factorization (PARAFAC) instead of linear regression, but they are not convinced. One of the reason is that traditional industry prefers linear regression because of the confidence intervals and statistical significance of the factors. In your approach you aggregate several factors into the time variable. You could have instead used a multidimensional tensor x[i,j,k,l]=a[i]*b[j]*c[k]*d[l] use L1 regularization and that would have given you some importance on the variables. I am not sure how you can match the linear regression metrics, but worst case you use bootstrap.

My answer: yes, I have been there, done that.. typically in the industry you write patents, on issues that will be hardly accepted to decent conference. But I know there is not enough time to pursue publications. By the way, in my approach, I am not aggregating several factors into time variables (I only noted this can be done using traditional matrix factorization approach), I am using separate factors for each variable. The tensor case can be hardly scaled beyond the 3rd dimension because of all the interactions between the latent feature vectors.

Another problem that I see with your approach is handling continuous variables. So You are trying to predict delays and you might have continuous variable like weather temperature or humidity. I believe you implicitly quantize them through hashing. This might be suboptimal since values that are too close might fall into separate bins. Even worse if let's say your original dataset didn't have any categorical/ordinal variables, but only d continuous factors, then by quantizing them with hashing there is high probability that true nearest neighbors will be assigned in different bins. A different approach would be to build m random-trees with l number of leafs each. So now your original d-dimensional space has been transformed into a m*l one. Each point now is mapped on m different leafs. I think this is a better approach thatn quantizing each variable separately. It preserves up to a small error the euclidean distances.

My answer: I agree there is a delicate point here. For the target variable (like flight delay) I support continues variables. But the other feature I quantize. Thus the normal relation between increasing values is lost. But for most problems I tried it works well in practice. Regarding the lost relations, whenever a single feature appears in two different samples then they are connected, and the gradient is compute with respect to both. Thus data dependancies are maintained very well.

There is also one detail that it is not mentioned in any matrix factorization works and it is very critical in practice. If the matrix has block diagonal components, in other words the corresponding graph has disconnected components, you can get very bad results in your recommender systems when you are looking for similar items. So a good advice is to use the graphlab method for identifying them first and then use your favorite factorization.

My answer: Most of the graphs I work on are connected. So you may be right but I am not concered with this issue.

One of the reasons why companies don't broadcast their solutions is because they are afraid of patents. This is also my advise to my clients, specially if they are in the medical/healthcare industry, never to disclose their exact methods. Patent trolls are watching

My answer: good life in academia!! Earn a low salary and broadcast as much as you like!!

At last I would like to say that I am very glad you posted this approach. I happen to teach an introductory course to practitioners with the title "Learning Machine Learning by Example" http://www.meetup.com/Learning-Machine-Learning-by-Example/ and we use the airline dataset. Your post is giving me an opportunity to introduce the students to matrix factorization coming straight from linear regression!

My answer: very interesting. I hope you will give an opportunity for the students to actually play with my code, I think it would be benfiticial regarding their understanding how different feature contribute to the overall solution quality. Nick has kindly added GraphChi CF toolkit to his machine learning meetup course.

Friday, December 14, 2012

NOTE: This blog post is two years old. We have reimplemented this code as part of Graphlab Create. The implementation in GraphLab Create is preferred since:1) No input format conversions are needed (like matrix market header setup)2) No parameter tuning like step sizes and regularization are needed3) No complicated command line arguments4) The new implementation is more accurate, especially regarding the validation dataset. Anyone who wants to try it out should email me, I will send you the exact same code in python.

**********************************************************************************
A couple of days ago I wrote about a new experimental software I am writing - which is what I call a 3rd generation collaborative filtering software. I got a lot of interesting feedback from my readers which helps improve the software. Previously I tried it to examine its performance on KDD CUP 2012 dataset. Now I tried it on a completely different datasets and I am quite pleased with the results.

First dataset: Airline on time

Below I will explain how to deploy it on a different problem domain: Airline on time performance. It is a completely different dataset from a different domain, but still the gensgd software can deal without without any modification. I hope that those results that show how
flexible is the software will encourage additional data scientist to try it out!

The airline on time dataset, has information about 10 years of flights in the US. The data of each year is a csv file with the following format:Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay

The fields are rather self explanatory Each line represents a single flight, and information about the date, carrier, airport etc. is given, and the interesting fields is the varying information about flight duration.

Note: you can get the dataset using the commands:curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 -o 2008.csv.bz2bunzip2 2008.csv.bz2

First task. Can we predict the total time the flight was on the air?

Well, for a matrix factorization method, it is not clear what is the actual matrix here. That is why it is useful to have a flexible software. In my experiments I have chosen "UniqueCarrier" and "FlightNum" as the two fields which form the matrix. This is because the characterize each flight rather uniquely. Next we need to decide which field we want to predict. I have chosen the ActualElapsedTime as the prediction target. Note that those fields are chosen on the fly, so you are more than welcome to chose others and see how well is the prediction in that case.
(Additional information about each field meaning is found here).

We got RMSE error of 35.3 minutes error on predicted flight time taking into account the carrier and flight number. That is rather bad.. we are half an hour off track.

Next let's throw in some temporal features into the computation: Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime. How do we do that? It is very easy! Just add the command line: --features=0,1,2,3,4,5,6,7 namely the positions of the features in the input file. This is what we call temporal matrix factorization or tensor factorization. But for utilizing it in one of the traditional methods, you need to merge al the 8 fields into one integer which encodes the time. Which is of course a tedious task.

Now we got down to 4 minutes avg error. But, we can continue the computation (run more iterations) and we get down even below 2 minutes error. Isn't that neat? The average flight time is 127 minutes in 2008, so 2 minutes error prediction is not that bad.

Second task: let's predict TaxiIn (time that the plane is on the ground when coming in)

This task is slightly more difficult, since as you may imagine, there is much larger variation in texiin time relative to flight time. But is predeicing it more difficult? No.. we simply change --val_pos=19 namely to point the taget into the taxiintime field.

WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com

[quiet] => [1]

INFO: gensgd.cpp(main:1155): Total selected features: 16 :

INFO: gensgd.cpp(main:1158): Selected feature: 1

INFO: gensgd.cpp(main:1158): Selected feature: 2

INFO: gensgd.cpp(main:1158): Selected feature: 3

INFO: gensgd.cpp(main:1158): Selected feature: 4

INFO: gensgd.cpp(main:1158): Selected feature: 5

INFO: gensgd.cpp(main:1158): Selected feature: 6

INFO: gensgd.cpp(main:1158): Selected feature: 7

INFO: gensgd.cpp(main:1158): Selected feature: 10

INFO: gensgd.cpp(main:1158): Selected feature: 11

INFO: gensgd.cpp(main:1158): Selected feature: 12

INFO: gensgd.cpp(main:1158): Selected feature: 13

INFO: gensgd.cpp(main:1158): Selected feature: 14

INFO: gensgd.cpp(main:1158): Selected feature: 15

INFO: gensgd.cpp(main:1158): Selected feature: 16

INFO: gensgd.cpp(main:1158): Selected feature: 17

INFO: gensgd.cpp(main:1158): Selected feature: 18

1.56777) Iteration: 0 Training RMSE: 3.89207

3.01777) Iteration: 1 Training RMSE: 3.64978

4.5159) Iteration: 2 Training RMSE: 3.46472

5.8659) Iteration: 3 Training RMSE: 3.30712

7.26778) Iteration: 4 Training RMSE: 3.17225

8.7159) Iteration: 5 Training RMSE: 3.06696

...

23.6072) Iteration: 16 Training RMSE: 2.60147

24.9789) Iteration: 17 Training RMSE: 2.57697

26.3267) Iteration: 18 Training RMSE: 2.55768

27.6967) Iteration: 19 Training RMSE: 2.54186

29.0773) Iteration: 20 Training RMSE: 2.53113

We again get to average RMSE of 2.5 minutes - which means that this task is actually more difficult than predicting air time.

Instructions:
0) Install GraphChi from mercurial using the instructions here.
1) Download the year 2008 from here.
2) Open the zip file using:bunzip2 2008.csv.bz2
3) Create a matrix market format file, named 2008.csv:info with the following two lines:%%MatrixMarket matrix coordinate real general20 7130 1000000
4) Run the commands as instructed above.

Second dataset: Hearst machine learning challenge

A while ago Hearst provided data about emails campaigns and the task was to predict user reaction to emails (click/ not clicked). The data has several millions records about emails sent with around 273 user features for each email. Here is some of the available frields:CLICK_FLG,OPEN_FLG,ADDR_VER_CD,AQI,ASIAN_CD,AUTO_IN_MARKET,BIRD_QTY,BUYER_DM_BOOKS,BUYER_DM_COLLECT_SPC_FOOD,BUYER_DM_CRAFTS_HOBBI,BUYER_DM_FEMALE_ORIEN,BUYER_DM_GARDEN_FARM,BUYER_DM_GENERAL,BUYER_DM_GIFT_GADGET,BUYER_DM_MALE_ORIEN,BUYER_DM_UPSCALE,BUYER_MAG_CULINARY_INTERS,BUYER_MAG_FAMILY_GENERAL,BUYER_MAG_FEMALE_ORIENTED,BUYER_MAG_GARDEN_FARMING,BUYER_MAG_HEALTH_FITNESS,BUYER_MAG_MALE_SPORT_ORIENTED,BUYER_MAG_RELIGIOUS,CATS_QTY,CEN_2000_MATCH_LEVEL,CLUB_MEMBER_CD,COUNTRY_OF_ORIGIN,DECEASED_INDICATOR,DM_RESPONDER_HH,DM_RESPONDER_INDIV,DMR_CONTRIB_CAT_GENERAL,DMR_CONTRIB_CAT_HEALTH_INST,DMR_CONTRIB_CAT_POLITICAL,DMR_CONTRIB_CAT_RELIGIOUS,DMR_DO_IT_YOURSELFERS,DMR_MISCELLANEOUS,DMR_NEWS_FINANCIAL,DMR_ODD_ENDS,DMR_PHOTOGRAPHY,DMR_SWEEPSTAKES,DOG_QTY,DWELLING_TYPE,DWELLING_UNIT_SIZE,EST_LOAN_VALUE_RATIO,ETECH_GROUP,ETHNIC_GROUP_CODE,ETHNIC_INSIGHT_MTCH_FLG,ETHNICITY_DETAIL,EXPERIAN_INCOME_CD,EXPERIAN_INCOME_CD_V4,GNDR_OF_CHLDRN_0_3,GNDR_OF_CHLDRN_10_12,GNDR_OF_CHLDRN_13_18,GNDR_OF_CHLDRN_4_6,GNDR_OF_CHLDRN_7_9,HH_INCOME,HHLD_DM_PURC_CD,HOME_BUSINESS_IND,I1_BUSINESS_OWNER_FLG,I1_EXACT_AGE,I1_GNDR_CODE,I1_INDIV_HHLD_STATUS_CODE,INDIV_EDUCATION,INDIV_EDUCATION_CONF_LVL,INDIV_MARITAL_STATUS,INDIV_MARITAL_STATUS_CONF_LVL,INS_MATCH_TYPE,LANGUAGE,LENGTH_OF_RESIDENCE,MEDIAN_HOUSING_VALUE,MEDIAN_LEN_OF_RESIDENCE,MM_INCOME_CD,MOSAIC_HH,MULTI_BUYER_INDIV,NEW_CAR_MODEL,NUM_OF_ADULTS_IN_HHLD,NUMBER_OF_CHLDRN_18_OR_LESS,OCCUP_DETAIL,OCCUP_MIX_PCT,PCT_CHLDRN,PCT_DEROG_TRADES,PCT_HOUSEHOLDS_BLACK,PCT_OWNER_OCCUPIED,PCT_RENTER_OCCUPIED,PCT_TRADES_NOT_DEROG,PCT_WHITE,PHONE_TYPE_CD,PRES_OF_CHLDRN_0_3,PRES_OF_CHLDRN_10_12,PRES_OF_CHLDRN_13_18,PRES_OF_CHLDRN_4_6,PRES_OF_CHLDRN_7_9,PRESENCE_OF_CHLDRN,PRIM_FEM_EDUC_CD,PRIM_FEM_OCC_CD,PRIM_MALE_EDUC_CD,PRIM_MALE_OCC_CD,RECIPIENT_RELIABILITY_CD,RELIGION,SCS_MATCH_TYPE,TRW_INCOME_CD,TRW_INCOME_CD_V4,USED_CAR_CD,Y_OWNS_HOME,Y_PROBABLE_HOMEOWNER,Y_PROBABLE_RENTER,Y_RENTER,YRS_SCHOOLING_CD,Z_CREDIT_CARD

Fields meaning and code are described in detail here. You will need to register the website for getting access to the data.

And this the is the first entry:N,N,,G,,8,0,1,0,0,0,0,1,0,0,0,0,4,0,0,1,0,0,0,B,U,0,,M,Y,0,0,0,0,0,1,1,1,0,2,0,A,C,0,J,18,Y,66,,A,U,U,U,U,U,34,,U,U,84,M,H,1,1,M,5,I,01,00,67,3,,E06,Y,7,3,0,05,0,37,78.09,30,63,36,13.27,59,,N,N,N,N,N,N,U,UU,U,07,6,J,4,,J,4,U,,Y,U,0,Y,,24,,,,,,,F,F,,,,,,,U,Y,,,,,,,17,69,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,NORTH LAUDERDALE,330685141,FL,190815,,,,,,1036,Third Party - Merch,"Mon, 09/20/10 01:04 PM"

For this demo, I used the file Modeling_1.csv which is the first of 5 files, with 400K entries.

We would like to predict the zeros entry (click flag). I have taken column 9 and 10 as the matrix from/to entries. The rest of the columns up to column 40 are features. (While there are more features the actual solution is so accurate so the first 40 are enough).

We got a very good classifier - starting from the second iteration there are no classification errors.

Some explanation about additional run time flags, not used in previous examples.
1) --rehash_value=1 - since the target value is not numeric, I used rehash_value to translate Y/N into two numeric integer bins.
2) --cutoff=0.5 - after hasing the target Y/N we get two integers: 0 and 1. So I use 0.5 as a prediction threshold to decide for Y/N.
3) --file_columns=200 - I am looking only at the first 40 columns, so there is no need in parsing all the 273 columns. (You can play with this parameter on run time).
4) --has_header_titles=1 - first line of input field includes column titles

Instructions
1) Register to the hearst website.
2) Download the first data file Modeling_1.csv and put in the in main graphchi folder.
3) Create a file named Modeling_1.csv:info and put the following two lines in it:%%MatrixMarket matrix coordinate real general11 13 400000
4) Run as instructed.

Tuesday, December 11, 2012

NOTE: This blog post is two years old. We have reimplemented this code as part of Graphlab Create. The implementation in GraphLab Create is preferred since:1) No input format conversions are needed (like matrix market header setup)2) No parameter tuning like step sizes and regularization are needed3) No complicated command line arguments4) The new implementation is more accurate, especially regarding the validation dataset. Anyone who wants to try it out should email me, I will send you the exact same code in python.

After spending a few years writing collaborative filtering software with thousands of installations, and after talking to tens of companies and participating in KDD CUP twice, I have started to develop some next generation collaborative filtering software. The software is very experimental at this point and I am looking for the help of my readers - universities and companies who would like to try it out.[NOTE: I HAVE ADDED SOME UPDATES BELOW ON THURSDAY DEC 13]

In other words, how do we utilize additional information we have about user features, item features, or even more fancier feature like user friends etc. This problem is often encountered in practice and in many cases, papers are written about it by doing specific constructions. See for example Koenigstein's paper. However, in practice, most users do not like to break their heads and invent novel algorithms but want to have a readily accessible method that can take more features into account and without much fine tuning.

The solution:

Following the great success of libFM, I thought about implementing a more general SGD method in GraphChi that case take a list of features into account.

A new SGD based algorithm is developed with the following
1) Support for string features (John Smith bought the Matrix)
2) Support for dynamic selection of features on runtime.
3) Support of multiple file formats with column permutation.
4) Support for an unlimited number of features
5) Support for multiple ratings of the same item.

Working example - KDD CUP 2012 - track1

To give some concrete example, I will use KDD CUP 2012 track1 data which will demonstrate how easy to setup and try the new method.

Explanation: --training is the input file. --val_pos=2 means that the rating is in column 2, --rehash=1 means we treat all fields as strings (and thus support string values), --limit_rating means we handle only the first million ratings (to speed up the demo), --max_iter is the number of SGD iterations, --minval and --maxval are the allowed rating range, and --quiet less verbose output. --calc_error displays the classification error (how many predictions were wrong).--file_columns=4 says that there are 4 columns in the input file.

Thursday, Dec 13 - An update

I am getting a lot of readers inputs about this blog post, which is excellent!

One question I got from Xavier Amatriain, manager of recommendations @ Netflix, is why do I compute training error and not test error. Xavier is absolutely right, I was quite excited about the results so I wanted to share them before I even had time to compute the test error. Anyway I promise to do so in a couple of days. But I am quite sure that the model is quite accurate!

I got some interesting inputs from Tianqi Chen, author of SVDFeature software:

I think one important thing that we may want to add is the support of classification loss( which is extremely easy for SGD ). Since now days RMSE optimization seems get a bit out of fashioned and most data are click-through data and the optimization target is ranking instead of RMSE.
I think the feature selection part is quite interesting. Since adding junk feature in those feature-based factorization model will almost hamper the performance. However, directly replacing L1 constraint on weight will work worse than L2 regularization, so I am curious what trick you used :-)

1. for SGD-FM, it is hard to turn the parameters like learning rate and MCMC based method is slow.

2. Recently I find another great model- online bayisian probit regression (adpredictor) which bing has used in their CTR prediction.
this model is a online learning model it is very fast,and the result is better than Logistic regression, so I am thinking about
borrowing some ideas from this model to improve LibFM to a online learning model.

The last kind of feedback I am getting is from companies how claim to already solved this problem.. I think that if the problem was already completely solved, I was not getting so much feedback about it.
What do you think?

About Me

6 years ago, along with my collaborators at Carnegie Mellon University, I have started the GraphLab large scale open source project, which is a framework for implementing machine learning algorithms in parallel and distributed settings. When the project became popular, we have decided to raise money to expand the project and provide an industry grade solution.
Specifically I wrote the award wining collaborative filtering toolkit to GraphLab which is widely deployed today, and helped us win top places at ACM KDD CUP 2011, ACM KDD CUP 2012 among other competitions.
Checkout our website: http://dato.com