Thanks to some early data from Pizza Girl, of Slice fame, I have some very preliminary findings.

There are a few different ways to tip, check (only one person did this), credit card at the door, pre-tipping with a credit card and cash. As seen in these boxplots, cash tippers were the highest, on average. Pre-tippers, who really are just tipping based on feeling, not performance, have the greatest variability. There was even someone who only pre-tipped a dollar. Pre-tipping a large amount might be a good idea–kind of like greasing a palm at a restaurant to get a table–but I don’t see how a small pre-tip is a good idea.

I wonder why people give bigger tips with cash than with credit cards. I would have thought it would be the other way around.

This is just the beginning. Pizza Girl is providing more data as the weeks go on. And as I get more data the analysis will become more sophisticated, so stay tuned as we unravel the world of pizza delivery. In the mean time, check out Pizza Girl’s third installment of her findings on Slice.

His first contention is whether assumptions about categorization are correct. This is certainly important, but hopefully qualified statisticians, social scientists, doctors, etc. . .are making these decisions and properly counting the results.

Next he discusses whether numbers you are looking at have been aggregated properly and were arrived at by using the proper choices of criteria, protocols and weights. He gives articles such as “The 10 Friendliest Colleges” and “The 20 Most Lovable Neighborhoods” as examples. Having done a lot of work where variable selection and shrinkage is important I can say that I, for one, allow the data to speak for itself and use various statistical methods to arrive at the correct decision.

Dr. Paulos makes more points, but I’ll let you read the article for yourself. The important take away–at least to me–is that when looking at reported statistics and measurements, try to figure out what methods were used. That’s why I always am disappointed when articles do not report their methods. I realize that understanding the techniques might be beyond the average person, but that’s when you ask yourstatisticianfriend.

Today, Google announced two new services that are sure to be loved by data geeks. First is their BigQuery which lets you analyze “Terabytes of data, trillions of records.” This is great for people with large datasets. I wonder if a program like R(my favorite statistical analysis package) can read it? If so would R just pull down the data like it would from any other database? That would most likely result in a data.frame that is far too large for a standard computer to handle. Maybe R can be ran in a way that it hits the BigQuery service and leaves the data in there. Maybe even the processing can be done on Google’s end, allowing for much better computation time. This is something I’ve been dreaming of for a while now.

Further, can BigQuery produce graphics? If so, this might be a real shot at Business Intelligence tools like QlikView or Cognosthat specialize in handling LARGE datasets. Continue reading →

The other day, I was working near Houston street, teaching a class on QlikView (which itself could be a great post topic about data munging for statisticians). On the last day of the class we decided to head to Bleecker street for a pizza feast.

Pizza Girl, a pizza delivery girl who is a regular contributor on Slice, tallied up and analyzed the time she spends on various duties in her pizzeria. This is just the first part in a series, but so far she determined that she spends 67% of her shift driving.

According to her pay schedule, she makes less money while driving ($4.95/hr) than she does while in the pizzeria ($7.50). Continue reading →

The New York Times, in what seems like a continuing series on NYC transportation, has an article about a decline in subway ridership. The article points out declines that were to be expected such as in the financial district or Midtown as well as expected increases like along J, which shares a route with the M and Z which are facing service cuts. It will be interesting to see how these findings impact the expected service cuts.

Another area with expected results was a massive drop off at the moribund Mets’ stop and a below average drop at the World Champion Yankees stop. However, the Mets–unlike the Yankees–have a convenient commuter rail stop. Perhaps that explains the drop more than the team’s performance. Continue reading →

Slice recently reported that Fark user “Certainly You Jest” tabulated a list of the 25 most mentioned pizzerias. Naturally, I decided to play with the numbers. Rather than write up another formal paper, I did some quick ad hoc analysis for posting on this blog and I will skip some of the more technical aspects.

First, I augmented the data with the price of a typical plain pie that could feed two to four people and the pizzeria’s distance from New York City. Adding the distance meant I had to remove the multi-state chains, like Monical’s, from the data.

While the number of times a pizzeria is mentioned is count data, it doesn’t quite fit a poisson distribution, and the poisson regression didn’t seem to be a good fit. This makes sense since I have three predictors (distance from New York, price and their interaction). You can see this in the two histograms below.