Actually, this all was Day 2 of the conference (Tuesday), because Day 1 was a workshop day. I was technically registered for a conference only; it won’t stop me from attending the workshops, but I’ve only arrived at about noon of Monday. So the only thing I’ve attended that day was a tutorial “Blind Men and an Elephant: Coalescing Open-source, Academic, and Industrial Perspectives on BigData”. The abstract can be found here, I have to admit, it was not that exciting, as it sounded, but that’s probably because by that time I was twice jet-lagged..

And my earlier post was about Tuesday – the first day of the conference. After the opening there was the first keynote, called “Data Crowdsourcing: Is it for Real?” presented by Hector Garcia-Molina. Let me tell you, that he is one of the “textbook characters” for me, and when you see in person somebody, who’s books and papers you’ve studied since you were a PhD student, you definitely want to hake this person’s hand and tell them how you never thought this is going to happen… and I did, but two days later :).

The keynote was absolutely amazing. And watching for Garcia-Molina live was equally exciting:). I’ve tried to take some pictures, but ended up just listening. So here is just one picture – but it describes the talk pretty well:)

“This is for real!” – states Garcia-Molina. He says: those old Wild West “wanted” ads where the first Data crowdsourcing: the sherif himself could not catch the fugitive, so he would announce a reward to those people who will provide any information which will help to catch the robber.

It todays world there are multiple examples of the data crowdsourcing. If you look at the picture posted above, Garcia-Molina was talking about one apparently well-known case, when the humans were hired to resolve captcha to create “non-human” hotmail addressed for the porn sites.

He also talked about the crowdsource based system, which I believe his team in Sanford is developing (since I do not have the text of this keynote, I’m not 100% sure). But in any case, here is an example. You need to buy a specific type of cable, but you have no idea how it is called and hence what to search for on Amazon. You take a picture of this cable, publish it into this service and “ask the crowd”, how it is called. The volunteering crowd respond with suggestions, and then you can search on Amazon with these suggestions and see whether they were correct. The feedback helps to improve the quality of the crowdsourcing.

After the keynote there was a panel “Big Data: Old Wine in New Bottle”. The panel description is here. Now, because in this case it’s not one but eight great people talking, and again, I do not have any slides from that, I am not sure whether I will be able to reproduce all the details of the discussion. There was one funny thing, when Hector Garcia-Molina was representing himself as his “evil twin Victor”, and he put the fake nose with mustache and glasses on, when he was talking as “Victor”. Here is one picture from this panel:

This is when Umesh Dayal was giving his 8-min presentation (and he was the only person on the panel whom I knew really well). I was introduced to Rakesh Agrawal a year ago (after reading his papers for years!), but this was just an introduction.

Anyways, the definition of the Big Data which I liked the most was the one given by David Lomet (another textbook character:)): Big Data is the date which is just a little bit too big for us to process with no problem at this time. And if you look back and try to remember what exactly you’ve considered to be “a big data” say 10 years ago – it’s not anymore! The most important thing is not the actual data volume but the techniques we use for this data manipulation – at some point we do not need a “special technique” for the data which we’ve considered very big before.

There were more panels and lots of interesting discussions later, and I will definitely tell more about everything – most likely over the weekend though.

Related

4 responses to “ICDE 2015 Day 1 Part 2”

this company would hire real people who would receive this capchas, resolve them and send the result back, and they were paid by getting access to some porn websites.

Are you sure you got it right? The common case is that no-one is hired (that’s expensive! the whole point of crowdsourcing is to get it for free 😉 . The scenario goes like this: a porn-seeking human comes to the porn site. In the back-end, the site then automatically fires off a new account request to the email service. Email service responds with captcha. Porn site then catches that captcha and presents it to a human porn-seeker (as if it was porn site’s own). Human solves the captcha, and porn site retranslates the solution back to email service. End result – a new email account that porn site could now use for spamming or sell to spammers, and an unsuspecting human thrill-seeker who have just been used.

My point is that even without the details this does not sound right ;-). You kept the word “hired”, and that’s the point of my contention. I maintain that to hire someone to just sit and manually perform new accounts registrations is neither crowdsourcing nor anything new. That’s ages old straightforward method… Think sweatshops for the sake of analogy.

What I was talking about (and I suspect what Garcia-Molina was referring to) is fundamentally different. Nobody was hired. Nobody worked. Nobody even knew they were doing anything except logging in to the porn site. The back-end captcha switcheroo was completely automatic, and cleverly written software had used the brain power of unsuspecting humans to bypass anti-robot defense measures.

That’s why I removed the whole piece :). There was actually an extensive discussion about hiring around this process, because they compared the quality of data crowdsourcing in case people were hired and not, and in case when there was a reward for each suggestion, etc., and the results were not obvious.