Misc Links

Thursday, July 26, 2007

Google now opened access to university researchers to its search and MT systems in today's announcement on their research blog. The search API documentation does not mention any restriction on the number of queries that can be posted for search (The earlier limit was 1000). Whatever the number is I am guessing it will be large (Drinking from the firehose?). However, the MT API allows 1000 queries per day with the documentation hinting that this need not be a hard limit.

Looking at the search API output, two things I really miss is the number of hits and the snippet for each search result. The number of hits has been used in several papers for interestingresults. The other useful feature is snippets. Every search result from Google is accompanied by a small snippet extracted from the page, as shown below for an example query "Dekang Lin".

The information in the snippets can be used as informative features in different tasks like this one in person name disambiguation. (BTW, Dekang is now at Google)

Despite these minor quibbles, these new APIs will be quite useful to all of us and will certainly result in more papers on Googleology.

Later addition: Turns out we can sort of get the counts by simply counting the number of search results by repeatedly executing the request (only ten results per request) but the API caps this limit to 100. That means you could get a maximum of 1000 results. Which is not quite the same as "Results 1 - 10 of about 779,000,000". Though that number is approximate, it is still indicative of how strong the query is w.r.t the web. For example GoogleCount("Horse+animal") >> GoogleCount("Horse+truck").

Tuesday, July 17, 2007

... because NLP is so underdeveloped in India, even undergraduate-level projects may be contributing to the cutting edge of research.

Turns out he was referring to this post from an undergrad which tries to give "the Indian perspective", rather inaccurately. Having worked on NLP at one of the IITs I am compelled to write from a grad student perspective. Sunayana's post is interesting as it brings out several issues in Indic computing.

1. Lack of annotation data - corpora, treebanks, and aligned texts which are sinews and bones of any language processing system. Resources exist, largely due to the efforts of CIIL, various universities and other government agencies but these are dwarfed compared to resources that exist for other languages, like English or the European languages.

However, the rich morphology in Indian languages can be exploited to mitigate the amount of annotation data required for certain tasks, for instance POS tagging.

2. Encoding issues - As rightly pointed by Sunayana, before the adoption of unicode, several data sources were locked up in the fonts they use. But things are changing, there is more and more Indian language content in unicode today than ever. Websites like BBC and Wikipedia are spewing out a lot of content in unicode for those interested in collecting monolingual, comparable corpora. A cursory glance at Wikipedia statistics shows the number of articles in, say Hindi or Tamil for example, has more than doubled in the past six months.

3. Visibility - While there has been an increasing trend to publish in reputed conferences like ICML or ACL, more participation is certainly desirable. IJCAI 2007 was held in India and I highly recommend, if you are around, to submit (sub. deadline: Jul 31st) and/or attend IJCNLP 2008.

This is an exciting time to do NLP research on Indian languages. There is both corporate as well as government motivations which translate to grants and support to universities. The group at IIT Bombay, for example, implemented and deployed, local language based systems for helping farmers. Similar efforts have been taken by other institutes. Microsoft research at Bangalore, and IBM research at New Delhi and Bangalore are working on various projects on Indian Languages, including speech recognition.

At the end of all this, I must partially agree with the quote I made from Alex's blog. Yes, some undergrads do make brilliant contributions which is just because of what they have in their bones. This is true for any country or university.

Joseph Price studies the effect of marriage on graduation in his paper, "Does a Spouse Slow You Down?: Marriage and Graduate Student Outcomes".

Here is a quick abstract:

Using data on 11,000 graduate students from 100 departments over a 20 year period, I test whether graduate student outcomes (graduation rates, time to degree, publication success, and initial job placement) differ based on a student’s gender and marital status. I find that married men have better outcomes across every measure than single men. Married women do no worse than single women on any measure and actually have more publishing success and complete their degree in less time. The outcomes of cohabiting students generally fall between those of single and married students.

Monday, July 2, 2007

Gilles Blanchard and François Fleuret, Occam’s Hammer. When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight. A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper.

Sparse Eigen Methods by D.C. ProgrammingBharath Sriperumbudur - University of California, San Diego, USADavid Torres - University of California, San Diego, USAGert Lanckriet - University of California, San Diego, USA

Information-Theoretic Metric Learning (one of the best paper awardees)Jason V. Davis - University of Texas at Austin, USABrian Kulis - University of Texas at Austin, USAPrateek Jain - University of Texas at Austin, USASuvrit Sra - University of Texas at Austin, USAInderjit S. Dhillon - University of Texas at Austin, USA

Agnostic Active Learning - not from ICML 2007 but exciting as it was discovered last year, theoretical bounds were proved this year in ICML 2007.http://hunch.net/~jl/projects/agnostic_active/agnostic-active.pdf