Text identification on the Internet for global search engine use

OBS! ANSÖKNINGSTIDEN FÖR DETTA EXJOBB HAR LÖPT UT.

About Picsearch
Picsearch is a premium provider of image and multimedia search services. The services are market leading in relevancy and family friendliness. Picsearch powers several leading Internet sites as well as its own Internet properties. Picsearch was founded in 2000, is privately held and headquartered in Stockholm, Sweden. The image service can be found at www.picsearch.com.

Thesis worker requirement
Excellent programming skills in C++ are required, programming skills C, Python and Perl are preferable. Understanding of web protocols, familiarity with using large databases is required. Correctness and scalability are the two most important goals when developing at Picsearch. The general goal for all Thesis jobs is to evaluate and implement one or more solutions to the given problem. A correct solution is preferred to an efficient one, in other words proof of concept solutions rather than industrial strength applications are prioritized. All work must be done in C++ and compile on a standard GNU/Linux platform.

Thesis project description
Below is a list of a few suggestions of what your thesis work at Picsearch could be about. If you have any own ideas that you would like to do that you think would benefit us, or if you would like to suggest modifications to our suggestions, feel free to tell us about your ideas.

1. Phrase identification
A phrase is defined as a sequence of two or more words. This project aims to find an automated way of identifying common phrases used on the Internet (examples “Britney Spears”, “to be or not to be”). Your program will run on a very large set of textual data and scalability will be an important topic here.

2. Query correction suggestion
This problem consists of finding out what the user was really searching for. There are many reasons why a user may be looking for something other than what he/she actually typed (misspellings are probably the most common reason). The goal of this project is to find a way to give appropriate suggestions when the probability that the user has typed something other than what he/she meant. Note that even though this is intuitively closely related to a spellchecker, this is not the case. Most queries are only one word and a time limit for suggestions also exists.

3. Related queries
Using the information available on the Internet, find phrases that relate to each other. The phrases in this case will be a subset of the queries to the Picsearch search engine. For instance, if searching for “muscle car” you might also be interested in “dragster” or “hot rod”.

4. Word segmentation
Word segmentation is used when dealing with Asian languages that do not have natural word separators. In most Asian languages the concept of a word is not natural. The boundary and meaning of a word can only be determined from its context. For search engine indexing this is a problem. Evaluate available methods for this. Evaluate an implement one or more word segmentation algorithms, either generic or language specific.

Application
If you are interested in doing your thesis work at Picsearch, please send an e-mail to [email protected] and specify which of the projects that you are most interested in (or if you have your own suggestions of what you would like to do). Also, please provide a copy of your grades and a CV.

Send in your application before Friday the 16th of November. Applications will be answered immediately.