Thursday, 29 December 2011

Common Crawl Foundation is a California 501(c)3 non-profit founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible.

Saturday, 24 December 2011

Intro: WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically.

Sunday, 18 December 2011

Facebook is going to launch new feature called "Timeline". It organizes all contents (including walls, photos, links, ...) on Facebook page according to time. It looks interesting.

Now I am thinking about timeline ... for news readers. Searching on the Internet, I've found some links:- http://html5.labs.ap.org/- http://feeds.allofme.com/RSS_Timeline.html?target=http://www.life.com/rss/news- http://www.labnol.org/internet/google-news-time-as-rss-reader/9089/

This timeline feature has been being developed. There is still a room for us ^^.

Wednesday, 30 November 2011

Mono is a software platform designed to allow developers to easily create cross platform applications. Sponsored by Xamarin, Mono is an open source implementation of Microsoft's .NET Framework based on the ECMA standards for C# and the Common Language Runtime. A growing family of solutions and an active and enthusiastic contributing community is helping position Mono to become the leading choice for development of Linux applications.

Tuesday, 22 November 2011

Link: http://www.ngrams.info/Intro: These n-grams are based on the largest publicly-available, genre-balanced corpus of English -- the 450 million wordCorpus of Contemporary American English (COCA). With this n-grams data (2, 3, 4, 5-word sequences, with their frequency), you can carry out powerful queries offline -- without needing to access the corpus via the web interface.

Monday, 7 November 2011

PiCloud is a cloud-computing platform that integrates into the Python Programming Language. It enables you to leverage the computing power of Amazon Web Services without having to manage, maintain, or configure virtual servers.

PiCloud integrates seamlessly into your existing code base through a custom Python library,cloud. To offload the execution of a function to our servers, all you must do is pass your desired function into the cloud library. PiCloud will run the function on its high-performance cluster. As you run more functions, our cluster auto-scales to meet your computational needs. Getting on the cloud has never been this easy!

PiCloud improves the full cycle of software development and deployment. Every function run on PiCloud has its resource usage monitored, performance analyzed, and errors traced. This data is further aggregated across all your functions to give you a bird's eye view of your service. PiCloud enables you to develop faster, easier, and smarter.

AlchemyAPI is a series of products of Alchemy company applied for knowledge extraction from text. The name "Alchemy" may make a confusion with Alchemy Open Source AI developed by University of Washington.

It is quite intersting to see how language technologies are used for real applications.

Monday, 26 September 2011

WebMatrix is a free web development tool from Microsoft that includes everything you need for website development. Start from open source web applications, built-in web templates or just start writing code yourself. It’s all-inclusive, simple and best of all free. Developing websites has never been easier.

Tbot is an automated buddy that provides translations for Windows Live Messenger. It was first launched in 2008 as a prototype and has since become immensely popular. You can have one-on-one conversations with Tbot or invite friends who speak different languages with Tbot translating for you.

Wednesday, 3 August 2011

IntroTERp is an automatic evaluation metric for Machine Translation, which takes as input a set of reference translations, and a set of machine translation output for that same data. It aligns the MT output to the reference translations, and measures the number of 'edits' needed to transform the MT output into the reference translation. TERp is an extension of TER (Translation Edit Rate) that utilizes phrasal substitutions (using automatically generated paraphrases), stemming, synonyms, relaxed shifting constraints and other improvements.

MANY is an MT system combination software which architecture is described is the following picture :

The combination can be decomposed into three steps

1-best hypotheses from all M systems are aligned in order to build M confusion networks (one for each system considered as backbone).

All CNs are connected into a single lattice. The first nodes of each CN are connected to a unique first node with probabilities equal to the priors probabilities assigned to the corresponding backbone. The final nodes are connected to a single final node with arc probability of one.

A token pass decoder is used along with a language model to decode the resulting lattice and the best hypothesis is generated.

This post is to collect papers regarding to system combination problem for Machine Translation systems. (collect everything first, filter later then)

1) Felipe Sánchez-Martínez.Choosing the best machine translation system to translate a sentence by using only source-language information. In Proceedings of the 15th Annual Conference of the European Associtation for Machine Translation, p. 97-104, May 30-31, 2011, Leuven, Belgium.

Cunei is a hybrid platform for machine translation that draws upon the depth of research in Example-Based MT (EBMT) and Statistical MT (SMT). In particular, Cunei uses a data-driven approach that extends upon the basic thesis of EBMT--that some examples in the training data are of higher quality or are more relevant than others. Yet, it does so in a statistical manner, embracing much of the modeling pioneered by SMT, allowing for efficient optimization. Instead of using a static model for each phrase-pair, at run-time Cunei models each example of a phrase-pair in the corpus with respect to the input and combines them into dynamic collections of examples. Ultimately, this approach provides a more consistent model and a more flexible framework for integration of novel run-time features.

Wednesday, 20 July 2011

HeidelTime is a multilingual temporal tagger that extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard, which is part of the markup language TimeML (with focus on the "value" attribute). HeidelTime uses different normalization strategies depending on the domain of the documents that are to be processed (news or narratives). It is a rule-based system and due to its architectural feature that the source code and the resources (patterns, normalization information, and rules) are strictly separated, one can simply develop resources for additional languages using HeidelTime's well-defined rule syntax.

Thursday, 7 July 2011

OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and the corpus is also delivered as an open content package. We used several tools to compile the current collection. All pre-processing is done automatically. No manual corrections have been carried out.

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.

Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal friction. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from common failure scenarios.

Saturday, 18 June 2011

Intro: The success of data-driven approaches and stochastic modeling in computational linguistic research and applications is rooted in the availability of electronic natural language corpora. Despite the central role that annotated corpora play for computational linguistic research and applications, the question of how errors in the annotation of corpora can be detected and corrected has received only little attention. The DECCA project is designed to address this important gap by exploring an error detection and correction method with potential applicability to a wide range of corpus annotations.

SALM is C++ package that provides functions to locate and estimates statistics of n-grams in a large corpus. SALM toolkit provides example applications such as estimating type/token frequency, locating n-gram occurrences, and a suffix array language model that can have arbitrarily long history for a very large training corpus.

Monday, 16 May 2011

hunalign aligns bilingual text on the sentence level. Its input is tokenized and sentence-segmented text in two languages. In the simplest case, its output is a sequence of bilingual sentence pairs (bisentences).

In the presence of a dictionary, hunalign uses it, combining this information with Gale-Church sentence-length information. In the absence of a dictionary, it first falls back to sentence-length information, and then builds an automatic dictionary based on this alignment. Then it realigns the text in a second pass, using the automatic dictionary.

Like most sentence aligners, hunalign does not deal with changes of sentence order: it is unable to come up with crossing alignments, i.e., segments A and B in one language corresponding to segments B’ A’ in the other language.

There is nothing Hungarian-specific in hunalign, the name simply reflects the fact that it is part of the hun* NLP toolchain.

hunalign was written in portable C++. It can be built under basically any kind of operating system.

"Welcome to YouAlign, your online document alignment solution. No software to purchase, no software to install. With YouAlign you can quickly and easily create bitexts from your archived documents. A YouAlign bitext contains a document and its translation aligned at the sentence level. YouAlign generates TMX files that can be loaded into your translation memory. YouAlign can also generate HTML files that you can publish on the Internet, or use with a full-text search engine to search for terminology and phraseology in context.

YouAlign is powered by the AlignFactory engine, which supports all kinds of formats, including Microsoft Word, Excel and PowerPoint, PDF, HTML, XML, Corel WordPerfect, RTF, Lotus WordPro and plain text."

The corpus has most of the functionality of the other corpora from http://corpus.byu.edu (e.g. COCA, COHA, and our interface to the BNC), including: searching by part of speech, wildcards, and lemma (and thus advanced syntactic searches), synonyms, collocate searches, frequency by decade (tables listing each individual string, or charts for total frequency), comparisons of two historical periods (e.g. collocates of "women" or "music" in the 1800s and the 1900s), and more." (From Corpora-List)

Friday, 11 March 2011

The summary quality of the above systems is not actually good (i think). One possible improvement is to focus mainly on summarizing contents from research papers which contain very useful and detailed technical materials. It could be regarded to "Related Work Summarization" (see my paper at here).

RelEx is an English-language dependency relationship extractor, built on the Carnegie-Mellon link parser. It can identify subject, object, indirect object and many other dependency relationships between words in a sentence. It also generates some advanced semantic relations, such as normalizing questions for question-answering. It also proposes "frames" or "semantic roles", similar in style to those of FrameNet. RelEx includes a basic implementation of the Hobbs anaphora (pronoun) resolution algorithm. As a "by-product", it also provides more basic functions, including entity detection, part-of-speech tagging, noun-number tagging, verb tense tagging, gender tagging, and so on. Relex now includes a Stanford parser compatibility mode, generating identical output, but more accurately and more quickly.

Scientext is a new, on-line French and English corpus of scientific texts. The corpus includes 4.8 million running tokens in French, 13 million words of research articles in English (medicine and biology), and an English-language sub-corpus of French undergraduate students’ texts (1,1 million words). The corpus is organized to facilitate the linguistic study of authorial position and reasoning in scientific articles through phraseology and lexico-grammatical markers linked to causality.