In the modern digital society, we estimate that each day are created around 2.5 quintillion bytes of data (2,5 Exabytes), in such a way that 90% of the data all over the world were created just only in the last two years. These data come from all type of sources: sensors used to obtain information on the climate, publications in social networks, blogs, digital images and video, etc. For instance, Twitter generates about 8 Terabytes of data per day, while Facebook captures about 100 Terabytes. This is what is known as Big Data. One of the main characteristics of this amount of information is the fact that, in many cases, is not structured.

Natural Language Processing (NLP) is considered as one of the methodologies more suited to structure and organize the textual information accessible through Internet. Linguistic processing of large amount of text is a complex task that requires the use of several subtasks organized in interconnected modules. The main problems found by the researchers in NLP are the high computational cost and the scalability problems of their tools, which make them impractical for the analysis of big volumes (Gigabytes or even Terabytes) of documents. For this reason, we believe that High Performance Computing (HPC) and Big Data technologies fit in a natural way as solution for the poor performance of the processing language modules.

The main goal of the project is to develop a new set of tools and software solutions for Big Data processing, which will allow the integration of a set of multilingual modules for natural language processing into a parallel and scalable suite. This suite must process large amounts of text in reduced execution times and, at the same time, make an efficient use of the considered high performance systems, paying special attention to the heterogeneous architectures. Note that the new NLP modules could be used in more complex and higher level linguistic applications such as machine translation, information retrieval, question & answering, or even new intelligent systems for technological surveillance and monitoring. In addition, the new tools result of the project will be for general purpose and, in this way, they could be applied to codes and applications coming from any research area.

Therefore, this is a multidisciplinary project which brings together two traditionally distant research areas: HPC and NLP. In addition, the expected results of the project have a big potential to be transferred to the industry.