Page 2 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)36 WHAT I HAVE IMPLEMENTED......................................................................................................................10637 WHAT I DID NOT HAVE THE TIME TO IMPLEMENT.......................................................................................106XII TEST...........................................................................................................................................................108XIII CONCLUSION..........................................................................................................................................10838 WHAT WOULD BE DONE DIFFERENTLY IF I COULD DO IT ALL OVER AGAIN................................................108XIV POSSIBLE EXTENSIONS......................................................................................................................1091. WHAT DID I GAIN DOING THIS PROJECT?......................................................................................110XV REFERENCES..........................................................................................................................................11139 RECIPE’S WEB SITES CONSULTED.................................................................................................................11440 DEVELOPMENT GROUPS AND INTERESTING PROJECTS ALL AROUND THE WORLD.....................................11541 LANGUAGES RELATED TO THE SEMANTIC WEB.........................................................................................11542 CONSULTED DICTIONARIES..........................................................................................................................115I. GLOSSARY....................................................................................................................................................116

Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

 The project consists of a study of the semantic web, and the new technologies to develop itmaking a comparison with the current web and showing the limitations of the last one.

Afterwards make an application to show the knowledge obtained during the previous research.This application will be an intelligent system able to understand the unstructured web pagesposted on the WWW.The user can make queries about the subject of the web page, and the system will resolve themwith some intelligent system and show all the obtained results to him.The main target of this project is to make a system able to answer the questions made based onthe meaning and the semantics of the data, instead of the appearance.The main goal is to develop a well structured application with a well defined meaning andcapable to understand the semantics of the data, being part of the next web generation.Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

  The semantic Web will provide a semantic meaning to the current Web, so it will be easier(for people and machines) to work with this data.There are several ways to improve the Web by providing it with meaning.

One is to structure all the information available in some semantic-based form, providing thedata along with its meaning. These can be done with some of the current semantic weblanguages, like XML, OWL, DAML, etc. A brief explanation of each one is provided in thenext chapter.

But this is a slow task. We can pray for all the new people posting documents in the webwould do it in a semantic-based form in order to achieve our goal, but besides this is verydifficult, what happens with all the information already available on the net? Should weremove everything and re-write it in a structured way? The answer is very clear, of course not,this is a non sense.The main strengthen of the WWW is that everybody can post everything on it, no matterswhat it is, no matters where it comes from, no matters how it is written.

But if we want to improve the information acquiring from the current documents all over thenet, some solutions have to be found.One solution is presented as this project’s goal: to extract information from the current weband structure in other way in order to provide semantic meaning to it.





This project should develop an Information Extraction process, which extracts relevantinformation from an unstructured set of HTML pages about the recipes’ context. Thisinformation is processed in order to provide meaning to it; so the system can “understand” thetexts, extract information from them, relate it and storage it.So the user can make advanced queries based on the meaning instead of the semantics. All thisprocess of providing meaning to the unstructured texts is guided by an Ontology.

 Find and extract the desired information within an input set of documentsAutomatically relate and structure the extracted informationAutomatically storage the information in a structured wayMaster Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

 I began thinking about this project when I attended the course “Advanced Databases”imparted by Hans Bruun last year (2003, Spring Semester) at DTU. I was very interested inXML utilities as a semi-structured database, as well as being a Web-oriented language. Ibegan thinking about a possible project to exploit its potential on the Web. Afterwards I readan article written by Tim Berners-Lee [19]. It was then when I came into contact with theconcept of Semantic Web. I was fascinated about this new concept, and all its unexploredutilities. In this section is described the methodology that has guided this project. A methodologyis a set of principles that help the project manager to choose the methods that better fitthis specific project.The use of a methodology helps to produce a better quality product, focusing on thedocumentation standards, acceptability to the user, maintainability and consistency ofsoftware. It also plans the task to ensure that the project will be delivered in time.Defining a methodology, the reader can easily have an idea of the structure of the project,its objectives, and how they will be reached.This project differs from most projects because it purpose responds to a specific problembut without a specific solution; find new methods to handle some of the needs and lacksthat appear nowadays in the WWW.This project comes from a set of broad ideas that will be shaped during the projectdevelopment. It is essential to discern the elements constituting the problem and how theyshould be improved.The three main parts of this project are: Gather information: Define the current lacks of the projects’ domain. Define what can be done: State the limits of the project scope. Performing research to uncover methods that would have an interestingimpact on the problem definition Do it: Find the most suitable implementation for these new methodsMaster Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 8 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)This is mostly a research study. It focuses to find and discuss new methods to performuncovered actions within the project scope, but this project has been also extended withthe implementation of new approaches, becoming a theoretical and practical project atonce.     

   This project has followed the waterfall diagram schema along its development.But the theoretical waterfall diagram [Figure1] is too rigid to be applied to an investigationproject. This model divides the project in clearly separated development stages.This particular project has had a lot of feed back from one stage to the others. When newdiscovers are reached, it is sometimes necessary to reconsider decisions made in previousstages. Due to this continuous feed-back a spiral model could be suitable to define theapproach, but in the spiral Diagram a prototype is made each time a cycle is finished, whichhas not been done in this project.The diagram which best models the way of doing of this project, is a real waterfall diagram[Figure 2]

   The Analysis-Requirements-Design stages where interleaved all the time in this project. As itis explained in future chapters, some problems and new discovers found in the implementationMaster Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 9 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)phase made the project go backwards to the design phase, to remodel some features in adifferent way.         

 This is the time schedule followed during the development of this master thesis. The firstmonth was spent in defining the objectives and scope of the project. Afterwards the next twomonths were dedicated to read articles, analyze the state-of-the-art, find out the lacks of thecurrent situation (concerning the project scope) and propose different possible solutions.At the end of the third month a proposal of a possible solution was presented.Then the implementation phase began. The next month was spent in finding which techniquesand kind of design are needed to fulfill the objectives. Once made a design of the system theimplementation phase begun, this phase is when all the ideas are codified. At the end of thisphase a program that is capable to do all the desired features is given. Notice that the designand implementation phases are overlapped; some facts were reconsidered while implementingthem, due to several reasons related in the implementation chapters. Finally the testing wasperformed. The documentation was made all along the project, since the very first months, soit reflects accurately all the project development process.Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

 World Wide Web Overview: This chapter is an introduction to the problematic ofthe current Web and future approaches. Problem Analysis: This chapter presents an overview of the specific topic that hasbeen chosen to develop this project. Requirements Specification: This chapter specifies the limits of the project.Defines what exactly the functionality of the systems is. Domain Modelling: This chapter describes the theoretical models that represent thedomain of the project. It is a formal conceptualization of the reality. System Design: This chapter will explain the design of this project; this is thechoice of the technologies that fulfill the Information Extraction task basing on theselected approach. Implementation: This chapter explains the final realization of the selected approach.This is what has been codified and how the diverse tools used are run.Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

At the beginning the web emerged as some computers interconnected in order to worktogether and share out the work (1989, Tim Berners-Lee). The web begun to grow and theintranets [see Glossary] and LANs [see Glossary] appeared. But the explosion of personalcomputers and major advances in the field of telecommunications were the triggers of the webas we know it today. The growth of the WWW has been impressive these last years.

In its first stage the web was thought as some exchange of documents and data and some kindof working collaboration. It was meant to be a big working place where the programs anddatabases could share their knowledge and work together.

But with the explosion of the media programs, video games, films, music, pictures, and so on,the web now is almost only used by the humans and not by the machines.

Its main problem is that appeared in the WWW is that the information is written only forhuman consumption in most of the cases. The machines can not understand what the meaningof what is online is. A lot of pictures, drawings, movies and natural language populate theactual web. This meaningless information is not useful at all for the machines, which can notoperate with this data; they only show it to the user using a proper format.

 A big amount of languages are used to publish data in the current Web. Some of them are:HTML, JSP, ASP, and some Media-oriented web languages: Flash …etc. But they have incommon the lack of semantic meaning.     The incredible growth of the web has as direct consequence a big explosion of all kind of on-line documents. The information storage and collection is like following: the information isstored in large databases kept in the servers. The programs running on the servers generatewebs pages “on the fly”, basing on this data.The next picture attempts to briefly describe the information flow schema in the WWW.

Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

   Most of these on-line documents are only made for human consumption, being impossible forthe machines to understand the meaning of these documents. Also the human searching isoften a hard task and has several limitations, as it is explained below.

    ! " ! " ! " ! " # # # # Information retrieval refers to the act of recovering information from the vast amount of on-line documents; getting the desired documents and presenting them to the user.This is the classic way to obtain information from the WWW.It does not extract any information from a document; it just picks up some documents amongall the available documents in the Web. The user will get a document or set of documentshe/she will have to analyze if he/she wants to find the desired informationThe non-structured languages of the current Web make difficult for humans, and more for themachines to locate and acquire the desired information. The current methods to retrieveinformation are browsingand keyword searching; next picture shows a schema of thisinformation acquiring.

Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 14 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM) Keyword searching normally returns vast amounts of useless data the user has to filter byhand.“Although search engines index much of the Web's content, they have little ability to select thepages that a user really wants or needs” [Berners-Lee:http://searchwebservices.techtarget.com/sDefinition/0,,sid26_gci214349,00.html]

7.3.2.1 Example of information retrieval by keyword searching

Let’s see a practical example of keyword searching and the subsequent browsing, within therecipe’s context:Imagine for example that someone is looking for a beef recipe that does not take so longbecause he/she does not have much time to cook today, so he/she enters these words in anindex server (Google in this case): recipe beef cooking-time 1 hourThe test has been made and 13,700 references have been obtained. This is useless, as it willtake the user more time to read and sort the recipes than the hour he/she wants to spend in thekitchen.He/she can try to redefine the searching to be more accurated: recipe beef cooking-time lessthan 1 hour. This new search “only” returns 4,930 results.If the user has experience using the index server, the search can be improved with a betteruse ot the quotes, for example: recipe beef cooking-time “less than 1 hour” and then get amore reasonable result of 25 pages. Although the searching has been improved considerably,the user has to still browse all the recipes to decide which one fits his/hers necessities. Withthis kind of information retrieval, it is not assured that all the pages are recipes’ pages.Morevoer, although they belong to this subject, some undesired web pages can be found, forexample it was found one with the text: “not less than 1 hour” which is not at all what the useris looking for.

&&&& $'$($'$($'$($'$(“The Semantic Web is an idea of World Wide Web inventor Tim Berners-Leethat the Web asa whole can be made more intelligent and perhaps even intuitive about how to serve a user'sneeds. He foresees a number of ways in which developers can use self-descriptions and othertechniques so that context-understanding programs can selectively find what users want.”[http://searchwebservices.techtarget.com/sDefinition/0,,sid26_gci214349,00.html]

Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 15 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)$$$$ %&'( %&'( %&'( %&'( Because of the incredible growth of the WWW, and the difficulties to cope with theseavailable information (as explained in the previous chapter); the father of the web, TimBerners-Lee, is now trying to bring it out to a new stage. He has developed a new concept ofWeb where people and machines could work together and collaborate to share all kind ofinformation. This is called the Semantic Web.The aim of this new phase is to make the machines capable to understand the semantics of theweb. To be able to “read” the web as a human does. For this purpose, many differentapproaches have been formulated by a lot of researchers. Most of these methods are detailedall trough this project.$$$$ )**' "*)**' "*)**' "*)**' "*( ( ( ( ++++Instead of returning the whole Web document, like the information retrieval does; a new wayof getting information from the web is needed. This is called the information extraction. Itconsists on extracting pre-specified information out of the document, and structures it in someway so humans and also machines can understand it and treat it. It gets facts out of the web,instead of documents.Information extraction is much more difficult than information retrieval, but also much morebeneficial; the main reason is that the data extracted is structured data, so machines can“understand” it and work with it.The reason of doing that is because a lot of information is already online in the web, butposted in so many different ways. There is no way to access the information in the servers tomake the desired queries, this is only possible trough the already-generated web pages, and aslong as they are normally unstructured web pages, only humans can read this pages. So is timeto reverse this process. Instead of querying the databases lets query the web.This will also allow taking data from different heterogeneous sources and merging all the vastinformation that is published on the Web, giving tailored information to the user.This way it will be possible to get all the information that is sparsed across the web andreunify it. This allows combining different sources maybe written by so different people, forso different purposes, in so different stiles and with totally different layouts.But it is a hard task to automate this process, because the machines do not “understand” themeaning of the plain data.

Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 16 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)                                Once a web page is written in a semantic language, extracting information is a very easy task.The semantic-oriented languages are just designed to support semantic queries. The user onlyhas to use an appropriate query language to retrieve the desired information.     !        !        !        !   A lot of information is already available on the Web. We can not expect that the entire Webwill be rewritten in a structured way. This maybe is never going to happen, as the Web is not acontrolled organization were some rules can be applied. Contrary it is a very decentralized andunconstrained place where everybody can post anything they want (with the only constraintsof the law rules of a determined country)As explained before this big amount of unstructured on-line information requires newmethods to gather all the spread documents and present sensible information to the user. Thereis a need to make better use of the current available information. The aim of this projectfocuses on this task: find a way to extract information from the current web, although it is notstructured properly. There is a need to find some methods to “simulate” the semantic web onthe current web.8.2.2.1 The difficulty of information extraction

The information extraction consists of a system that goes over a text with respect to apredefined context, looking for the desired information that fits the context specifications.Afterwards this meaningful information can be structured in some way.Information extraction is a more powerful way to query the Web, but it presents somedifficulties. It does not look for words that syntactically match the words the user wants tolook for. Instead it searches the Web looking for facts, for entities and their relationships, inshort, for their semantic.The problem the information extraction systems have to face up refers to the intrinsiccomplexity of the natural language; there are a lot of ways to express the same fact. Below isan example of these many different ways to express the same idea in the natural language,referred to the recipes context.

 “You need five tomatoes of fifty grams each to make the tomato soup” “Five tomatoes of fifty grams are needed to prepare the tomato soup” “This tomato dish is prepared with five tomatoes which should weight fifty gramseach one to get a perfect and tasty result” “Ingredients for the tomato soup: 5 small tomatoes of 50 grams” “Take the 250 grams of tomatoes (5 approximately) and…” “With a quarter of kilo of tomatoes, which corresponds to five small ones, you canprepare a delicious tomato soup“ And so forth…Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 17 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)The way of achieving the information extraction is making some intelligent programs thatcould “read” the web pages and redefine them in a structured way, understandable for amachine.A brief schema of this process is shown in the next picture:

$ % " &'  “One of the biggest problems we nowadays face in the information society is informationoverload. The Semantic Web aims to overcome this problem by adding meaning to the Web,which can be exploited by software agents to whom people can delegate tasks” (EsperontoProject IST-2001-34373) [http://www.esperonto.net/semanticportal/jsp/frames.jsp]

8.2.2.2 What is an intelligent agent?The notion of an agent belongs to the AI field. Agents have application in many AI areas, likeprocess control, electronic commerce, information management, etc. This last application isthe one that concerns to this project.Agents and intelligent agents are not the same, to show the different, both definitions aregiven:“Agents are simply computer systems that are capable of autonomous action in someenvironment in order to meet their design objectives” [1]“An intelligent agent is … one that is capable of flexible autonomous action in order to meetits design objectives” [1]Where flexible refers to: respond differently depending on their environment, taking initiativesto achieve their goals and interacting with other agents or humans.There are several ways to provide knowledge to this agent. Most of them are deeply describednext in sectionIntelligentAgent

Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 18 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)With information extraction the data and its relationships are extracted and structured so theuser can make advanced queries and obtained the desired information.

$ $ $ $ ( ,##*( ,##*( ,##*( ,##*So many different languages oriented to create the Semantic Web have appeared within thelast years. All these languages are structured languages that can carry on meaning besidesgiving structure to the text.They have different characteristics among them. Some are newer than others, and so thenewest ones use to make progress from the previous ones, evolving and improving theircharacteristics.Different levels of semantic are reached: some languages provide meaning to the texts; othersgo further and can make assertions and infer knowledge, etc.

 Darpa Agent Markup Language (DAML+OIL). It is an extension of XML and RDF.It can conclude statements by itself. Web Ontology Language (OWL): The new Semantic Web Standard. It has justbecame a W3C Recommendation the 10 Feb 2004 Resource Description Framework (RDF): Became a W3C recommendation in 1999.It is a general framework to describe the contents of an internet resource. It is based inMetadata (data about data, definition or description of data). eXtensible Markup Language (XML): It is a flexible text language, derived fromSGML. It can define both the format and the data, and exchange it all over the WorldWide. Standard Generalized Markup Language (SGML): It is a system for organizing andtagging elements of a document. SGML was developed and standardized by theInternational Organization for Standards (ISO) in 1986[http://www.webopedia.com/TERM/S/SGML.html]In further chapters all these features will be explained in detail and a comparative of all thesemantic languages is presented.

$-$-$-$-     . /. /. /. /There is a consortium that actively helps to the achieving of the Semantic Web, and can beconsidered as one of its main supporters.

“The World Wide Web Consortium (W3C) develops interoperable technologies to lead theWeb to its full potential. W3C is a forum for information, commerce, communication, andcollective understand” [Definition found at the official page of the consortium:http://www.w3.org/]The director of the consortium is non other than the “father of the web”, Tim Berners-Lee.Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 19 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)He invented the Word Wide Web in 1989, creating the first WWW client and WWW server;he has also defined the URLs [see Glossary], HTTP [see Glossary], and HTML [see Glossary].The W3C group develops some standards (like recommendations) concerning to theWWW (e.g.: Web definition languages: HTML, semantic web languages: OWL, RDF,XML, etc)The W3C's goals can be summarize in three ways: Provide universal access to the Web, making accessible for everybody Develop the Semantic Web. Make a software environment that allows the users tobetter use the resources available on the Web. Develop a web of Trust: Consider the legal, commercial, and social issues causedby the WWW technology.This project has the ambitious aim to collaborate to the second goal, trying to improve thecurrent Web, raising it to the second Web generation: The Semantic Web.$$$$ &*01&*01&*01&*01Step by step the current Web will hopefully turn into the new Semantic Web. But this is notsomething that is going to happen suddenly.A study about the future of the web [http://www.aktors.org/technologies/gate/] reports that:“for at least the next decade more than 95% of human-to-computer information input willinvolve textual language […] by 2012 taxonomic and hierarchical knowledge mapping andindexing will be prevalent in almost all information-rich applications […] The web revolutionhas based on human language materials; making the shift to the next generation ( knowledge-based web) human language will remain key” [2]Most of the experts agree on that this is a slowly change. The users and developers of the Webwill not change their minds to the Semantic Web unless they have enough motivations and/orfacilities.The main challenge is to provide new tools (servers, editors, browsers) to construct andbrowse the new semantic Web pages in an easy way; so developers do not have to spendmuch time and effort creating Web pages with semantic contents; and users do not even noticethat they are looking for semantically related information. If they have to spend much timeand effort this change will never happen.Until the Web is beginning to grow semantically, there is a need to simulate the SemanticWeb on the current Web using different language-based technologies, which are deeplyanalyzed in next chaptersMaster Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

         This chapter presents an overview of the specific topic that has been chosen to develop thisproject.)))) ' ' ' ' ****The topic chosen to accomplish this project is the online cooking recipes. Several topics werediscussed at the beginning, and after a detailed study this was the chosen one.Other topics considered were: a travel planner, a TV-planner, and the world heritage.They were discarded for many reasons (like their easiness, narrow relevant information or thelack of personal motivation for these topics)2222 3 3 3 3 ,##*,##*,##*,##*There are a countless number of recipes all over the Web. This is a very common topic manypeople are interested in. This is why it is so spread out and why so many different web pageshave been found about this topic.Some examples of different web pages from different consulted web sites are described in the[Appendix-1] along with an explanation about the different parts and recognizable elements of arecipe.As the current web agglomerates documents posted by many different people, without anyrestriction in the way of describe de contents, some discrepancies were found among thestudied documents, being a challenge for the IE to cope with this data sparseness.Some of these differences are related below.2222  +' +' +' +'        ))))4444 5 5 5 5After studying a big amount of online recipes I found out the lack of standards in this topic.Some of the differences founded among several recipes are explained in detail (they can bealso observed in Appendix-1]

 The nutritional valueof a recipe refers to different concepts depending on theconsulted web page. (e.g.: some recipes state this value per 100 grams, others per eachfellow dinner, other per serving, etc.) The measure unityof the nutritional facts (cholesterol, fats or carbohydrates, etc)varies from a recipe to another one. (It is normally expressed in grams, but it can bealso stated in kilograms, ounces, etc…) The IE process has to be able to recognize andrelate all these different data types. Neither the energy valuecan be assumed to be in a certain unity, it can appear indifferent units (e.g.: calories, kcalories, kilojoules, etc)Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 21 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM) The same problem appears in the priceof the recipe. As the web agglomeratesdocuments posted by all kind of people from all over the world, the price may beexpressed in many different currencies (euros, crowns, dollars, etc.) The time unitsdo not either follows a standard (Some recipes state it in hours, othersin minutes, others in hours and minutes…etc.) The way of expressing timealso varies from one to another recipe (e.g.:1 hour and 30minutes, 1h and 30 min, 1:30 h, 90 min, one hour and thirty minutes, ninety minutes,etc.) The temperature unit is neither standard (can be expressed in degrees centigrade aswell as in degrees Celsius.) At last, the numerical values(like the quantity of an ingredient, number of fellowdiners, etc) are not express either in a normalized way. (Some recipes express thesequantities with numbers: 1, 2, 5; and others with letters: one, two, five ... The fractionsare also expressed in many different ways: ½, half, 0.5, etc.)Some way of converting this data to a certain standard is needed to be able to operate andmake comparisons with these data.Another big challenge is the non-standard way of defining the ingredients. There are nostandards or common criteria to express the ingredients of a recipe, several ways were foundamong all the recipes consulted. Next subchapter will go more deeply into this problem as itis very important to classify correctly the ingredients of the recipes,9.2.1.1 No standardized way of referring to an ingredientAs there are no standards about describing an ingredient, several ways are used. Some recipesrefer to the kind of ingredient, others to its origin, others to its parts, etc…Kind of ingredient vs. its partsIt is very common to find in a recipe description, the whole animal as an ingredient (e.g.: “250gr. of chicken”), sometimes this information is improved with the part of the animal should beused (e.g.: “8 chicken wings”). But many others only describe the part of the animal withoutreferring to any animal in particular, for example: “200 gr. of liver”. In this kind of descriptionthe decision about which kind of animal should be used is leaved to the cook.All this different ways are (unfortunately for the IE task) very common to express ingredientsin the recipes, and they are combined within different recipes.

Kind of ingredient vs. its origin or other characteristicsAnother example of the lack of standards is explained below. It does not concern to the partsof the ingredient but to the type, origin or characteristics of ingredient.

This is for example the problem that faces the cheese classification (among others):There are a big amount of recipes that explain the ingredients like this:Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 22 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)(Referring to the kind of ingredient) “100 gr. of cheese”, others present the next ones (thesub-classification of the ingredient) “100 gr. of mozzarella”, “100 gr. of parmesan”, “100 gr.ricotta”, and others have both (the ingredient and the kind of ingredient): “100 gr. of ricottacheese”. It is also normal to find the following cheese classifications based on its kind, withoutspecifying a concrete one: “200 gr. of firm cheese”, “250 of semi-firm cheese” etc. Alsosometimes classifications like this are found: “150 gr. of French cheese” etc.Another problem is faced about the origin or other characteristics of the ingredient. Forexample in the wines description some recipes describe it just like “wine”, others refer to itscolor “red wine”, “white wine”, “rosé”, others refer to the origin of the wine “Rioja”“Ribera del Duero”, “Bordeaux”, others to their age “vintage wine” “ new wine” “reserve”etc.The normalized way of expressing these ingredients would be: “250 grams of soft Italiancheese named mozzarella”, “a red reserve wine from the region of Bordeaux…”, where theentities cheese and wine are detailed with other attributes referring to its origin, kind, or othercharacteristics. The IE task would be very easy, it would recognize the main entity (ingredient)and then some additional information can be added about the other characteristics.The problem is that the majority of the ingredient descriptions do not have explicitly writtenthe kind of ingredient they are referring to (wine, cheese, chicken, etc). This main word is leftout because the user is supposed to know what these features refer to. For example that“Rioja” refers to a wine and “Mozzarella” refers to a cheese. The aim is to make the intelligentagent to know this as well, but so much information has to be carefully detailed in order toprovide this knowledge.These lacks of standards or official sites have caused the greatest problems during thedevelopment of this project. But this was also the most interesting challenge I had to face, andit reflects the real state of the current web: no standards, no consensus, no rules … just a freespace where anyone can post its ideas, this is the ideal of the World Wide Web2 2 2 2 6#6#6#6#What this project pretends to finish off is this lack of standards in the recipes field byautomatically understanding the different ways of expressing a recipe, extracting its relevantinformation and structuring in such a way that a machine can easily understand its content.++++ ,-,-,-,-./

././

./0,.10,.10,.10,.1****This chapter will analyze the different ways to perform Information Extraction within thecurrent unstructured Web.

Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 23 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)7777 8+ 8+ 8+ 8+     Bellow there are deeply described several current information extraction approaches.They have been all compared, highlighting their weaknesses and strengthens and explainingwhich kind of texts each one is focused on.All of them have been considered to fulfill this project information extraction task. I will showthe one I have focused my Master Thesis explaining all the reasons that made me make thischoice.

"""" ## ##

Although this approach does not really retrieve information from the unstructured currentwebs, it can be said as a part of the incoming semantic web, because it improves the meaningof the current web pages. So it is fair to take it into account and explain it here.

10.1.1.1 What is an Annotation?

Annotations are commentaries, notes, texts or append files made on an existing web file.These annotations are external documents that improve the current source without changingthe web code.10.1.1.2 How does it work?Everybody can leave annotations on a web page (if it allows it). The user needs an annotationclient installed in his computer so he can introduce an annotation in the web page.Immediately afterwards this annotation is stored in an annotation server, so all the users thatvisit the page can see it.10.1.1.3 Pros and cons

A summary of the advantages and disadvantages of using annotations to improve the meaningof the current web are shown in the next table:

AdvantagesDisadvantagesIt is still difficult to annotate pages, and noteverybody knows about it.The original web page is the same; it doesnot change at all, since the annotations areattached to the web documents in anexternal way without modifying its code.They are stored as independent documentsin another server (the annotation server)They do not interfere or change the originalUser needs to be aware of what annotationsare and install an annotation client in hiscomputer

Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 24 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)web page and the efficiency and speed ofthe downloading rate of the page is notdamaged.There is a W3C open annotation calledAnnotea.It is time consuming and does not assure thatit provides meaning to the web page, theannotations can just be some plain text thatusers post to give suggestions or extend theweb contents but without providing anysemantics to the page.They are sometimes also difficult to entrust,due to anybody can post an annotation.

10.1.1.4 Required document’s featuresAny kind of document can be annotated as long as it is related to an annotation server.More information about the W3C annotation project, can be found in the [Appendix-2]

Human natural language, its rules and characteristics are the backbone of the NLP approach.This approach tries to extract knowledge by deeply studying the texts characteristics.This is an old approach used in the AI field long time ago. It now aims to teach the computersto understand human language like a human does. This way humans and computers couldcompletely interact. Some researches done in this field try to carry on conversations betweenhumans and machines make the machines able to answer questions, give advices, and a big listof etc.Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

AdvantagesDisadvantagesThey are highly effective in plain free text Non effective with non complete languagestructuresDifficult to apply, unnecessary or ineffectivein web pages, because of the extra linguisticstructures (HTML tags, documentsformatting, etc)Laborious to developIt is content search. Ignores the informationthe web structure providess.10.1.2.4 Required document’s featuresIt is necessary to have the data written in natural language and it performs much better if thesentences are complete and follow the grammatical rules.

"""" *$*$ *$*$

10.1.3.1 What is an Ontology?“An Ontology is a formal specification of a shared conceptualization” [[Studer, R.; Benjamins,V.R.; Fensel, D. Knowledge Engineering: Principles and Methods. IEEE Transactions on Data and KnowledgeEngineering]

10.1.3.2 How does it work?The Ontologies are conceptual models that describe the data of interest and control theinformation-extraction process. They do not rely on the underlying page structure; otherwisethey rely on recognizable constants that describe the document’s content, so they are fixed to acertain field of knowledge.This conceptual model instance describes the lexical appearance, the keywords and therelationships of the data of the domain of interest. The ontology will provide the schema toextract and structure the data. It will guide the information extraction from the texts and itssubsequent structuring.

Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

AdvantagesDisadvantagesThe ontology is made manually, but onlyonce for each domain, (it covers all the webpages for that domain)An ontology is only useful for the domain itwas constructed for. If the domain changesthen the ontology has to be redefined.This has the additional work to have tomake a different ontology for each topicIt is insensitive to changes in web-pageformatThe pages need to have some particularcharacteristics to apply this approach.This approach does not rely on the order ordataAnother inconvenience is the language it isfocus on. Ontology is a conceptual modelfor a certain domain in a certain language.Also a great knowledge of this domain isrequired by the ontology developer, who hasto perfectly know the entities of this subjectand the relations between them

This approach presents some inconveniences, but on the other hand several advantages arereached with this approach. It is very precise (very good rates of performance can be obtainedwhen a good implementation of the ontology is made).As long as it relies on the data, if the data appearance or its order changes (and web pagesusually change very often) the same application can still extract information without doing asingle change.The only dependent module is the ontology model, so if it is necessary to reconstruct theknowledge-extraction system to another subject or to another language, it is only necessary tochange the ontology that describes the domain, the rest of the application will remain thesame.10.1.3.4 Required document’s features

The Ontology conceptual modeling can be easily applied to unstructured documents with thefollowing characteristics:

Required document’s featuresData-richA document is rich in recognizable constants if it has severalidentifiable constants like dates, names, account numbers, IDnumbers, part numbers, times, currency values, etc...Multiple-recordA texts contains multiple records of information for theontology if it contains a sequence of pieces of informationabout the main entity in the ontologyNarrow in ontologicalbreadthA texts is narrow in ontological breadth if it is possible todescribe the application domain with a relatively smallontologyMaster Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

This is very powerful approach, but it is not feasible to use it with all the Web pages posted onthe web (if a good performance is desired). However, many of them accomplish thesecharacteristics, so if the domain web pages fit these characteristics, the Ontology approach isas a very good candidate to extract their information."+"+"+"+ ,- .  $,- .  $,- .  $,- .  $    This is not a method to extract information from unstructured documents, but from structureddocuments written in a suitable semantic language. Although that, it is described here becauseof the importance for this project: Once the information is extracted from unstructured webpages, it can be transformed into a structured web language and then make queries in a veryeasy way.10.1.4.1 What is a query language?The web query languages address the web as a big database where a declarative language canbe used to query it. Several query languages for semi-structured web languages have beendeveloped:10.1.4.2 Pros and cons

AdvantagesDisadvantagesVery effective in the query task They can only be applied to structured orsemi-structured webs.10.1.4.3 Required document’s featuresThe document has to be structured in some way the query language knows, so it can performthe extraction of the information.

"/"/"/"/ ,&& ,&& ,&& ,&&Using wrappers to extract information from the Web was one of the most (or maybe the most)used way so far. The wrapper approach parses the unstructured data and maps it into astructured one, relying on the web page structure (HTML mark-up tags for instance) andpatterns.10.1.5.1 What is a wrapper?This approach builds a wrapper around the Web page and then uses traditional queries toextract the desired information. The wrappers use the underlying structure of the page toformat the information contained on it.Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 28 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)10.1.5.2 How does it work?There are several main tasks while developing a wrapper,

I. Structure the source

The first step aims to identify the sections and subsections of the page. This is made byidentifying the tokens of interest, such as keywords or maybe complete sentences thatindicate the heading of a section dividing the source into sections.For example the sections of a recipe are the ingredient part and the way of doing part.This work is done relying on the HTML tags and the text appearance (like bold font, uppercase, lower case, letter size, inclusion of special characters, etc)The most common approach to develop this task is making use of a lexical analyzer, thatparses the text looking for certain words that fit its regular expressions identifying them asthe page headings.

The next step is finding out the nesting hierarchy of the Web page. For example in therecipes context, the nesting structure of the ingredient part is that it is composed by severalingredient descriptions, each one having a quantity, a measurement unit and an ingredientname. The nesting hierarchy within the sections and subsections can be identified by theuse of other heuristics. Most of the wrapper developers make use of these algorithms:

Font-size: It has been proved that in some Web pages (not all) font-size is normallydecreasing as we go deeper into the nesting structure. Headings use to have bigger fontsize than their sub-headings.Indentation space: The indentation space that normally means that one section isnested into another one.This structuring task states which the interesting tokens and the nesting structure of theWeb page is.

II. Build a parser for the source pagesThe next function is to generate a parser for the selected source pages. This parser can beautomatically made to analyze the incoming pages according to the lexical (tokens ofinterest) and syntactical (grammar of the nesting structure) results obtained in the previoussection.A parser can extract the desired sections from any source, as long as it follows the sourcestructure determined in the previous step. For any other sources it is useless.

Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

AdvantagesDisadvantagesIt is domain insensitive. When changing thedomain the wrapper remains the sameIt is sensitive to changes in web-pageformat. If the lay-out changes the wrapper isuseless and has to be changed.Valid for all kind of data characteristics It can easily fail to identify tokens orhighlight tokens incorrectly, and it can alsofail to guess the document nesting

The Web sources can be queried in adatabase-like manner, being this way veryfamiliar to many developers.It is very time-consuming to make awrapper and generate wrappers by hand isimpractical and almost impossible.Several web pages can be integrated withthis approach, building a wrapper aroundthem all.All this pages have to be similar in layout tobe integrated by the same mediator.

Effective when it is applied to highlystructured HTML pagesIt is only valid for semi-structured texts, noteffective when applied to unstructured(plain) texts because of the data sparsenessStructure based. Ignores the contextmeaning.

10.1.5.4 Required document’s features

As it can be guessed by the wrapper approach, the documents have to follow some strictstructure.They need to be written in some markup language (HTML in my case of study) as long asthey rely on the markup tags to guess the structure of the page; they are not meant to be usedover plain texts, which make the task more difficult.The pages also need to be well-structured, with sections and subsections well defined andfollowing a strict agreement of how to represent the different parts of the texts, so they can beeasily recognized by their characteristics.

Studying a great amount of recipes I have found out that all of them have several recognizableinstances. They all have some fixed sections: the ingredient description part, and the way ofdoing part. All the ingredient descriptions are compounded by the name of the ingredients, thequantity of each ingredient, and measure unit. The way of doing contains normally thecooking time, the cooking method, etc. Some of them also have additional information like theseason of the ingredients, the kilocalories of the dish, and further entities. So manyrecognizable data is found in the recipes context.

Multiple-record

All the recipes I have found so far have multiple ingredient description.

It is normally found one ingredient description per each line of writing, but this is only asirrelevant information for an Ontology (the contents is what guides the information extraction,not the layout). This information would be useful for the wrapper approach instead.

Narrow in ontological breadth

The recipes domain can be modeled with a relatively small Ontology. All depends on the levelof detail wanted in the ingredients classification, but the general recipes model is easy tohandle. ( 8: , '( 8: , '( 8: , '( 8: , '4444After a deeply study of all the available methods to query the current web, The Ontology-based approach was chosen.The reasons to follow this conceptual modeling extraction are basically the documentsfeatures. So as long as the recipes’ structure perfectly fits with the Ontology-based approach,this has been the one chosen, due to it can be applied to all kind of web pages (both to highstructured, as well as to more free texts)The Ontology approach is not so tedious like the NLP one, and is more web-oriented than thisone. While the NLP is more oriented to plain texts, Ontologies are to web texts.Wrappers have been also considered but they were discarded because they are only focused onthe data structure, not the data meaning. The data layout of different recipes has been studied,finding that not all follow the same patter. Some are designed with some indentation, otherswith tables, and others with blank spaces…etc. So no fixed patter can be applied to follow thewrapper approach.

Although this project focuses on HTML pages, because these are the most common pagesposted in the net nowadays, this approach can be directly applied to any kind of unstructuredtexts posted in the net, as well as plain text without any format at all, as long as the data iswritten in text, not in graphs, pictures, animations, or any other multimedia way.Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 31 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)        **** *$ 0*$ 0*$ 0*$ 0The Ontology definition has always carried a lot of controversy. It has been defined in verydifferent fields, each of them focusing in the characteristics they want the Ontology for. Someof these definitions are exposed below:The most traditional description of the term Ontology can be found in any dictionary: “thescience or studying of being” as described in the Oxford English Dictionary. This agrees withthe etymology of the word Ontology. It comes from Greek and means the science of beings,or the general doctrine of being. Onto means existence, being.Other fields also give their particular vision of what an Ontology is:The Ontology concept belongs to metaphysics; it is actually the main part of it.In the philosophical environment is referred as “the branch of metaphysics that deals with thenature of being” or “the study of the kinds of things that exist”. The philosopher Aristotleattempt to classify the things of the worldIn the logic circles an Ontology is know as “the set of entities presupposed by a theory”In terms of the Artificial Intelligence the Ontology is defined as “the specification of aconceptualization”. This means, to define terms and the relationships between these terms, insome formal way.This is the most useful one, as the IA is the field of this project. So from now and on theOntology will be referred to as “a set of knowledge terms, including the vocabulary, thesemantic interconnections and some simple rules of inference and logic for some particulartopic” [2]Another definition, refers to the AI systems need to reuse and share knowledge. For thispurpose is necessary to define the common vocabulary in which this knowledge isrepresented. For this purpose: “A specification of a representational vocabulary for a shareddomain of discourse -- definitions of classes, relations, functions, and other objects -- is calledan ontology” [Gruber, T. (1993). A translation Approach to Portable Ontology Specifications. KnowledgeAcquisition]

An Ontology specifies a conceptualization, it represents an abstract and simplified view(vocabulary, relationships and logical rules) of the piece of reality it wants to represent.Committing to an Ontology, Web agents will know which field and vocabulary they arereferring to. It facilitates the knowledge sharing in a certain field, laying the foundations of thelanguage in this context, so it allows several agents to interoperate among them in a certainfield.When two remote applications or agents dialog between them there has to be an unambiguousframe and a common language to talk about. This can be achieved sharing references to theOntologies currently available on the net. The Ontologies are a consensus about a commondomain of discourse. The Ontologies lead the conversation between web agents and they givea possible interpretation of the data that is posted in the web, but never constraining what canbe published. This understanding is essential to accomplish automatic tasks on the Web, liketransactions, e-commerce tasks, B2B, B2C, etc.However, the submission to an Ontology does not guarantee the complete interoperationamong agents. Some agents can have the capacity to assert some answers to determinedqueries while others can assert other kinds of knowledge. What the Ontology does guaranteeis the coherence and consistency of the knowledge sharing among different agents, not itscompleteness. [35]Some of the Ontology utilities for the semantic Web are:

 Web Querying: How to query the web in an efficient way to easily find the documentswith the desired characteristics. Web sources integration. Find out similarities between different web pages about thesame subject and integrate them all increasing their knowledge. Restructure current sites. Present different views about the same thing.There are two possible approaches of how to implement these functionalities; both approacheswill be discussed in detail in chapter 23.

Several attempts to create an Ontology can be made in different degree. Depending on thelevel of detail we can refer to different concepts:

 The simplest one is a simple group of lexicons and vocabularies

 More complete is grouping together the words that have a similar meaning; creatingthesauri[see definition in the glossary]. We can go further and create a taxonomy[see definition in the glossary], this is a systemwhere the things are hierarchically organized and named in groups with similarcharacteristics and which can be given different properties.Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 33 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM) Finally, a complete Ontology can be defined when the concepts are related to otherconcepts. The most advanced stage of an Ontology is when it is capable to define newknowledge.++++ 0 2  *$0 2  *$0 2  *$0 2  *$This project aim is to create a complete Ontology, defining all the relationships among theconcepts.Several kinds of Ontologies can be defined basing on different features. Next picture showsthe different kind of Ontologies existent based on different criteria: like the point of view,level, subject, language, etc. [Approaches to ontology design, by Jørgen Fischer]ontologyphiloso-phicalontologypragmaticontologytop levelontologyuniversalontologydomainspecificontologygeneralontologytaskspecificontologytask inde-pendantontologylanguageinde-pendantontologylanguageinde-pendantontologyformalontologynotformalonto-logyVIEWspecificontologyLEVELSUBJECTPURPOSELANGUAGEFORMALIZINGapplicationspecificontologyGuarino, Nicola (1998). Formal Ontology and Information Systems,. In:Formal Ontology in Information Systems, Proceedings of the FirstInternational Conference (FOIS'98), June 6-8, Trento, Italy, 3-15. Ed.Nicola Guarino. Amsterdam: IOS Press.Bodil Nistrup Madsen, based on a.o.:

(  ) I will explain only the level-based classification; there are several kinds of Ontologiesdepending on their level. Upper level or Universal Ontologies: Describe the concepts and relationships of anyinformation of any domain in natural language. Provide a unified upper-levelvocabulary that allows different system to communicate between them. Top Level Ontologies:State fundamental categories and their connections. It easesand guides metadata representation and organization. Specific Ontologies:Ontologies specialized in a given domaino Regional Ontologies:Describe a more concrete domain level. Describesspecific fields like medicine, culinary, business, etc... Normally comprisesdiverse local Ontologies.o Local Ontologies: Even more specific than regional Ontologies. The recipesOntology can be classified as a local Ontology that makes part of the culinaryregional Ontology.

Master Thesis: Ontology-based semantic querying of the WEB with respect to food recipes

Page 34 of 118Leticia Gutiérrez Villarías Technical University of Denmark (IMM)The upper level Ontologies is a relatively new approach. It is very interesting and veryambitious as well. It pretends to create some Ontology that can be used for any contextdefining standards for the Semantic Web. [5]

The ontology defined in this project has the following characteristics: view: pragmatic, level:specific, subject: specific, purpose: task specific, (so it is an application specific ontology),language dependent (only for English language) and formal (follows a methodology)//// *$3 *$3 *$3 *$3 Every kind of Ontology has two main parts:

 Terminological component: This is the Ontology schema part. This is similar to thedatabase schema. Defines the terms and their structure in the ontology (their relations) Assertion component: This is the instance data. This is the population of the ontologywith individual instances. This part can be taken apart from the ontology and kept in aKnowledge Base. (See chapter 25.6)

The model instance can be defined with any design language like ER diagrams or ObjectOriented languages (UML, for example)There are two kinds of objects in an ontology domain: lexical and non-lexical objects. Theyhave some differences in their data-frames.Only the lexical objects describe a constants patterns and lexicon patterns for its member