Recent years have seen a dramatic growth of natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents, and social media such as blog articles, forum posts, product reviews, and tweets. Text data are unique in that they are usually generated directly by humans rather than a computer system or sensors, and are thus especially valuable for discovering knowledge about people’s opinions and preferences, in addition to many other kinds of knowledge that we encode in text.
This course will cover search engine technologies, which play an important role in any data mining applications involving text data for two reasons. First, while the raw data may be large for any particular problem, it is often a relatively small subset of the data that are relevant, and a search engine is an essential tool for quickly discovering a small subset of relevant text data in a large text collection. Second, search engines are needed to help analysts interpret any patterns discovered in the data by allowing them to examine the relevant original text data to make sense of any discovered pattern. You will learn the basic concepts, principles, and the major techniques in text retrieval, which is the underlying science of search engines.

Рецензии

GI

Excellent ! Well organized, presented with aptitude to detail. Definitely will recommend and take further units in this specialization.\n\nThanks Prof

GS

May 19, 2020

Filled StarFilled StarFilled StarFilled StarFilled Star

A bit difficult to complete as the Quiz questions were tougher. But when you go through all, you might feel good.

Из урока

Week 5

In this week's lessons, you will learn feedback techniques in information retrieval, including the Rocchio feedback method for the vector space model, and a mixture model for feedback with language models. You will also learn how web search engines work, including web crawling, web indexing, and how links between web pages can be leveraged to score web pages.

Преподаватели

ChengXiang Zhai

Текст видео

[SOUND] This lecture is about the feedback in text retrieval. So in this lecture, we will continue with the discussion of text retrieval methods. In particular, we're going to talk about the feedback in text retrieval. This is a diagram that shows the retrieval process. We can see the user would type in a query. And then, the query would be sent to a retrieval engine or search engine, and the engine would return results. These results would be issued to the user. Now, after the user has seen these results, the user can actually make judgements. So for example, the user says, well, this is good and this document is not very useful and this is good again, etc. Now, this is called a relevance judgment or relevance feedback because we've got some feedback information from the user based on the judgements. And this can be very useful to the system, knowing what exactly is interesting to the user. So the feedback module would then take this as input and also use the document collection to try to improve ranking. Typically it would involve updating the query so the system can now render the results more accurately for the user. So this is called relevance feedback. The feedback is based on relevance judgements made by the users. Now, these judgements are reliable but the users generally don't want to make extra effort unless they have to. So the down side is that it involves some extra effort by the user. There's another form of feedback called pseudo relevance feedback, or blind feedback, also called automatic feedback. In this case, we can see once the user has gotten [INAUDIBLE] or in fact we don't have to invoke users. So you can see there's no user involved here. And we simply assume that the top rank documents to be relevant. Let's say we have assumed top 10 as relevant. And then, we will then use this assume the documents to learn and to improve the query. Now, you might wonder, how could this help if we simply assume the top rank of documents? Well, you can imagine these top rank of documents are actually similar to relevant documents even if they are not relevant. They look like relevant documents. So it's possible to learn some related terms to the query from this set. In fact, you may recall that we talked about using language model to analyze what association, to learn related words to the word of computer. And there, what we did is we first use computer to retrieve all the documents that contain computer. So imagine now the query here is a computer. And then, the result will be those documents that contain computer. And what we can do then is to take the top n results. They can match computer very well. And we're going to count the terms in this set. And then, we're going to then use the background language model to choose the terms that are frequent in this set but not frequent in the whole collection. So if we make a contrast between these two what we can find is that related to terms to the word computer. As we have seen before. And these related words can then be added to the original query to expand the query. And this would help us bring the documents that don't necessarily match computer but match other words like program and software. So this is very effective for improving the search result. But of course, pseudo-relevancy values are completely unreliable. We have to arbitrarily set a cut off. So there's also something in between called implicit feedback. In this case, what we do is we do involve users, but we don't have to ask users to make judgments. Instead, we're going to observe how the user interacts with the search results. So in this case we'll look at the clickthroughs. So the user clicked on this one. And the user viewed this one. And the user skipped this one. And the user viewed this one again. Now, this also is a clue about whether the document is useful to the user. And we can even assume that we're going to use only the snippet here in this document, the text that's actually seen by the user instead of the actual document of this entry. The link they are saying web search may be broken but it doesn't matter. If the user tries to fetch this document because of the displayed text we can assume these displayed text is probably relevant is interesting to you so we can learn from such information. And this is called interesting feedback. And we can, again, use the information to update the query. This is a very important technique used in modern. Now, think about the Google and Bing and they can collect a lot of user activities while they are serving us. So they would observe what documents we click on, what documents we skip. And this information is very valuable. And they can use this to improve the search engine. So to summarize, we talked about the three kinds of feedback here. Relevant feedback where the user makes explicit judgements. It takes some user effort, but the judgment information is reliable. We talk about the pseudo feedback where we seem to assume top brand marking will be relevant. We don't have to involve the user therefore we could do that, actually before we return the results to the user. And the third is implicit feedback where we use clickthroughs. Where we involve the users, but the user doesn't have to make it explicitly their fault. Make judgement. [MUSIC]