سرفصل های مهم

Page Rank Overview

توضیح مختصر

The idea is that Google and other search engines, including the one that you're gonna run, don't actually want the web. And it's a simple website that tells that search engines when they see a domain or URL for the first time they download this and it informs them where to look and where not to look. And then there is some HTML and D-3.js which is a visualization that produces this pretty picture and the bigger little dots are the ones with the better page rank and you can grab this and move all this stuff around.

دانلود اپلیکیشن «زوم»

فایل ویدیویی

ترجمه‌ی درس

متن انگلیسی درس

So now we’re going to write a set of applications and the code there is there, pagerank.zip. That’s a simple web page crawler and then a simple web page indexer and then we’re going to visualize the resulting network using a visualization tool called d3.js. So in a search engine, there are three basic things that we do. First, we have a process. That’s usually done, sort of, when the computers are bored they crawl the web by retrieving a page, pulling out all the links, having a list, an input queue of links, going through those links one at a time, marking off the ones we’ve got, picking the next one, and on and on and on. And so it’s this front end process, a spidering or crawling. And then once you have the data, you do what’s called index building where you try to look at the links between the pages to get a sense of what are the most centrally located and what are the most respected pages where respect is defined as who points to whom. And then we actually look through and search and in this case, we won’t really search it. We will visualize the index when we’re done. And so Web Crawler is a program that browses the web in some automated manner. The idea is that Google and other search engines, including the one that you’re gonna run, don’t actually want the web. They want a copy of the web and then they can do data mining within their own copy of the web. It’s just so much more efficient than having to go out and look at the web, you just copy it all. So the crawler just slowly but surely starts, crawls and gets as good a copy of the web as it can. And like I said, its goal is to repeat, retrieve a page, pull out all the links, add the links to the queue and then just pull the next one off and do it again and again and again and then save all the text of those pages into storage. In our case, it will be a database. In Google’s case, its literally thousands or hundreds of thousands of servers. But for us, we’ll just do this in a database. Now web crawling is a bit of a science. We’re gonna be really simple. We’re just going to try to get to the point - we’ve crawled every page that we can find in, once. That’s what this application is gonna do. But in the real world, you have to pick and choose how often, which pages are more valuable. So, in real search engines they tend to revisit pages more often if they consider those pages more valuable but they also don’t want to revisit them too often because Google could crash your website and make it so that your users can’t use the website ‘cause Google is hitting you so hard. There’s also, in the world of web crawling, this file called robots.txt. And it’s a simple website that tells that search engines when they see a domain or URL for the first time they download this and it informs them where to look and where not to look. And so, you can, like, take a look at Pythonforeverybody.com and look at the robots.txt and see what my website is telling all the spiders where to go look and where the good stuff is at. So at some point, you build this, you have your own storage, and it’s time to build an index. So the, the idea is, is to figure out what pages are better than other pages and it, certainly, you start by looking at all the words in the pages. Python, word splits, etcetera, etcetera. But the other thing we’re going to do is look at the links between them and use those links as a way to ascribe value. And so here’s the process that we’re going to run. There’s going to be a couple of different things in the code for all of this is sitting here in pagerank.zip. The way it works is it actually only just spiders a single webpage. You can spider dr-chuck.com or you can actually spider Wikipedia. It’s kind of interesting but it takes you a little longer before the links start to, sort of, go back to one another on Wikipedia. But Wikipedia is not a bad place to start. If you want to run something long because at least Wikipedia doesn’t get mad at you for using it too much. And so there’s always all these sort of data mining things. This crawling have this thing where it grabs basically a list of the un. So we end up for the list of URLs. Some of the URLs have data, some do not. And it looks for the, you know, randomly looks for one of the unretrieved URLs, goes and grabs that URL, parses it and then puts the data in for that URL, but then also reads through to see if there’s more links. So, in this database, there are a few pages that are retrieved and lots of pages yet to retrieve. Then it goes back, says, oh, let’s randomly pick another unretrieved file. Go get that one, pull that in, put the text for that one in but then look at all the links and add those links to our, to our list. And you can, if you watch this, even if you do like one or two documents at a time, you might like, whoa, that was a lot of links and then you grab another page and whew, there’s 20 links or 60 links or 100 links. And so, you’re not Google, so you don’t have the whole internet but what you find is as you touch any part of the internet, the number of links kind of explodes and you end up with so many links to, that you haven’t retrieved. But if you’re Google after, you know, a year and you’ve seen it all once then, then you get your data more dense and so that’s why we stay with, in this program we stay with one website. So eventually, kind of, get some of those links filled in and have more than one set of pointers. And, and the other thing in here is we keep track of which pages point to which pages. Alright, little arrows and so these, each page then gets a number inside this database like primary key. And we can keep track of which pages and we’re going to use these inbound and outbound links to compute the page rank and that is the more inbound links you have from sites that have good, a good number of inbound links, the better we like that site so that’s a better site. And so, the page rank algorithm is a thing that sort of reads through this data and then writes the data and it takes a number of times through all of the data to get this page rank values to converge. And so these are numbers that converged toward the goodness of, of each page. And so you can run this as many times as you want. This runs really quickly. This runs really slow because it’s got to talk to the network and pull these things back. Talk to the network and that’s why we can restart this. The page rank is all just talking to data inside that database and it’s super fast. And then, if you want to reset these to the initial value, the page rank algorithm, you can reset that and it just sort sets them all to the initial value. I think of one, there was a one, a goodness of one and then some of these ended with goodnesses of “5” and “0.01”, you know. And so, the more you run this this, the more this data converges. So these, these data items tend to converge after a while. First few times they jump around a bunch and then the, and then later they jump around less and less. And then at a point in time when you, as you run this ranking application, you can pull the data out and dump it to, kind of, look at the page rank values of, you know, for this particular page, has a page rank value of 1. These are, it’s dumping out. This one has, this one has probably just run the SP reset ‘cause they all have the same page rank. After you’ve run it you’ll see when you run SP dump, you’ll see that these numbers start to change. And this stuff is all in the read me file. It’s sitting here in the zip file. You undo that and so the SP dump just reads the stuff and prints it out and then SP Json also reads through all the stuff that’s in here and then takes the the best, the some 20 or so links with the best page rank, and dumps them into a J, JavaScript file. And then there is some HTML and D-3.js which is a visualization that produces this pretty picture and the bigger little dots are the ones with the better page rank and you can grab this and move all this stuff around. And it’s nice and fun and exciting. And so we, we visualize, right? So again, we have a multi-step process where it’s a slow restartable process then a sort of fast data analysis clean up process and then a final output process that pulls stuff out of there. So it’s, it’s another one of these multi step data mining processes. And the last thing that we’re gonna talk about is visualizing mail data. We’re going to go from the mbox short to mbox to mbox super gigantamatic. That’s what we’re going to do next.