Search Engines Chapter 3 Crawls & Feeds Felix Naumann

Transkript

1 Search Engines Chapter 3 Crawls & Feeds Felix Naumann

2 What to crawl 2 Every document answers at least one question: Now where was that document again? Poor quality documents Swamp index Slow down query answering But Have at least some value Have not hindered success of commercial search engines Thus: Crawl everything More important: Keep documents up to date Keep archive of older material

4 Web Crawler 4 Synonym: A spider spiders the Web Web crawling is easy: Pages are meant to be found and retrieved Other information is hard to come by. Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly growing Google knows 1 trillion pages (2008) Not countable what constitutes a page? Vast storage is needed Web is not under the control of search engine providers Set of pages at a particular web server unknown Owner might not want page to be copied Parallel crawling is impolite Pages are hidden behind form Web pages are constantly changing Crawlers also used for other types of data

5 Retrieving Web Pages 5 Every page has a unique uniform resource locator (URL) Web pages are stored on web servers that use HTTP to exchange information with client software

6 Retrieving Web Pages 6 Web crawler is similar to Web browser Web clients that fetch web pages in the same way Web crawler client program connects to a domain name system (DNS) server DNS server translates the hostname into an internet protocol (IP) address Crawler then attempts to connect to server host using specific port Usually port 80 After connection, crawler sends an HTTP request to the web server to request a page usually a GET request: GET /csinfo/peope.html HTTP/1.0

7 Crawling the Web 7 Job 1: Find URL Job 2: Download pages

8 Web Crawler 8 Start with a set of seeds Set of URLs given as parameters Seeds are added to a URL request queue Crawler starts fetching pages from the request queue Downloaded pages are parsed to find link tags Might contain other useful URLs to fetch New URLs added to the crawler s request queue, or frontier Frontier: Standard queue or priority queue Continue until no more new URLs or disk full

9 Web Crawling 9 Web crawlers spend a lot of time waiting for responses to requests DNS server response Connection to web server Page download To reduce this inefficiency, web crawlers use threads and fetch hundreds of pages at once Crawlers could potentially flood sites with requests for pages To avoid this problem, web crawlers use politeness policies No parallel requests Delay between requests to same web server Logically: one queue per web server Large queue Crawler fetches 100 pages / sec At most 1 page per server per 30 sec => Queue size > 3,000 Normally much larger (many URLs in queue from same server)

10 Controlling Crawling 10 Even crawling a site slowly will anger some web server administrators, who object to any copying of their data robots.txt to control crawlers All crawlers Special crawler Reveal hidden web

11 HPI s robots.txt 11

12 Simple Crawler Thread 12 Politeness timer starts here

13 Freshness 13 Web pages are constantly being added, deleted, and modified Web crawler must continually revisit pages it has already crawled to see if they have changed in order to maintain the freshness of the document collection Stale copies no longer reflect the real contents of the web pages

14 Freshness 14 HTTP protocol has a special request type called HEAD Easy to check for page changes Returns information about page, not page itself

15 Freshness 15 Not possible to constantly check all pages must check important pages and pages that change frequently Different types of pages update at different rates Homepage vs. news site Differences even within type: Active vs. inactive blog Freshness is the proportion of pages that are fresh Fresh = most recent copy Optimizing for this metric can lead to bad decisions, such as not crawling popular sites One can never achieve freshness there, thus waste of resources Age is a better metric

16 Freshness vs. Age 16 t t

17 Age 17 Expected age of a page, t days after it was last crawled: where λ is the change frequency per day. Page crawled at time t and changed at time x => (t x) is an age Web page updates follow the Poisson distribution on average time until the next update is governed by an exponential distribution

18 Age 18 Older a page gets, the more it costs not to crawl it e.g., expected age with mean change frequency λ = 1/7 (one change per week) Age Crawl of each page once a week => expected age of page just before crawl is 2.6 Second derivative > 0 => Cost of not crawling always increases t

19 Focused Crawling 19 Also: Topical crawling Attempts to download only those pages that are about a particular topic Used by vertical search applications Promise: Higher accuracy User preselects relevance by choosing vertical search engine Index is not cluttered by irrelevant documents Rely on the fact that pages about a topic tend to have links to other pages on the same topic Anchor texts help Popular pages for a topic are typically used as seeds Crawler uses text classifier to decide whether a page is on topic See Chapter 9

20 Deep Web 20 Sites that are difficult for a crawler to find are collectively referred to as the deep (or hidden) Web much larger than conventional Web Three broad categories: Private sites no incoming links, or may require log in with a valid account Exception: News sites Form results Sites that can be reached only after entering some data into a form Exception: Online stores Scripted pages Pages that use JavaScript, Flash, or another client-side language to generate links Possible but difficult for crawler

21 Sitemaps 21 Sitemaps contain lists of URLs and data about those URLs modification time, modification frequency Generated by web server administrators Tells crawler about pages it might not otherwise find Gives crawler a hint about when to check a page for changes

23 Distributed Crawling 23 Three reasons to use multiple computers for crawling 1. Helps to put the crawler closer to the sites it crawls Low latency, high throughput 2. Reduces the number of sites the crawler has to remember In queue In index 3. Reduces computing resources required Parsing, network bandwidth Distributed crawler uses a hash function to assign URLs to crawling computers URLs are sent in batches Hash function should be computed on the host part of each URL Causes imbalance, but Mostly site-internal links Abides politeness anyway

24 Desktop Crawls 24 Used for desktop search and enterprise search Differences to web crawling: Much easier to find the data Responding quickly to updates is more important Different expectations than for Web File systems can send change notifications Many different document formats Document conversion Data privacy very important Must be conservative in terms of disk and CPU usage Copies of documents need not be stored GDS?

25 Google Desktop Search 25

26 Document Feeds 26 Many documents are published Created at a fixed time and rarely updated again e.g., news articles, blog posts, press releases, In general time-sensitive content Published documents from a single source can be ordered in a sequence called a document feed New documents found by examining the end of the feed Single place, no crawling necessary using HTTP GET requests to web servers that host them Two types: A push feed alerts the subscriber to new documents (phone) Expensive; news agencies ( Ticker ) A pull feed requires the subscriber to check periodically for new documents ( ) Most common format for pull feeds is called RSS Really Simple Syndication, RDF Site Summary, Rich Site Summary

59 Conversion 59 Text is stored in hundreds of incompatible file formats Raw text, RTF, HTML, XML, Microsoft Word, ODF, PDF Other types of files also important PowerPoint, Excel Old, obsolete file formats Typically use a conversion tool Converts the document content into a tagged text format such as HTML or XML => Readable in browser => Retains some of the important formatting information But often barely readable

60 These slides as seen by GDS 60

61 Character Encoding 61 A character encoding is a mapping between bits and glyphs i.e., getting from bits in a file to characters on a screen Can be a major source of incompatibility ASCII is basic character encoding scheme for English encodes 128 letters, numbers, special characters, and control characters in 7 bits, extended with an extra bit for storage in bytes Other languages can have many more glyphs e.g., Chinese has more than 40,000 characters, with over 3,000 in common use Many languages have multiple encoding schemes e.g., CJK (Chinese-Japanese-Korean) family of East Asian languages, Hindi, Arabic Must specify encoding ( code pages ) Can t have multiple languages in one file Unicode developed to address encoding problems

62 ASCII 62 American Standard Code for Information Interchange

63 Unicode 63 Single mapping from numbers to glyphs that attempts to include all glyphs in common use in all known languages Unicode is a mapping between numbers and glyphs Does not uniquely specify bits to glyph mapping! e.g., UTF-8, UTF-16, UTF-32 Proliferation of encodings comes from a need for compatibility and to save space UTF-8 uses one byte for English (ASCII), as many as 4 bytes for some traditional Chinese characters variable length encoding, more difficult to do string operations UTF-32 uses 4 bytes for every character Many applications use UTF-32 for internal text encoding (fast random lookup) and UTF-8 for disk storage (less space)

67 Storing the Documents 67 Requirements for document storage system: Random access But only simple lookups: Request the content of a document based on its URL Hash function based on URL is typical Hash-number identifies file and/or server Then: Secondary index to locate page in file. Compression and large files reducing storage requirements and efficient access Update handling large volumes of new and modified documents adding new anchor text

68 Large Files 68 Store many documents in large files, rather than each document in a file Avoids overhead in opening and closing files Reduces seek time relative to read time Seektime 10ms Readtime 70kB/ms Compound document formats Used to store multiple documents in a file e.g., TREC Web

69 TREC Web Format 69

70 Compression 70 Text is highly redundant (or predictable) Entropy: Native English speakers guess next letter with 69% accuracy (Shannon 1951: Compression techniques exploit this redundancy to make files smaller without losing any of the content. Less disk space Faster read time Compression of indexes covered later Popular algorithms can compress HTML and XML text by 80% e.g., DEFLATE (zip, gzip) and LZW (UNIX compress, PDF) No random access => May compress large files in blocks to make access faster

72 BigTable 72 No query language => no complex queries to optimize Only row-level transactions => No complex locking mechanisms Tablets are stored in a replicated file system that is accessible by all BigTable servers Any changes to a BigTable tablet are recorded to a transaction log, which is also stored in a shared file system Data in immutable files Data in file is never changed cannot be corrupt, only incomplete If any tablet server crashes, another server can immediately read the tablet data and transaction log from the file system and take over. Table updates in RAM Peridodic merge with disk files

73 BigTable 73 Logically organized into rows A row stores data for a single web page Many attribute-value pairs Combination of a row key, a column key, and a timestamp point to a single cell in the row Timestamp: Extra dimension for versioning

74 BigTable 74

75 BigTable 75 BigTable can have a huge number of columns per row All rows have the same column families Families: Columns of same type Not too many families, all known in advance Not all rows have the same columns Important for reducing disk reads to access document data Rows are partitioned into tablets based on their row keys (i.e., URLs) Simplifies determining which server is appropriate

78 Detecting Duplicates 78 Duplicate and near-duplicate documents occur in many situations Multiple links point to same page Copies, versions, plagiarism, spam, mirror sites 30% of the web pages in a large crawl are exact or near duplicates of pages in the other 70%. Fetterly et al (http://portal.acm.org/citation.cfm?doid= ) Duplicates consume significant resources during crawling, indexing, and search. Little value to most users

79 Duplicate Detection 79 Exact duplicate detection is relatively easy Checksum techniques A checksum is a value that is computed based on the content of the document e.g., sum of the bytes in the document file Possible for files with different text to have same checksum Functions such as a cyclic redundancy check (CRC), have been developed that consider the positions of the bytes Otherwise acfhiiloprs T would be duplicate

80 Near-Duplicate Detection 80 More challenging task Are web pages with same text context but different advertising or format near-duplicates? Do small revisions or updates constitute a new page? A near-duplicate document is defined using a threshold value for some similarity measure between pairs of documents e.g., document D1 is a near-duplicate of document D2 if more than 90% of the words in the documents are the same

82 Fingerprints (shingling) The document is parsed into words. Non-word content, such as punctuation, HTML tags, and additional whitespace, is removed 2. The words are grouped into contiguous n-grams, for some n. Usually overlapping sequences of words, although some techniques use non-overlapping sequences. 3. Some of the n-grams are selected to represent the document. Distinguishing feature of different algorithms 4. The selected n-grams are hashed to improve retrieval efficiency and further reduce the size of the representation. 5. The hash values are stored, typically in an inverted index. 6. Documents are compared using overlap of fingerprints Absolute number of shared n-grams Relative number of shared n-grams (Jacard distance)

83 Fingerprint Example 83 Example text Tropical fish include fish found in tropical environments around the world, including both freshwater and salt water species. 3-grams tropical fish include, fish include fish, include fish found, fish found in, found in tropical, in tropical environments, tropical environments around, environments around the, around the world, the world including, world including both, including both freshwater, both freshwater and, freshwater and salt, and salt water, salt water species Hash values (hypothetical) gram selection: 0 mod 4 (hash values whose mod 4 value is 0)

84 Simhash 84 Similarity comparisons using word-based representations more effective at finding near-duplicates Problem is efficiency Simhash combines the advantages of the word-based similarity measures with the efficiency of fingerprints based on hashing Similar documents have similar hash values! Property of simhash: Similarity of two pages as measured by the cosine correlation measure is proportional to the number of bits that are the same in the simhash fingerprints.

85 Simhash Process the document into a set of features with associated weights. Assume simple case: Features are words weighted by their frequency 2. Generate a hash value with b bits (the desired size of the fingerprint) for each word. The hash value should be unique for each word. b typically several hundred 3. In b-dimensional vector V, update the components of the vector: For each feature, add the weight to every component for which the corresponding bit in the word s hash value is 1. Subtract the weight if the value is After all words have been processed, generate a b-bit fingerprint by setting the i-th bit to 1 if the i-th component of V is positive, or 0 otherwise.

86 Simhash Example 86 Original text Tropical fish include fish found in tropical environments around the world, including both freshwater and salt water species. 1. Words with weights (here: tf) tropical 2 fish 2 include 1 found 1 environments 1 around 1 world 1 including 1 both 1 freshwater 1 salt 1 water 1 species bit hash values (in practice hundreds of bits) tropical fish include found environments around world including both freshwater salt water species Vector V formed by summing weights (weight for 1s; negative weight for 0s) bit fingerprint formed from V (1 for positive, 0 for negative) Duplicate if certain (high) percentage of fingerprint bits agree

88 Removing Noise 88 Many web pages contain text, links, and pictures that are not directly related to the main content of the page. This additional material is mostly noise that could negatively affect the ranking of the page. Presence of large numbers of irrelevant words Techniques have been developed to detect the content blocks in a web page. Non-content material is either ignored or reduced in importance in the indexing process

89 Noise example 89

90 Noise example 90

91 Finding Content Blocks 91 Intuition: less HTML tags in main content Cumulative distribution of tags in the example web page Document slope curve Main text content of the page corresponds to the plateau in the middle of the distribution

Version: 00; Status: E Seite: 1/6 This document is drawn to show the functions of the project portal developed by Ingenics AG. To use the portal enter the following URL in your Browser: https://projectportal.ingenics.de

p^db=`oj===pìééçêíáåñçêã~íáçå= Error: "Could not connect to the SQL Server Instance" or "Failed to open a connection to the database." When you attempt to launch ACT! by Sage or ACT by Sage Premium for

Prediction Market, 28th July 2012 Information and Instructions S. 1 Welcome, and thanks for your participation Sensational prices are waiting for you 1000 Euro in amazon vouchers: The winner has the chance

Exercise (Part II) Notes: The exercise is based on Microsoft Dynamics CRM Online. For all screenshots: Copyright Microsoft Corporation. The sign ## is you personal number to be used in all exercises. All

Exercise (Part XI) Notes: The exercise is based on Microsoft Dynamics CRM Online. For all screenshots: Copyright Microsoft Corporation. The sign ## is you personal number to be used in all exercises. All

Exercise (Part I) Notes: The exercise is based on Microsoft Dynamics CRM Online. For all screenshots: Copyright Microsoft Corporation. The sign ## is you personal number to be used in all exercises. All

1 The zip archives available at http://www.econ.utah.edu/ ~ ehrbar/l2co.zip or http: //marx.econ.utah.edu/das-kapital/ec5080.zip compiled August 26, 2010 have the following content. (they differ in their

Inequality Utilitarian and Capabilities Perspectives (and what they may imply for public health) 1 Utilitarian Perspectives on Inequality 2 Inequalities matter most in terms of their impact onthelivesthatpeopleseektoliveandthethings,

Possible Solutions for Development of Multilevel Pension System in the Republic of Azerbaijan by Prof. Dr. Heinz-Dietrich Steinmeyer Introduction Multi-level pension systems Different approaches Different

Delivering services in a user-focussed way - The new DFN-CERT Portal - 29th TF-CSIRT Meeting in Hamburg 25. January 2010 Marcus Pattloch (cert@dfn.de) How do we deal with the ever growing workload? 29th

This press release is approved for publication. Press Release Chemnitz, February 6 th, 2014 Customer-specific software for autonomous driving and driver assistance (ADAS) With the new product line Baselabs

USBASIC SAFETY IN NUMBERS #1.Current Normalisation Ropes Courses and Ropes Course Elements can conform to one or more of the following European Norms: -EN 362 Carabiner Norm -EN 795B Connector Norm -EN

The Single Point Entry Computer for the Dry End The master computer system was developed to optimize the production process of a corrugator. All entries are made at the master computer thus error sources