As you can see from the attached screenshot, lots of YaCy search results are duplicate or very similar and are positioned adjacently throughout the results list. So similar to be practically indistinguishable to the user and be of no utility to occupy the ranking list.

I think it would be best for YaCy to recognize these duplicates and tidy them up.

I think this isn't a yacy topic, because yacy has no control over page titles.

If you only check the page titles shown in result list, then you are right, if you talk about duplicates. But yacy (and any other search engine/crawler) is using the real url as the unique identifier for a search result.

As you can see in your result list the page title is the same, but url itself differs from result to result:

Interesting. I have similar problem with Twitter showing in many languages and no way to limit it. Maybe have a delimiter character in the url string when you do a crawl eg cut short the URL or process on the peer later on.

Twitter has the annoying habit of sending a page in the language that is specified in the "Accept-Language" request-header. Which means that if you crawl the same Twitter-URL with different languages listed in the accept-language request-header, you will get different results.

This may be nice if a page is requested with an actual browser, but can be really confusing if you do this with a crawler.

Twitter has the annoying habit of sending a page in the language that is specified in the "Accept-Language" request-header. Which means that if you crawl the same Twitter-URL with different languages listed in the accept-language request-header, you will get different results.

This may be nice if a page is requested with an actual browser, but can be really confusing if you do this with a crawler.

There is a much better scraper for Twitter now but I not going to share my peer to the yacy network because my VM runs out of space every 2 days.Its http://loklak.org using RSS feeds into yacy or a reader.

Winter_fox hat geschrieben:I think google solves this buy not showing multiple pages from the same domain on the same page.

That sounds like an interesting approach. 2nd-level ICANN-approved domains are somewhat expensive, which acts as a rate limiter on spamming the same content across domains. 3rd-level domains on the same 2nd-level domain, however, are very cheap for the owner of that 2nd-level domain. Does Google require the 2nd-level domain to be unique?

I suppose another approach would be to use a similarity algorithm of the content in the Solr fields for the pages. For example, you could construct a float vector of words/phrases, and collapse groups that have a very high cosine similarity. This idea totally fails the KISS test compared to your approach, though.

What about hashing the snippet result data, such as title, body, or url with a perceptual library like phash (which is not written in java) and check for "visually" duplicate entries among the results?

For example, once all the titles of the returned results are hashed, their fingerprint can be compared to obtain a linear indication of how much these titles differ between each other. If the difference is below a threshold, they are very similar and hence duplicate. The hashing is so fast it can be performed by the fly when results are returned.

The advantage of using a perceptual hash library to process an already ranked list of results before it is presented to the user is that such a hash can be indiscriminately extracted from the text snippets which accompany the results, as well as from thumbnail mages, and can be used to numerically determine the visual difference between results presented by Yacy.

For text results, this could be effective at detecting and deleting results which look very similar, and for image results it would detect identical images which differ only by resolution or canvas ratio.

To demonstrate how simple the principle is, check this program I wrote years ago using the phash library.It takes as argument the filename of two images to compare, and replies via its exit status whether the images are almost identical but differ by resolution or cropping.

Its good your brought this subject up like a revisit.The country probably has a very small primary industry. It looks like Google has rejected there site (not to sure yet). They sell just about anything related to do with computers. There site is massive and it has improved over time and has valid email address's for contact. Unlike some of the sites.

If I want to buy things online now, I run a web portal on the site of interest with the category I need as a start point then they are easier to locate the bargains.In (/IndexSchema_p.html) there is some settings that may help.