Can YaCy be the new Google?

When Google was a mere babe among search giants like Yahoo, AltaVista, Lycos and WebCrawler, geeks across the world promoted its use thanks to its cleaner interface and the fact that it focussed only on the task at hand. In many ways, Google owes a lot of its success to IT professionals who quickly introduced it to the less IT-savvy as a less cluttered interface to finding the things they were looking for.

Rowan Puttergill is a technology evangelist and software engineer with a long career working in enterprise environments. He brings with him the experience of being the Technical... More

Advertisement

Times have changed. Google’s rise to world dominance in the search market, its advertising-oriented business model and its history of collecting user data are slowly pushing it from geeky favour. Of course, in fourteen years of building up a vast collection of search results and related data, geeks are hard-pressed to find an alternative. Enter YaCy.

YaCy is a fully decentralized P2P search engine. While it is possible to search the YaCy network without running the software*, to properly experience YaCy at work you need to install YaCy locally. That might seem a little off-putting at first, but its no different to running other P2P software like Skype. Running locally, YaCy can be used in a variety of ways.

First, it can be installed within an intranet to provide a local search engine for all of your internal pages. In general though you will probably run YaCy as a node within its global network. Running as a node, YaCy is able to share results with other peers. That means that instead of storing billions of links and search metadata within a huge server farm, the YaCy software simply uses all of the computers on the network to store its search results. The more nodes within the network, the better the results for a query.

Since YaCy has only just been released, the number of nodes on its network is still fairly small. Currently within my copy of the software, I am seeing around 1 200 unique nodes and it’s apparently indexed around 800-million pages. Obviously, while YaCy remains in the territory of IT oriented users, all those pages that are getting indexed are more heavily weighted toward IT related topics. As YaCy gains more mainstream adoption, the results will tend to cover wider areas of interest.

So, you’re probably wondering about this indexing. What is it going to do to your network, your computer, your disk space and all of that? Okay, technically disk space should not be a major concern. The whole idea behind the project is that nodes share links, which means that to have access to a huge number of search results, you don’t need to store them all locally. In fact, YaCy’s administration panel allows you to define many of these sorts of parameters to start with. So you can limit the maximum and minimum amount of RAM dedicated to its running processes, and you can define the maximum number and size of pages that you decide to crawl yourself. Disk space is another story though.

Currently YaCy doesn’t provide options to limit disk space usage, although there is a discussion about implementing this on one of the YaCy forums. On the other hand, YaCy’s FAQ suggests that for around 10-million web pages indexed you would need around 20GB of disk space. However, the fact that you can’t limit this does mean that many users will be unhappy about installing the search engine on their average workstation.

There are a number of approaches that techies are suggesting in order to limit resource usage from outside of YaCy itself. The most obvious is to run YaCy within a virtual machine. I like the approach, and have decided to do something similar for my own installation, but this is putting YaCy adoption right outside of the mainstream. That said, the Admin pages for your local YaCy install are well beyond the comprehension of your average web-surfer, so its unlikely that YaCy is going to gain any foothold with these sorts of users in the near future.

YaCy is also not your fastest search engine. Results don’t seem to pop up almost instantaneously. That will obviously improve as the network grows, but its quite a drawback right now. The fact that you are also giving up precious bandwidth in order to share information with other peers will also not appeal to most people. While outgoing traffic is actually pretty small, and the majority of your incoming traffic will depend on how much you decide to crawl the web, both activities seem to be usage of bandwidth that you could do without. That makes YaCy slightly less ‘free’ than Google, since bandwidth is something you pay for.

So, by now you are probably thinking YaCy is something to steer well-clear of. Despite all of the negatives, I actually think that YaCy has a lot of potential, and could ultimately be the way of search in the future. It offers much better privacy and security than any of the search engines that are currenly in mainstream use. It has fantastic scalability. It offers the ability to integrate localized intranet search with more global searches. And it is open-source.

There are some obvious things that need to be sorted out. For one, I think that there needs to be some kind of reward for peers that store larger indexes and stay online for longer periods. Another big point would be to provide a highly simplified interface to the Administration side of the software. But most importantly it needs better controls for resource usage. Let’s hope that the devs make some improvements fast!

*Since YaCy’s press release, its demo search portal has been hard hit, and last I checked it was down. That’s understandable, since YaCy is not really designed to be used like this.

Extremely interesting article. I’ve always felt that open-source search is a vastly superior search solution than proprietary search companies and search algorithms. For example, how do we know that Google doesn’t skew its top 10 results to favour its own web properties or those of its paying partners and the like? With open-source P2P search, this issue is eliminated and search becomes much more egalitarian and fair.

A company as rich and powerful as Google cannot be trusted with the world’s information as that is something which transcends corporate profit-making and is too important a task to leave it in the hands of a company that has a profit-motive.

Ultimately though, it comes down to quality of search results, so until the YACY team comes up with a way to solve this problem and dramatically improve search result quality as well as allow the public to submit URL’s which they would like to have indexed, the project will remain in techie-land and never go mainstream.

I think that nodes can make revenue by publishing ads from ad networks for PPC clicks generated from searches off their nodes and results from the pages which they serve. This will surely incentivise the addition of more nodes and have a net improvement of the service as nodes begin popping up all over the world.

Personally, my search engine of choice is http://www.ads4trees.com as I get to query both Google and Bing from one interface and contribute to the saving of the environment in the process. The quality of the results are as good, if not better, than using Google or Bing alone and the value proposition is pretty compelling.

Puttergil, you are a moron. Comparing Y to G is moronic. Y is not going to surplant G ever. You are a moron for implying that it will or even has a remote outside chance. And G has not fallen out of favour with geeks.

And skype is not an example of P2P software. utorrent is. You are dumb.

http://twitter.com/ksredelinghuys Kyle Redelinghuys

Wow, so cool.

As the internet becomes more closed by corporate interests and legislation (SOPA etc), people will find alternatives. This is one such alternative. Google is falling out of favour with the geeks and others due to privacy concerns and questionable ethics, albeit slowly.

I am amazed that anybody could actually get so upset about any implication that Google might eventually find competition from any other arena. Certainly, I felt that I was very fair in my review of YaCy, pointing out a lot of the hurdles that it has ahead of it. If anything, I implied that it had a long way to go still, but of all the approaches that I have seen so far, it has a lot of potential.

As for the comment about Skype not being P2P… well, that was just ignorant. I purposefully chose Skype as an example, because it was an intelligent use of P2P technology that was beyond simple filesharing. Using examples like bittorrent etc just tend to give the wrong idea about what P2P can be used for. The fact that there are people on the planet who are unaware that Skype relies on P2P, just goes to show how poorly understood the technology actually is. Bitcoin was another possible example, and certainly merits mention, if only because they have found a way to solve the problem of rewarding mining peers.

In the end, YaCy has a LONG way to go to offer any real competition to Google. Of course, by Google I am only referring to Search and not their other products. Still, I would love to use a decent search engine that didn’t treat me as a commodity. On the downside, so many sites on the Net use Google code now (e.g. Analytics etc) that the dream of being a private surfer is probably long dead.

Rowan Puttergill

Thanks for your comments Charles.
Your attitude toward Google is not that unusual, and certainly there is a growing concern that where things like privacy are concerned, Google cannot be trusted not to do ‘evil’. Google’s attempt to enforce a Real Name policy on Google+ early this year was certainly an indication that they don’t just want user trend data, but are actually keen on user specifics. It is this sort of information that gives projects like YaCy a lot more clout, and I really hope that they make the necessary headway to provide us with alternatives to using the search behemoth.

Your suggestion to use PPC ads within YaCy’s search results is one that I have seen mentioned on the YaCy forums. It is certainly something that they should look into exploring. I would love to see other recommendations on solving the reward problem, so if anybody else has anything to add, please do comment.

I checked out your link, its pretty sweet but it is not clear how they handle privacy. One alternative that basically proxies google results, but helps to keep your privacy intact is https://ixquick.com/.

Charles Ash

I think you have a very appropriate username based on the drivel you just spewed. Your comment was not worth the read and as a technology columnist myself, I find Rowan’s article balanced, informative and his feedback on the comments section pretty exemplary.