In the conext of a YaCy re-design towards a YaCy2 I plan to rip YaCy apart into stand-alone modules, make room for funded (and commercially usable) plug-in parts and then pack the resulting modules again together to different appliances. This could lead to a 'new' YaCy which is compatible to the old network but is composed by the new modules. A target is also to create professional appliance packages which can consist of parts which are not applicable for p2p search but necessary for customers.One of the tasks to create that architecture is the identification of standards which the modules of YaCy2 should support. I identified that WARC is really amazing and important and would fit into the YaCys user demand to collect large amounts of web data. WARC is the file standard of the internet archive http://archive.orgThere are a lot of interesting applications available to create and process WARC:

I tried and liked a lot the webarchiveplayer https://github.com/ikreymer/webarchiveplayer - this starts a web server and within your browser you get all the content from a specific WARC file served (unfortunately this starts on our default port 8090 so you might have to stop YaCy to try that .. or change the port)

I also like to idea that in a YaCy2 architecture we should be able to share on two levels: additionally to p2p index sharing we could do a WARC sharing as well. I consider to add a bittorrent tracker for that together with a WARC archive management to the list of modules which could be glued together to YaCy2.

What do you think? Please try the wget command above and maybe start to collect WARC archives which we can share to bootstrap a huge YaCy2 index when software modules are ready!

Since 2014, I'm automatically recording nearly all the pages I visit with the shelve firefox add-on , and I'm planning to convert the data to WARC files using Wget and then build my private waybackmachine with openwayback. Of course, I will use YaCy as search engine.

Shouldn't any public website been encouraged to provide its own WARC up-to-date archive of its contents, and even its own reverse-index : in fact to run its own YaCy node instance with a WARC and index related only to its contents? Maybe a specific YaCy distribution would help that... Maybe I am not aware that websites already do this with current YaCy release?

luc hat geschrieben:At Common Crawl they also use WARC format to store very huge crawl archives on Amazon WS : http://commoncrawl.org/the-data/get-started/. Maybe their data could also be a source to feed some YaCy nodes?

They have a strange way to present links to their warc files ("replace xxx with yyy" which is then a link to a gzipped file which contains again strings which are parts of links) but I finally managed to load one of these dumps. What I have seen then is a mixture of documents from different domains, originating from a 'wide crawl' over different domains in the same way as YaCy does. One month has about 50 Terabytes in those archives. It could be interesting to download that all and filter it according to some rules, but indexing could be a hard work requiring a lot of servers. It's possibe but not easy...

luc hat geschrieben:Shouldn't any public website been encouraged to provide its own WARC up-to-date archive of its contents, and even its own reverse-index : in fact to run its own YaCy node instance with a WARC and index related only to its contents?

Thats a good point since there is also a copyright-issue: while it is simple to create a WARC with wget, it could be a copyright issue to publish a full crawl of a single domain because it's like a re-distribution. I don't know if that is already a problem, but I am unsure. For a YaCy2 architecture I would suggest to implement a 'closed-group' torrent-based file sharing infrastructure for such files. And then your suggestion could lead to a marker in the web pages containing a copyright note. I believe such things exist, but no-one uses such markers. For domains containing free licenses we could create an open-group sharing instead of a closed-group sharing.

To eventually build an index upon theses commoncrawl datasets I suppose it would require some Hadoop programming skills to run the jobs on Amazon EC2 or another cloud. It would certainly take some time but seems very interesting... despite the fact that these data are stored on a commercial and centralized cloud system.

Modular is fine, already a idea of a framework or handcrafted ... or do you mean realy stand-alone (communicating over a file system ;-( )?

Orbiter hat geschrieben: I identified that WARC is really amazing and important

I don't get the discussion about WARC, is it about the idea to distribute (sell ) a index w/o crawling. Does that work for us?Or is it just .... basically to have a module to write the crawler cache in a different (reuseable) format ....

To extend the idea of a modular architecture based on standards, dont't you think integrating Apache Nutch web crawler (https://nutch.apache.org/) Apache Tika parsers (https://tika.apache.org/) would be a good point? Eventual parts of Yacy crawling or parsing system not already in theses libraries could be contributed to these projects... It would allow even more code review and testing on such core components.

I worked with a commercial partner who selected YaCy over Nutch as crawler because they considered Nutch as old and badly maintained already some years ago. Since then I worked with these partners to enhance the YaCy crawler even further. Because of this experience, turning to Nutch would be a huge step back.

Apache Tika is a component for Solr which bundles a set of parsers and unifies their metadata structure into a common metadata structure. The same does YaCy and YaCy uses a superset of parsers which are in Tika. Furthermore, the metadata structure in YaCy is much much richer than that which is used in Tika. That means: Tika is great, but already subsumed with the functions in YaCy.

What is great about Nutch and Tika is the 'thinking in modules'. Thats exactly what the idea with the YaCy2 components is.

reger hat geschrieben:I don't get the discussion about WARC, is it about the idea to distribute (sell ) a index w/o crawling.

WARC is a great format and there are already a lot of tools for it, so it's just a good choice. This is not about 'selling' data. The word 'distribution' considers the usability for the (YaCy!) community.

reger hat geschrieben:Or is it just .... basically to have a module to write the crawler cache in a different (reuseable) format ....