Query [ cablegate ] in this miner. Then locate the result whose URL is https://wikileaks.org/cablegate.html and click the Search Inside tool icon below said result. By recursively searching inside a result you will be walking a portion of Wikileaks link graph.

Open Source Projects is a new miner available at http://www.minerazzi.com/osp. It allows you to find or submit all kind of open source projects. Access open source community resources. Search by software, hardware, or project name.

Looking for open source projects relevant to Apache, Linux, or Windows? Need to be more specific in your search (e.g. search for Weka, JQuery, Aptana, JNode, Ubuntu, Mozilla, etc…)? Want to build your own open source collections? If so, this miner is for you.

Wikileaks.org is one of those huge sites where researchers and investigative reporters can feel like in heaven.

That is, provided that they have a way to move across Wikileaks complex link structure. Simply put, they need a tool that allows them to understand the relationships between links and quickly move in and out of specific link paths of interest. This need to be done at different levels of the link graph, while current resources are pulled out of said structure and in almost real time.

That is hard to do by just searching or by crafting site, command, or custom searches–not even by using Wikileaks own search engine.

Fortunately, you can do the above with Minerazzi recrawling features–at least to some degree.

Although Minerazzi technology is evolving and not perfect, moving from searching indexes to mining user-driven recrawls is a right step in the right direction.

However, there might be a broad spectrum of starting experimental conditions, each one requiring of different crawling strategies.

The purpose of this post is not to discuss solutions for all possible experimental conditions. It is assumed that users are familiar with Minerazzi’s Recrawl It (RI) and Search Inside (SI) complementary tools. To simplify, the recrawls are done with SI

A good starting point is the result whose URL is https://www.wikileaks.org/wiki as it contains links to latest leaks and recent analyses. Click the Search Inside tool icon below this result.

That should retrieve all links from this result with the tool icon now at the right of each of the new results. You should see three output sections. The first one logs the current URL being crawled. The other two’s are the External and Internal Links sections.

You can now recursively recrawl results by clicking their SI icon and, again, check how the above sections are updated. That is, you will be walking a portion of Wikileaks link graph. At any given step you can walk backward or forward the link graph by clicking the SI icons from the above sections.

This mechanism works as expected with the latest versions of Firefox, Opera, Safari, and Chrome browsers. However, sometimes the state of the logged section is not preserved in IE. We are working on fixing this anomaly.

You can always submit for indexing in the above miner a particular Wikileaks URL. Once indexed, you can use it as a starting point.

Example 3: What if I still want to combine searching with recrawling?

You can always do that. Wikileaks link graph has many URLs with the pattern [keyword].wikileaks.org which can be easily mined.

For instance search for [file wikileaks] and recrawl with SI the result whose URL is https://file.wikileaks.org. Next from the results page recrawl the result whose URL is https://file.wikileaks.org/file. You will be presented with over a thousand of interesting results. Have a field day!

What is next?

Because Wikileaks is so huge, perhaps it is time for us to start building a miner exclusively for mining Wikileaks.org site. Such a miner will help us to address initial starting point and link walk issues.

With US and Cuba reaching out each other, there is an increasing interest in data mining resources from that beautiful caribbean island.

For companies interested in jumping on the bandwagon (e.g., marketing, tourism, and technology companies) the following might be relevant to them.

We have added a whole new set of newspapers to the News miner (http://www.minerazzi.com/news), to include newspapers, not just from Cuba, but from all the caribbean islands and 50 states from the US. Whether you want to build curated collection of resources from Cuba, Dominican Republic, or Virgin Island, use this miner to your heart needs.

The news miner (http://www.minerazzi.com/news) was built for indexing and mining newspapers. However, you can use it to mine news aggregation sites like HuffingtonPost, DrudgeReport, Topix, Google News, Yahoo News, Bing News, and many more. Just visit the above link and search for any of those sites.

After that you can recursively crawl these with Minerazzi’s Search Inside and Recrawl It tools. These are complementary tools so if one returns no results, try the other one.

To illustrate, the HuffingtonPost and DrudgeReport are two of the best user-friendly and content-rich news sites on the Web. These are great sources for building news collections about relevant topics like politics.

By searching for [ huffingtonpost ] or for [ drudgereport ] you can discover additional news services and even follow specific authors and their posts. You can then start building curated collections of news services, authors, and their posts.

When building collections from news services, if a remote host is busy you may want to retry it at another time. However, if the remote host denies you service you are out of luck. This is not really a drawback. As there are zillion of friendly hosts out there that will provides you with rich content, the ones that eventually refuse connection are expendable.

3. Note from step 2 output that for Google Scholar some of the links discovered by the tool are about co-authors discovered by Google. Locate a co-author and click the Search Inside tool icon, this time located the right of said result.

4. You will be presented with a new list of results each with the Search Inside icon. Some of these include co-authors.

By recursively using Search Inside you can build a curated collection on the pagerank topic or a curated collection of co-authors, without having to resubmit the query.

This approach assumes that the initial Google Scholar URL to be mined is already in the IRC microindex. For other queries, you need to query Google Scholar and submit for indexing in IRC the search results URL. Once indexed, it can be mined as described above.

However, if a user discovers a Google Scholar URL when using the Search Inside tool on a previous result, said URL can be recrawled and mined as described above, so it no need to be in the IRC microindex at all.

In general, any URL searchable with Search Inside can be mined, unless the tool hits a dead end (no links accessible or to follow).

To start a curated collection about the Death Penalty, search for [ cornell ] in the Human and Civil Rights Collection miner (http://www.minerazzi.com/hcrc). Use the Search Inside tool on the third result whose URL is

You should get over 2,800 links, enough to start a curated collection on the topic. This is a practical example on building collections with Minerazzi. Great for attorneys, law students, or others interested in the above topic.

Today we are adding a new tool that reports users which sites they have visited while walking the link graph of recrawled web pages. The tool works when they use the Search Inside tool of any miner built with Minerazzi (http://www.minerazzi.com). At this time we are limiting it to the last 10 visits per result per query session.

Click the Search Inside icon at the right of this result to recrawl said result. By doing so, you already have access to almost the entire online collection of human rights resources from the University of Minnesota Library. By recursively using the Search Inside tool, you can keep discovering new resources.

You can do similar searches in large repositories like the Library of Congress. You just need to find said repository in the miner or as a secondary URL while recrawling.

The discovering and data mining capabilities of Minerazzi has prompted us to launch a new and ambitious project: The World Libraries Recrawl Project.

In general, the goal of Minerazzi is to turn searchers into data miners. We try to accomplish this with dozen of tools the platform comes with.

For instance, we have two complementary tools: Recrawl It (a url crawler) and Search Inside (a link crawler). In this post, we want to discuss Search Inside so you could grasp its power.

Go to the seominer (http://www.minerazzi.com/seominer) and search for [ sesconference.com ] to find said site link. Click the Search Inside icon (a black square of contiguous arrows) of that result.

You should see a list of external and internal links. In the internal links list, locate the SES London result and again click the Search Inside icon this time located at the right of said result. The tool will retrieve more results.

By recursively searching inside you can extract more resources and if lucky enough hit a blog, discussion forum, portal, etc which then you can keep mining.

If you prefer, you can select results by clicking the {S} red link at the top-right of the lists. You can then export the results by copy/pasting to an external source (e.g., Excel, txt file, etc..). In this way you can build your own curated collection or organize links to your heart needs.

You can also start the above mining by just searching for [ speaker profiles ] in the seominer, doing a Search Inside in the SES London result, and then doing a Search Inside on a result as done before.

In some tests, we were able to get inside a discussion forum and all replies of a specific post. In another test, we were able to mine zillion of posts of Twitter and Facebook users.

Proceeding as describe above, you will be walking and mining the link structure of sites across the Web. That is you will be mining while searching which, as said at the beginning of this post, is our primary goal: to turn searchers into data miners.