I like the proposals, and to my mind it would really fits well with YaCy philosophy ("Web Search by the People, for the People").

However for now I have also no clear idea or inspiring examples of how shoud be the understandable info...About the feedback option, maybe the + button could be used to register "votes" in a Solr field that could be used as any other field in ranking (note I do not know very well this function and I am not sure if there is not already such a field...).

Another idea I was wondering about was if bookmarked pages could not also be used in the ranking process (I am also not very sure of how it precisely works currently...).

By the way, it would be great if some people with no developer skills would also share their point of view.

UX is not really my specialty, but my guess is that to really define "understandable" we would have to have some idea of what a user's goal is.

Some examples:

1. "I think YaCy is ranking a website too high or low by accident, I want to figure out how this can be fixed by changing the ranking rules."2. "I think YaCy is ranking a website too high because that website is using abusive SEO / spam techniques, I want to figure out what the website is doing so we can make YaCy penalize such sites."3. "I have a new idea for a YaCy ranking method and I want to figure out whether it would be beneficial, and which pages would be most affected."4. "I have a website and I want to figure out how to change the site to make it rank more highly in YaCy using the default YaCy settings."

There may be some information that is beneficial for some of these use cases, but is superfluous for others of these use cases. It might be useful to consider these use cases independently for the purpose of figuring out what information should be highlighted and how it should be visualized. I think a good first step is to simply make the raw data available, since this allows people to experiment with layers on top of it, but I fully agree that making raw data available is not really sufficient by itself for most real-world use cases. (Although for my particular use cases, it's sufficient given that I'm willing to code some Python scripts to do my additional analysis.)

In terms of optimization/learning, a common technique in machine learning is backpropagation. Basically, this uses the partial derivative of an output variable with respect to some input variable, to determine how to change the input variable in order to optimize the output variable. I'm playing around with this technique in the context of YaCy ranking, but I don't have any results to share yet. The important takeaway here is that because backpropagation needs partial derivatives, it needs to know what calculations were used to get the final ranking score.

One potentially useful way to get data for deciding how to optimize ranking would be to use clickthrough data. There's not much of a privacy implication to collecting clickthrough data as long as it's not shared with peers, but my guess is that multiple nodes' clickthrough data would need to be combined in some way to get sufficiently noise-free data. There are some ways that this could be done; I'm investigating using a social graph method where users' own nodes do optimization using their own clickthrough data but share a weighted sum of their own optimizations and their friends' optimizations. This is reasonably private (users effectively act as a blind for the users in their social graph), and reasonably Sybil-resistant (social graphs have been reasonably well-studied for Sybil-resistance, including in the context of Freenet, for which the stakes are a lot higher). I don't have any practical results to share on that yet.

biolizard89 hat geschrieben: I'm curious what its intended use case was and why it was removed.

The primary intended use was to make sure a search result that the user found worthwhile to look at (click on) is used to improve the local index.fyi: intro of the servlet https://github.com/yacy/yacy_search_server/commit/d44d8996d03ecec0e3c78fb54ab39ae22caef7c1past TODO-List Actions e.g. (0- = not implemented yet) - crawl/recrawl the url - crawl all links on page (with depth) / site 0- increase/create rating 0- add to a collection 0- connect query and url 0- learn and classify content - promote rating 0- add to click statistic url/cnt (maybe to use for boost)

reger hat geschrieben:The primary intended use was to make sure a search result that the user found worthwhile to look at (click on) is used to improve the local index.fyi: intro of the servlet https://github.com/yacy/yacy_search_server/commit/d44d8996d03ecec0e3c78fb54ab39ae22caef7c1past TODO-List Actions e.g. (0- = not implemented yet) - crawl/recrawl the url - crawl all links on page (with depth) / site 0- increase/create rating 0- add to a collection 0- connect query and url 0- learn and classify content - promote rating 0- add to click statistic url/cnt (maybe to use for boost)

P.S. a veto by Orbiter is then good enough for a delete.

Hi reger,

Okay, so it looks to me like there are 2 different questions here, which I think are orthogonal.

1. How do we want to collect information that can be used to improve YaCy's results?2. What do we want to do with that information once it's collected?

The various possible answers to (1) would seem to include the following:

* Include UI elements next to results (e.g. upvote/downvote buttons).* A browser add-on that adds UI elements (e.g. upvote/downvote) while actually visiting the page.* Allow users to opt into receiving notifications in the YaCy UI (with configurable frequency, e.g. once per week), of the form "Do you have a few seconds to help YaCy improve? If so, choose which of these 2 results for the search 'foo' is more relevant to you."* Clickthrough data (not really any need for this to be part of YaCy if you guys don't wish it to be; it could easily be done as a browser add-on, e.g. a Greasemonkey script, for users who wish to opt in).

The possible answers to (2) would include the things you listed; the ones that occurred to me include:

* Recrawl URL or site (possibly with depth).* Use as input to machine learning to improve ranking rules.

Both of these questions imply a 3rd question: what should the structure of the collected data be?

I would suggest that the collected data be a set of pairs (q, d), where q is a query, and d is a DAG (directed acyclic graph) of URL's. A DAG seems like a good fit, because it can be traversed to find pairs of URL's, where the first URL is more relevant than the second URL.

There are a number of UI's that could feed a DAG. For example, a simple UI could assign a URL to 3 categories: "exactly what I wanted", "relevant", "irrelevant". (Last I checked, this is what Google does to train their ranking.) Since these 3 categories are inherently ordered, the DAG would in effect have 3 layers, with each URL in the "exactly what I wanted" layer being more relevant than each URL in the "relevant" layer, etc.

Alternatively, a UI could offer a "rank this URL more highly" button next to a search result; this would create a link in the DAG that makes the result more relevant than the result that appeared directly above it. If that button is clicked, the UI could immediately swap those 2 results in the results page, and if the user clicks the button again, the action would be repeated with whatever result is above it this time.

In terms of how algorithms would use the DAG, a recrawl could be initiated whenever a URL is assigned to the "more relevant" side of a link between 2 URL's. For machine learning algorithms, backpropagation could be used to try to decrease the distance between any URL's whose ranking is in the inverse order that the DAG has. If a genetic algorithm is used, then the fitness score could be the fraction of URL pairs from the DAG which are in the correct order using the given ranking.

So yeah, lots of possibilities here, but basically all of the use cases I can think of can be met by a DAG-per-query structure, and the methods of feeding data into the DAG are orthogonal to the methods of using the DAG to improve results.

biolizard89 hat geschrieben:1. How do we want to collect information that can be used to improve YaCy's results?2. What do we want to do with that information once it's collected?

right, we could divide the topic into the 2 sections/question

biolizard89 hat geschrieben:Alternatively, a UI could offer a "rank this URL more highly" button next to a search result;

fyi: for (1)That what cam to my mind too and I'm experimenting with it, with focus on the button and effect(but are far from happy with what I've tested so far (its 2 button up/down, a pie chart and 3 numbers) but stumpled over other things to look at in the rwi ranking area).In regards to how to represent (internal structure), I started with the rwi (reverse word index) and deal here just with result pairs for ranking parameter (as that is what machine learning could optimize).Have to read your nice reply likely a couple times more and probably have to get closer to the ... how to represent details (to fully understand your query, URL, DAG comment/idea but I think will get it ... as with my above sentence .... I'm in the context of a search which includes query & url).But spitting out rows of numbers etc. without answer to your question (2) which includes .... "is handled within YaCy by...... or with Tool xyz" is not of benefit for me.