On Dec 23, 2009, at 2:37 PM, Mark Bennett wrote:
> Hello Simon and Robert,
>
> Robert, yes, I do have a private corpus and truth table. At this point I can't share
it, though I'll ask my client at some point.
>
> I did find some code in JiRA, your patches, including the links here "for the record":
> https://issues.apache.org/jira/browse/ORP-1
> https://issues.apache.org/jira/browse/ORP-2
>
> Top languages for testing are English, Japanese, French and German.
>
> I'm exicited to have others to talk to! I have some general comments / questions:
>
> 1: Although the qrels data format was originally binary yes/no, apparently there were
more flexible dialects used in later years, that allowed for some weighting. Was there a
particular dialect that y'all were considering?
I think it would be nice to have both binary and something like: relevant, somewhat relevant,
not relevant, embarrassing or a scale of 1-5 or 1-10 depending on how hard core you want to
be.
>
> 2: CAN WE use the TREC qrels format(s)?
>
> I believe TREC has various restrictions on the use of test results, source content and
evaluation code (annoying since TREC is supposed to foster research and NIST is paid for by
US tax dollars, but that's another whole rant) But do we think the file format is "open"
or "closed" ?
We should be able to use the format. I think the only thing closed about TREC is the need
to pay a small sum for the collection, but that isn't NIST's fault.
>
> 3: I do favor an optional "analog" scale. Do you agree?
>
> Our assertions are on an A-F scale, I can elaborate if you're interested. A floating
point scale is more precise perhaps, but we have human graders, and explain letter grades
that approximate academic rankings was less confusing, plus we were already using numbers
in two other aspects of the grading form.
>
> 4: Generally do you guys favor a simple file format (one line per record), or an XML
format?
>
> TREC was born in the early 90's I guess, so is record oriented, and probably more efficient.
We have our tests in an XML format, which though more verbose, affords a lot more flexibility
including comments and optionally self-contained content. It also sidesteps encoding as XML
is UTF-8. I've found that "text files" from other countries tend to be numerous encodings.
And Excel, which is often used for CSV and other delimited files, sadly does NOT do UTF-8
in CSV files.
Pretty wide open at this point
>
> 5: How important do you value interoperability with Excel?
>
> It's VERY handy for non-techies, and the xlsX format is a set of zipped XML files, so
perhaps acceptably "open". I would not propose .xlsx as the standard format, but it'd be
nice to inter-operate with it. We'd need some type of template.
That would be great.
>
> 6: "quiescent" vs. "dynamic" strategies
>
> Content: During in house testing it's sometimes been hard to maintain a static set of
content. You can have a separate system, but I suspect in some scenarios it won't be feasible
to lock down the content. See item 10 below. My suggestion is to mix this into the thinking.
Some researches wouldn't accept the variables it adds, but for others if it's a choice between
imperfect checks and no checks at all, they'll take the imperfect.
>
> Grades / Evaluations: It's VERY hard to get folks to grade an MxN matrix. I had a matrix
of just 57 x 25 (> 1,400 spots) and, trust me, it's hard to do in one sitting. It'd be
nice to handle spotty valuations.
I think it's pretty important to be able to reproduce experiments across users/machines/etc.
which means the content needs to be versioned. This is the one big issue I have w/ simply
pointing at other data sets. Ultimately, we will need our own collection that we can version.
>
> 7: fuzzy evaluations vs. "unit testing"
>
> Given the variabilities (covered in other points), it'd be nice to come up with fuzzier
assertions.
>
> Examples:
> * "Doc1 is more relevant to Search1 than Doc2"
> * "I'd like to see at least 3 of these docs in the top 10 matches for this search"
Nice to have, but likely further down the road. However, the door is wide open at this point,
so scratch that itch!
>
> 8: URLs as keys (optional, handy in some contexts)
> Various technical issues here, just wanted to bring it up.
>
> 9: Ideas for an "(e)valuation console" / "crowd sourcing"
>
> There are several ways to present searches and answers to users in a somewhat reasonable
way, to make it a bit easier / fun for them to make assertions. Lots of ways to go here,
but we'd need some UI resources.
Yep, this has been kicked around and would be quite nice.
>
> 10: "academic" vs. "real-world" focus, can't we serve both!?
>
> Some areas of search R&D aren't applicable to real world / commercial usage. TREC
is a perfect example of this. Some open source licenses also prevent commercial participation.
And I can imagine some testing standards that, while very well thought out and thorough,
might be impractical to actually use.
It's an open source project at Apache, so anybody who is fine w/ the ASL can participate.
Meaning both academics and commercial companies. Frankly, I don't care much about p@1000,
but p@5 and p@10 are quite interesting, so I tend to be more real world focused, but a health
cross fertilization will be great.
>
> I really think we can serve both groups, and will get better results for our efforts.
>
> 11: Task appropriateness
>
> There are different tasks that folks might want to use Relevancy Testing Tools for:
> * Engine A vs. Engine B
> * Configuration A vs. Configuration B (same engine)
> * "normal variable" vs. "acceptable" vs. "unacceptable"
> Etc.
>
> We should keep these different use cases in mind.
Indeed, not to preclude others, but I know I'm focused on how to use it for Lucene/Mahout
etc. In other words, the latter two in the list. If other vendors want to participate, that
is great too. All are welcome. Still, it's pretty hard to really do Engine A vs. Engine
B tests in a fair way.
>
> 12: Clusters and Relevancy grading: Do you agree with the following assertion:
>
> If you manually cluster documents by subject, then presumably using one of those documents
(perhaps a shorter one), you'd expect it to generally find the other documents in that cluster,
presumably at a higher relevance than other documents from other clusters.
>
> *If* this were true, it suggests some automated testing methods.
>
> I'll say that I don't think it's entirely true, but I think it's one technique to keep
in mind, for some use cases. This by itself is probably a long discussion.
Maybe. Many clustering algorithms calculate distance much the same way that the engine scores,
so it may just be a case of self-fulfilling prophesy.
>
> 13: Problems with measurements...
>
> Just listing some of the stuff I've been worried about:
> * Individual opinion drift (me before coffee on test # 10 vs. me after 8 cups of coffee
on test # 500; if I go back to test 10, would I grade it the same)
> * Tester variance (how closely would 2 coworkers grade the same search against the same
docset)
> * Language drift - if I translate both the questions and searches into French, then have
a French speaker evaluate the results, how close should I expect them to be?
> * Ordering drift - I've seen this myself, you mind and habits can change as you go through
many tests, also sorted vs. unsorted data
>
> And if using clustering as part of your testing:
> * Cluster drifts - cluster started out as "Windows", but as docs are added becomes more
about "Windows installation and drivers", etc
> * Cluster spans - a small cluster might be about Microsoft Office applications, but one
particular document is about ease of use of Powerpoint 2007 vs. another about problems installing
Office on a Mac; but other 3 docs in the cluster gradually span this gamut
> Cluster split / merge - Windows cluster is now about "Windows installation" vs. "Windows
applications"
Reproducibility is paramount. One of the biggest issues w/ these types of evaluations is
the problem of managing the output of the tests and keeping track of them. I imagine we'll
develop tools for that, too.
>
> 14: Long tail / variability / Poisson issues
>
> Sample A is 1,000 test docs and 100 searches from a particular web site.
> Sample B is another 1,000 test docs and 100 searches (non-overlapping) from the same
web site, over the same period of time
> Sample C is 10,000 test docs and 1,000 searches (also not overlapping, and from the same
time-frame / site)
>
> You'll see very common themes in all sets of docs and searches, that will clearly overlap.
>
> However, the long tail quickly gets into 1 item samples, and you'll find the 2 tails
do not overlap. This has something to do with testing... but is probably a long subject for
another day.
>
> And since sample C is 10x samples A and B, what variance can be explained simply due
to that fact? For example, is the tail 10x longer, or maybe sqrt(10) longer? etc.
>
> 15: Participation in ORP
>
> As some of you know we're active in SearchDev.org, plus our newsletter, plus we work
with Lucid on webinars sometimes. So there's a bunch of ways we could publicize this group,
when we're ready.
Sure, the more the merrier. Getting the word out is important, as is setting expectations
on what they will find once they arrive.
>
> Any of you Bay Area?
Sometimes
>
> And should we "take on TREC" ?
See other response.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search