> On 16 Aug 2007, at 04:40, Robert Burns wrote:
>
>> A scientific approach would involve several things. It would be
>> conducted with a goal to retrieve unbiased data. That means giving
>> every HTML document an equal probability of selection.
FWIW it is seldom possible to select unbiased data in "scientific studies".
Consider astronomy, for example, where surveys of particular types of object are
typically limited by the sensitivity of the instrument being used (so you can
only detect faint things if they are nearby), by the angular resolution
available (so you cannot distinguish objects that are close together on the sky)
and a variety of other factors which depend on what you are trying to measure.
Nevertheless it is possible to make progress in our understanding of the
universe through analysis of astronomical surveys. All that's required is care
in interpreting the data.
>> Genuine scientific statistical research also lays out methodology and
>> is reproducible.
It is actually often surprisingly difficult to reproduce scientific studies. For
example it is considered perfectly permissible to publish scientific studies
based on closed source code running on proprietary hardware. However there is
generally enough methodology documented that someone could in-principle
reproduce the results by making a similar study with their own code on their own
system.
From a scientific perspective, saying I searched a
>> cache that I have, that you can't search and I won't even show you the
>> code that produces that cache , would be the same as me saying the
>> following. "I have this 8-ball and when I ask it if we should drop
>> @usamap from |input| it tells me 'not likely'. You may say that sure,
>> 8-balls say that But the odd part is that it says that every time [cue
>> eerie music]." :-) The point though is that it can't be reproducible
>> at all if its all based on hidden data and methods.
It's neither based on hidden data nor a hidden method. The data is all publicly
accessible webpages. The methodology is a) spider the webpages, b) run the
parsing algorithms in the html 5 spec over the resulting files c) extract
whatever data is of interest. That seems in-principle pretty straight forward to
me and at-least as reproducible as many peer reviewed scientific studies. Indeed
Phillip Taylor has already managed to reproduce the procedure on a smaller
dataset and thus independently verified many of Hixie's results.
--
"Eternity's a terrible thought. I mean, where's it all going to end?"
-- Tom Stoppard, Rosencrantz and Guildenstern are Dead