Why? This is statistically more correct. Top 10,000 websites are biased and most probably the distribution of Languages (whatever statistics you want) differs from the distribution of all the Internet. If you choose big enough number of websites uniformly you will have sample which is close to the distribution of the Internet.

It would be interesting to see these scores altogether with your ones.

If you say "to choose websites randomly is statistically more correct", then I must say it's not as simple as that. If we were counting people speaking languages, I would completely agree with you. However, counting websites is different. Netcraft estimates that out of the 1.8 billion websites, only 10% are "active", and their definition of "active" is very generous, as it doesn't require any activity. Why is that so? Because it's very easy to register a website and have a default page showing up on it. People do this to reserve domain names or to participate in some black hat SEO, or for a number of other reasons. Depending on your definition of "website", you can create millions of sites with a mouse click. Now, if you happen to choose Neapolitan as language on those sites, our statistics would be completely biased for no good reason at all.

While your approach is theoretically correct, it doesn't make sense when 90% or more of what you are measuring is garbage. Our way to filter out that garbage is to require a minimal amount of traffic on the sites, as measured by the Alexa ranking. And, btw, we use the top 10 million, not the top 10k sites. We are convinced that this leads to much more useful statistics.