Roll call: Bjoern, Nick[niq] (half here), Ville[scop], Yan, Karl, Yves, Olivier[yod], Dom, Terje[xover] (arrived later)
last meeting: http://www.w3.org/mid/C798D705-8925-11D8-AEFA-000393A63FC8@w3.org
** Agenda 1 - checklink and robots **
[00:45:48:] I understood OT has been testing 3.9.3-dev a bit, what about others?
[00:46:49:] * bjoern_ fwiw, did not test checklink...
[00:47:22:] * yod happy with new feature, with the reservation that I wonder whether it should ignore the robots protocol for non-recursive mode
[00:47:58:] yod: I have a feeling that could be a bit hairy
[00:48:07:] (to implement, that is)
[00:48:34:] scop: because of different UA/ RobotUA classes?
[00:48:58:] * yod would like to know other's gut feeling about that too, beyond implementation issue
[00:49:10:] yod: yep, might be possible to work around that though by directly accessing RobotRules, dunno
[00:49:45:] * bjoern_ thinks that link checkers should ignore at least Disallow: *...
[00:50:29:] nope
[00:50:36:] it's after all just a HEAD and following robots.txt makes link checkers less useful
[00:50:59:] niq?
[00:51:15:] should display "forbidden by robot rules" with a link to howto change that to allow the link checker
[00:51:57:] that'll work with default imple of robotrules
[00:52:03:] I think in non-recursive mode, the linkchecker is hardly a robot
[00:52:09:] it's merely a browser
[00:52:27:] indeed
[00:52:29:] it is. And it falls straight into ban-me traps
[00:52:50:] and it subjects webservers to rapid-fire
[00:53:01:] how so?
[00:53:13:] hmm... actually, what I mean is a bit more precise: the link checker should not fail when the primary URI is excluded by robots rules
[00:53:28:] that too
[00:53:30:] ... only when checked URIs inside the page falls down onto these rules
[00:53:53:] * niq thinks it should
[00:54:12:] I would respect robots.txt... if someone put a robots.txt with Disallow, it's because they have reasons for that, this same person who's in charge will have also the possibility to tweak a configuration to let the link checker goes if needed. UserAgent string for example
[00:54:17:] otherwise it's open to varoius attacks, like pointing it at a bad-crawler-trap page directly
[00:54:31:] well, as an author, if I want my links checked and the link checker says I should test manually, I would open the link in my browser which yields in much more traffic than caused by the link checker (HEAD vs GET, style sheets, images, ...)
[00:54:34:] karlcow++
[00:55:08:] niq/karlcow++
[00:55:27:] bjoern_: as author is one thing, but an online robot can be pointed at a third-party webserver, including in a malicious attack
[00:56:13:] http://qa-dev.w3.org/wlc/checklink?uri=http%3A%2F%2Fkoti.welho.com%2Fvskytta%2Ft.html
[00:56:47:] what's missing is the link to a howto describing how to allow the link checker to access the site
[00:57:05:] yep
[00:57:12:] well people are usually editing robots.txt once for all and use User-Agent *
[00:57:30:] I want to validate external links (internal ones never break) and I cannot change the robots.txt of a foreign server.
[00:57:34:] scop++
[00:57:45:] I would use my own link checker that does not honor robots.txt instead
[00:58:21:] fine. so that can fall straight into a ban-me tarpit and start generating 403s on every page
[00:58:23:] yes bjoern_: but you can't force people if they don't want. People have really the choice or not.
[00:58:24:] * yod agrees at least with Dom's point about not stopping when the (checked) page is disallowed
[00:58:31:] And typically you use robots.txt for things you don't want to show up on search engines...
[00:59:23:] * yod would like to make a distinction on recursive/nonrecursive
[00:59:36:] I don't think there is any disagreement with recursive mode, is there?
[00:59:45:] (bjoern?)
[00:59:50:] We could limit the number of pages/host to prevent malicious use
[01:00:42:] even one page could get the checker banned automatically from a site
[01:01:41:] # of pages/host is not too much different from "full" robots.txt "compliance", it also produces unsatisfactory results for author POV
[01:02:05:] http://www.robotstxt.org/wc/exclusion.html#robotstxt
[01:02:43:] btw, fwiw, LWP does not support the "revised internet-draft" version of the spec
[01:03:10:] ok the "spec" is clear
[01:03:23:] it's for all robots
[01:03:32:] it's for retrieved document
[01:03:41:] not mentiong of indexing.
[01:03:46:] HEAD is not retrieval
[01:03:52:] s/g//
[01:04:08:] [[ Robots are often used for maintenance and indexing purposes, by ]]
[01:04:25:] yep
[01:04:33:] maintenance ;) for example
[01:04:42:] HEAD retreives meta-information, so it is partly retreival
[01:04:49:] the "spec" talks about "visiting"
[01:04:59:] it says " Note that these instructions apply to any HTTP method on a URL."
[01:05:00:] GET retreives data and metadata, so not only the content
[01:05:10:] * yod waiting in a corner for the spec bashing to start
[01:05:19:] ahaha
[01:05:33:] It's only a draft...
[01:05:34:] :)
[01:06:01:] if ever crwalers would be willing to start using OPTIONS * :)
[01:06:25:] yeah...
[01:06:40:] and means in web servers to configure OPTIONS...
[01:06:49:] * yod thinks... that we need to find a way to make checklink behave, and that robots.txt is one such mechanism
[01:06:59:] * yod would be happy with :
[01:07:03:] s/one/the/
[01:07:47:] 1 - inviting people to be nicer to checklink in their robots.txt
[01:08:15:] 2 - an option (not available in recursive mode?) to ignore the protocol
[01:08:31:] 2--
[01:08:43:] with default to follow it and a note on responsibility + other "behave mechanisms" (timer?)
[01:09:12:] the trouble is, any such option is an open invitation to the malicious
[01:09:24:] there's already the 1 sec delay, not bound to robots.txt as such
[01:09:59:] yeah I was thinking of increasing the delay when not following robots.txt
[01:10:07:] tell that to a slashdotted site
[01:11:58:] is recursive mode limited to the host of the original page uri?
[01:12:07:] For the malicious ones... Checklink is a perl program Open Source... a real malicious geek will anyway reactivate what he wants. So I think the options can be minimum.
[01:12:22:] bjoern_: host + base uri
[01:12:44:] so it follows only "internal" links?
[01:12:58:] malicious ones don't need that to do a DoS
[01:13:02:] no, the restriction is for *documents*, not links
[01:13:28:] bet more on a user fumbling with a config than someone wanting to doe vil things
[01:13:32:] I mean, if I have a link on x.org to www.w3.org, would it follow links on www.w3.org?
[01:14:04:] depends on definition of "follow", but yes, it would do the "link checking" on them, ie HEAD
[01:14:49:] why?
[01:14:55:] * dom__ wonders how a bot is supposed to react when robots.txt is forbidden of being visited by a robots.txt file
[01:15:09:] * dom__ knows he's looking for troubles :)
[01:15:41:] dom__: http://www.robotstxt.org/wc/norobots-rfc.html section 3.1
[01:15:44:] The bot would be ashamed and hide in the corner of the server...
[01:16:18:] * xover arrives...
[01:16:28:] dom: and you can make it forget this using a Cache-Control: no-cache, no-store
[01:16:28:] what do we do with ?
[01:16:45:] bjoern_: why what? /me lost...
[01:17:09:] Why it would check links on foreign sites in recursive mode
[01:17:49:] well, it is a link checker? note, that is not the same as recursing offsite
[01:17:59:] unhandled ATM
[01:18:42:] oops, misread, it does not check links *on* foreign sites. it does check links *to* foreign sites
[01:20:19:] ok
[01:22:19:] well, we don't seem to have an agreement on that
[01:22:45:] what do we do then?
[01:22:54:] launch 3.9.3 beta
[01:22:59:] get feedback
[01:23:03:] decide what to do
[01:23:13:] yod++
[01:23:18:] (I think the current behavior is fine, although I'd prefer the way I proposed)
[01:24:08:] I think this discussion had interesting points, will re-use that to steer feedback when we go to beta
[01:24:30:] speaking of which, please play with the instance on qa-dev, as well as the markup validator there
[01:24:43:] they have the latest lwp, which we need to try
[01:24:59:] [closing this item]
(later)
[02:22:29:] btw, first cut at documenting the /robots.txt stuff for checklink up @ http://qa-dev.w3.org/wlc/checklink?uri=http%3A%2F%2Fkoti.welho.com%2Fvskytta%2Ft.html
[02:22:36:] wording improvements welcome
** Agenda 2 - CSS validator - progress and priorities **
[01:25:58:] dodji updated libcroco CVS, I am going to have a look at that
[01:26:07:] no progress on css schema
[01:26:28:] ok, so I recently closed some issues, partly by fixing the grammar (which is really thin) and by upgrading javaCC
[01:27:06:] would be nice to have a test suite (that can act as a regression TS as well)
[01:27:24:] __Yves, I can look at the bugs and prioritize them to some extend
[01:27:29:] also a list of "needs to be fixed in priority" would be nice :)
[01:27:52:] bjeorn: well, what may have a high priority to me might not have the same for others
[01:28:02:] s/bjeorn/bjoern_/
[01:28:18:] so guidance from users and people interacting with users is welcomed :)
[01:29:02:] Well, I would probably give those highest priority which most users complained about...
[01:29:16:] btw http://www.w3.org/Bugs/Public/buglist.cgi?product=CSSValidator
[01:29:40:] yep, saw this, I found the .not bug there (and fixed it)
[01:29:59:] P2 quite crowded
[01:30:50:] * scop notes that P2 is the default in Bugzilla
[01:31:07:] yeah and P1 is for a URI that has moved...
[01:31:24:] http://www.w3.org/Bugs/Public/show_bug.cgi?id=337
[01:31:40:] It should probably be closed as invalid
[01:32:12:] yes
[01:32:23:] so only P2 bugs remains
[01:33:04:] (if the mime type is good, there are no reason it would work regarldess of the URI)
[01:33:05:] there is a P5 http://www.w3.org/Bugs/Public/show_bug.cgi?id=399 which should probably have higher priority
[01:33:08:] so ACTION: bjoern to modify priorities in CSSValidator's bugzilla
[01:33:23:] and ACTION: Yves to fix bugs
[01:33:24:] :)
[01:33:32:] yeah :)
[01:33:34:] what do we do re test suite?
[01:33:42:] ACTION yod to start a test suite :)
[01:33:47:] !!!
[01:34:06:] I have not touched test suites for a while, my bad
[01:34:16:] bjoern : I have a set of files used to test some bugs, they can be used to do regression test, but not more
[01:34:29:] Yves: send that list to me
[01:34:29:] and we need perhaps more thatn that (from regular stuff to corner cases)
[01:34:37:] and this works also for the markup validator
[01:34:51:] I'll try to work on that within the next 2 weeks
[01:34:51:] (including weird encodings cornercases)
[01:35:07:] I also have a number of test pages/style sheets, a number of them linked from bugzilla...
[01:35:07:] yod: remind me so that I won't forget
[01:35:18:] I will...
[01:35:35:] ACTION: Yves send olivier list of "test" cases URI for the CSS validator
[01:35:48:] now I know I will remind you
[01:36:12:] on a related (to the CSS validator) note, the spanish office is motivated to handle translation of interfaces and errors
[01:36:14:] s/Yves/Yves and Bjoern/
[01:36:38:] I will (tomorrow I think) work on a plan for translations and maintenance thereof
[01:37:16:] anything else on the CSS validator?
[01:37:30:] should be straightforward for the css validator
[01:37:42:] bjoern_: I think so
[01:37:45:] I would like information from sijtsche/plh/whoever on how much CSS3 is supposed to be implemented
[01:37:49:] that should be it (note that with the new JavaCC, preformance improved)
[01:38:02:] yeah, so do I, and information on support for other profiles
[01:38:29:] bjoern_: would you like to start a mail thread about it on qa-dev?
[01:38:36:] There are lots of things i am not sure about whether they are unimplemented or broken...
[01:38:42:] or w-v-c if you prefer
[01:39:18:] I would prefer if you send them a mail to summarize what's implemented / what they implemented / something like that
[01:39:30:] cc'ing w-v-c/qa-dev
[01:39:31:] fine
[01:39:33:] I will
[01:40:07:] oh, and probably w3c-css-wg
[01:40:10:] ACTION: olivier contact PLH/Sijtsche and ask them what is implemented / to what extent (esp. CSS3)
[01:40:58:] (btw, Bert has an ongoing action item to make sure CSS 2.1 is supported in the css validator...)
[01:40:41:] [closing item]
** Agenda 3 - Markup Validator **
[01:42:22:] Markup Validator : not much feedback on 0.6.5b2, beyond style issues
[01:42:41:] bjoern animating interesting discussions
[01:42:58:] without much luck, as I expected...
[01:43:15:] there were answers... from the usual suspects
[01:43:45:] There wasn't much feedback on previous betas either (not considering my comments), it seems we have a general feedback issue
[01:44:08:] Well, this beta was pretty much low profile
[01:44:19:] compared to others, which were announced much more broadly
[01:44:47:] which did not yield in much feedback either
[01:44:55:] add a with a