-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ville Skyttä <ville.skytta@iki.fi> wrote:
>It's a bit hard to find the interesting entries since validator is quite
>an errorlog-trasher (still, even though I managed to get some of the
>noisiest bugs fixed for 0.6.6).
And we should probably make an effort to reduce this problem even further
fairly quickly.
><http://www.w3.org/TR/query-semantics/> (~1.6MB) [170MB]
><http://www.w3.org/TR/2003/WD-xsl11-20031217/> (~1.8MB) [107MB]
><http://www.go-mono.com/[…].Windows.Forms.html> (~0.9MB) [141MB]
>
>Normal, smallish validation cases seem to take 10MB or so per
>"check" process on my box, so 100+ MB is pretty much... ideas?
Process size will balloon with input document size (and hence complexity)
since each element has a gazillion attributes who will show up in the ESIS
whether they're in the physical markup or not. A normal document has a very
large markup:content ratio; the cited documents have inordinatly much markup
compared to the amount of data in them.
Well, or at least that's my theory. :-)
BTW, Björn has (on IRC) just suggested some optimizations that can be used to
avoid some of this overhead in a number of cases. I'll have a look at whether
that can reasonably be done for 0.6.7. The bug on this has been targetted for
0.7 IIRC.
>Running "top" on v.w.o suggests that it seems to kill the "check"
>process once its footprint reaches 100MB when validating any of the
>above URLs. I did not see any related configuration or limits in
>httpd.conf, and the box does not run out of memory or anything.
Which means these are probably either Apache compile-time limits or Debian
kernel ulimits.
>There is also one 500 from what is apparently caused by someone
>repeatedly (7ish times) clicking the referer badge in the lower right
>hand corner of the results page after having validated a pretty large
>document with show source and show parse tree options on, causing
>ovbiously pretty heavy recursion and an URL with length of about 2k...
>any ideas how we could prevent this?
Look for the User-Agent or similar distinguishing characteristic of the
incoming request, and if it's ourselves we append an extra token ("recursive")
to out User-Agent string. If a request comes in with "recursive" we throw a
fatal error. Add in a configurable prmitted recursion level perhaps...
- --
"Temper Temper! Mr. Dre? Mr. NWA? Mr. AK, comin´
straight outta Compton and y'all better make way?" -- eminem
-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0.3
iQA/AwUBQK/gHqPyPrIkdfXsEQLL5QCg1HJZgRVZhZtOEDaQ1B1Qwkrf4F0An3U4
SWGhS3bzWDuWdgTEBRlHLNo7
=6FgA
-----END PGP SIGNATURE-----