GISS Interruptus

GISS does not provide a coherent archive of station data (such as is available at GHCN.) As reported before, Nicholas developed a technique for downloading station data within R. Downloading the entire data set (which takes a minute or so from GHCN on a highspeed network) is laborious but automatic. I did this in the background and after about 8 hours had downloaded half of one version. In the middle of this, the program stopped working. It was hard to figure out why it stopped working. When I went back and tried to do things line by line, I found that it wasn’t reading. Now there had been a few missing records which caused my read program to file and to require restarting at the next record (this could be fixed but it seemed just as easy to restart if it didn’t happen too often). After a while, I checked some records that I’d already downloaded and these failed. I wrote to the GISS webmaster wondering about the 403 diagnostic.

Robert Schmunk promptly replied that my attempting to “scrape” data from their website constituted an “obvious and blatant violation” of their rules as set out in their robots.txt directory and they had accordingly blocked access.

Steve,

Although you did not provide any further details about your problem, I will assume that you are the person on the cable.rogers.com network who has been running a robot for the past several hours trying to scrape GISTEMP station data and who has made over 16000 (!) requests to the data.giss.nasa.gov website.

Please note that the robots.txt file on that website includes a list of directories which any legitimate web robot is _forbidden_ from trying to index. That list of off-limits directories includes the /work/ and /cgi-bin/ directories.

Because the robot running on the cable.rogers.com network has rather obviously and blatantly violated those rules, I placed a block on our server restricting its access to the server.

If you are indeed the person who has been running that particular web robot, and if you do need access to some large amount of the GISTEMP station data for a scientific purpose, then you should contact the GISTEMP research group to explain your needs. E-mail addresses for the GISTEMP research group are located at the bottom of the page at http://data.giss.nasa.gov/gistemp/

rbs

I wrote back to Schmunk stating that I was not using a “web robot” but was downloading data for legitimate scientific purposes, as follows:

I have been attempting to collate station data for scientific purposes. I have not been running a robot but have been running a program in R that collects station data.

However, even after confirming that this was not a web robot and was data access for scientific purposes, NASA GISS did not remove the block (which applies to many webpages besides the GISTEMP data that I was downloading.

I wrote to Reto Ruedy of NASA GISS this morning as follows:

Dear Dr Ruedy, I have been unable to locate an organized file of station data as used by GISS (such as is available from GHCN). In the absence of such a file, I attempted to download data on all the stations using a script in R. This was laborious as it required multiple calls. I was not using a “web robot” nor was I indexing files. During the course of this, your webmaster blocked my access to the site claiming that downloading the data in this fashion violated your policies. Would you please either restore my access to the site or provide me with an alternative method of downloading the entire data set of station data in current use. Thank you for your attention, Steve McIntyre

We’ll see what happens.

UPDATE: After a series of emails, GISS agreed to allow me to continue operating my download program exactly how I’d been doing it (after hours.) I posted the correspondence below, but here is a collection. I sent a further email to Schmunk noting that I was blocked from the pages identifying the email addresses of contact persons.

I am blocked from access to the page where the email addresses are located.

How can I download the data then?

The GISS webmaster replied:

Good point. That was foolish of me to suggest checking a page on which access had been turned off. I have turned off the restriction that I added to the server on data.giss.nasa.gov last night so that you can access the GISTEMP page and view the contact information.

> I have been attempting to collate station data for scientific purposes. I have not been running a robot but have been running a program in R that collects station data.

It is an automated process scraping content from the website, and if that isn’t what a web robot does, then it’s close enough.

The only “notice” of the supposed policy is their robots.txt file. Google, which can surely be regarded as authoritative on web rotbots, discusses the function of robots.txt files as follows:

A robots.txt file provides restrictions to search engine robots (known as “bots”) that crawl the web. These bots are automated, and before they access pages of a site, they check to see if a robots.txt file exists that prevents them from accessing certain pages.

My program was obviously not “crawling the web”, but was downloading specific data from GISS.

The webmaster also said separately:

Please contact the GISTEMP group and inquire if they are willing to provide you with the dataset(s) from which the website applications extract information.

If they are not (I do not know what their current policy on this), then you can go a step closer to the source and obtain station data from the same location that the GISTEMP group obtains the original “raw” datasets that they work from. That is the Global Historical Climatology Network at http://www.ncdc.noaa.gov/oa/climate/ghcn-monthly/index.php

I’m not sure which specific files from the GHCN site are used. But if the complete GISTEMP data are not available then perhaps Dr. Ruedy of the GISTEMP group could give you some tips on how to use the GHCN data.

I had already contacted the GISTEMP group. I replied that I wasn’t interested in “tips” on how to access GHCN data and re-iterated my request (copy to Ruedy):

I know how to use the GHCN data. I’m not interested in “tips” on how to use it.

I’m interested in the versions as used by GISS. The GHCN version is convenient to download and I see no reason why GISS versions should not be available on equivalent terms.

In response to my initial request to Ruedy about access to the data, I received the following response that these were “scratch pads”, asking why I needed the data and an undertaking to “try” to provide the necessary data (which was already online).

Dear Steve,

Our main focus working with observed data is creating a gridded set of temperature anomalies which gives reasonable means over comparatively large regions – the global mean average being one of the major goals. If you are interested in individual stations, you are much better off working directly with the GHCN data.

Our station data are really intermediate steps to obtain a global anomaly map, and are not to be viewed as an end result. A modified time series for a particular location may be more representative for the surrounding region than for that particular location. So it is important to use these data in the proper context.

All our publications and investigations deal with regional temperature anomalies and that is the only use these data are good for after our modifications.

If you still think that downloading our “scratch pads” is important to your investigations, please let me know exactly what stage after the raw GHCN data you need and maybe an indication why you need it, and I’ll try to provide you with the necessary data.

Again, we are not trying to compete with GHCN as provider of station data; we are using their data for a very specific project and we made – perhaps unwisely – some of our tools that we used to test the various steps of our process available on the web.

Reto

I promptly responded that I was interested in the data as it was available to the public, and asked for a copy of the program by which they generated their data from GHCN data (on the basis that the size of the file could not be held to be a consideration for this request):

Dear Reto, in that case, could you provide me with copies of the programs that you use on the GHCN data so that I can replicate these calculations for myself? Thanks, Steve McIntyre

In answer to your question, I’m interested in the data as it is presented to the public. All I was doing was downloading the data that is supposedly available to the public, but in a way that would not take 4 weeks of manual labor. If your version differs from the GHCN version, I’m interested in downloading your version so that I can assess the differences.

Later in the day, instead of providing a coherent file of the data or the source code, Ruedy said that I could continue downloading the data in the way that I had commenced, asking me to do so after hours or on weekends. (My original download was being done after hours and was interrupted at 11.30 at night; so they were asking me to observe a condition that had not been a problem in my initial download attempt.)

After a short meeting with Dr. Hansen, we were advised to let you download whatever you want as long as generally accepted protocols are observed. Please try to do so at a time that does not impact other users, i.e. late nights, weekends.

What we did with the GHCN data is carefully documented in the publications listed on our website. We are not creating an alternate version of the GHCN data, we are mainly combining their data in various steps to create our anomaly maps.

Sincerely,

Reto A. Ruedy

I replied to Ruedy thanking him for this and politely re-iterating my request for code:

Thank you for this. I will observe this condition.
I realize that you have provided some documentation of what you did. In econometrics, it is a condition of publication in journals that authors archive their code and data so that their results can be routinely replicated. I realize that no such standards apply to climate science. However, equally, there is no prohibition on individual climate scientists voluntarily adopting these best practice standards. In that spirit, I would appreciate it if I could inspect the code used to process the GHCN data. Thanks, Steve McIntyre

Ruedy replied not entirely cordially:

The block has been lifted as far as I know. As far as I’m concerned, this is the end of our correspondence.

I resumed downloading after hours and have finished one data version. While I was downloading, I tested browser access to the GISS site to see if the R-downloading program interfered with access to the GISS site by others and invited any readers online to verify this. I experienced no service degradation whatever when I attempted browser access simultaneous with R download, nor did another reader who tested it simultaneously.

I have no particular objection to the webmaster blocking access until he was assured that the inquiry was legitimate or even that the webmaster referred the matter to his bosses. I also have no objection to how long GISS took to remove the block. If all climate data access issues were resolved this quickly, it would be great. Reasonable people can differ about whether they would have been so responsive in the absence of blog publicity. I happen to think that the publicity to the issue facilitated resolution of the matter, but I can’t prove that they wouldn’t have resolved it anyway. On the other hand, I don’t think that any of my actions were unreasonable.

157 Comments

A CA reader had already downloaded the data, spreading his download over 36 hours. So I’ll get the data one way or another. However, I expected NASA GISS to remove the block once they got notice that the data was not being accessed by a web robot but for scientific research. The block is also much wider than the data set already downloaded.

Isn’t the language interesting: my attempt to get data is described as “scraping” data. Perhaps that’s a term of art in websites, but it seems an odd choice of word. Also to describe this activity as a “blatant” violation of their robots.txt policy. Imagine that – me “blatantly” violating a robots.txt policy. Sorrrrr-eeee. Will I have to write “I promise to be nice to Gavin Schmidt” 100 times on the blackboard.

It appears that your R program indeed constitutes a “robot”
as that term is used in the context described.

Also, the files, such as the “monthly data as text” files,
that you download, are not stored as “permanent” files at
GISS, they are generated dynamically in response to a
request for them. So, your program is not only loading their
system with thousands of downloads, but also with thousands
of executions of programs to generate the files to be
downloaded.

Perhaps you might request copies of their programs which
generate those files from the GHCN and USHCN input data.

There is the separate problem of their USHCN input data
being a custom version of USHCN adjusted data which does
not include “adjustments” for missing data.

While you may not think of it that way, you were indeed running a web robot. The problem with webbots and CGI programs is that even 10 – 20 simultaneous CGI requests can bring a web server to its knees. It is also how denial-of-service attacks are launched

Remember the time CA disappeared from Google after you had disabled bots? In that case, Googlebot was reading and obeying the directions in the robots.txt file for the site. Your automated retrieval was not. I would be surprised if NASA’s web admins were not required by security policy to block access to bots which do not obey robots.txt. I do not think this has anything to do with you personally.

Now, I am assuming the data you were scraping off the web pages are the same as the one that can be downloaded from ftp://data.giss.nasa.gov/pub/gistemp/download/. In that case, you might want to use the script I posted at http://www.unur.com/climate/giss-1880-2006.html. to decode the data posted there. I will not mince words here: I think it is absolutely disgusting that the only files containing all the data are in binary format. There is absolutely no reason whatsoever for them not to provide the data in a simple text format (compressed of course).

In the mean time, I have been meaning to do a frame by frame difference between this version and the one I animated earlier this year. However, I have grading and travel to do and I don’t know when I will be able to get to it.

Hope this helps. By the way, I think it might be a good idea to change the title of this post.

#2. As noted above, I requested the underlying data when the R access was denied.

I did not regard my R program as a “web robot” in the sense that this is usually used as I was not indexing their files, but I realize that reasonable people can disagree. There is no explicit notice on the webpages.
Jerry – could you post up what the robots.txt file says (as I am now blocked)?

The matter is easily resolved by GISS making an organized file available as GHCN already does.

#3. Sinan, those data files are for the gridded data versions, not the station data.

#2,3. I understand why a webmaster would block access from an unidentified person doing this. But once I had identified myself, I think that they should either provide me with the data or allow me to continue downloading.

IMHO, a reasonable admin should restore access to you if you modify the script to wait, say, a random period of 2 – 5 seconds between each request. That way, the server can keep up with the burst of requests while still serving other visitors.

Incidentally, I think the quick succession of requests was the reason for the intermittent failures you were seeing before.

Screen scraping is a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for final display to a human user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing.

Isn’t the language interesting: my attempt to get data is described as “scraping” data. Perhaps that’s a term of art in websites, but it seems an odd choice of word.

I’ve never heard that term. I can’t promise to be an expert in everything web-centric, but I do create web applications professional so you’d think I would have heard it before.

Also to describe this activity as a “blatant” violation of their robots.txt policy. Imagine that – me “blatantly” violating a robots.txt policy. Sorrrrr-eeee. Will I have to write “I promise to be nice to Gavin Schmidt” 100 times on the blackboard.

I didn’t check robots.txt but I did spend quite a bit of time on more than one occasion looking for an Acceptable Use Policy or similar document on their web site. I read the “Private Statement & Important Information” link which talked about quite a few issues, but not acceptable usage. As you say this isn’t really a robot, which is why I didn’t think to look in robots.txt. You should ask them to make it clear that their policy forbids large-scale harvesting of data for scientific purposes either on the page which you access the downloads, or in an obvious link from that page. Otherwise I don’t see how they can complain that you weren’t aware of this policy, as it was not obvious to either of us, and as I said, I did check.

Of course it’s ridiculous that we have to download data this way. It would be much easier if there were a way to download the whole database in a single compressed file, or something similar. Easier for their web server and easier for serious scientific researchers.

Sinan,

While you may not think of it that way, you were indeed running a web robot. The problem with webbots and CGI programs is that even 10 – 20 simultaneous CGI requests can bring a web server to its knees. It is also how denial-of-service attacks are launched

The program only makes one simultaneous request. It’s no different from Steve sitting there labouriously clicking all the links himself, really, it’s just a lot less bothersome.
I don’t consider it a “robot”, it isn’t indexing anything, it’s merely trying to fetch the data from the archive, and isn’t that why the web page exists in the first place? It’s not our fault they made it hard to fetch en masse – the data IS required en masse for legitimate purposes!
James:

Re 1: “Scrape” is a term of art.

Well, maybe I’m wrong, but HTML is a form of SGML which is designed for both human and machine readability, no? So I’m not “scraping” the HTML, I’m merely parsing it and following the links. It’s no different from what a browser does, just automated.

Nicholas, while there may be only one request at a time, they are concentrated rather than being spread across time, since each request spawns a CGI script (which, I am assuming, as is typical) is not well written, the effect of running the R script is the same as a denial-of-service attack.

Sinan, look, as I said I write web apps for a living. I also maintain web servers and database servers, do database programming, etc. I know a reasonable amount about this kind of thing.

My program does NOT constitute a denial of service attack and it does NOT cause the web server to have to respond to multiple requests simultaneously, unless somebody has modified it or is running multiple copies simultaneously.

What it does is very simple.

Given a station ID or name, first it requests the data from the CGI. This returns an HTML page. The program waits until this page has been returned – and thus until the CGI script has finished running on the web server. The web server process will be finished and returned to the pool by the time my program gets the HTML.

Then, when it has the WHOLE HTML document, it finds the appropriate link, and fetches the data. Again, this causes a CGI to run on the server, and again, it waits until it receives all the data and thus the CGI is complete. Mr. McIntyre’s program then stores this data, and begins the process again for another site.

At no time is there more than one outstanding request, and some of the time there will be none (due to network lag, time spent locally storing data, etc.). The net result will be to add less than 1 to the load average of the web/database servers.

Unless NASA has really poor web servers, I can’t see how this would cause them any problems, other than perhaps some puzzlement about the volume of requests they are receiving. I can understand why they blocked Steve, but not why they did not remove the block when they found out nothing nefarious is going on.

In a business, I can’t imagine someone being as huffy as this – saying that this was a “blatant violation” of their robots policy. A business would ask – what is it that you want? How we can help you? You’re a customer.

Sinan, having re-read your post.. yes, it’s true that if you are on a fast network, this script (without any kind of delay in the loop) could, in conjunction with a poorly written CGI script, make the CPU of the web server fairly busy.

However, it won’t cause it to be used 100%. It will still have time left for other requests. Other requests should also share the CPU evenly, so they will in effect slow down Steve’s fetching in order to temporarily serve other people’s requests.

As a result, I would be surprised if it had a significant impact. I think NASA’s budget is in the hundred of billions of dollars a year. They can’t spring $500 for a dual core web server for GISS? This research is related to an issue which is supposedly one of the most critical of our time, which is having lots of money thrown at it, etc. What’s worse, the reason we have to do this is because they have organized their web site in such a way to make it hard to get to the data.

It certainly isn’t a “denial of service” because running this script will not make the web server unusable for others, nor even especially slow. They might experience small extra delays above what is normal. But, the data sets being served are so small, and the network latency such a large part of the time spent fetching the data, that they’d have to drop the ball really badly to implement this in such a way that it really couldn’t handle the extra requests gracefully.

I’m in the web search-engine business so I can probably speak to a few of the issues.

“scraping” is definitely a term of art and is what Steve was doing.

Any programmatic retrieval from a website (rather than manual browsing) is usually regarded as “robot” activity.

If you do programmatic retrieval, it is usually considered good manners to insert a delay of at least 1 second between requests so as not to load the server.

Webmasters are pretty much at the bottom of the IT foodchain. This one seems even worse than usual, since he can’t even serve the robots.txt file properly. As a creed, they are generally very proprietary about their webservers; they get very huffy if you overload their servers; and they tend to shoot first and ask questions later (eg, they will assume your intentions are malicious and block you rather than engage you in conversation).

All-in-all what has happened here is par for the course. If you promise to play nice in future (see above RE delaying requests), they should restore your service.

Using the standard Perl module CGI.pm, the maximum number of invocations per second was less than 20 whereas the number went up to 60-70 per second using CGI::Minimal.

Again, these figures are for a script that does nothing but load and terminate. While the execution time of a script can be significantly reduced by using faster CPUs, the invocation is disk I/O bound. That is why there once the empty script is executed via mod_perl, the number invocations per second jumps to hundreds.

Now, I don’t think Steve caused denial-of-service. What I am saying is that, in the absence of other information, denying access would have been my first instinct as well based on the fact that sustained, repeated requests can bring a server to its knees. It is not a matter of CPU time. It is also disk I/O and swapping.

Should GISS provide these data without users having to do custom programming? Absolutely. I am not arguing the opposite.

My point was based on what I perceived to be a misunderstanding of why the admin had taken the step of banning Steve’s IP address.

#16

Steve, while I agree with your sentiment, my guess is that the webmaster of a business would also have acted this way. You would have had to take it up with a manager to get to the point of “how can we get you what you want?”.

I don’t think there is any need to get any more technical than this on CA regarding web server performance.

Usually there is interesting info here and I realize there is frustration on this matter but to be blunt, I think you should move on to something else. This is clearly a bot. The web adminstrator is just following policy – no conspiracy. I had to draft network policy before and the problem is that you have to treat everyone the same or the exceptions will open back doors that end up being flood gates. My advise is to follow the network procedures and you will have less troubles in the long run.

If you are potentially blocking other customer from getting service then a business will get huffy.

I use mod_perl for PERL scripts, and mod_php for PHP scripts. I think it’s silly to do otherwise. Most of my web servers have multiple cores (generally 4), even regular PCs these days tend to have at least two cores. Basically what I am saying is, it’s possible their web server is so poor it can’t handle requests like Steve was making, but if so there is no excuse for it. Fast hardware is cheap and setting up mod_perl or mod_php isn’t difficult and is the best practice.

OK, let’s say they don’t use mod_perl for whatever reason, and their hardware is slow, it can only handle say 5 requests per second. Mr. McIntyre download less than 2000 over 8 hours. That’s about three per minute, or one every twenty seconds. That’s consuming 1% of their meager resources. This is why I say I don’t really think they were justified in blocking him unless they thought he was doing something malicious (and if would be a pretty pathetic DoS attack at that rate).

Anyway, I don’t know all the facts, so I can only guess, and maybe they have perfectly good reasons. But it smacks of stubbornness to me.

MarkW, yes, or something I’ve done in the past is to probe the windowing system’s display tree to extract information, or in DOS text mode, the text buffer. That sounds like what “scraping” is intended to mean. HTML is clearly meant to be machine-readable as well as human-readable, so I wouldn’t use that term to describe this process.

Dave Blair: How is it “clearly a bot”? Please read comment #10. This program is not crawling the web site, it is simply requesting data for a range of IDs and downloading the data returned. That means calling the same script repeatedly with a different argument and downloading the file returned, while crawling which implies following any and all links so as to “explore” the whole site (or at least a reasonable fraction of it). I realize that we can disagree on the exact meaning of terms.. but in that case I think you shouldn’t use “clearly” to describe it. To say that it’s arguably a bot would be more accurate surely.

Anyway, this doesn’t really matter… but it makes me sad that in this day and age, NASA can’t make a web server which can serve data at a reasonable rate without issues. This isn’t rocket science, people!

#8 That’s why when I wrote my code to download the whole thing, I slept for a full 4 seconds between each station request and ran it overnight on the weekends. While it’s obnoxious for them to make a CGI program and transient files the only way to get the data, my goal was to successfully get the data, not bring down the system or get banned….

rv, I wasn’t thinking about bringing the system down or getting banned. In retrospect, I would have put a sleep instruction – BTW how do you do that? – but it didn’t occur to me that my downloading requests would be material to them. In any event, I’ve received the following response from the GISS webmaster:

On May 16, 2007, at 23:44, Steve McIntyre wrote:

> I am blocked from access to the page where the email addresses are
> located.

Good point. That was foolish of me to suggest checking a
page on which access had been turned off.

I have turned off the restriction that I added to the
server on data.giss.nasa.gov last night so that you can
access the GISTEMP page and view the contact information.

> I have been attempting to collate station data for scientific
purposes.
> I have not been running a robot but have been running a program in R > that collects station data.

It is an automated process scraping content from the website, and if that isn’t what a web robot does, then it’s close enough.

rbs

I’ve not received any reply from Ruedy about alternative means of accessing the data. I’ve written back to Schmunk asking how I’m supposed to access the data.

Please contact the GISTEMP group and inquire if they are willing to provide you with the dataset(s) from which the website applications extract information.

If they are not (I do not know what their current policy
on this), then you can go a step closer to the source and obtain station data from the same location that the GISTEMP group obtains the original “raw” datasets that they work from. That is the Global Historical Climatology Network at http://www.ncdc.noaa.gov/oa/climate/ghcn-monthly/index.php

I’m not sure which specific files from the GHCN site are
used. But if the complete GISTEMP data are not available
then perhaps Dr. Ruedy of the GISTEMP group could give you
some tips on how to use the GHCN data.

rbs

I replied that I wasn’t interested in “tips” on how to access GHCN data and re-iterated my request (copy to Ruedy):

I know how to use the GHCN data. I’m not interested in “tips” on how to use it.

I’m interested in the versions as used by GISS. The GHCN version is convenient to download and I see no reason why GISS versions should not be available on equivalent terms.

In the first half of the GISS data set, before I got so rudely interrupted, there were 246 Chinese stations of which only 24 contained records after 1992! It’s all too weird. The impression that I have of the GISS data is that it’s being presented as being selected for rural-ness. That may well be true of the US network, but the US network has very high 1930s. Here are the 24 sites that I identified (and a few sites may crop up in the rest of the network.) These include such rustic locations as Beijing and Shanghai, with a town like Dulan, which we’ve discussed, being classified as “rural”.

I asked politely for access to the data and received the following response from one of HAnsen’s team:

Dear Steve,

Our main focus working with observed data is creating a gridded set of temperature anomalies which gives reasonable means over comparatively large regions – the global mean average being one of the major goals. If you are interested in individual stations, you are much better off working directly with the GHCN data.

Our station data are really intermediate steps to obtain a global anomaly map, and are not to be viewed as an end result. A modified time series for a particular location may be more representative for the surrounding region than for that particular location. So it is important to use these data in the proper context.

All our publications and investigations deal with regional temperature anomalies and that is the only use these data are good for after our modifications.

If you still think that downloading our “scratch pads” is important to your investigations, please let me know exactly what stage after the raw GHCN data you need and maybe an indication why you need it, and I’ll try to provide you with the necessary data.

Again, we are not trying to compete with GHCN as provider of station data; we are using their data for a very specific project and we made – perhaps unwisely – some of our tools that we used to test the various steps of our process available on the web.

#36. HAns, it’s about the same size as the GHCN v2 database – max 20 MB or so for each version. I was at 8 MB halfway through one version. They have a couple of versions – so it’s not a big, big dataset.

Fortunately for me, there are not similar issues for elevation data. The USGS has a data archive for all of this type (i.e. imagery) of data at seamless.usgs.gov. It took a little effort to get the data in a format I can use, but not that much.

#33 “Reluctantly I’d have to agree with the NASA webmaster on this one. Apache really isn’t designed to be scraped like that.”

Of course it can! Apache can even do load balancing, lol! There is no way a server can be “brought to its knee” by a bot accessing from a single IP address like Steve’s R script. Saying otherwise is a joke: any free video service delivers tens of megabytes to a single user on realtime! Denial of service attacks are made by multiple accesses from thousands of separate computers (thanks to distant control by worms and trojans), not from a single IP address.

There is NO technical justification for the attitude of the GISS except arrogance, bureaucracy, sheer incompetence or simply voluntary obsfucation.

#31 Steve, didn’t mean to imply that you intended to! I just meant that I was trying hard not to be a big drain on their system, and, since I work in software and web applications, I was aware of the possible load issues that a script could create.

I used Perl to download the data and used the sleep function to pause. In R, I believe you could use Sys.sleep(4) to get the same effect.

In response to Ruedy’s refual to provide access to the data, I asked for a copy of the program by which they calculated their results from GHCN data. Now they have reversed course:

After a short meeting with Dr. Hansen, we were advised to let you download whatever you want as long as generally accepted protocols are observed. Please try to do so at a time that does not impact other users, i.e. late nights, weekends.

What we did with the GHCN data is carefully documented in the publications listed on our website. We are not creating an alternate version of the GHCN data, we are mainly combining their data in various steps to create our anomaly maps.

Reudy says:
“If you still think that downloading our “scratch pads” is important to your investigations, please let me know exactly what stage after the raw GHCN data you need and maybe an indication why you need it, and I’ll try to provide you with the necessary data.”
Steve characterizes this as:
“refual to provide access to the data”

How in the hell does “let me know what you want and I’ll try to get it for you” become a refusal? Especially when it is followed shortly after by a polite permission to access all their data?

FWIW — we recently changed host providers and learned a bit about server load management along the way. Our new host (unlike most) has explicit rules: in addition to the usual space and bandwidth, they require things like:
– don’t use more than 25% of server CPU for more than 90 seconds
– don’t send more than 500 emails per hour

— and they auto-block and auto-ban if you exceed these limits.

While your usage *alone* won’t overload the server, your usage most definitely was using *all it could* of the server. Other users would see a slowdown.

So yes, when massively downloading, it is good netiquette to include a few seconds’ delay between requests!

I realize that you have provided some documentation of what you did. In econometrics, it is a condition of publication in journals that authors archive their code and data so that their results can be routinely replicated. I realize that no such standards apply to climate science. However, equally, there is no prohibition on individual climate scientists voluntarily adopting these best practice standards. In that spirit, I would appreciate it if I could inspect the code used to process the GHCN data. Thanks, Steve McIntyre

He did not provide source code according to best practices standards. Instead he replied:

bender, drop the indignation. Steve got diagnosed as a bot, was blocked. Several of Steve’s readers here, people in the industry , say this was no surprise, and standard IT practice – and that perhaps the IT person isn’t the best on the planet, but this wasn’t a violation of any kind of ethical practice. Steve wrote asking for access to the data, and was given access to the data. Now he is crowing that somehow he forced them to do something against their wishes – and he sounds like a paranoid jerk for it.

He then ramped up asking for the code, and is now going indignant about that.

I’m not playing any kind of victim. If anything, I’m quite amused at how transparent Steve has become.

It’s certainly enlightening, don’t you think, to observe the actions of GISS here. First, they ban Steve – fair enough, it seems. Then he tells them what he’s doing and they say “tell us what you want”. He tells them and asks if he can also see their methods. Their response is, essentially, “You are unblocked now – bye!”. Hmmm. If Steve’s actions were “bad”, and it’s better to get all the data in one lump, why not supply it so? Why tell him he has to get it in a way that they have previously said was not in their best interests? Why do they not show their work to a fellow researcher? All very odd. It sure *seems* like they don’t want anyone poking around finding out what they’re doing, which is at best odd, and at worst a cover-up.

It’s interesting that GISS describe the data that they provide to the public as “scratch pads” and ask for justification for downloading their “scratch pads”. What sort of answer do they expect? That I want to scratch it? If they don’t know what the relevance of the data is, why are they disseminating it to the public?

#52. The first interruption was fine. However, after I told them that I was not a bot, they did not immediately restore access or provide an alternative. That’s what they should have done. In fairness, it’s not like other situations where data is still unavailable after months or even years of effort, such as the Lonnie Thompson obstruction.

So your complaint, Steve, is that they didnt immediately make an exception at 6am to their bots policy for your bot, and it took them until 2 pm that same day to offer to get the data for you, and then another two hours to offer full access with a request to modify your practice a bit.

Dude, you ran a bot in violation of their policies, they blocked you, and it took less than a single working day to come to a solution and give you access to the data. Stop whining.

Uh, Lee; you’re the one whining, as usual. Still, it’s nice to see one of the old trolls at home, stirring in his cave and asking, “Who’s that walking over my bridge!” Why don’t you go down to the Exponential Growth thread and try giving Jimmy D a hand. He needs someone to complain that Gerald Browning is being mean to him.

I don’t think that I “whined” or was “indignant”. I’ve reviewed my postings, which, for the most part, merely recorded the progress of the correspondence. At the end of it, I expressed mild satisfaction that I had been successful in getting service restored and believe that blog power had something to do with it. I draw this conclusion in part because of the curtness of Ruedy’s closing remark. However, I cannot prove that this was not their intention all along not can you show the opposite. Nor am I worried about it very much. As noted above, I’m satisfied that, in this one case, I’ve been able to get access to data relatively promptly, however it came about. Lee, maybe you can help getting Lonnie Thompson’s data.

Ah, I’m glad to see this was resolved satisfactorily and we all learned something.

Unfortunately, it could have been resolved much better, for example if they were to make it easier to grab all the data in one go (it would be better for them and for us if they could), or if they could better document how they process the data so that it can be fully reproduced.

MrPete, those limitations sound like you are sharing a web server with other people. I would hope NASA has dedicated servers, although they might serve multiple sites from one server. If so one would imagine they’ve invested a little bit of money in the hardware and the configuration so that it isn’t going to be noticeably slowed by serving a few extra requests per minute.

Re # 46: Also in other research fields it is not common to provide the source code (I don’t do it). The method should be described clearly and that is enough. Write your own code if you want to reproduce the results. That’s what scientist do in my research field and many other fields. Climate science is not unique.

Re # 46: Also in other research fields it is not common to provide the source code (I don’t do it). The method should be described clearly and that is enough. Write your own code if you want to reproduce the results. That’s what scientist do in my research field and many other fields. Climate science is not unique.

Unfortunately, in climate science part B of your prescription (the method should be described clearly) is all too often ignored, or treated only perfunctorily.

In addition, there are a lot of statistician wannabes in the climate field, who either make egregious mistakes (such as happened in the Hockeystick) or use “novel” or untested procedures which are neither fully explained nor adequately justified. Without the code, these are very difficult or impossible to diagnose.

In addition, the provision of the code saves hours and hours of work when trying to understand where a particular procedure might have gone off the rails.

My question is, in this day of easy communication and archiving … why not provide the code?

Why the reticence? What are you protecting? Why are you trying to justify withholding important information about the work which is being done? Why are you supporting wasting my time trying to guess how some scientist has come to a wrong answer?

w.

“Had we but world enough, and time,
This coyness, lady, were no crime.

I was interested to discover above that I could look at surface air temperature records at NASA and get a nice graph. I probably haven’t been paying attention. Anyway, I’ve spent some time clicking away, choosing, where available, rural areas with 100 years or more of records. I’ve yet to see a graph that goes unambiguously up; some go obviously down; some go down and up; some go up and down. What am I doing wrong? Everybody agrees the world is getting warmer but every graph I look at says it isn’t – at least over the last 100 years.

I was checking out the GISS data myself and am dumbfounded at the apparent lack of any FGDC compliant metadata. It must be there and I missed it. Has anyone seen links to metadata for the various datasets?

62: The Idsos seem to find the same thing. See the “temperature record for the week” on CO2science.com. It’s cherry-picking, of course, but interesting, nonetheless. I think the UHI effects are far more important than is generally believed. EPA says temperatures in large cities can be up to 10 degrees higher than surrounding rural areas.

62 When I look up the stations for Switzerland (where I live) at GISS, there are four stations, which go up to 2007. Two of them are the two biggest cities (Zurich and Geneva) of Switzerland. All other stations (around 7), which can be partly seen as rural, stopped to be included in the late 80s or early 90s.
Switzerland is very small, but if this has happened in other countries, this would be a good explanation for GISS showing the largest warming.

#67. Gaudenz, this pattern seems to be almost universal in the GISS (and GHCN data) other than (perhaps) the US. Of course, the US is the location where the spread between the 1930s and the 200s is the least.

There are only 24 GISS sites in China with data past 1992, mostly consisting of rustic locations like Shanghai and Beijing. The pattern seems universal.

Steve, I know you’ve been mistreated by other researchers, but that’s no reason to go into each new encounter with a chip on your shoulder. You were running what you now realize was a poorly-written site-scraping program because it didn’t include even so much as a one-second delay to allow other requests into the queue. That’s bad practice. Good webmasters put policies in place to interrupt programs like yours. (If you automatically and unquestioningly let people run programs that absorb even 1/4th or 1/16 of your bandwidth, pretty soon you’ve got no bandwidth left. Best to nip such problems in the bud.)

So they were correct to block you and when you complained about the block they resolved the issue in less than a day; not bad at all!

You should add an update to the main post (so people don’t have to read all the comments) noting that you’ve been allowed to download the rest of the data and thanking them for their assistance in quickly resolving the matter. (And even though you’ve been whitelisted, you should still add a small delay in your R script when you get the rest of the data.) Celebrate the small victories!

If you have other info you’d like to request, wait a week or two and ask for it as a separate transaction rather than a continuation of this one. Otherwise interacting with you may start to seem like interacting with your R program – an exchange with the potential to suck up all available bandwidth because the instant one request is fulfilled another one follows it, and another, and another with no end in sight. Humans have even less tolerance for that sort of thing than do webservers. 🙂

#69. your comments are fair enough. I note that the ultimate resolution of the issue was them telling me just to carry on with what I was doing after hours (which is what I was doing anyway) so it couldn’t have had that deleterious an impact on their operations.

If the roles were reversed, I’d like to think that I would have sent a relatively pleasant response – saying, thank you for identifying yourself. I’m glad that the access was being undertaken for a scientific purpose, however, the form of access is having an adverse impact on our system (if that actually was the case)and we would prefer to make the information available to you in a more efficient way that places less impact on our server.

I would not have responded the way that the GISS webmaster did. Having said that, there’s little doubt that John A would have responded as aggresively as the GISS webmaster, so maybe there’s a geek element to the aggressive language of the initial response.

Obviously I have had very poor luck in getting data from the Team. This is not a matter of asking politely or not. My own judgement is that blog publicity is helpful in getting data. Sciencemag only took steps to get data from Esper after I publicly criticized their hypocrisy. They contacted me to object and over the next year managed to get data from Esper.

In my judgement, I thought that I’d have better chance of getting the data by hitting the issue while the iron was hot. The matter was resolved. Reasonable people can disagree whether they would have resolved the situation without the unfavorable blog publicity.

But I’ll update the post to reflect that the matter was resolved.

BTW the data that I was accessing is data that should be routinely available.

Lee, once again space is wasted by your attempts to analyze the motivations, style and approaches of Steve M. Let the rest of us adults here make our own decisions on those matters so we can stick to the main point of this and other related threads, i.e. the willingness and cooperation of climate science related organizations and individuals to provide easy access to important scientific data. If you think Steve M and many of the posters here are wrong to expect more — and more in this specific case — just say so. Otherwise your remarks appear to me to be more diversionary than informative.

I think these bureaucracies have not yet figured out that this is not your every day issue, that this is a trillion dollar issue. This culture of denying service and pretending that server performance is the issue when the real issue is data protection is ridiculous. It’s not about IT, it’s about IP, and bureacratic raison d’etre.

Lee would be half-right except that the normal standards do not apply in abnormal circumstances.

I would hope NASA has dedicated servers, although they might serve multiple sites from one server. If so one would imagine they’ve invested a little bit of money in the hardware and the configuration so that it isn’t going to be noticeably slowed by serving a few extra requests per minute.

My point was that CPU-intensive requests do cause a heavy server load. The key parameter is not requests per minute but overall CPU load. If these files are being generated on the fly, Steve’s script could easily use 100% of the CPU. All it requires is that the CPU time be larger than the download time. That’s independent of shared/dedicated servers. And remember, most likely we’re not dealing with top-end industry sysadmins here 😉

#75. One of the results of this episode is that it appears to have prompted a new hack attack against climateaudit. In addition to the usual spam of poker, mortgage and porn sites, we’re being hit with random messages.

Mr. McIntyre, it could of course just be a co-incidence. There are a lot of people on the internet with too much time on their hands and just enough knowledge to cause mischief.

MrPete, you would be correct if Mr. McIntyre was sending requests very frequently, or requesting large or complex queries. He was not. There was about one station query every fifteen seconds, and each returns something like 1-2KB of data. Such a request should not take more than 0.2 seconds to return even on a fairly old and slow machine, unless the person who programmed the CGI did so very poorly. This would constitute an average load of just a few percent on their server. Possibly enough to be noticed, but unlikely to cause any interruptions of service.

I think most likely what happened is the sysadmin was browsing through the web server logs and saw constant activity, assumed someone was up to no good and blocked them. It’s possible these repeated queries were causing problems but only, I think, if the CGI is very poorly written and incredibly slow for the small amount of work it does for each request. But as I said before I’m only making an educated guess, I don’t know what hardware they have, what database software they are using, how their scripts are written, etc. And, as I said before, if they want to disallow this type of access to their data they really should put a notice on their web page stating that users are not allowed to automatically download lots of data.

They said what Mr. McIntyre was doing was in violation of their policy, but where is the policy stated? Surely it’s superior for people to be able to read the policy and know whether what they are doing is going to violate it prior to actually commencing that activity. Perhaps it is written on their web page somewhere but if so I can’t find it.

There’s no need to speculate about whether the R downloading program adversely affects their server or not. You can check for yourself right now. I’m doing another download instalment as we speak now since it’s both weekend and nighttime. To test whether this was adversely affecting GISS server performance, I went to the GISS site using a browser (while the R program is operating in the background) and got a response immediately. If anyone doubts this, go to GISS and see for yourself http://data.giss.nasa.gov/gistemp/ any time in the next few hours. Their server is operating just fine.

Steve, it does not matter if your activity adversely affected their server. What matters is that the kind of activity you were doing CAN have such an effect, so that kind of activity is forbidden. The fact is they DID create such an exception, on the same day. They did so despite your whinging at them here, about their valid application of a valid policy. You have no valid complaint about this treatment. None. So apologize, and stop complaining.

Lee, given the structural non-cooperation of climate scientist on releasing data. Steve had valid reasons for his initial suspicion.
And yes, 20 megabyte is peanuts, so don’t blame Steve for a poor server configuration at GISS.

#81 Nonsense. Bureaucracies like to protect their intellectual property. They are territotially assertive. This has nothing to do with technology limitations in serving the public and everything to do with information control. The only reason Hansen ok’d the request is because he knows how bad it would look if he did otherwise. No kudos are in order. The hierarchy made an exception because it fears a force that is more powerful: the democratic blogosphere.

Many of the threads are peppered with traces of mind reading,
as well as with assumptions about why things should have been
different than they are/were, but this one seems to be
overflowing with them.

If you like, one could explicitly label speculations as such, but I think it is somewhat obvious when someone is speculating in these matters.
My unproveable conjectures are only a counter to the unproveable conjectures of others, such as Lee. Hence the outbreak.

Very true, but there also are things which we simply do not
know, and about which we can only guess (and without any
“confidence bars”).

Here’s one that has nothing to do with any of the previous
guesses, but not implausible: Hansen realized that Steve’s
activity would raise the usage stats of the GISS server, and
he could use those numbers in his next request for a bigger
budget. 🙂

Yes, and more compelling, but the guess/assumption level
of this thread has me cringing each time I notice another
comment has been added.

Let me mention another aspect which, I would guess, Steve,
and Nicholas (an IT guy, no less), seem not to have considered.
Steve’s request for “the entire set of station data in current
use” implies that he want’s the thousands of .gif files,
each commonly in excess of 200,000 bytes in size, generated
by his thousands of download requests. I very much doubt
that Steve wants them, or has considered the overhead in
the thousands of executions of the ghostscript program that
generates them, but his note to Reto Ruedy implies that he
does. 😦

re 69, Glen Raphael. I notice that Steve has not added a clarifying note to his original article, as yo suggested. Far be it from me to engage in “mind reading.” I’ll simply note the fact, for other fair observers to ponder.

re 81, Earle. If one were to RTFT, one would find frequent references to the robots.txt file for that site.

#93. When an issue arose, I promptly contacted Ruedy asking for access to an organized datafile that would not require separate access for every station. Instead of providing me with that file, GISS decided that they would rather permit to me to continue downloading station by station. This was not my decision but theirs.

I’ve updated the head notes with subsequent correspondence showing that access has now been provided, as I’d undertaken to do (although the responses were already in the comments.) It is a holiday in CAnada and the grandkids were over, so my abject apologies that Lee had to wait for this collation. We at climateaudit are très désolé that Lee had to wait even a second for this vital collation.

Yes, Lee, sadly, I, for a brief moment, descended down to your level. I won’t do it again.
12
Lee says:
May 21st, 2007 at 12:36 pm
edit

Steve,

Stop whinging. You may not have noticed, but in the GISS case, several of the CA regulars are also hitting you for exactly the same thing I am. You violated a robots policy, they blocked you, when you inquired they told you why you were blocked and directed you to the appropriate place to request data, and within less than a day they made an exception to their web policy just for you. You have yet to note this in your original article – one has to read into the comments to find that they acted fairly and that you have the script access you initiated to the data. And you are still implying that they somehow treated you unfairly, with posts like your look my activity si not stalling their server” post, and this one here.

Nice to see that you also continue to engage in making quotes out of context, with no link, indeed not even a mention of the thread from which the quote originated, as well. Good job.
13
Lee says:
May 21st, 2007 at 12:38 pm
edit

bender, please point to where I engaged in “mind reading?” If you are going to engage in this, and Steve is going to allow it, you can damn well support it. Or shut up.
14
bender says:
May 21st, 2007 at 12:50 pm
edit

Re #13 “Show me where I engaged in mind reading?” Well, call it what you like – they’re your words – but I didn’t do anything that you did not do first. That’s Jerry B’s observation, and I agree with him.

Regarding your use of language, it’s very impolite to threaten somebody who’s got no leverage on you. You have so little credibility, you have nothing to lose by engaging me that way. It’s a bit unfair, don’t you think? But I’ll “shut up”, as you suggest. I’ll crawl off to some other thread and let you carry on with your excellent work.
15
Lee says:
May 21st, 2007 at 12:55 pm
edit

Do you think that all logical deduction is a form of mind-reading? Take a look at what Bender actually said:

But Lee will be along any second now to tell us that the purpose of this was to centralize services to make them more efficient.

This was a prediction based on your past behavior. It could have been a self-defeating prophesy if you’d have the sense to ignore him, but no, you had to prove him right by claiming sarcastically that he read your mind, which is, in its own way, a self-fulfilling prophesy.

But look at what you did. You claimed Bender read your mind and proved it by posting a message to Steve of the sort he predicted. Therefore you yourself are claiming that you’ve read Bender’s mind since otherwise how could you know he was reading yours rather than just making a deduction? And then you ask for proof you’ve ever done what you just did?

You need to pull out a copy of “Godel, Escher, Bach” and read the “Chromatic Fantasy, And Feud”. As Achilles tells the Tortoise, “Oh, Mr. Tortoise, don’t put me through all this agony. You know very well….”
17
bender says:
May 21st, 2007 at 1:04 pm
edit

No, you’re just too obtuse and have too short a memory span to realize that I’m referring to the other thread where your problem originated, GISS interruption. The thread where you tried to read Hansen’s mind as to his motive (protect the IT), and I merely followed suit (protect the IP).

IOW you are doing what you usually do: carrying your arguments across threads to take them all OT. Oh wait I forgot, you’re a [self-snip]. Carry on. Please take the last word. Make it a good one.

re: #78: steve, hits on a web server aren’t evenly distributed over time. Traffic comes in clusters; it’s random, chaotic, clumpy. The web server needs to have a maximum throughput that is larger than the average throughput, but it’s a waste of money to pay for bandwidth too much in excess of peak needs.

Suppose the GISS server hits its maximum load at 2pm. If you’re running your script *then*, there’s a chance access will be slow for other users. Response time would still be fine at 4am, hence the suggestion you run your script late at night.

Eeven if they do have the bandwidth to handle *your* process unencumbered throughout the day, there’s still an important scaling issue. They don’t just set their policy to deal with you; they set policy to deal with the entire class of potential users *like* you. Suppose their system can handle *four* users like you before normal use is affected. Then you’ll never notice a degradation when you run your test, and they’ll be fine as long as they keep the number of users like you under control. How do they do that? Exactly the way they did – watch for unusual usage, deny excess usage, and approve on a case-by-case basis.

Passing your sanity test (have a few people request pages while the script is running) is not a sufficient proof that they’ve got ample bandwidth such that they don’t need to worry about users like you. You don’t know how many OTHER people are likely to be running similar scripts at any given time, or how understaffed and overworked the IT department dealing with them is.

The key thing webmasters worry about is clumpiness – what happens when, by random chance, a LOT of people suddenly request pages from them at the same time. If you put a short “sleep” in between your requests, that gives the system time to deal with other requests in the queue. It also makes you look more like a normal user and less like a bot.

In the future try adding a line like: “Sys.sleep(2)” between page requests.

I see Steve allows the attacks on me, made by himself and bender, to stand in that other thread, but is moving my responses to them. Transparent, steve.

Bender, if you didn’t want this to show up in that other thread, then you damn well should not have raised it there. Neither should SteveM have done so. Of course I know which thread you were referring to – where do yo think the phrase ‘mind reading’ came from? You still have not defended your accusation.

lee…you have used the term “whinging” several times in a non-specific way. thinking i might have missed some new “cyber-terminology”, i looked it up on wikipedia. there is no such word there. could you please either define “whinging”, or use a real word to describe what you mean?

to complain, especially about something which does not seem important:

– Oh stop whinging, for heaven’s sake!

– She’s always whingeing (on) about something.

whinge

noun {C usually singular} UK INFORMAL DISAPPROVING

– We were just having a whinge about our boss – nothing new.

whinger

noun {C} UK INFORMAL DISAPPROVING

a person who complains continually

—
deanesmay.com
The beauty of this is that you can go decades as a native speaker of the language and still come across words you’ve never heard. I recently came across just such a word, and it astonished me. Because I’d seen it several times in print, including in comments left to articles here on Dean’s World. Yet every time I’d seen it, I assumed it was a mis-spelling. I thought the writers meant to say “whining,” but had just produced a typo. But no, the word is:

WHINGING: To complain or protest, especially in an annoying or persistent manner.

#102. For an example of whinging, you may read Lee’s posts on this thread. The GISS data is important and my request for access to data were expressed appropriately. The form in which I was downloading data from GISS was eventually approved by them. I did not complain that they took an unduly long time to resolve things; on the contrary, I expressed some small satisfaction. I see little complaining in my notes, the majority of which merely document correspondence. On the other hand, Lee’s complaining about whether my initial downloading failed to observe their robots.txt policy and related complaining is surely a textbook example of “complaining, especially about something which does not seem important”.

#102 thank you, lee. the deanesmay blurb at the end was spot-on. i mistakenly thought it a mis-spelling. taking steve m’s advice, i looked over the thread, and i agree with him, you , lee, really are the one whinging here. thanks for the new piece of vocabulary.

I’ve often seen a similar phrase without the “h” where I assumed it was derived from “a wing and a prayer” and meant “just throwing something out and hoping it works.” Are there two separate terms or is the one with “h” just the British version and with what would appear to be a more negative connotation? Of course if it has an “e” at the end it would be pronounced differently than an avian arm would be.

JerryB, why would they be generating GIF files when we are explicitly requesting text data? We certain aren’t DOWNLOADING any .GIF files. Only one HTML and one TXT file per site, and the HTML is only being downloaded because they force us to do that in order to access the TXT file. The actual data for a site is just a few kilobytes.

If they’re generating a GIF file even if we don’t request it, that’s a very poor implementation and I don’t see how you can blame us for that.

I’ve said repeatedly, I don’t know exactly how their system works and therefore I can’t judge exactly how much load Steve’s requests place on the server, but I do know that if I wrote something equivalent, one query every 10 seconds or so would be such an insignificant load on the server that I wouldn’t be able to notice it without looking at the logs.

There was about one station query every fifteen seconds, and each returns something like 1-2KB of data. Such a request should not take more than 0.2 seconds to return even on a fairly old and slow machine, unless the person who programmed the CGI did so very poorly. This would constitute an average load of just a few percent on their server. Possibly enough to be noticed, but unlikely to cause any interruptions of service.

This answer is quite intriguing. Since the script has no artificial delays in its loop, then the four queries per minute generating 4-8kb/minute of data is quite obviously CPU-bound. Yes, other queries could get in there and receive a response (particularly if viewing static data/pages)… but if there’s a 15 second response delay for downloading a few KB of generated data, then the system is in Very Bad Shape.

I agree 100%: the server script (or some other aspect of the server) must be written or configured incredibly badly. I won’t bother guessing what’s going on. We’ve beat this horse to death.

–MrPete

PS If it were my server, I’d install some performance/database analysis code to see what’s wrong. The inside guys are being hurt just as badly, particularly since this is a part of their internal data processing path!

MrPete, it’s possible, but don’t forget that establishing TCP connections and sending HTTP requests can take a little while as they involve several packets going back and forth and the network transportation isn’t instantaneous.

It doesn’t take more than about a second to load the station data when I request it via my browser, and my round-trip time to the GISS server is 270ms. Maybe the server was very busy during the time Mr. McIntyre was downloading and it was slower, I don’t know.

Now that I’ve investigated, I think Jerry may be right, I had a look at the URL it uses for the temperature graph on that page. It appears to be pointing to a temporary file. That possibly means that it generates the graph image even if you never request it! Poor design, IMO, I would have set up the graph as a CGI and it would render upon request, rather than generating the graph and the station data text simultaneously. Maybe that explains why fetching the data puts such a drain on their resources, but if you ask me, it’s a silly way to make the data available. It seems more like a toy than a serious scientific endeavor if it’s designed mainly to produce single station graphs rather than let you download the data for more rigorous research.

OFFICE OF MANAGEMENT AND BUDGET
Guidelines for Ensuring and Maximizing the Quality, Objectivity, Utility, and Integrity of Information Disseminated by Federal Agencies; Republication

The reproducibility standard applicable to influential scientific, financial, or statistical information is intended to ensure that information disseminated by agencies is sufficiently transparent in terms of data and methods of analysis that it would be feasible for a replication to be conducted.

Reproducibility’ means that the information is capable of being substantially reproduced, subject to an acceptable degree of imprecision. For information judged to have more (less) important impacts, the degree of imprecision that is tolerated is reduced (increased)

Administrative Complaint Process
This section establishes an administrative complaint process allowing affected persons to seek and obtain, where appropriate, timely correction of information maintained and disseminated by the FLETC. This administrative complaint process has been designed to be flexible, appropriate to the nature and timeliness of the disseminated information, and incorporated into FLETC’s information resources management and administrative practices.

Any “affected person” (“Person”) may request from the FLETC the timely correction of information that the FLETC has disseminated. For the purposes of these guidelines, affected persons are persons who may benefit or be harmed by the disseminated information. This includes persons who are seeking to address information about themselves as well as persons who use the disseminated information.

Documents and information disseminated but neither authored by the FLETC nor adopted as representing the FLETC’s views are not covered by these guidelines.

If an affected person believes that disseminated information does not conform to Office of Management and Budget (OMB), Department of Homeland Security, or FLETC’s guidelines, that person may submit a written request for correction to the Disclosure Officer at the following address:

Description of the information deemed to need correction.
Manner disseminated and, if available, date of dissemination.
Specific error(s) cited for correction and proposed correction or remedy, if any.
Manner in which the information does not comply with the information quality guidelines.
How the person was affected and how correction would benefit them.
Supporting documentary evidence, such as comparable data or research results on the same topic, which will help in evaluating the merits of the request
The person’s contact information for the agency reply on whether and how correction will be made.

If matey won’t do it, why not ask what the relevent appeals procedure is?

Also, it is quite common to download large data files. Organisations use their own facility, or specialist download sites.

The idea that data files of a few megabytes are putting an unreasonable load on a system is ridiculous. If they don’t provede the raw data in a file, and make a query process necessary, then the consequences are down to them.

Steve’s downloading was both understandable and honorable, but unusual in the context — clearly the “inside” team had not run across this kind of data request. So he unknowingly was stirring a pot.

I sense Nicholas’ final functional analysis is correct. The “inside” team created this script not as a web service but as part of their workflow — to generate graphs and data. Clearly they did not expect outside “normal” web access; the method is not efficient enough to do that very well.

And I give them a break as well: sure, it’s poor design if the goal is to serve up web pages. But what if the goal is to scrape together something quickly that will give me the graphs I need and the data to go with it? I’m not thinking about web services, I’m thinking about my data analysis. I just need to get a job done. Writing a script is a quick and easy way to do that. So I write the script, generate the graphics and data I need, and forget about it. [Did anyone else notice that the script produces not just a few KB of data and a GIF graphic — it ALSO generates a PostScript version of the graphic!)

As they put it:

Our station data are really intermediate steps to obtain a global anomaly map, and are not to be viewed as an end result…our “scratch pads”…tools that we used to test the various steps of our process

Those of you who do NOT run commercial/public websites, but DO have access to intranet or personal server resources… have you ever created a page that was really for your own use, and then forgot about it?

This sounds to me like a working group that happens to have webhost-capable systems–over time it’s easy to blur the lines between internal-use pages and external-access pages. In January, Joe Scientist puts up a script for internal testing and data viewing; in March Jane Resarcher tweaks it for something else (generating a PostScript version of the graphic); in September, Jill SysAdmin sees the script and turns it into a public resource, without considering the impact. I’m not saying it happened that way; doesn’t matter how. The point is working teams use web scripts much differently than teams focused on creating web sites for an outside audience.

“In answer to your question, I’m interested in the data as it is presented
to the public.”

left him little alternative but to imagine that you want over 18,000 .gif
images of plots of annual means (my quick estimate is over 3 gigabytes),
plus another more than 18,000 postscript images of the same plots, plus
the text files.

Having been reading your blog, I would guess that you do not want all
those images, but based on your emails to Ruedy his interpretation may be
that you actually do. In such a case he may imagine that you are a
rather strange person, and he may be happy to have as little to do with
such a person as possible.

Of course, I have no idea what his interpretations were, but they may
very plausibly have been very different than what you had in mind when
you wrote to him. Also, if you had specified that you want only the
text files, that may not have changed the outcome at all, but he may
have viewed your interest at least somewhat differently.

#111. Jerry, my first email to Ruedy requested an organized data file such as is available from GHCN. My subsequent letter was to clarify that I wanted the GISS version and that a link to GHCN was not adequate. He knew what I wanted. The inital request said:

Dear Dr Ruedy, I have been unable to locate an organized file of station data as used by GISS (such as is available from GHCN). In the absence of such a file, I attempted to download data on all the stations using a script in R. This was laborious as it required multiple calls. I was not using a “web robot” nor was I indexing files. During the course of this, your webmaster blocked my access to the site claiming that downloading the data in this fashion violated your policies. Would you please either restore my access to the site or provide me with an alternative method of downloading the entire data set of station data in current use. Thank you for your attention, Steve McIntyre

My reading of his his emails to you is that did not know
that you did not want the images. His phrase “scratch pads”
refers to temporary files generated when you
“Click a station’s name to obtain a plot for that station.”
and those files include the images which are among the “data
as it is presented to the public”.

Again, I have no idea what his interpretations were, but I
see no evidence to support an unambiguous conclusion that he
had in mind what you think he had in mind, and I see at least
hints to the contrary.

MrPete, fair enough. Please keep in mind I’m not trying to criticise anyone, I’m just interested in the technicalities of how they have set up their site, why they made the decisions they did, and what Mr. McIntyre can do in future to access data without such problems. Call it intellectual curiosity. I do think their implementation is sub-optimal but I’m not blaming anyone for this, just pointing it out.

I think that what we seem to have agreed upon – that the GISS web site does not appear to be designed primarily for disseminating information for scientific research – is an interesting point in and of itself. I don’t know what it means, but it’s worth pondering.

When I built this function for Mr. McIntyre it never crossed my mind that using it would be considered abuse, for two main reasons. One, it seems a perfectly reasonable thing to want to do, and two, if I implemented such a system it would not cause any serious problems. I guess I should have looked more closely at how they had set it up, rather than assuming they did it similarly to the way I would have. That could have given me some clues that it causes more load on their server than I thought previously. Oh well, live and learn…

Thanks for the note but I’m well aware of the robots.txt file. I’m curious where this valid policy is that Lee cites for the implementation of the robots.txt file, and what the agency’s criteria are for what constitutes a robot. I asked Lee for it and got nothing in response but ‘RTFT’, as if rancor and rudeness substitute for fact and references. If anyone else has suggestions about where to find this policy on the GISS web site I’d appreciate the link greatly.

Or is it that there is no policy posted on the GISS web site, solely a robots.txt file? That certainly appears to be the case.

#113. Jerry, while Ruedy may have not understood things (although he should have) and, as noted above, I don’t have any particular objection to either how or how long it took to resolve the matter. However, if the roles were reversed, they should have said I go back to #70 which is how I hope I would have responded (and I think that I would have):

If the roles were reversed, I’d like to think that I would have sent a relatively pleasant response – saying, thank you for identifying yourself. I’m glad that the access was being undertaken for a scientific purpose, however, the form of access is having an adverse impact on our system (if that actually was the case) and we would prefer to make the information available to you in a more efficient way that places less impact on our server.

I am attempting something similar. I am trying to create a temperature trend for some 5° by 5° areas on the Earth’s surface. I am interested in long term trends so I decided to look at GISS records within the selected areas that extend from 1931 to 2000 with over 90% coverage during this period.

So far I have looked at three 5×5 sections of the globe.

The first was 30°N to 35°N and 95°W to 100°W. In this area I found 29 GISS temperature records that met my criteria. Of these 29, 16 were labelled “rural area”. Only one 1°x1° section within this area was devoid of a record. The warmest 11 year running average year was centred on 1934, the 11 year running average centred on 2001 was 0.31°C cooler than 1934.

The second was 35°N to 40°N and 80°W to 85°W. In this area I found 49 GISS temperature records that met my criteria. Of these 49, 30 were labelled “rural area”. All the one 1°x1° sections within this area had at least one temperature record. The warmest 11 year running average year was 1934, 2001 was 0.24°C cooler.

The third was 45°N to 50°N and 5°E to 10°E. In this area I found 6 GISS temperature records that met my criteria. Of these 6, 1 was labelled “rural area”. 19 of the 25 1°x1° sections within this area had no temperature record. The six I found were: Geneve-Cointr, Saentis (the “rural area”), Strasbourg, Stuttgard, Trier-Petrisb and Zurich. I am not sure if a temperature trend calculated for this area based upon my criteria would be representative of anything.. The individual trends suggest the IPCC global trend.

The first two areas clearly show why some Americans are skeptical of the IPCC’s claims.

Re: #63 Rich,

But see Fort Mcmurray.

The Fort McMurray trend ends in 1990 as do most Canadian records in the GISS. What was your point?

So why should my taxes be used to fund a Canadian hobby? More seriously, why does anyone assume that NASA has the servers and personnel needed to reach the level of service you are all saying it has to have?

#118. Eli, the only thing that I’ve requested is a dataset that is under 20 MB in size. The station data sets are under 20 MB. Data sets of that size are routinely exchanged around the world at negligible cost. The data should be available because the GISS data is cited in international studies. IF GISS wish to withdraw their data from consideration in international studies, then they would have no obligation to provide it to people wishing to examine it.

It’s amazing that advocates of international cooperation and action should be as chauvinistic as you propose, making the nationality of the requester an issue. That’s a type of pointless chauvinism that, if exercised, would both look bad and be pointless, since an American would doubtless make the same request if such an attitude were taken.

As NASA’s capacity to permit 20 MB to be downloaded, this proved to be well within the capacity of their billion-dollar computing system, as I’ve downloaded the relevant datasets uneventfully once the block was removed.

Eli, you’ve also falsely stated on your blog that my downloading “resulted in denial of service to everyone else”. The webmaster did not say this nor do you have any basis for fabricating this claim. I checked out access to GISS while the download program was in process and there was no difficulty in accessing GISS. A CA reader did so concurrently and vouched for this above. So please don’t make things up.

Steve, what you did was run a bot to scrape data from the GISS site, in violation of their robots.txt policy, apparently without even first making a polite request for the data. Your bot generated over 16,000 data requests on their server while it was running. That is considerably more than “the only thing that I’ve requested is a dataset that is under 20 MB in size.” That has been explained many times here in the comments thread, including by people on “your side.” Your error was a naive error, but it was nonetheless an error, and it got you blocked.

When you wrote to ask why you could not access the result of the graphic-generating scripts, and the webmaster informed you of your error, the proper response was NOT “I’m doing science, so make an exception and unhand your data” while simultaneously implying on your blog that they were doing something underhanded, and then pushing for not only the data (which you didn’t even bother to request, apparently, before running your bot) but the source code that generated that data. The proper response would have been a sincere “I’m so sorry – I didn’t know and I f****d up.”

But the fact is that Dr. Reto did offer to try to get the data for you, and you immediately inflated your request to asking for the source code instead – still with no hint of apology for violating their web policy in the first place. And all the while implying here several times that your blog was shining some kind of light and pushing them to do something they didn’t want to do.

Even better, you have already mentioned that the data has ALREADY been scraped by a CA reader, and is available to you. Why on earth are you making this kind of fuss, creating work for others, using their bandwidth, for data that you acknowledge you have easy access to from another source?

Oh get off it Lee. 16k requests over an 8 hour period is nothing unless you’re running on 1995 hardware or have the world’s worst sysadmin. I run web sites that get hundreds of thousands of hits per day, and the servers are highly underutilized. Web sites all over get scraped every day to retrieve data from them. They are not bot’s by any normal definition of the term.

Maybe Steve should have contacted them beforehand and let them know what he was doing. Perhaps they would have provided an alternative method of getting the data and that would have prevented the issue from occurring in the first place, and if you said that, I wouldn’t disagree with you. But as usual, you only criticize rather than offer helpful alternatives.

1) It is NOT a “bot” or a “web robot”. It is an automated download program which does no “spidering” whatsoever.
2) It does NOT “scrape” anything. It parses an HTML file and downloads a text file. Scraping involves reading data which is not intended to be machine-readable, while HTML, being a variant of SGML, clearly is.
3) I could find no policy document. robots.txt does not apply, as per #1.

I’ve no interest in getting into a fight but I don’t want to see you disseminating disinformation out of ignorance.

It appears that your R program indeed constitutes a “robot”
as that term is used in the context described.
—-

#11 Steve McIntyre says:
May 17th, 2007 at 8:03 am

#9. I guess “scraping” describes what was being done to a T. Objection to this word withdrawn.
—-
#21 Interested Observer says:
May 17th, 2007 at 10:01 am

I’m in the web search-engine business so I can probably speak to a few of the issues.

“scraping” is definitely a term of art and is what Steve was doing.

Any programmatic retrieval from a website (rather than manual browsing) is usually regarded as “robot” activity.

If you do programmatic retrieval, it is usually considered good manners to insert a delay of at least 1 second between requests so as not to load the server.
—–
#24 Dave Blair says:
May 17th, 2007 at 10:36 am

Usually there is interesting info here and I realize there is frustration on this matter but to be blunt, I think you should move on to something else. This is clearly a bot. The web adminstrator is just following policy – no conspiracy. I had to draft network policy before and the problem is that you have to treat everyone the same or the exceptions will open back doors that end up being flood gates. My advise is to follow the network procedures and you will have less troubles in the long run.
—–
#33 John A says:
May 17th, 2007 at 1:28 pm

Reluctantly I’d have to agree with the NASA webmaster on this one. Apache really isn’t designed to be scraped like that.

Could I suggest that you contact NASA and request ftp (or secure ftp access) to the data?
——
#69 Glen Raphael says:
May 20th, 2007 at 11:51 am

Unfortunately, Lee’s right.

Steve, I know you’ve been mistreated by other researchers, but that’s no reason to go into each new encounter with a chip on your shoulder. You were running what you now realize was a poorly-written site-scraping program because it didn’t include even so much as a one-second delay to allow other requests into the queue. That’s bad practice. Good webmasters put policies in place to interrupt programs like yours. (If you automatically and unquestioningly let people run programs that absorb even 1/4th or 1/16 of your bandwidth, pretty soon you’ve got no bandwidth left. Best to nip such problems in the bud.)

So they were correct to block you and when you complained about the block they resolved the issue in less than a day; not bad at all!
—–

I didn’t say that 16000 hits was excessive. I said that it constituted something more than simply requesting a 20 mb file, which is what Steve keeps claiming is all he did.

Steve was scraping data from a directory which was blocked by the robots.txt policy. THAT was his naive error, and what got him blocked – and it is his response to being blocked for violating that robots policy that is at issue in my post.

Lee, I suggest you leave the web programming discussion to those that know what they are talking about. Web servers can handle multiple simultaneous requests. Steve was not making multi-threaded requests en masse. It was a single-threaded program that took a at least 1/2 second and maybe longer (not knowing the amount of time it ran, it’s hard to say). If one single threaded program can bring a web server to it’s knees, then the administrator should be fired. Whether there was a delay in Steve’s program or not would make no difference whatsoever to the ability of the server to service other requests, unless they only have a single connection to their database or file system to retrieve the data. Again, in any real organization, you would be fired for having such a system, even if it wasn’t designed for that exact purpose. In addition, there’s no way on Gods green earth that Steve’s program was using 1/16 of their bandwidth, or any number even close to that, unless their connection is a 56K ISDN line. Again, an organization like NASA would be very unlikely to have such a setup as this.

You can quote as many people on this thread as you want. It won’t change the facts and most of their comments are just plain wrong.

GISS appears to run on Apache, which the default installation sets the MaxClients to 150 concurrent clients. Steve was using only one of these connections repeatedly. It is extremely unlikely that his program created any performance problems on the server and likely in now way prevented other users from using the site. Based on the correspondence, the sysadmin reviewed the logs, saw a large number of hits from one ip address, and blocked it because he didn’t know what was going on. Nothing in particular wrong with that and as I stated in a previous thread, if Steve had contacted them prior to running the program, it likely would have been no big deal.

But let’s not make it out to be more than it is. It’s not a bot. It scrapes HTML looking for a particular link, then downloads the text file that link points to. Had they not looked at their logs, they wouldn’t have even known he was downloading the data, which, btw, is not against their policies.

Requesting a 20MB file in one chunk or in 2k chunks makes no difference. It’s still a 20MB file in the end. Anything else is just a semantic argument.

As to your other point, you seem to have a problem with Steve’s response to everything that happens to him. As he has stated repeatedly, he doesn’t get mad, and although he may make cynical or snide comments about his experiences, it’s just who he is. Why you have a problem with that is beyond me.

I’ll chime in briefly: a few people have put together a story based on a technical definition of ‘bot’/’robot’ and of robots.txt exclusion.

Here is the complete Web Robots definition of a ‘bot covered by robots.txt:

A robot is a program that automatically traverses the Web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

Note that “recursive” here doesn’t limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.

Normal Web browsers are not robots, because they are operated by a human, and don’t automatically retrieve referenced documents (other than inline images).

Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.

Comments:

1) (Side note) This definition is a bit dated; modern browsers DO sometimes automatically retrieve referenced documents beyond images, “pre-caching” what they anticipate the user may wish to view.

2) AFAIK (Nicholas can say definitively), the script in use here is NOT automatically traversing the hypertext structure. It is processing a specific and limited list of object names retrieved from a text file, generating server requests based on that list, and interpreting the server results to retrieve specific text files. It is not unlimited.

3) Nor is the script recursively retrieving documents. It knows a specific set of expected query-response instructions required to retrieve the text document generated by the original request. That’s not recursive. The script would need to be modified if the GISS site added another layer of links to get to the data needed.

At best/worst (depending on your POV), the script is an intelligent agent. I just checked and Steve’s script’s ONLY “non best practice” is the lack of a delay loop. Other download agents pay zero attention to robots.txt! You can test this yourself by beginning at a page that DOES have multiple links to the GISS station data, such as http://global-warming.accuweather.com/2007/02/a_good_man_is_hard_to_findif_y_1.html and see what your favorite recursive download agent does! You’ll find it retrieves by default, without delays, MORE than one of the pages at a time.

The point is, Steve’s script was a human-initiated, very specific and limited data-grabbing tool designed for a specific purpose. Not recursive, not unlimited. A similar script using simulated mouse clicks — indistinguishable from a human other than the speed of operation. Robots.txt is not designed to block such operations; it is simply a matter of politeness to introduce delay in the script.

And there is no question: if the server script were designed for normal web access, it would use one or both of the following techniques: caching of the expensively-produced graphics, and delay of graphic generation until requested by the user agent. This script does neither.

This is why I recommend giving everybody a break: Steve’s script was completely reasonable for his purpose, and the server script is completely reasonable for its original (internal) purpose. The ONLY problem here is the “misconnect” of external access to a server script not properly designed for external access. And even that is not really a big issue.

(Even if the data-access script in its current form were distributed worldwide and used to simultaneously download multiple copies of the data, the server script could be easily adjusted to cache the generated results, completely eliminating any CPU load that results. That tiny change (a few lines of code) would make the script highly scalable. And yes, caching easily handles updates to the source data files.)

It is a blatant mischaracterization. There was never any claim that Steve’s script caused a denial of service.

I tried to point out, from the admin’s perspective, repeated requests over a long period of time, using an “automated user agent”, to a CPU and memory intensive resource, can be seen as such an attempt and tried to explain the motivation behind blocking access to something other than active data hiding. Steve M. immediately saw and understood that point and fixed the title of the post.

I see first hand every day highly paid “professionals” with a lot of resources at their fingertips write dumb scripts, open security holes, fail to authenticate etc. Heck, the other day, I noticed that, due to a cut and paste error on the part of a Human Resources programmer, I was able to edit his online profile. I have no doubt NASA has its share of such people. In fact, this particular admin demonstrated some incompetence by blocking robots from accessing robots.txt! (See #6).

FWIW, I find #127 to be right on the mark.

I would like to see the GISS adopt data dissemination practices along the lines of the colossal (in comparison to the climate data at GISS) U.S. Health and Retirement Study. Every U.S. taxpayer (and I am one) ought to be proud of the way public monies are being used to inform research on health, income and wealth in retirement. A simple registration provides any researcher with all the public data, all the documentation, source code etc. All adjustments are documented, all changes in participants documented. “Data alerts” are available to document adjustments/corrections. I could go on and on. Everyone should check out http://hrsonline.isr.umich.edu/data/index.html and see an example of good data dissemination practice. Just look at the sheer number of variables and ask if there is any reason the task GISS cannot provide even a fraction of the volume of information that is routinely provided by the HRS?

And then ask, when GISS data or other research, funded by public monies, is being used to further quite intrusive policy proposals, shouldn’t data used in that research be just as readily available to any researcher as the HRS data set is without having to write scraping tools?

Or, take a look at the U.S. Census web site. Examples of good data stewardship by U.S. government agencies abound. The case of GISS is not one of them.

Steve M. or John A.: My earlier comment got tangled up in the spam blocker. I cannot access email right now, so I am posting this to ask you guys to check the queue and unblock it if the post is acceptable. Thank you. — Sinan

A problem with the definition that you found is that it is
limited to a particular kind, or class, of bots, and may
mislead those who may not be acquainted with wider usages
of the term.

Perhaps the discussion at http://www.atis.org/tg2k/_bot.html
will give you a broader sense of its usages, although as
definition is defined, even that discussion is too narrow to
constitute a proper definition.

There are many kinds of bots. Email programs are kinds of bots.
Part of the GISS software that we have been discussing is
a kind of bot. Steve’s program, by automating the issuance of
requests to the GISS server, constitutes another kind of bot.

As for the GISS software under discussion, its primary output
is the .gif image; the text file is a secondary output. Let me
call to your attention the wording: “Click a station’s name
to obtain a plot for that station.”

As for why the GISS software does what it does in the way
that it does, IMHO such matters may be curiosities that
might be more suitable for a programmers blog than for CA.

The USGS seamless data service is one. Quite nice, with image/elevation data of the world down to 1 arcsecond. They even include entries with canopy information as far as I can tell. You just draw a box on an image and it’ll format, and then dump the data, with a variety of formats available (and plenty of description docs, though it took a bit of effort to figure it all out).

Geological surveys have long traditions of excellent archiving practice not just in the US. This does not necessarily extend to climate studies done by USGS scientists, where even USGS scientists can adopt reprehensible Team practices, as I will illustrate some time soon.

Yeah, but, as I’ve mentioned before, this data does not fall into the “controversial” category. In fact, it is really developed for hi-res uses such as radar, i.e. defense. The publicly available data is a subset, sort of, of the classified databases (which I have no need of).

The definition I quoted is the only pertinent one: it is the definition that relates to robots.txt policies. It doesn’t matter what other kinds of bots exist or may be created.

This is not a “someone should have known” issue. The issue is not a robots.txt issue at all. This script is in an entirely different category. If it WERE a robots.txt issue, you would see a variety of functions and tools to help script-writers handle the robots.txt policies. They aren’t there to be found. The available programming toolkits for web page retrieval pay no attention to robots.txt! It is not even discussed. (Take a look at the most popular web programming language, PHP. (www.php.net)… lots of functions and tools for retrieving web pages and data… nothing about robots.txt Same for the various web-scrape functions extensions for various web browsers– few pay any attention to robots.txt. “FasterFox” is one of the few that does, and it requires custom robots.txt support to do so!)

As for the final snipe in #130… just as HOW the data is produced is only of interest to those who are not engineers, it’s also s just as immaterial which output is “primary” or secondary. Only our eyes know the difference; the server was actually less loaded with Steve’s script than with page “hits” from a normal web browser. Both are requesting the base page plus another file — and the text file retrieved by Steve is smaller than the graphic! So his script saved on bandwidth even though it cost the same in CPU cycles.

Give it up. There’s no reason to berate anyone. Everyone was acting reasonably according to their own paradigm; we’re just looking at the unintended side effects of reusing an internal script for an external purpose that exposes the inefficiencies of quick “hacks.”

I apologize if, as seems to be the case, what I wrote appears to be
sniping, and/or berating. I had, and have, no wish to do either.

I had hoped concisely to convey some reasons for different interpretions
than you had expressed about some items. Perhaps I was too concise.

For example let me try a different way to express my paragraph:

‘As for the GISS software under discussion, its primary output is the .gif
image; the text file is a secondary output. Let me call to your
attention the wording: “Click a station’s name to obtain a plot for that
station.”‘

Alternate form of expression:

1. Until a station is selected, no image or text file is generated.
2. When a station is selected, both image and text file are generated,
(unless a previous user had sufficiently recently selected that same
station (and with the same option) that the temporary files remain from
their request) and the image is presented, but not the text file.
3. The text file will not be presented unless, and until, an additional
request is made to “download monthly data as text”.
4. Therefore, it seems to me, that application treats the image as the
primary output, and the text file as (optional) seconday output.

End of alternate form of expression.

In any case, we have different interpretations of the items that I have
mentioned, as well as of others. It certainly is not necessary that
we reach agreement on any of them.

I do feel that throughout this thread, seemingly more than other threads
here, a relatively excessive number of firmly expressed convictions have
been based on weak evidence, and/or extemporaneous guesses. From time to
time that feeling prompts me to post something to suggest that alternate
interpretations may be at least as plausible. That, rather than sniping
or berating, is what I hoped to do.

“Web servers can handle multiple simultaneous requests. Steve was not making multi-threaded requests en masse. It was a single-threaded program that took a at least 1/2 second and maybe longer (not knowing the amount of time it ran, it’s hard to say). If one single threaded program can bring a web server to it’s knees, then the administrator should be fired.”

I never said otherwise. In fact, I explicitly said that the volume of requests is not a problem and was not THE problem. I suggest you re-read what I’ve actually said.
—
“Based on the correspondence, the sysadmin reviewed the logs, saw a large number of hits from one ip address, and blocked it because he didn’t know what was going on. Nothing in particular wrong with that and as I stated in a previous thread, if Steve had contacted them prior to running the program, it likely would have been no big deal.”
Which is basically what I said, of course. Steve’s response to being blocked, with rapidly accelerating “requests” made while making posts here implying their active attempt at blocking, is what is the problem. And that is what my post said.

re 126 – Steve’s script was not just requesting data. Thus the 6,000 requests for less than half the file. But again teh issue isnt teh script, the issue is Steve’s response when his script was blocked by the web admin. And Steve didn’t just make a snide comment – he actively implied misconduct on the part of GISS. I agree that that is “just who he is.” Its part of why I lost respect for him quite some time ago.

JerryB, I appreciate your attempt at expressing your thought more clearly. However, I understood you correctly the first time. Unfortunately, your guess as to how this works is incorrect.

It’s true that alternate interpretations are plausible for various aspects of this whodunit. However, not everything is up in the air. What you are proposing is simply untrue.

(You can verify the following with any web developer tool or “header extension” the displays the interaction between your browser and the server. The nicest one I’ve found is the “net” tab in the Firefox “Firebug” extension. It shows each interaction as well as the timing involved.

Here’s a corrected version of your sequence:

1. Until a station is selected, no image or text file is generated.
2. When a station is selected, both image and text file are generated,
(unless a previous user had sufficiently recently selected that same
station (and with the same option) that the temporary files remain from
their request). Note: I’m pretty sure the PS, GIF and TXT files are always generated by their script, no matter what.**
2a. The main page html is sent to the browser, with links to the image, PS and TXT files. The only difference between the links is that the GIF link uses an “IMG” tag while the other links use an “A” tag.
2b. If the browser has image display enabled, it requests the GIF file from the server.
2c. The server sends the GIF file and the browser displays it appropriately.
3a. If the browser is pre-caching linked files, it may also choose to request the text file and the PS file from the server. Most browsers would not do this by default.

Can you see why I am saying your assumptions are wrong?

To explain myself even more clearly (perhaps this will be helpful for other readers’ undersanding?) This is simply how the web works.

* Your first “click” or whatever sends a request to the server. It responds with some HTML that may contain embedded links and such. (BTW, not just graphic and other links. Things like “CSS” and “JavaScript” (js) files also are linked… as well as audio files, flash animations and more. There is no such thing as embedding graphics directly into that initial HTML download!

* It is then up to your browser to interpret what it received, and request any/all of the other files that it wants.

Final note from (**) above. Server scripts do NOT have to generate the graphics (or even the auxiliary text file) at the same time as that original HTML. This was part of the technical discussion in the thread above. Even more important, server scripts can easily remember all (or almost all) of the graphics generated by a page, and re-send what was done before. It’s called caching, it’s a simple technique, and it is very efficient. Even if there are 20k of these pages and each page generates 200kb of text/graphic data, that’s only 4GB of storage to cache 100% of everything. 4GB costs us about US$1 these days.

Again, I’ve strayed from my consistent main point: Steve’s script was reasonable for its purpose. The web server script was reasonable for an internal purpose. It was NOT designed properly to handle external requests yet was pressed into use for that, probably inadvertently. It’s not a big deal. I’m sure these kinds of things happen a thousand times a day all over the planet.

And Steve didn’t just make a snide comment – he actively implied misconduct on the part of GISS.

I reviewed my postings. I don’t see any in which I have alleged or implied “misconduct” by GISS. As early as # 5, I said:

I understand why a webmaster would block access from an unidentified person doing this. But once I had identified myself, I think that they should either provide me with the data or allow me to continue downloading.

I insisted on one or the other and the matter was expeditiously resolved, as I’ve pointed out. Lee’ I don’t understand your beef, other than for the pure pleasure of whinging.

Unlikely. Steve was not blocked because he was a bot. Steve was blocked because the Sysadmin was alarmed at seeming a large volume of requests from a single IP address.

So your complaint, Steve, is that they didnt immediately make an exception at 6am to their bots policy for your bot, and it took them until 2 pm that same day to offer to get the data for you, and then another two hours to offer full access with a request to modify your practice a bit. Dude, you ran a bot in violation of their policies, they blocked you, and it took less than a single working day to come to a solution and give you access to the data.

Again, unlikely to be a response to a bot. Bots wouldn’t make 16,000 request to the same page, varying only query parameters. A simple examination of the logs would show this to be true.

Steve, it does not matter if your activity adversely affected their server. What matters is that the kind of activity you were doing CAN have such an effect, so that kind of activity is forbidden. The fact is they DID create such an exception, on the same day. They did so despite your whinging at them here, about their valid application of a valid policy. You have no valid complaint about this treatment.

Untrue. This kind of activity is not forbidden. Bots are forbidden. Steve’s program was not a bot. I naddition, the type of activity Steve was doing should in no way have affected their server. If it did, then they need a) a new server, b) a new sysadmin, c) both.

Steve, what you did was run a bot to scrape data from the GISS site, in violation of their robots.txt policy, apparently without even first making a polite request for the data. Your bot generated over 16,000 data requests on their server while it was running. That is considerably more than “the only thing that I’ve requested is a dataset that is under 20 MB in size.” That has been explained many times here in the comments thread, including by people on “your side.” Your error was a naive error, but it was nonetheless an error, and it got you blocked.

Again with the bots thing and again wrong. Also, should Steve have generated the requests when the server was NOT running? Perhaps GISS would also be better served to have such a 20MB file available for download on a FTP server, rather than their current method. Given that such doesn’t exist, this is one way to get the data.

You were running what you now realize was a poorly-written site-scraping program because it didn’t include even so much as a one-second delay to allow other requests into the queue. That’s bad practice. Good webmasters put policies in place to interrupt programs like yours. (If you automatically and unquestioningly let people run programs that absorb even 1/4th or 1/16 of your bandwidth, pretty soon you’ve got no bandwidth left. Best to nip such problems in the bud.)

I never said otherwise. In fact, I explicitly said that the volume of requests is not a problem and was not THE problem. I suggest you re-read what I’ve actually said.

I have re-read what you said. First you said he blocked because he was a bot. You also said inserting a delay was necessary to allow other requests into the queue. You were wrong about being a bot. Steve’s program did not prevent others from getting into the queue. While Steve’s program is single threaded, making synchronous calls to the server, the server is quite the opposite. It is multi-threaded, allowing for a number of simultaneous requests and perfectly capable of dealing with them up to it’s capacity to process.

Now let’s examine the rest of your comments.

Steve, you look paranoid and irrational here. And I’m being polite.
Stop whining.
So apologize, and stop complaining.
Your error was a naive error
Steve, I know you’ve been mistreated by other researchers, but that’s no reason to go into each new encounter with a chip on your shoulder.
And Steve didn’t just make a snide comment – he actively implied misconduct on the part of GISS. I agree that that is “just who he is.” Its part of why I lost respect for him quite some time ago.

So, by your words, I have read mis-information about the type of program Steve was running, mis-information about why Steve was likely blocked, and character assasination ranging from being a whiner, a complainer, a bad programmer, having a chip on his shoulder, and that he engages in character assasination when things don’t go his way.

Lee, I think you are projecting.

Oh, and all of this followed up by “gladys” with Lee being pronounced fair and balanced. Lol.

When you wrote: “What you are proposing is simply untrue.”,
to which of my statements were you referring?

The essence of your “alternate form of expression” is not true. My response replicated your four-part statement with necessary corrections to bring it in line with reality. Your statement #1 was correct as-is but everything else needed changes or wholesale replacement.

When you wrote: “Can you see why I am saying your assumptions
are wrong?”, to which of my assumptions were you referring?

Uhh… the primary assumptions used to support your thesis (about primacy of the graphic file, to which I was originally responding) and the assumptions about how the web works as stated and expanded on in your message?

Your claim about primacy of the graphic file is built on a set of assumptions (not knowledge) about how the web works. You attempted to further explain your “model”. Unfortunately, your model is wrong. The web does not work the way you think it does.

This is not a matter of interpretation.

I’m not sure why you are attempting to parse this into ever finer detail.

For the moment, this thread has no specific knowledge about the server script, only educated and uneducated guesses.

On the other hand, when it comes to the info delivered to browsers, and how browsers handle that info, there’s no need for guessing or interpretation.

Your statements make it clear that you do not know how web pages are transferred to a browser. Hence, you are making guesses and assumptions, and attempting to create “room” for alternate interpretations. That is simply incorrect. It’s not worth arguing about. If you want to learn something about this, that’s great — as I recommended above, a great way to learn is to use appropriate monitoring tools to see how it works.

I have no compulsion to attempt to “force agreement” on anything. On the other hand, I do tend to respond when people make false assertions, or attemnpt to tear down the truth, in areas where I am able to bring light to a subject.

I don’t mind suggestions of alternate interpretations — when such is at least possible or plausible. However, when such suggestions are based on ignorance of reality, the suggestions feel more like an attempt to be argumentative where there’s really no room for argument.

Over the top analogy:

Person A: 2+2=4, and I think green is better than red
Person B: I don’t think 2+2=4, and I think red is better than green
Person A: You’re wrong about 2+2. It really is 4. If you want to think red is better,
more power to you. That’s a matter of interpretation.
Person C: A and B, isn’t it at least plausible that there’s an alternate
interpretation? What if 2+2 is actually 333?

JerryB, your “alternate interpretation” and further explanation was an attempt to explain
an alternate reality for how the web works. That’s a settled issue. And your suggestion is incorrect.

Want to comment on why the server script is written the way it is? Go right ahead. Any guess is good as another. Doesn’t matter. The script would be fine for internal use but is badly written for external use. Period. And yes, professionals can state that with some certainty because we can see how the system actually works to a large extent.

You are right about one thing: some people expressed “convictions” firmly… that were based on a guess rather than understanding. To a large extent, they were just sheep following along what someone else said.

It’s human nature: we choose sides based on what we believe rather than what we know.

I hope this response is somewhat satisfying — I am unlikely to have time to check back in for a bit.

MrPete, Jerry, I think we’ve taken this discussion as far as can possibly be productive. I agree with MrPete’s point to “give everyone a break”. I did make a mistake – I didn’t realize that requesting the text file always caused the generation of a GIF and PS file. If I had looked more carefully at how their server was set up, I would have seen it, but that seemed an illogical way to implement the script, so I assumed it wasn’t so. This caused more load than I had intended. Unfortunately, there was no other way to fetch the data.

Let’s just say that, while I don’t agree with all the technical diagnoses we’ve seen here, there are some things we can all agree on. It wasn’t surprising nor totally unreasonable that Mr. McIntyre was blocked. It was unfortunate, however, since he was engaged in serious scientific research and that’s something that we, and I think NASA, want to promote. The situation has now been resolved – no brilliantly, but resolved, nonetheless. There’s little point continuing this.

It’s probably my fault this thread is still going, I was silly enough to respond to Lee, and it was an emotional response (although, I think, rationally argued). Who cares if he has the last word. And, while I don’t agree with JerryB, he has been polite and rational, and I understand his point of view. Let’s scratch this one up to experience and “move on”.

The additional details which you mention are not corrections; they are
simply additional details, none of which change the sequence in which the
plot, and the text file, are presented to the user.

From the page with the comment “Click a station’s name to obtain a plot
for that station”, the user can display the image of the plot with one
request. The user must issue a second request to display the text file.
The combination of the wording of that comment (which does not even
mention the text file), and the sequence of steps by which the user can
access the two kinds of files, indicate to me that the image of the plot
is the primary output. There is no alternate reality involved.

Nicholas,

I would say this thread has been one from which to “move on” for quite
some time, as you may surmise from my comment #93 to bender. I’d be
delighted to do so.

…the user can display the image of the plot with one
request. The user must issue a second request to display the text file…indicate to me that the image of the plot is the primary output.

Then our disconnect has been over the meaning of “primary”, and over who or what is a “user.”

Your response indicates your concern is with respect to a human user. I would certainly agree with you that to a human user, the GIF file is primary! The vast majority of people will read the intro, will have a browser that is set to automatically display images, and will see the GIF file.

But that has nothing to do with this thread which addresses the challenges of script-based web access and the robots.txt rules.

It has been shown that the “must obey robots.txt” line is NOT drawn between human and script-based access. Only scripts that automatically traverse a site and recursively retrieve pages are subject to it — clearly including the yahoo’s and googles of the world. Many browser-based tools should be subject to robots.txt in theory, but most ignore it, as a human has initiated the scan. Steve’s script does NOT traverse the site nor recursively retrieve whatever it finds, and is NOT subject to robots.txt, even in theory.

And, it has been shown that the server script generates all three outputs (graphic, PS and text data), whether or not any or all of them are retrieved. In that sense, one is not less “important” than another.

So. We better understand what is and is not subject to robots.txt. We understand that this web page was never designed for external traffic levels, and in fact is rather inefficient for anything beyond one-time data generation. And we understand that everyone involved needs a break.

People who have been involved with applications of general purpose
computers for a few decades may recall times when almost all such
computer applicatons were “batch” applications. Input files might be
decks of cards, or tape and/or disk files, and usually sorted in some
particular sequence. Output files usually went to tape and/or disk
and/or a printer.

Eventually, and at first very gradually, another kind of applications
came about, perhaps called “online, or “real time”, or “interactive”.
Instead of the input being a deck of cards, or files on tape and/or disk,
input would be typed “online” by a person using some kind of terminal
connected to the computer. Output would go to the same terminal from
which the input was received. The sequence of transactions would
commonly be relatively random in comparison to batch applications.

In both batch, and interactive, applications, some of the input would
consist of updates to other input files, and some of the output would be
updated versions of those files. Such files would commonly reside on
tape and/or disk.

While batch processing would usually be much more “efficient” per
transaction, the increases of computer speed, and decreases of cost per
transaction, which accompanied the changes of technologies over the
years, led to more, and more, interactive applications. Programming
techniques which would be considered unacceptably inefficient in batch
processing, may be considered acceptable in interactive processing. Some
of the comments in this thread might appear in a different light if such
differences of kind of applications are considered.

In the context of this thread, the GHCN files would fit a batch paradigm,
the GISS application would fit an interactive paradigm.

From this perspective, Steve, by automating his requests to the GISS
server, would be regarded as trying to use an interactive application as
if it were a batch application. I would guess that Steve grew up in some
place other than a computer room.

When the GISS webmaster, now website curator, noticed the quantity, and
frequency, of Steve’s requests, he accurately deduced they were from an
automated process, i.e. some kind of “robot”. He need not consider
whether Steve’s bot fit someone else’s definition of a “web robot”, he
need only consider whether its behavior fit his perception of the
behavior of an “illegitimate” robot. It seems that it did.

#150. Jerry, what’s your point? That undoubtedly was the webmaster’s perception and no one, including me, has argued against the webmaster blocking pending determination of what was going on. However, other things do not follow from that.

1) My downloading was not a violation of any notice on the GISS website. They didn’t say: “Downloading of data is for only recreational purposes, large amounts of data for scientific purposes is prohibited without the express written consent of the GISTEMP proprietors” or some equivalent explicit notice. If a notice is to be effective, it has to be made.

2) For all the reasons set out by others, the robots.txt policy simply doesn’t necessarily apply to the form of downloading which I was carrying out. The description of robots.txt policies links them to web crawlers, not to special purpose automated downloads and the robots.txt is explicit notice only to the web crawlers. If GISS ALSO wants to block automated downloads, then they need to place a different notice on their website explicitly prohibiting such behavior. Since there was no applicable notice, I wasn’t “violating” any policy of which I had notice. That doesn’t mean that webmaster’s initial blocking was unreasonable, only that the downloading was not a “violation” of their robots.txt policy, and, even if it was, it wasn’t a “blatant” violation.

3) There is no evidence that the downloading caused ANY disruption of service to anyone. No one has suggested that it did, including the webmaster, and later testing of GISS access while the download was operating by a CA reader and myself confirmed this. Despite this, Eli Rabett has claimed that it did on his blog. Not only did it not cause any disruption of service, there was no intention on my part to disrupt service and I didn’t. IT wasn’t a “Disruption of service attack” inadvertent or otherwise. IT was simply behavior that attracted the attention of the webmaster as anomalous. GISS agreed that I could continue downloading in exactly the same fashion, which surely proves the point.

Again, we know why the webmaster intervened. But so what? what’s your point?

Not so much a particular point, but rather to suggest some considerations
which people might find useful in interpreting some of the comments in
this thread, and to do so without criticizing any particular comment, or
commenter.

Perhaps the nearest I got to a particular point was the statement: “Some
of the comments in this thread might appear in a different light if such
differences of kind of applications are considered.”

It’s not clear to me the usefulness of the “different light” being shed; comments like this feel like fog not sunshine:

In the context of this thread, the GHCN files would fit a batch paradigm, the GISS application would fit an interactive paradigm.

Actually, not. That’s the problem! The GISS application was also written using a batch paradigm: each query generates all the results. That works great in a batch run but is inefficient for interactive access.

Perhaps you’re not the only one who relates better to card decks and tapes? I too lived (actually, grew up as a kid) in that world… so let’s have some fun and play out this scenario in that world of long ago…

The GISS application was written as a card deck (web script) that processes GHCN data. The instructions for the deck tell you to provide a location to run it against. And the job control sheet allows you to request any or all of three output files: graphical in tape (GIF) or disk file (PS) form, or data on a few punched cards (TXT).

The internal team use it occasionally; when they do, they always generate all three output files. Because they use it that way, nobody notices that there’s a bug in the deck: it ALWAYS produces all three outputs, but who cares, since that’s what they need anyway.

Then somebody made this card deck available to all visitors to the computing center, and you could take your output home at no charge. They also posted lists of the available input parameter so you could request runs for any and all local areas. (The lists were provided as a plain printed sheet, as well as a map with a lookup table.)

The computer center has lots of spare tapes, and everyone likes tape anyway (the reels make an interesting wall decoration, and are not too bad for playing frisbee)… so people would stop by and request a few runs with tape output… some only wanted one tape, some would create a bunch of runs and generate as many tapes as they could carry home. Probably some kid stopped by and requested a few hundred runs, just to watch the tape reels stack up ;). Nobody paid any attention to the few wasted disk files and output cards.

Steve’s got some work to do — he needs the ten punch cards produced for each run of the deck. No tapes, no disk file. So, he sends a friend (ScriptKid) down to the computer center to hang around and request runs for each location on the posted sheet, and collect the resulting cards.

One of the computer operators see the steady level of computer activity, and sees one tape drive going crazy but never being unloaded. And he’s mystified why ScriptKid only wants the punched card output. So he denies further run requests from ScriptKid.

Hopefully, the analogy makes the problem clear. Back then, the waste of generating so much output would have be obvious. If people only want a tape, or a few cards, why generate everything every time? [Proper coding] And if most people are asking for certain local areas, why not just save the most popular tapes and dupe them, rather than regenerate everything every time? [Caching]

The GISS folks decided the easiest thing was to let Steve request the runs he needed and toss all output other than the punched cards. Inserting the proper cards to only produce needed output is not worth their time. They did ask that he request his runs during off hours.

Some onlookers complained that Steve’s request was illegal according to the I ROBOT policy posted in the cafeteria. But the I ROBOT policy only applies to commercial users who are attempting to use every publicly available card deck to generate every possible output file. Clearly, Steve is not doing that.

Others noted that the computer center would be less heavily loaded if Steve had ScriptKid wait five minutes before each run request. That’s true but actually immaterial: this is a large computer center with 10 multiprocessing mainframes and a large tape and disk farm. Nobody’s work is delayed by Steve’s project. Not even the busload of school kids who stopped by to each take a tape home 😉

And so we again come to the end of the story. The original purpose of the card deck was reasonable. Steve’s request was reasonable. The card deck was never designed for public access. But that’s what happened.

And let’s give everyone a break.

(That was fun! I haven’t dragged up card deck / computer center vocabulary in a loooong time. I wonder how many readers can even understand what I wrote!)

I don’t call it a batch run. I call it a script that’s more appropriate for batch use than interactive use. All it takes for a “batch run” is to call that script repeatedly with different input parameters.

As described above, it is missing the elements needed to make it effective for interactive or multiuser (web) interactive access. You can call it an interactive script if you wish; I would not, nor do the other web professionals who have commented above.

Imagine if Excel worked this way: go to save your file and it always saves in Excel, Comma Delimited, PostScript and a chart format! Pretty frustrating for a normal interactive user.

One Trackback

[…] when GISS shut off McIntyre’s access for download large reams of data in 24 hours. See GISS Interruptus. McIntyre wrote then: I have been attempting to collate station data for scientific purposes. I […]