How can I deal with that? I did not see any query to interrogate sample information. In addition, I cannot wget the web page and look into the source code because of the "?acc=GSM1442240" part in the URL. Finally, I did not find a clinical spreadsheet available on GEO or provided by the authors in their paper.

There are two problems with the GEOquery package from BioConductor. First: GEOquery required downloading the whole data again (unless I missed an option to only get sample information?) and the raw dataset is nearly 100 Gb. Knowing that I already downloaded the complete dataset, processed it and deleted it due to its volume. Second: I understood that SOFT formatted file contain sample information but GEOquery took ages to load a ~36Gb file (I had to download that one too). I guess that if the dataset was smaller, GEOquery could have been a convenient tool to do that. However, it seems a non-viable option in my case.

What I did: a basic UNIX grep command on the SOFT formatted file. At some point (after the micro-array format definition), sample information are indicated. I caught the pattern to got what I wanted:

Basically, this command captures two lines per sample: sample name (starts with ^SAMPLE) and sample ethnicity (starts with !Sample_characteristics_ch1). paste is used to merge two consecutive lines into a single one. sed removes patterns. Output (tab delimited):

No problem. GEOquery actually does the job but requires downloading the SOFT files first. It does not directly query GEO. Moreover: it downloads a 100Mb file per sample if you process them individually. 873 samples times 100Mb each is 87Gb, much more than the 36Gb file available from GEO (whole dataset). This is because the micro-array format is repeated for each sample if you process them serially.

I have asked this from the NCBI folks directly and via channels that I think ought to reach to the developers directly. The lack of response from them makes me believe that the entrez direct tools are not suited for reaching into the content of these files as these are not represented independently in the database. I think these are stored as blobs of text. Hence the tools are unable to query these.