Re: st: selecting obs while reading in huge data set

--- In statalist, "Sascha O. Becker" <sascha.becker@g...> wrote:
> I have a huge data set A (2 GB in ASCII) with 40 mio. observations
> (workers) but only 10 variables. I have another data set B containing
> information on (a sub-set of) employers and want to select only workers
> from data set A that are employed in firms in data set B (firm IDs are
> one variable in data set A).
>
> Is there also a way to select observations while reading in data set A,
> i.e. something like
>
> -insheet datasetA using mydata if firmid in datasetB- ?
>
> and if yes, is it likely to be faster?
You can try this: Assuming your data set B is already in Stata format, and
the ID in that data set is -tid-; assuming data set A has the employer ID
variable called -firmid-; I want to make this distinction just in case
those are different variable names in two data sets; let's see...
clear
set memory ...m
* load the firm data set into memory
use B
keep firmid
* make sure there are unique observations for each tid
bysort tid : assert _N == 1
local iflist
local N = _N
forvalues n=1/`N' {
local thistid=tid[`n']
local iflist `iflist'|firmid==`thistid'
}
* local macro `iflist' now has the full list of IDs
* provided we did not hit Stata limits on the length
* of a macro (67784 in Intercooled, 1081511 in SE)
* it starts with "|" so we might want to strip it out
gettoken slash iflist : iflist , parse("|")
* we might have done this with the string functions, too,
* but this is probably faster and more reliable
* in terms of Stata limits on the length of strings
* alternatively, we might have used something like
* ... if 0`iflist'
* which would evaluate in the appropriate way
* now, try to get our data in
clear
set memory ...m
infile <list of variables> using A if `iflist'
* note that -insheet- does not support -if- qualifiers,
* so we would have to use -infile-
count
exit
I don't know how much faster this is going to be, but it seemed workable
to me. Unless Nick J. Cox finds a better solution -- say through his
-listutil- suit of commands :))
And yes, you need to be sure you don't hit Stata limits on the length of
the macro before your file B ends. If you have say 10 digit IDs, then you
are limited to about 65k/20 = 3000 firms. How many have you got?
--- Stas Kolenikov
-- Ph.D. student in Statistics at UNC-Chapel Hill
- http://www.komkon.org/~tacik/ -- Stas.Kolenikov@unc.edu
* This e-mail and all attachments to it are not intended to provide any
* reasonable point of view and was transmitted to you in error. It
* should be immediately deleted by all recipients unless they really
* enjoy communicating with the author :). Other restrictions apply.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/