Hi there,
I have the following problem:
Because my dataset is too big (>20gigs, my computer crashes when trying to work with it despite 16gigs ram) I wanna take a sample and try to do some stuff with that sample.
Problem is that for the sample to be useful I have to sample clusters... that is the data has more than 150,000,000 or so observations but only like 30,000 or so different values that a certain categorial variable takes. I wanna sample like 8,000 or so of these 30,000 and then have all observations for which the variable takes one of the 8,000 values.... so that overall I have roughly a 25% sample but all the clusters are still complete.
Problem is that there is a cluster option only for bsample which is sampling with replacement... for my purposes here I obviously don't wanna have replacement... but for sample (the function w/o replacement) there is no cluster option...
Do you wise people here have an idea of how I can solve that problem?
Additional complication: The whole dataset is split up now into several dta files each containing some of the observations - but each single cluster is NOT contained in a single dta unfortunately.
Best,
Jakob
I have the following problem:
Because my dataset is too big (>20gigs, my computer crashes when trying to work with it despite 16gigs ram) I wanna take a sample and try to do some stuff with that sample.
Problem is that for the sample to be useful I have to sample clusters... that is the data has more than 150,000,000 or so observations but only like 30,000 or so different values that a certain categorial variable takes. I wanna sample like 8,000 or so of these 30,000 and then have all observations for which the variable takes one of the 8,000 values.... so that overall I have roughly a 25% sample but all the clusters are still complete.
Problem is that there is a cluster option only for bsample which is sampling with replacement... for my purposes here I obviously don't wanna have replacement... but for sample (the function w/o replacement) there is no cluster option...
Do you wise people here have an idea of how I can solve that problem?
Additional complication: The whole dataset is split up now into several dta files each containing some of the observations - but each single cluster is NOT contained in a single dta unfortunately.
Best,
Jakob
Comment