Efficient way to make draws for each observation and write to dataset in memory

Isaac Liu

Join Date: Sep 2020

Posts: 10
#1

Efficient way to make draws for each observation and write to dataset in memory

06 Oct 2020, 08:03

I am looking a for an efficient way to make draws from a dataset for each observation and write the data for those draws to that observation to new variables.

Example:
I have a dataset of pairs of jobs at two different locations within the same firm. Next I wish to add a randomly selected job (half of a pairing) at a different firm and with different other characteristics such as wage but with the same location, SOC, PayFrequency, SalaryType, and Year as the first job in the pairing. I would write the characteristics of this drawn job to new variables and do this for each observation (pairing).

More detail on what we tried:
So far we have tried looping through individual observations, which was not efficient (computationally impossible with millions of observations). We tried two methods looping through the observations. First, keeping an appropriate subset of the data and then sampling a single observation from that subset and saving it to disk. Then we also tried preserving the big dataset, keeping an appropriate subset and sampling an observation, writing all of the variable values to macros, restoring the big data, and then writing to that observation with the macro values.

We also tried joining by the desired same characteristics and then keeping only observations in which the draw was from a different firm and met other criteria, but this rapidly exceeded memory available.

Thanks in advance for any help. I should mention we are using Stata 15.

Last edited by Isaac Liu; 06 Oct 2020, 08:06.
Tags: None
Isaac Liu

Join Date: Sep 2020

Posts: 10
#2

06 Oct 2020, 08:13

Here is some sample code for the joinby attempt, which runs out of memory:

*create a job characteristics group variable
egen g_job_char = group(`2')

save draw_ready.dta, replace

*create a local of key variables for draws
local draw_vars = "State Employer CommutingZone Year OES_wage occ_weight FirmSector FirmNAICS3 Region census_region top_establishment establishment_size firm_size mean_salary_psy_trim mean_salary_psyo_trim mean_salary_no_trim ms_psy_trim_over_num ms_psyo_trim_over_num ms_no_trim_over_num vacs mean_edu min_range establishment_job_size SOC PayFrequency SalaryType Establishment obs_per_firm_period num_loc"

*randomly select the other jobs
*the first random draw:
*first draw a job with the desired characteristics and the location of job 1
*in the pairing
*remove variables for second items in pairs
drop *_2
*rename the draw varaibles
foreach var of varlist * {
rename `var' `var'_draw
}
rename g_job_char_draw g_job_char

save draws_set.dta, replace

use draw_ready.dta, clear

joinby g_job_char using draws_set.dta

*keep jobs with the desired characterstics relative to first job
keep if CommutingZone_draw == CommutingZone

keep if Employer_draw != Employer

*draw the observation
sample 1, count by(g_job_char)

save with_draw_one.dta
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#3

06 Oct 2020, 09:31

If you would post some example data, per the StataList FAQ (using -dataex-), your chances of getting a helpful response would be much increased. Your problem sounds like one that could draw interest, but it's hard (at least for me) to exactly understand what you have and what you want. If possible, you could help us help you by illustrating both what your input data is like, and what you want your result to be.
Comment
Jeph Herrin

Join Date: Apr 2014

Posts: 335
#4

06 Oct 2020, 10:57

I agree with Mike that your problem is not clearly stated. However, it sounds like a matching problem, not a 'draw' problem. This is going to take a long time no matter how you do it, but you might try using some of the matching tools that are available: -reclink-, -matchit-, -psmatch2- and -mahapick- are some relevant user written tools that might be useful (use -findit- to track them down).
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

06 Oct 2020, 11:26

Code:

joinby g_job_char using draws_set.dta

*keep jobs with the desired characterstics relative to first job
keep if CommutingZone_draw == CommutingZone

suggests to me that CommutingZone be included in g_job_char since you discard every observation for which the value in the draw does not match the value in the main data. That would trim the size of the joinby results.

Then, the next step is to loop your joinby over subsets of your draw_ready_data dataset and append the resulting output datasets instead of trying to create the entire dataset at once. Here's a sketch of that process, obviously untested and not really proofread all that well. Maybe it can start you in a useful direction.

Code:

* create an empty output dataset
clear
save with_draw_one.dta, emptyok
* how many observations
quietly describe using draw_ready.dta
local nobs `r(N)'
* pick the size of the subsets
local batchsize = 100000
* and away we go
forvalues n1 = 1(`batchsize')`nobs' {
    local n2 = min(`n1'+`batchsize'-1,`nobs')
    use draw_ready.dta in `n1'/`n2', clear
    joinby g_job_char using draws_set.dta
    *keep jobs with the desired characterstics relative to first job
    *include CommutingZone in g_job_char, or add it to the joinby command, removing the _draw suffix it in the draw dataset.
    *keep if CommutingZone_draw == CommutingZone
    keep if Employer_draw != Employer
    *draw the observation
    sample 1, count by(g_job_char)
    append using with_draw_one.dta
    save with_draw_one.dta, replace
}

Comment

Isaac Liu

Join Date: Sep 2020

Posts: 10
#6

06 Oct 2020, 12:05

I can't really get this text box to format correctly, but here's a list of variables before the draw.

The point is that we are adding "_draw" variables.

SOC Region vacs Year_2 me~no_trim_2
State census_reg~n min_range OES_wage_2 mean_edu_2
PayFrequency top_establ~t estab~b_size occ_weight_2 ms_psy_tri~2
SalaryType estab~t_size obs_per_fi~d FirmSector_2 ms_psyo_tr~2
Employer firm_size num_loc FirmNAICS3_2 ms_no_trim~2
CommutingZ~e mean_~y_trim SOC_2 Region_2 vacs_2
Establishm~t mean~yo_trim State_2 census_reg~2 min_range_2
Year mean~no_trim PayFrequen~2 top_establ~2 est~b_size_2
OES_wage mean_edu SalaryType_2 est~t_size_2 obs_per_fi~2
occ_weight ms_psy_tri~m Employer_2 firm_size_2 num_loc_2
FirmSector ms_psyo_tr~m CommutingZ~2 mea~y_trim_2 g_job_char
FirmNAICS3 ms_no_trim~m Establishm~2 me~yo_trim_2

And here's after the draw:

SOC mean~yo_trim Year_2 min_range_2 census_reg~w
State mean~no_trim OES_wage_2 est~b_size_2 top_establ~w
PayFrequency mean_edu occ_weight_2 obs_per_fi~2 ~t_size_draw
SalaryType ms_psy_tri~m FirmSector_2 num_loc_2 firm_size_~w
Employer ms_psyo_tr~m FirmNAICS3_2 g_job_char ~y_trim_draw
CommutingZ~e ms_no_trim~m Region_2 SOC_draw mean_salar..
Establishm~t vacs census_reg~2 State_draw mean_salar..
Year min_range top_establ~2 PayFrequen~w mean_edu_d~w
OES_wage estab~b_size est~t_size_2 SalaryType~w ms_psy_tri~w
occ_weight obs_per_fi~d firm_size_2 Employer_d~w ms_psyo_tr~w
FirmSector num_loc mea~y_trim_2 CommutingZ~w ms_no_trim~w
FirmNAICS3 SOC_2 me~yo_trim_2 Establishm~w vacs_draw
Region State_2 me~no_trim_2 Year_draw min_range_~w
census_reg~n PayFrequen~2 mean_edu_2 OES_wage_d~w ~b_size_draw
top_establ~t SalaryType_2 ms_psy_tri~2 occ_weight~w obs_per_fi~w
estab~t_size Employer_2 ms_psyo_tr~2 FirmSector~w num_loc_draw
firm_size CommutingZ~2 ms_no_trim~2 FirmNAICS3~w
mean_~y_trim Establishm~2 vacs_2 Region_draw

Sorry that the dataex output is too large.

I got this to run on a small subsample.

Last edited by Isaac Liu; 06 Oct 2020, 12:15.
Comment
Isaac Liu

Join Date: Sep 2020

Posts: 10
#7

15 Oct 2020, 07:23

Just wanted to mark this as resolved and thank William. Adding commutingzone to the joinby was in fact in itself enough to make the draw manageable!
Comment

Announcement