Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Efficient way to make draws for each observation and write to dataset in memory

    I am looking a for an efficient way to make draws from a dataset for each observation and write the data for those draws to that observation to new variables.

    Example:
    I have a dataset of pairs of jobs at two different locations within the same firm. Next I wish to add a randomly selected job (half of a pairing) at a different firm and with different other characteristics such as wage but with the same location, SOC, PayFrequency, SalaryType, and Year as the first job in the pairing. I would write the characteristics of this drawn job to new variables and do this for each observation (pairing).

    More detail on what we tried:
    So far we have tried looping through individual observations, which was not efficient (computationally impossible with millions of observations). We tried two methods looping through the observations. First, keeping an appropriate subset of the data and then sampling a single observation from that subset and saving it to disk. Then we also tried preserving the big dataset, keeping an appropriate subset and sampling an observation, writing all of the variable values to macros, restoring the big data, and then writing to that observation with the macro values.

    We also tried joining by the desired same characteristics and then keeping only observations in which the draw was from a different firm and met other criteria, but this rapidly exceeded memory available.

    Thanks in advance for any help. I should mention we are using Stata 15.
    Last edited by Isaac Liu; 06 Oct 2020, 08:06.

  • #2
    Here is some sample code for the joinby attempt, which runs out of memory:

    *create a job characteristics group variable
    egen g_job_char = group(`2')

    save draw_ready.dta, replace

    *create a local of key variables for draws
    local draw_vars = "State Employer CommutingZone Year OES_wage occ_weight FirmSector FirmNAICS3 Region census_region top_establishment establishment_size firm_size mean_salary_psy_trim mean_salary_psyo_trim mean_salary_no_trim ms_psy_trim_over_num ms_psyo_trim_over_num ms_no_trim_over_num vacs mean_edu min_range establishment_job_size SOC PayFrequency SalaryType Establishment obs_per_firm_period num_loc"

    *randomly select the other jobs
    *the first random draw:
    *first draw a job with the desired characteristics and the location of job 1
    *in the pairing
    *remove variables for second items in pairs
    drop *_2
    *rename the draw varaibles
    foreach var of varlist * {
    rename `var' `var'_draw
    }
    rename g_job_char_draw g_job_char

    save draws_set.dta, replace

    use draw_ready.dta, clear

    joinby g_job_char using draws_set.dta

    *keep jobs with the desired characterstics relative to first job
    keep if CommutingZone_draw == CommutingZone

    keep if Employer_draw != Employer

    *draw the observation
    sample 1, count by(g_job_char)

    save with_draw_one.dta

    Comment


    • #3
      If you would post some example data, per the StataList FAQ (using -dataex-), your chances of getting a helpful response would be much increased. Your problem sounds like one that could draw interest, but it's hard (at least for me) to exactly understand what you have and what you want. If possible, you could help us help you by illustrating both what your input data is like, and what you want your result to be.

      Comment


      • #4
        I agree with Mike that your problem is not clearly stated. However, it sounds like a matching problem, not a 'draw' problem. This is going to take a long time no matter how you do it, but you might try using some of the matching tools that are available: -reclink-, -matchit-, -psmatch2- and -mahapick- are some relevant user written tools that might be useful (use -findit- to track them down).

        Comment


        • #5
          Code:
          joinby g_job_char using draws_set.dta
          
          *keep jobs with the desired characterstics relative to first job
          keep if CommutingZone_draw == CommutingZone
          suggests to me that CommutingZone be included in g_job_char since you discard every observation for which the value in the draw does not match the value in the main data. That would trim the size of the joinby results.

          Then, the next step is to loop your joinby over subsets of your draw_ready_data dataset and append the resulting output datasets instead of trying to create the entire dataset at once. Here's a sketch of that process, obviously untested and not really proofread all that well. Maybe it can start you in a useful direction.
          Code:
          * create an empty output dataset
          clear
          save with_draw_one.dta, emptyok
          * how many observations
          quietly describe using draw_ready.dta
          local nobs `r(N)'
          * pick the size of the subsets
          local batchsize = 100000
          * and away we go
          forvalues n1 = 1(`batchsize')`nobs' {
              local n2 = min(`n1'+`batchsize'-1,`nobs')
              use draw_ready.dta in `n1'/`n2', clear
              joinby g_job_char using draws_set.dta
              *keep jobs with the desired characterstics relative to first job
              *include CommutingZone in g_job_char, or add it to the joinby command, removing the _draw suffix it in the draw dataset.
              *keep if CommutingZone_draw == CommutingZone
              keep if Employer_draw != Employer
              *draw the observation
              sample 1, count by(g_job_char)
              append using with_draw_one.dta
              save with_draw_one.dta, replace
          }

          Comment


          • #6
            I can't really get this text box to format correctly, but here's a list of variables before the draw.

            The point is that we are adding "_draw" variables.

            SOC Region vacs Year_2 me~no_trim_2
            State census_reg~n min_range OES_wage_2 mean_edu_2
            PayFrequency top_establ~t estab~b_size occ_weight_2 ms_psy_tri~2
            SalaryType estab~t_size obs_per_fi~d FirmSector_2 ms_psyo_tr~2
            Employer firm_size num_loc FirmNAICS3_2 ms_no_trim~2
            CommutingZ~e mean_~y_trim SOC_2 Region_2 vacs_2
            Establishm~t mean~yo_trim State_2 census_reg~2 min_range_2
            Year mean~no_trim PayFrequen~2 top_establ~2 est~b_size_2
            OES_wage mean_edu SalaryType_2 est~t_size_2 obs_per_fi~2
            occ_weight ms_psy_tri~m Employer_2 firm_size_2 num_loc_2
            FirmSector ms_psyo_tr~m CommutingZ~2 mea~y_trim_2 g_job_char
            FirmNAICS3 ms_no_trim~m Establishm~2 me~yo_trim_2

            And here's after the draw:

            SOC mean~yo_trim Year_2 min_range_2 census_reg~w
            State mean~no_trim OES_wage_2 est~b_size_2 top_establ~w
            PayFrequency mean_edu occ_weight_2 obs_per_fi~2 ~t_size_draw
            SalaryType ms_psy_tri~m FirmSector_2 num_loc_2 firm_size_~w
            Employer ms_psyo_tr~m FirmNAICS3_2 g_job_char ~y_trim_draw
            CommutingZ~e ms_no_trim~m Region_2 SOC_draw mean_salar..
            Establishm~t vacs census_reg~2 State_draw mean_salar..
            Year min_range top_establ~2 PayFrequen~w mean_edu_d~w
            OES_wage estab~b_size est~t_size_2 SalaryType~w ms_psy_tri~w
            occ_weight obs_per_fi~d firm_size_2 Employer_d~w ms_psyo_tr~w
            FirmSector num_loc mea~y_trim_2 CommutingZ~w ms_no_trim~w
            FirmNAICS3 SOC_2 me~yo_trim_2 Establishm~w vacs_draw
            Region State_2 me~no_trim_2 Year_draw min_range_~w
            census_reg~n PayFrequen~2 mean_edu_2 OES_wage_d~w ~b_size_draw
            top_establ~t SalaryType_2 ms_psy_tri~2 occ_weight~w obs_per_fi~w
            estab~t_size Employer_2 ms_psyo_tr~2 FirmSector~w num_loc_draw
            firm_size CommutingZ~2 ms_no_trim~2 FirmNAICS3~w
            mean_~y_trim Establishm~2 vacs_2 Region_draw

            Sorry that the dataex output is too large.

            I got this to run on a small subsample.
            Last edited by Isaac Liu; 06 Oct 2020, 12:15.

            Comment


            • #7
              Just wanted to mark this as resolved and thank William. Adding commutingzone to the joinby was in fact in itself enough to make the draw manageable!

              Comment

              Working...
              X