Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extending dataset to cover all pairwise combinations

    Dear Statalist users,

    I have data that resembles the following:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(year worker supervisor workervariable supervisorvariable y)
    1869 1 2  12 23  45
    1869 1 3  12  2  43
    1869 2 1 322 32   4
    1869 3 4   3  2  43
    end


    In the above, I have by year, different combinations of workers-supervisors, along with a worker specific variable, a supervisor specific variable, and a regressand. What I want to do is the following: First, I would like to generate the OLS coefficient of a regression of y on the worker specific variable, and supervisor specific variable across all observations. This is simply
    Code:
    regress y workervariable supervisorvarabile
    . What I want to do afterwards, however, is to use the predicted values based on coefficients, to in fact generate imputations even for the worker supervisor pairs that don't exist in a specific year. So in the above example, this would include augmenting 1869 by observations such as 2-3, 3-1 etc, and then using their values of the regressors from other observations. So, in essence I would like to have for each year, each possible combination of worker-supervisor variables that dont exist as pairs, but exist in diifferent pairs. For example, for the missing 1-4 combination, there would be a new row with the value for supervisorvariable and workervariable equal to 12 and 2 respectively. Any suggestions would be much appreciated.


  • #2
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(year worker supervisor workervariable supervisorvariable y)
    1869 1 2  12 23  45
    1869 1 3  12  2  43
    1869 2 1 322 32   4
    1869 3 4   3  2  43
    end
    
    preserve
    keep year worker workervariable
    duplicates drop
    tempfile workers
    save `workers'
    
    restore
    keep year supervisor supervisorvariable
    duplicates drop
    tempfile supervisors
    save `supervisors'
    
    use `workers'
    joinby year using `supervisors'
    Added: Before running this code you should verify that in each year, the value of workervariable is the same for all observations of any worker, and analogously for supervisors. Thus:

    Code:
    by year worker (workervariable), sort: assert workervariable[1] == workervariable[_N]
    by year supervisor (supervisorvariable), sort: ///
        assert supervisorvariable[1] == supervisorvariable[_N]

    Comment


    • #3
      Hi Clyde,

      Thank you for your continued help. In your version of the code, however, I lose the y variable. I need to y variable so as to run the OLS regressions on the worker supervisor pairs that are available, obtain the estimated coefficients, and then predict values for y for those observations which are newly created pairs. So for instance, in the previous example, the 1-4 combination has been created, but after its creation, I want to generate the predicted value of the pair's y variable based on the supervisor-worker variable. A couple of options come to mind: merge this new file as one-to-one with the original dataset (new pairs will be unmatched), run the regression (where the new pairs wont contribute to the regression because of a missing y value), and the use the predict option. I am unsure, however, if the predict command would work for the new observations as well. Another option is to put these vectors in mata, and then do perform an element by element multiplication. This one would work for sure, but it does seem a bit brute force type of solution. Do you have any suggestions?

      Many Thanks.

      Comment


      • #4
        merge this new file as one-to-one with the original dataset (new pairs will be unmatched), run the regression (where the new pairs wont contribute to the regression because of a missing y value), and the use the predict option. I am unsure, however, if the predict command would work for the new observations as well.
        Yes, following -regress-, -predict- will do out-of-sample predictions. Try it:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(year worker supervisor workervariable supervisorvariable y)
        1869 1 2  12 23  45
        1869 1 3  12  2  43
        1869 2 1 322 32   4
        1869 3 4   3  2  43
        end
        
        preserve
        keep year worker workervariable
        duplicates drop
        tempfile workers
        save `workers'
        
        restore, preserve
        keep year supervisor supervisorvariable
        duplicates drop
        tempfile supervisors
        save `supervisors'
        
        restore
        tempfile original
        save `original'
        
        use `workers'
        joinby year using `supervisors'
        merge m:1 worker supervisor using `original'
        
        regress y workervariable supervisorvariable
        predict yhat, xb

        Comment


        • #5
          Thank you so much, Clyde.

          Comment

          Working...
          X