Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to produce an estimate of a variable's value overtime based on its relationship with another variable?

    Greetings,

    I'm running Stata 15.1 on OSX. Here's my situation: I have a variable (X1) that's been measured every couple of years between 1986-2016. This variable is fairly strongly correlated (r=0.6) with another (X2), but the latter was only measured in 2016. My question: is it possible to use the correlation between X1 and X2 in 2016 to produce a rough estimate of the values of X2 in previous years? In other words, if the mean of X1 was 5 in 2016, and the correlation between X1 and X2 in 2016 was 0.6, can we estimate what the value of X2 was in 2004 when the mean of X1 was 3?
    If yes, how can I go about this in stata?

    Here is sample data from the aforementioned 2016 survey in which both X1 (bsZ) and X2 (wgZ) were measured:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(wgZ bsZ)
      -.7052631  1.5197775
      -.7052631  -.5593465
      -.7052631 -1.3775046
       .3473684  1.7334635
      -.7052631  -.9669351
      -.7052631  -.7616504
       2.803509  1.9471498
    -.003508814 -.14606696
      -.7052631 -1.3775046
      -.7052631  1.1008067
      -.7052631 -.53956276
      -.7052631  -.1357716
      -.7052631  -.3315677
      -.7052631  -.3426796
      -.7052631 -1.1722198
              .  -.9726263
      2.1017544   .0749338
      -.7052631  .05813077
      -.7052631  -.9596206
    -.003508814   .9023002
      -.7052631   .7083976
      -.7052631  .06382197
    -.003508814  1.3212707
      -.7052631  -.3337417
      1.0491229   .4844163
       .6982455   .4833293
      -.7052631 -.53956276
    -.003508814  1.3212707
      -.3543859 -.13414814
    -.003508814  1.1159863
       .6982455  1.9471498
       1.750877  1.7334635
      -.7052631 -1.3775046
      -.7052631  -.9726263
       .3473684   .9023002
      -.3543859  1.0945792
      -.3543859 -.14254972
      -.7052631   .4844163
      -.7052631 -.55474234
    -.003508814  -.7673417
      -.3543859  -.9726263
      -.7052631   .8949856
      -.7052631  -.3410563
      2.1017544  .27561432
      -.7052631 -.14715388
      -.7052631 -.55365545
      -.7052631 -1.1722198
              .  .08414225
            1.4  1.9471498
      -.7052631  -.9726263
      -.3543859 -.14037575
      2.4526315  1.7334635
      -.7052631 -.12737001
              .  1.3212707
    -.003508814 -1.1722198
              .   .6981023
      -.7052631 -1.3775046
            1.4    1.53387
      -.7052631  -.9596206
    -.003508814 -.12167896
      -.7052631  -.3681545
      2.4526315    .904474
      2.1017544  1.7475564
       .3473684   .7037936
              .  1.9471498
    -.003508814   .4985087
              .  .26450247
      1.0491229  1.9471498
      -.7052631 -1.1722198
      -.7052631 -1.3775046
      -.7052631   .2848226
            1.4  1.9471498
       .3473684  -.3174751
      -.7052631 -1.3775046
            1.4 -.12737001
      -.3543859  .05921783
       .6982455   .7037936
      -.7052631  -.3269636
      -.7052631 -1.3775046
      -.7052631  -.7616504
      -.3543859 -1.3775046
              .  -.7516255
              .  1.3082652
              .  1.5208644
      -.3543859  .06653236
      -.3543859 -.55365545
      -.3543859  .07953796
              .   .2848226
      1.0491229  1.9471498
      -.7052631 -1.3775046
      1.0491229  1.5349573
    -.003508814  1.1064978
      -.7052631 -1.3775046
      1.0491229  1.3212707
       1.750877  1.5208644
            1.4  1.7334635
      -.7052631 -1.1722198
      -.7052631  -.9669351
       .3473684  1.1159863
      -.7052631 -1.3775046
    end
    Here is also data from a 1986 survey in which only X1 was measured:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(bsZ year)
    -.18299085 1986
     -.6595315 1986
     -1.637336 1986
     -1.378108 1986
     -1.637336 1986
      .3420923 1986
     1.8448405 1986
      .8325119 1986
     -.3797976 1986
     -.6390256 1986
      1.305739 1986
      .3354653 1986
     1.5856125 1986
     -.6390256 1986
     -.9220732 1986
      .5980068 1986
      .1074479 1986
     -.4767431 1986
    -.18630435 1986
     .05573131 1986
      .8391389 1986
     .08286423 1986
     1.6103355 1986
     1.8998705 1986
      .9015602 1986
     -.9532838 1986
     -.1970091 1986
      .3009411 1986
     -.9532838 1986
     1.0498245 1986
      .5980068 1986
     -.9015672 1986
      1.357595 1986
      .0548277 1986
    -.13127425 1986
    -.18299085 1986
     -1.378108 1986
    -1.1500906 1986
      .3116459 1986
     -.9532838 1986
     1.0950534 1986
     .09674313 1986
     1.0671563 1986
      .8013012 1986
    -1.3888127 1986
       .321447 1986
     -.9015672 1986
     .09674313 1986
     -.6432428 1986
     -.1517802 1986
      .3247605 1986
     -.3699964 1986
      2.107382 1986
     .07623722 1986
    -1.1607953 1986
      -.607815 1986
     .28284508 1986
      .5873021 1986
     1.3263844 1986
     -.6940558 1986
    -1.1295847 1986
     2.3559053 1986
     -.4348277 1986
      .3042546 1986
      .5839886 1986
    -.18630435 1986
     -.4045206 1986
    -.18299085 1986
     1.3263844 1986
     .07623722 1986
     -.3452734 1986
      .8424524 1986
      .8391389 1986
     -.8908625 1986
      .3140558 1986
     -.9015672 1986
    -.18630435 1986
      .3140558 1986
     .57659733 1986
     -.9327779 1986
     -.9532838 1986
      .5280549 1986
      .3592847 1986
     1.1016805 1986
      .8046147 1986
     .57328385 1986
     -1.378108 1986
    -1.1813012 1986
      .3354653 1986
     1.0671563 1986
    -1.1188799 1986
       .867036 1986
     -.6390256 1986
      .3354653 1986
     -1.378108 1986
      .3149594 1986
     -.2142015 1986
      .0515142 1986
      .3140558 1986
      .3797906 1986
    end
    Thank you for your help!

    -Zach

  • #2
    You could regress X2 on X1 in the 2016 data, and then use -predict- to get estimates of X2 conditional on X1 in the earlier years. How useful those estimates would be depends on what you want to do with them.

    Comment


    • #3
      Hey Clyde,

      I fully realize that the accuracy of any estimate is going to be questionable as it entails the assumption that the relationship between the two variables is consistent over time ((though, theoretically speaking, the two constructs have a lot of overlap so some consistency would be expected). That said, this analysis is more exploratory than anything else. I want to visualize what the value of X2 might be for a given year given the value of X1.

      Question: How would I go about doing what you're suggesting? Do I need to merge the two datasets first? Or do I first generate the prediction in the 2016 dataset and merge it with the other data?

      This is what I've done thus far (correct me if I'm going astray):

      Code:
      . regress bsZ wgZ if white==1 [pweight=weight], cluster(state)
      (sum of wgt is 861.6198629995133)
      
      Linear regression                               Number of obs     =        874
                                                      F(1, 49)          =     229.83
                                                      Prob > F          =     0.0000
                                                      R-squared         =     0.3339
                                                      Root MSE          =     .76419
      
                                       (Std. Err. adjusted for 50 clusters in state)
      ------------------------------------------------------------------------------
                   |               Robust
               bsZ |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
               wgZ |   .5654636   .0372991    15.16   0.000     .4905083    .6404188
             _cons |  -.2301547   .0397563    -5.79   0.000    -.3100481   -.1502614
      ------------------------------------------------------------------------------
      
      . predict bsZ_estimated
      (option xb assumed; fitted values)
      (325 missing values generated)
      Thanks again!

      Comment


      • #4
        Bleh, just realized I should have regressed wgZ (X2) on bsZ (X1). My bad. But you get the idea of what I've done.

        Comment


        • #5
          Ultimately, I'd like to generate a time series chart showing the estimated values of X2 overtime.

          Comment


          • #6
            I would operationalize this by -append-ing (not -merge-ing) the earlier data with the 2016 data. Then carry out the regression and run -predict-. The pre-2016 observations will not be in the estimation sample, because they will not have any values of X2, but -predict- applies to out-of-sample observations following -regress-.

            Comment


            • #7
              Okay. So what additional syntax to include with the -predict- to generate estimates for the years in which X2 are missing?

              I've appended the data. Here are the years in which data is available for X1 (with 2016 including measures of X1 & X2):
              Code:
              year    Freq.    Percent    Cum.
                          
              1986    1,028    4.57    4.57
              1988    1,721    7.65    12.22
              1990    951    4.23    16.44
              1992    2,197    9.76    26.20
              1994    1,732    7.70    33.90
              2000    1,514    6.73    40.63
              2004    1,046    4.65    45.28
              2008    2,059    9.15    54.43
              2012    5,447    24.20    78.63
              2016    4,809    21.37    100.00
                          
              Total    22,504    100.00
              Next I run...

              Code:
              . regress wgZ bsZ if white==1 [pweight=weight], cluster(state)
              (sum of wgt is 861.6198629995133)
              
              Linear regression                               Number of obs     =        874
                                                              F(1, 49)          =     121.83
                                                              Prob > F          =     0.0000
                                                              R-squared         =     0.3339
                                                              Root MSE          =     .78092
              
                                               (Std. Err. adjusted for 50 clusters in state)
              ------------------------------------------------------------------------------
                           |               Robust
                       wgZ |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                       bsZ |   .5904899   .0534987    11.04   0.000     .4829802    .6979996
                     _cons |    .105629   .0425178     2.48   0.016     .0201864    .1910716
              ------------------------------------------------------------------------------
              Now, for the predict command, what do I enter exactly? Is it -predict if year==2010- etc. etc. until predictions are generated for all the years? Ultimately, I'd like one variable that stores all the estimates, which I can then graph.

              Thanks again!

              -Zach


              Comment


              • #8
                Code:
                predict estimated_wgZ
                That's all you need. This will create a new variable, called estimated_wgZ, that contains the estimated values for every observation in the data set, regardless of year. (The only observations that won't get a prediction are any that have missing values for bsZ.)

                Comment


                • #9
                  Got it. Thanks!

                  Comment

                  Working...
                  X