Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Application of two stage procedure with endogeneity

    Hi all,

    Jeff Wooldridge
    it's been a while that I am trying to apply the two stage procedure described in "Testing and correcting for endogeneity in nonlinear unobserved effects models" (Lin, Wei, and Jeffrey M. Wooldridge. Panel Data Econometrics. Academic Press, 2019. 21-43) which you can find here: http://www.weilinmetrics.com/uploads...g_20170309.pdf in pdf.
    In particular I have a balanced panel dataset consisting of identifiers called idatc observed in Years from 2004 to 2010. From literature I know that when modeling innovation on size, one might incur in endogeneity of size. For this reason, since innovation is modeled by the number of clinical trials (and therefore a fixed effect poisson is needed) I am following the aforementioned procedure. Summarizing, I am estimating a "first stage" of the endogenous (called y in the following code) on the instrument (L.major_recalls_norm) and the other regressors (average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented). Predicting the residuals of the fixed effects estimation and putting the residuals inside the "second stage" in a sort of control function approach. So in the second stage I am basically regressing the dependent (trials) on the residuals of thee first stage and the other regressors using a poisson f.e.:

    Code:
    xtreg y L.major_recalls_norm average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented i.Year, fe vce(cluster newid)
            predict residuals, e
            xtpoisson trials y residuals average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented i.Year, fe vce(robust)
    The issue comes now: what I would like to do is to perform bootstrap on such a procedure (so that residuals are computed in every drawn sample each time), but when I do it, the coefficients seem not to be close to the ones that I obtain by simply applying the two step procedure without bootstrap. I am therefore concerned about the possibility (and reliability) of applying bootstrap procedure to the two step approach introduced by the authors. Can you please help me figuring out?

    Thanks,

    Federico

  • #2
    As usual, you’ll get a better answer if you show commands and output. If you’re doing the bootstrap properly then the average of the bootstrapped estimates will be close to the estimate in the sample. Of course there will be variability—that’s what produces a standard error. You should use at least 1,000 bootstraps and 10,000 would be better. It should run pretty fast.

    Comment


    • #3
      Jeff Wooldridge
      Thank you for the useful reply. Sorry, you are right, I might have shown the bootstrap techniques adopted.

      So I tried two different approaches.
      1. The first one is based on Cameron and Trivedi (2010) book:
      Code:
      ***** BOOTSTRAP
      
      use "/Users/federiconutarelli/Desktop/DB_atc3.dta", clear //my DB based on idatc3 as ids and Year from 2004 to 2010. A balanced panel.
      *bysort idatc3 (Year): gen byte panelsize = _N
      *rename tot_count trials
      
      capture program drop onebootrep
      program define onebootrep, rclass
          preserve
              rename tot_count trials
              bsample, cluster(idatc3) idcluster(newid)
              xtset newid Year //setting the new as identifier in case of repetitions
              quietly xtreg y L.major_recalls_norm average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented i.Year, fe vce(cluster newid) //here I cluster for newid since it is the unique id in the repetitions.
              predict residuals, e
              quietly xtpoisson trials y residuals average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented i.Year, fe vce(robust)
              *local xvars y residuals average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented
              *local i = 1
              *foreach var in `xvars'{
              *    return scalar b`i'= _b[`var']
              *    return scalar s`i'= _se[`var']*sqrt(e(N))
              *    local ++i
              *}    
          restore
      end
      
      
      preserve
          simulate, seed(1234) reps(500)  nodots saving(coeff_se, replace): onebootrep //this should compute the coefficients for N times WITHOUT exploiting the results stored in r() thanks to the above defined program.
      restore
      It basically build up a program performing the first and the second stage of the FE Poisson approach. Then it implements a Monte-Carlo simulation 500 times (maybe too few?).

      2. The second is a "manual brute-force approach" constructed by me:



      Code:
      set seed 1234
      set matsize 11000
      global filedata "DB_atc3.dta"
      global varx "y residuals average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented"
      global nrepl 500
      
      
      local nx=0
      set more off
      
      foreach var in $varx {
          local nx=`nx'+1
      }
      
      global ncol= `nx' +2
      mat coeff = J($nrepl ,$ncol,.)
      
      set more off
      forval i=1/$nrepl {
          drop _all
          use "$filedata", clear
          xtset, clear
          xtset idatc3 Year
          rename tot_count trials
          gen lag_maj_rec = L.major_recalls_norm
          *bysort idatc3 (Year): gen byte panelsize = _N
          bsample, cluster(idatc3) idcluster(newid)
          xtset, clear
          xtset newid Year
          quietly xtreg y lag_maj_rec average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented i.Year,fe vce(cluster newid)
          predict residuals, e
          *predict residuals, residu
          capture quietly xtpoisson trials y $resid average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented i.Year, fe vce(robust)
      
          if ( _rc ) {
          
          continue
          }
          di "... the return code is " _rc " and the # of iterations is reported to be `i'"
          local conta=0
          set more off
          foreach var in $varx {
              local conta=`conta'+1
              mat coeff[`i',`conta']=_b[`var']
          
          *mat coeff[`i',`conta'+1]=_b[$resid]
          }
      
          *di as text " Replication # `i'"
          
          mat coeff[`i',$ncol]=`i'
      
      }
      It is a basic bootstrap within a for loop. So the issue is that they both give the same results, i.e.:

      Code:
      Variable | Obs Mean Std. Dev. Min Max
      -------------+--------------------------------------------------------
      A | 494 .3611457 5.130059 -32.17217 29.44407
      B | 494 .3349674 5.12483 -28.58649 32.8156
      C | 494 .0052883 .4578338 -3.4639 3.640835
      D | 494 .0024553 .0224125 -.149159 .1644048
      E | 494 .068939 .3264411 -3.457498 2.688154
      -------------+--------------------------------------------------------
      F | 494 -.0014377 .0042038 -.0362368 .0467434
      G | 494 -.4351782 5.421608 -40.58364 36.06022
      H | 494 1.020604 5.319493 -30.74711 42.10214
      I | 494 -1.037221 12.14937 -60.82576 123.775
      J | 494 .3349674 5.12483 -28.58649 32.8156
      -------------+--------------------------------------------------------
      K | 494 250.3664 144.8643 1 500
      Observations are just 594 since in the first stage I included a lagged variable (hence all observations of 2004 have been delated since their lag is not available).

      But when I perform a single "two-stage", i.e. only:

      Code:
      xtset idatc3 Year
      
      quietly xtreg y L.major_recalls_norm average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented i.Year, fe vce(cluster newid)
      
      predict residuals, e
      
      quietly xtpoisson trials y residuals average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented i.Year, fe vce(robust)
      the coefficients are completely different for the second stage (hence I guess the problem lies on the residual part):

      Code:
      Iteration 0:   log pseudolikelihood = -890.52999  
      Iteration 1:   log pseudolikelihood = -608.79514  
      Iteration 2:   log pseudolikelihood =  -574.8175  
      Iteration 3:   log pseudolikelihood = -572.35238  
      Iteration 4:   log pseudolikelihood = -572.33087  
      Iteration 5:   log pseudolikelihood = -572.33086  
      
      Conditional fixed-effects Poisson regression    Number of obs      =       680
      Group variable: idatc3                          Number of groups   =        85
      
                                                      Obs per group: min =         8
                                                                     avg =       8.0
                                                                     max =         8
      
                                                      Wald chi2(16)      =    272.25
      Log pseudolikelihood  = -572.33086              Prob > chi2        =    0.0000
      
                                                 (Std. Err. adjusted for clustering on idatc3)
      ----------------------------------------------------------------------------------------
                             |               Robust
                      trials |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -----------------------+----------------------------------------------------------------
                           y |  -.3015194   1.204209    -0.25   0.802    -2.661725    2.058686
                   residuals |   .7787056   1.238825     0.63   0.530    -1.649347    3.206758
      average_age_prodbyatc3 |   .0651748   .1125092     0.58   0.562    -.1553391    .2856887
                  avg_prd_sq |  -.0023611   .0052091    -0.45   0.650    -.0125707    .0078484
          mean_agefirm_byatc |   .0612808   .1220469     0.50   0.616    -.1779268    .3004884
        mean_agefirm_squared |  -.0010596   .0014785    -0.72   0.474    -.0039575    .0018383
                         hhi |   .0028483   1.298405     0.00   0.998    -2.541979    2.547676
               share_expired |   1.330289   1.839446     0.72   0.470    -2.274959    4.935537
              share_patented |  -.4722916   2.619808    -0.18   0.857    -5.607022    4.662439
                             |
                        Year |
                       2006  |  -.2822704   .1854015    -1.52   0.128    -.6456507    .0811098
                       2007  |  -.1689345   .2271054    -0.74   0.457    -.6140529    .2761839
                       2008  |   -.816972   .2738895    -2.98   0.003    -1.353786   -.2801584
                       2009  |  -1.221988   .2555353    -4.78   0.000    -1.722828   -.7211485
                       2010  |  -1.732823   .3902066    -4.44   0.000    -2.497614   -.9680317
                       2011  |  -3.448337    .553699    -6.23   0.000    -4.533567   -2.363107
                       2012  |  -3.137114   .4748049    -6.61   0.000    -4.067714   -2.206513
      ----------------------------------------------------------------------------------------
      Sorry for the verbose post. Maybe it's just me performing something wrong in the bootstrap procedure but I think that clarifying the isse might be useful both for STATA users and for people interested in the theoretical approach.

      Thanks a lot,

      Federico

      Comment


      • #4
        Errata corrige:

        of course the code here is with vce(cluster idatc3) i.e. as follows:

        Code:
         
         xtset idatc3 Year  quietly xtreg y L.major_recalls_norm average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented i.Year, fe vce(cluster idatc3)  predict residuals, e  quietly xtpoisson trials y residuals average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented i.Year, fe vce(robust)

        Comment


        • #5
          Errata corrige:

          of course the code here is with vce(cluster idatc3) i.e. as follows:

          Code:
          xtset idatc3 Year
           quietly xtreg y L.major_recalls_norm average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented i.Year, fe vce(cluster idatc3)
           predict residuals, e  
          quietly xtpoisson trials y residuals average_age_prodbyatc3 avg_prd_sq mean_agefirm_byatc mean_agefirm_squared hhi share_expired share_patented i.Year, fe vce(robust)

          Comment


          • #6
            I'll find the Stata do file and post. The bootstrap should be relatively simple to apply.

            Comment


            • #7
              Jeff Wooldridge

              Thanks a lot and sorry for any disturb!

              Comment


              • #8
                Jeff Wooldridge a quick update. I tried to replicate step by step the procedure manually. I am not really sure about that but I guess the problem is in the xtpoisson, fe vce() part...

                Comment


                • #9
                  Here is the code I've used with an example on airfare data. You fan easily find the data set using Google; in fact, it is on the Stata website, I think.

                  Note that there are two other estimators that combine the correlate random effects and control function approaches. They tend to be more precise, but they impose more assumptions, too.

                  I hope this helps.

                  Code:
                  use airfare, clear
                  
                  egen double concenbar = mean(concen), by(id)
                  egen double lfarebar = mean(lfare), by(id)
                  
                  
                  capture program drop cre_cf
                  
                  program cre_cf, rclass
                  
                  xtreg lfare concen y98 y99 y00, fe
                  predict double u2ddh, e
                  reg lfare concen concenbar ldist ldistsq y98 y99 y00
                  predict double v2h, resid
                  
                  xtpoisson passen lfare u2ddh y98 y99 y00, fe
                  return scalar b_lfare_1 = _b[lfare]
                  glm passen lfare v2h concenbar lfarebar y98 y99 y00, fam(poiss)
                  return scalar b_lfare_2 = _b[lfare]
                  glm passen lfare v2h concenbar y98 y99 y00, fam(poiss)
                  return scalar b_lfare_3 = _b[lfare]
                  
                  drop u2ddh v2h
                  
                  end
                  
                  bootstrap r(b_lfare_1) r(b_lfare_2) r(b_lfare_3), ///
                      reps(500) seed(123) cluster(id) idcluster(newid): cre_cf

                  Comment


                  • #10
                    Jeff Wooldridge
                    Thanks a lot for your helpful advices and comments.

                    Comment

                    Working...
                    X