Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Program for cross validation for regression discontinuity in time not running

    Dear All,

    To estimate a regression discontinuity in time model I want to use cross validation to determine the optimal bandwidth. I have 7 weeks of pre-intervention and 7 weeks of post-intervention data. For cross validation, I retain only the 7 weeks pre-intervention.
    I use the leave-on-out procedure
    and then iterate through 2, 3, 4, 5, 6 , 7 weeks to see which bandwidth gives me the smallest mean square error.

    Following is the data:

    Code:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str10 npi int year float week int userTRA
    "J338339LLR" 2014 4 0
    "J338339LLR" 2014 6 0
    "J338339LLR" 2014 7 0
    "J33833J3J3" 2014 2 0
    "J33833J99R" 2014 2 1
    "J33833JOLJ" 2014 4 0
    "J33833NF9L" 2014 5 0
    "J33833R8F8" 2014 1 0
    "J33833RLFF" 2014 7 0
    "J33833RO8R" 2014 2 0
    "J338383FRV" 2014 7 0
    "J338383R89" 2014 2 0
    "J33838FR9R" 2014 6 0
    "J33838LJ8R" 2014 3 0
    "J33838LVFO" 2014 6 1
    "J33838RFNL" 2014 1 1
    "J338393FOR" 2014 7 0
    "J338398J88" 2014 6 0
    "J3383998JF" 2014 4 0
    "J338399JRF" 2014 5 0
    "J338399N33" 2014 2 1
    "J338399V3R" 2014 7 0
    "J33839F99O" 2014 6 1
    "J33839F99O" 2014 7 0
    "J33839FNL3" 2014 5 0
    "J33839JFRV" 2014 6 0
    "J33839JLRL" 2014 5 0
    "J33839NNOL" 2014 6 0
    "J33839O383" 2014 4 0
    "J33839O8R8" 2014 2 0
    "J33839OR8R" 2014 6 2
    "J3383F33NJ" 2014 2 0
    "J3383F38JN" 2014 4 0
    "J3383F988V" 2014 2 0
    "J3383FN3VR" 2014 2 1
    "J3383FNFNL" 2014 1 0
    "J3383FNFNL" 2014 5 0
    "J3383FR8L9" 2014 2 0
    "J3383FROOF" 2014 2 0
    "J3383FVRVO" 2014 5 0
    "J3383J3983" 2014 1 0
    "J3383J3JV8" 2014 3 1
    "J3383J88FO" 2014 3 0
    "J3383J8RJV" 2014 3 0
    "J3383J8RJV" 2014 4 0
    "J3383J8VFV" 2014 5 0
    "J3383JONVF" 2014 5 0
    "J3383JRLJ8" 2014 2 1
    "J3383JRLJ8" 2014 7 1
    "J3383L3VJV" 2014 7 0
    "J3383L88NV" 2014 5 0
    "J3383LF888" 2014 1 0
    "J3383LFR3J" 2014 7 0
    "J3383LJJFO" 2014 2 0
    "J3383LL9RN" 2014 6 1
    "J3383LLN8N" 2014 5 0
    "J3383LLVFJ" 2014 1 0
    "J3383LLVFJ" 2014 5 0
    "J3383LRVO8" 2014 2 0
    "J3383LVFOR" 2014 7 0
    "J3383LVN93" 2014 3 0
    "J3383N83R8" 2014 5 0
    "J3383N888L" 2014 2 0
    "J3383N9LFJ" 2014 5 0
    "J3383NL93R" 2014 2 0
    "J3383NLV8O" 2014 7 1
    "J3383NNJFV" 2014 1 0
    "J3383NNJFV" 2014 6 1
    "J3383NO3RF" 2014 6 0
    "J3383NVJNJ" 2014 4 0
    "J3383O3LVV" 2014 7 0
    "J3383OFJJL" 2014 5 0
    "J3383ON8LO" 2014 3 1
    "J3383OOO9F" 2014 1 2
    "J3383OORLN" 2014 1 0
    "J3383ORLLO" 2014 2 0
    "J3383OVN8F" 2014 4 0
    "J3383OVRF3" 2014 6 0
    "J3383R3F9N" 2014 2 0
    "J3383RF9O9" 2014 6 0
    "J3383RF9O9" 2014 7 0
    "J3383RNV9R" 2014 1 0
    "J3383ROLLV" 2014 6 0
    "J3383V3OOO" 2014 4 0
    "J3383V3OOO" 2014 7 0
    "J3383V83FL" 2014 6 0
    "J3383V83N9" 2014 7 0
    "J3383V9J89" 2014 3 0
    "J3383VL398" 2014 1 1
    "J3383VL398" 2014 2 1
    "J338F8FJFV" 2014 4 0
    "J338FLNVF3" 2014 5 0
    "J338FR9RLL" 2014 3 0
    "J338FRR3LF" 2014 6 0
    "J338FRV38F" 2014 5 0
    "J338J33FOV" 2014 3 0
    "J338J33RNO" 2014 5 1
    "J338J38JR3" 2014 5 1
    "J338J398ON" 2014 1 1
    "J338J39OV3" 2014 5 0
    end
    Then the code I am trying to run is:

    Code:
    . // PREPARING DATA FOR CV FOR FEDERAL SCHEDULING
    .
    . use "C:\Users\Sumedha\Documents\OPTUM\fillsProviderPanel_weekly_DS_10.dta", clear
    
    . drop if dateWeek<2834 |dateWeek>2840 // only 7 weeks pre-intervention
    (5,265,166 observations deleted)
    
    . gen event_time_dateweek=dateWeek-2840
    
    . rename  prov_state state
    
    . sort state
    
    . drop _merge
    
    . merge m:1  state using "G:\Misc. Data\stateFIPS.dta"
    
        Result                           # of obs.
        -----------------------------------------
        not matched                         3,459
            from master                     3,459  (_merge==1)
            from using                          0  (_merge==2)
    
        matched                            66,609  (_merge==3)
        -----------------------------------------
    
    . drop if _merge~=3
    (3,459 observations deleted)
    
    . drop _merge
    
    . rename stateFIPS st_fips
    
    .
    .         gen ReschedTRA_treat=0
    
    .         replace ReschedTRA_treat=1 if st_fips==5| st_fips==13|st_fips==17|st_fips==21|
    > ///
    >         st_fips==       28|st_fips==35|st_fips==36 |st_fips==38|st_fips==40|st_fips==47
    > |st_fips==56|st_fips==39
    (15,658 real changes made)
    
    .
    . keep if ReschedTRA_treat==0
    (15,658 observations deleted)
    
    . gen week=dateWeek-2833
    
    .
    . /*
    > NOTES: cllr_crossval
    > The goal is to estimate the bandwidth that minimizes the IMSE of a local linear regress
    > ion.
    > A grid search is used and estimation is based on the cllr program described above.
    >
    > Arguments
    >   outcome: a stata variable containing the dependent variable
    >         x: a stata variable containing the independent variable
    >     start: a hardcoded number or local variable defining start of a sequence candidate
    > bandwidths
    >      step: a hardcoded number or local variable defining the stepsize of the sequence o
    > f candidate bandwidth
    >      stop: a hardcoded number or local variable defining the end of a sequence of candi
    > date bandwidths.
    >       sub: a stata variable set to 1 if the observation should be included in the analy
    > sis
    >
    > Returns
    >   A stata matrix and set of stata variables that contain the estimated IMSE for each ca
    > ndidate bandwidth.
    >
    > */
    .
    .
    . sort npi
    
    . gen N=_n if npi[_n]~=npi[_n-1]
    (6,945 missing values generated)
    
    . bysort npi: egen maxN=max(N)
    
    . replace N=maxN if N==.
    (6,945 real changes made)
    
    . bysort N week: gen counter=_n
    
    . drop if counter>1
    (141 observations deleted)
    
    . xtset N week
           panel variable:  N (unbalanced)
            time variable:  week, 1 to 7, but with gaps
                    delta:  1 unit
    
    .
    . gen outcome = userTRA
    
    . gen x = week
    
    .
    . capture program drop cllr_crossval
    
    . program define cllr_crossval
      1.         set more off
      2.         args outcome x start step stop sub narrowsub
      3.         tempvar cx ew e2 e2n
      4.
    .         local stop = 7
      5.         local start = 1
      6.         local step = 1
      7.         *make a matrix to store the estimated IMSE
    .         local size = ((`stop' - `start')/`step')+1
      8.         matrix M = J(`size', 3, .)
      9.        
    .        
    .         *Iterate over candidate bandwidths
    .         local count = 0
     10.         forvalues h = `start'(`step')`stop'{
     11.        
    .                 *increment counter
    .                 local count = `count' + 1
     12.                
    .                 *store location on the bandwidth grid
    .                 matrix M[`count', 1] = `h'
     13.                
    .                 *initialize the residual variable
    .                 gen `e2' = .
     14.                
    .                
    .                 *Iterate over observations
    .                 forvalues i = 1(1)`N'{
     15.                 capture quietly reghdfe /*regress*/ `outcome' `x' if  _n~=`i' & week
    > =<`h', absorb(npi)
     16.                                 replace `e2' = (`outcome' - _b[_cons])^2 in `i'
     17.                                         }
     18.                
    .                 *compute IMSE for the candidate bandwidth
    .                 su `e2'
     19.                 matrix M[`count',2] = r(mean)
     20.
    .                                
    .                 drop `e2'
     21.         }
     22.
    .         matrix list M
     23.         svmat M
     24. end
    
    .
    end of do-file
    
    .
    But it doesn't run at all! I will be grateful for any help in identifying what I am doing wrong.

    Sincerely,
    Sumedha.



  • #2
    When you say doesn't run at all, that is not very helpful. Actually, it is obvious from part of the results (not code) you posted that part did run. Knowing where it quit running is essential to diagnosis. Also, it is best if you can only post what you need to demonstrate the problem - the extra stuff may reduce the number of folks willing to take the time to help you.

    You have defined a program, but I don't see where in the output you run the program. You need to run it or you will get nothing. So add a line with
    cllr_crossval after the end statement at the end of the definition of the program. If this is not the problem, sometimes, I find it easier to debug the code as part of the full do file, and then put the debugged code into the program definition. Remember, you can add set trace on and it will tell you a lot of useful stuff (like how it is interpreting the macros) when it runs (but produces piles of output usually).



    Comment


    • #3
      Thank you Prof. Bromiley. Here is what I am running:

      Code:
      .
      . capture program drop cllr_crossval
      
      . program define cllr_crossval
        1.         set more off
        2.         args outcome x start step stop sub narrowsub npi
        3.         tempvar cx ew e2 e2n
        4.        
      .        
      .         matrix M = J(6, 3, .)
        5.         gen n=_n
        6.         egen maxn=max(n)
        7.         scalar nobs=maxn
        8.        
      .         *Iterate over candidate bandwidths
      .         local count = 0
        9.         forvalues h = 2(1)7{
       10.         display "`h'"
       11.                 *increment counter
      .                 local count = `count' + 1
       12.                 display `count'
       13.                 *store location on the bandwidth grid
      .                 matrix M[`count', 1] = `h'
       14.                
      .                 *initialize the residual variable
      .                 gen `e2' = .
       15.                 display "`e2'"
       16.                
      .                 *Iterate over observations
      .                 forvalues i = 1(2)_n-1{
       17.                 if `sub'[`i']==1 {
       18.                                 gen `cx' = `x' - `x'[`i']      
       19.                                 replace `cx' = `cx' - `x'[`i'+1]
       20.                                 capture noisily xtreg `outcome' `cx' if `sub'==1 & (
      > _n~=`i'|_n=`i'+1), fe vce(cluster `npi')
       21.                                 if _rc==0{
       22.                                         replace `e2' = (`outcome' - _b[_cons])^2 in
      > `i'
       23.                                 }
       24.                         }
       25.                         capture drop `cx'
       26.                                         }
       27.                
      .                 *compute IMSE for the candidate bandwidth
      .                 su `e2' if `sub'==1
       28.                 matrix M[`count',2] = r(mean)
       29.
      .                 su `e2' if `sub'==1 & `narrowsub'==1
       30.                 matrix M[`count',3] = r(mean)
       31.
      .                
      .                 drop `e2'
       32.         }
       33.
      .         matrix list M
       34.         svmat M
       35. end
      
      .
      . cllr_crossval
      2
      1
      (50,810 missing values generated)
      __000002
      invalid syntax
      r(198);

      Comment


      • #4
        You're running at least three threads at once, notably this one and https://www.statalist.org/forums/for...running-at-all and one on matsize , which I suggest you can and should close explicitly.

        In the thread I just referenced I explained a personal feeling that I don't understand the whole of your problem which is not the kind of thing I do at all. But that does not stop me trying to advise positively on different levels.


        1. Statalist etiquette. It's good practice to post a message in the thread I've just referenced saying "If interested please now follow <URL of this thread>". This is not arcane: we explain it in the FAQ Advice you are always prompted to read.
        https://www.statalist.org/forums/help#adviceextras #1 Note that the main FAQ Advice flags that as a place where "Bumping, closing threads, and starting new threads" are discussed.

        2. Help us to help yourself and us to debug. Posting a big chunk of code and asking why it isn't working with a bare error message is a slow and inefficient way to get help. You can


        Code:
        set trace on
        and as need be tune the depth of tracing with

        Code:
        set tracedepth 1
        or 2 or 3 or whatever.

        See

        Code:
        help trace
        for more. That allows you to see where the code fails and allows you to show us that. Phil Bromiley gave you this advice and you ignored it.

        3. Give more information. You made several changes to your program between #1 and #3 and didn't even mention that or try to explain any of them. It does seem possible that the changes might be important.

        4. Your latest bug. Fortunately it is possible to guess the immediate bug which lies in


        Code:
          
         forvalues i = 1(2)_n-1{
        forvalues will not evaluate _n - 1 for you. In any case that would be "the current observation MINUS 1". Is that what you want? If it is, that is problematic, as the current observation number you have in mind may change throughout the loop. What do you want there? Is the "the number of observations MINUS 1"? Perhaps. No point in my guessing. Unless someone reading this understands your total goal, you need to tell us.
        Last edited by Nick Cox; 26 Oct 2019, 03:45.

        Comment


        • #5
          Ok. Thank you. I have fixed the forvalues command. I have also set trace.

          My total goal is to decide how many waves of data to include in my estimation. I want to base my decision on the number of waves that will minimize the mean square prediction error. I have a total of 7 waves. So my options are to include only 2 waves, 3 waves, 4 wave...., all 7 waves. So I start by including only 2 waves (`h'=2) and estimate an individual fixed effects regression for all except individual `i' (leave out all waves of individual `i'). Then calculate the prediction residual for `i' as `outcome'-e(b)*`x' in each of the two wave . Then store that in variable e2. Then repeat same exercise by leaving out both waves of the next individual. Eventually my e2 will have the predicted residual for each individual in each of the two waves.

          Then I repeat the above exercise for panels of 3 waves, 4 waves....all 7 waves.

          Here is what I have:

          Code:
          . program define cllr_crossval
            1.         set more off
            2.         args outcome x /*start step stop*/ sub narrowsub npi
            3.         tempvar cx ew e2 e2n
            4.        
          .        
          .         matrix M = J(6, 3, .)
            5.         local nobs=c(N)
            6.         local nobsmin1=c(N)-1
            7.        
          .        
          .         *Iterate over candidate bandwidths
          .         local count = 0
            8.         forvalues h = 2(1)7{
            9.         display "`h'"
           10.        
          .                 *increment counter
          .                 local count = `count' + 1
           11.                 display `count'
           12.                 *store location on the bandwidth grid
          .                 matrix M[`count', 1] = `h'
           13.                
          .                 *initialize the residual variable
          .                 gen `e2' = .
           14.                 display "`e2'"
           15.                
          .                 *Iterate over observations
          .                 forvalues i = 1(2)`nobsmin1'{
           16.                 if `sub'[`i']==1 {
           17.                                 gen `cx' = `x' - `x'[`i']      
           18.                                 replace `cx' = `cx' - `x'[`i'+1]
           19.                                 capture noisily xtreg `outcome' `cx' if `sub'==1 & (
          > _n~=`i'|_n=`i'+1) , fe vce(cluster `npi')
           20.                                
          .                                                                         replace `e2' =
          > (`outcome' - e(b)`x')^2 in `i'
           21.                                                         }
           22.                         capture drop `cx'
           23.                                         }
           24.                
          .                 *compute IMSE for the candidate bandwidth
          .                 su `e2' if `sub'==1
           25.                 matrix M[`count',2] = r(mean)
           26.
          .                 su `e2' if `sub'==1 & `narrowsub'==1
           27.                 matrix M[`count',3] = r(mean)
           28.
          .                
          .                 drop `e2'
           29.         }
           30.
          .         matrix list M
           31.         svmat M
           32. end
          
          .
          . cllr_crossval
            --------------------------------------------------------------- begin cllr_crossval ---
            - set more off
            - args outcome x sub narrowsub npi
            - tempvar cx ew e2 e2n
            - matrix M = J(6, 3, .)
            - local nobs=c(N)
            - local nobsmin1=c(N)-1
            - local count = 0
            - forvalues h = 2(1)7{
            - display "`h'"
            = display "2"
          2
            - local count = `count' + 1
            = local count = 0 + 1
            - display `count'
            = display 1
          1
            - matrix M[`count', 1] = `h'
            = matrix M[1, 1] = 2
            - gen `e2' = .
            = gen __000002 = .
          (50,810 missing values generated)
            - display "`e2'"
            = display "__000002"
          __000002
            - forvalues i = 1(2)`nobsmin1'{
            = forvalues i = 1(2)50809{
            - if `sub'[`i']==1 {
            = if [1]==1 {
            - gen `cx' = `x' - `x'[`i']
            = gen __000000 =  - [1]
            - replace `cx' = `cx' - `x'[`i'+1]
            = replace __000000 = __000000 - [1+1]
          (50,810 real changes made)
            - capture noisily xtreg `outcome' `cx' if `sub'==1 & (_n~=`i'|_n=`i'+1) , fe vce(cluste
          > r `npi')
            = capture noisily xtreg  __000000 if ==1 & (_n~=1|_n=1+1) , fe vce(cluster )
              --------------------------------------------------------------------- begin xtreg ---
              - version 6, missing
              - local version : di "version " string(_caller()) ", missing:"
              - if replay() {
                if `"`e(cmd)'"'==`"xtreg"' {
                if _by() { error 190 }
                `version' xtreg_`e(model)' `0'
                exit `e(rc)'
                }
                else if `"`e(cmd2)'"' == "xtreg" {
                if _by() { error 190 }
                `version' `e(cmd)' `0'
                exit `e(rc)'
                }
                error 301
                }
              - syntax varlist(fv ts) [if] [in] [aw fw pw iw] [, I(varname) BE FE RE MLE PA GEE VCE
          > (passthru) * ]
          ==1 invalid name
              ----------------------------------------------------------------------- end xtreg ---
            - replace `e2' = (`outcome' - e(b)`x')^2 in `i'
            = replace __000002 = ( - e(b))^2 in 1
          matrix operators that return matrices not allowed in this context
              }
              capture drop `cx'
              }
              su `e2' if `sub'==1
              matrix M[`count',2] = r(mean)
              su `e2' if `sub'==1 & `narrowsub'==1
              matrix M[`count',3] = r(mean)
              drop `e2'
              }
            ----------------------------------------------------------------- end cllr_crossval ---
          r(509);
          
          end of do-file
          
          r(509);

          There are many errors in this. One for sure is that I am not restricting the sample to include only say 2 waves for when `h'=2 in the xtreg in the second loop. I am not sure how to do that. I will be grateful for any help you are able to offer.
          Sincerely,
          Sumedha.

          Comment


          • #6
            I was hoping that someone who does this kind of thing would pick up the thread, but I hoped in vain, All I can do is try to spot bugs, not to rewrite code completely,

            I made various points in #4. I just fixed adding a cross-reference in the other thread, as you didn't do that. You made some changes in the program, but as before you weren't specific in detailing them,

            You don't give the code that called cllr_crossval. That's a major omission,

            Your program expects 5 arguments but you only supplied 1.

            How do I know this when you don't type what you supplied?

            First, the trace shows clearly that sub is empty. Look at this pair of lines. Your program doesn't see a sub and a blank (empty string) is used instead, The - line is your code. The = line is how it is interpreted.
            Code:
            - if `sub'[`i']==1 {  
            = if [1]==1 {
            Stata can interpret what you implied as a test that 1 is equal to 1,

            The next pair of lines

            Code:
            - gen `cx' = `x' - `x'[`i']  
            = gen __000000 =  - [1]
            shows that you didn't supply x either, But Stata can interpret what you implied as an instruction to fill a variable with -1.

            When you get to


            Code:
             - capture noisily xtreg `outcome' `cx' if `sub'==1 & (_n~=`i'|_n=`i'+1) , fe vce(cluster `npi')  
            = capture noisily xtreg  __000000 if ==1 & (_n~=1|_n=1+1) , fe vce(cluster )
            errors catch up with you. Local macro sub is empty and Stata can't make sense of the code implied without it, The error message is explicit on what Stata finds illegal.

            Note that Stata bailed out before noticing that npi is empty. That would have been a fatal error too.
            Last edited by Nick Cox; 27 Oct 2019, 02:33.

            Comment


            • #7
              Correction: What you typed is visible. You supplied no arguments at all to the command, not even the outcome variable.

              Comment


              • #8
                hmm... Thank you prof. Cox. You are right. I figured that the program bit is a bit out of my league right now. So, I have tried to do the above simply using loops. It does mean that I have to copy and paste the code several times, as opposed to the program, but it seems to work. Here is what I have:


                sort npi
                egen ID=group(npi)
                bysort ID week: gen counter=_n
                drop if counter>1
                tsset ID week
                tsfill, full
                recode userTRA (.=0)

                xtset ID week
                gen outcome = userTRA
                gen x = week
                sort ID week

                sum ID
                local numID=r(max)
                //set trace on
                //set tracedepth 2
                matrix M = J(6, 2, .)


                forvalues h=2(1)7 {
                //local h=2
                preserve
                drop if week>`h'
                *increment counter
                local count = `h' - 1
                //display `count'
                *store location on the bandwidth grid
                matrix M[`count', 1] = `h'
                gen e2=.


                forvalues i=1(1)`numID' {
                gen coutcome=outcome
                replace coutcome=. if ID==`i'

                quietly xtreg coutcome week, vce(cluster ID)
                predict y_hat if ID==`i', xb
                replace e2=(outcome-y_hat)^2 if ID==`i'
                drop coutcome y_hat
                }
                *compute IMSE for the candidate bandwidth
                su e2
                matrix M[`count',2] = r(mean)

                restore
                }


                matrix list M
                svmat M

                Thank you helping me as I entered the new territory of Stata programming. I have learnt a lot.

                Sincerely,
                S.

                Comment

                Working...
                X