Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to run Nearest Neighbor Matching (nnmatch) on panel data?

    I am trying to run a Nearest Neighbor Matching in order to run a DID inference on it.

    Background:

    Stata 13

    This dataset contains panel data from Compustat merged with Thomson Reuters M&A dataset.


    What I need to do:

    Due to the nature of this data set, I have to match firms according to their size (ln(Assets) + Industry code (SIC).

    In order to match the firms by Asset and SIC, I will use Nearest Neighbor Matching (nnmatch).

    I have to match year and SIC code as exact (ematch)

    (Later on) After matching is done, I need to perform a DID estimation to see if there is a causal effect.


    However, when I use the teffects nnmatch code using ematch, I get an error.

    Code:
    gen treatment = 0
    replace treatment = 1 if merger==1
    teffects nnmatch (income industry_sic assets firm_year) (treatment), biasadj(assets) ematch(firm_year industry_sic) vce(robust) dmvariables


    Error: "12 observations have no exact matches"


    It runs fine if I don't use ematch but I need it in order to compare apples to apples. What do I do to deal with this problem?


    Dataset:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int(id firm_year) byte industry_sic int(assets income merger_year) byte merger
    111 2000 22  10  10    . .
    111 2001 22  12  20 2002 1
    111 2002 22  30 400    . .
    111 2003 22  50 470    . .
    111 2004 22  60 490    . .
    333 2000 22  15  10 2001 1
    333 2001 22  40 100 2002 1
    333 2002 22  70 200    . .
    333 2003 22  80 260    . .
    333 2004 22  85 270 2007 1
    333 2005 22  90 280    . .
    333 2006 22  95 290    . .
    333 2007 22 120 700    . .
    555 2000 37  40  10 2001 1
    555 2001 37  60  50    . .
    555 2002 37  70  70    . .
    end
    Last edited by Tahseen Hasan; 05 Jan 2019, 10:29.

  • #2
    Caveat: I have rarely used -teffects-, and I have never used it with nearest-neighbor matching.

    But it seems as if Stata is simply telling you that there are twelve firms for which no exact match on year and sic code exist in your data. If that is the case, then it seems to me you have the following possible workarounds:

    1. Get some additional data that will match the currently unmatchable firms..
    2. Omit the 12 unmatchable firms from your analysis.
    3. Loosen the requirement for an exact match on year. For example, settle for a match within 2 or 5 years or something like that.
    4. Loosen the requirement for an exact match on sic code. Accept a match between one industry and a closely related one (I don't know enough about SIC codes, nor, for that matter about industries, to suggest how you might implement "closely related.")

    Comment


    • #3
      Thank you so much for the advice Clyde, I really appreciate it. In my case, point #2 is what I have to find a solution to. The options won't work in my case because I don't have access to any more data. For my DID to have a valid control group, even if I could loosen up Industry matching, I won't be able to do the same for year. Even if I run ematch individually on year or SIC, I still have unmatched observations.

      The only solution that I'm seeing is figuring out a way to omit the unmatchable firms from the analysis, as you suggested. However, I have over 5K unmatched observations in my full dataset so I need to figure out how to exclude those observations in order to make this work.

      I have looked through all the relevant forum posts and people often suggest using the -osample()- function. However, even when I include it in my code I still face the same error so I don't know what else to do.

      Code:
      gen treatment = 0
      replace treatment = 1 if merger==1
      teffects nnmatch (income industry_sic assets firm_year) (treatment), biasadj(assets) ematch(firm_year industry_sic) osample(Unobserved) vce(robust) dmvariables

      These are the two relevant threads that relates to my problem, but unfortunately I didn't find a solution for Nearest Neighbor matching in either thread:

      https://www.statalist.org/forums/for...ffects-nnmatch

      https://www.statalist.org/forums/for...ffects-nnmatch


      I really appreciate your help Clyde. I've been struggling with this for days now and I am not being able to find a solution.
      Last edited by Tahseen Hasan; 05 Jan 2019, 12:44.

      Comment


      • #4
        What about -drop if Unobserved- and then trying the -teffects nnmatch- again? Does that go through?

        Comment


        • #5
          Still no luck because the -osample- function creates a new variable "Unobserved" =1 if observation does not have an exact match.

          However, my code does not run at all for the Unobserved variable to be created in the first place.

          These are the codes I tried but I got the same error.

          Code:
          drop if Unobserved==1
          teffects nnmatch (income industry_sic assets firm_year) (treatment), biasadj(assets) ematch(firm_year industry_sic) osample(Unobserved) vce(robust) dmvariables
          Unobserved not found
          Code:
          teffects nnmatch (income industry_sic assets firm_year) (treatment), biasadj(assets) ematch(firm_year industry_sic) osample(Unobserved) vce(robust) dmvariables
          drop if Unobserved==1
          12 observations have no exact matches

          Comment


          • #6
            OK. I guess I don't properly understand how the -osample()- option works, or something like that.

            Well, we can go back to basics and identify the unmatchable observations outside of -teffects- and then remove them before calling -teffects-. I assume that treatment is the name of the variable that distinguishes treatments from controls, and that it is 1 in the treatment group and 0 in the controls.
            Code:
            gen long uid = _n
            preserve
            keep if treatment == 0
            tempfile controls
            save `controls'
            restore, preserve
            
            keep if treatment == 1
            keep uid industry_sic firm_year
            joinby industry_sic firm_year using `controls'
            keep if _merge == 1 // THESE ARE THE UNMATCHABLES
            drop _merge
            tempfile unmatchable
            save `unmatchable'
            
            restore
            merge 1:1 uid using `unmatchable', keep(master) nogenerate
            At this point the data in memory will look like the original data set, except that it has a new variable uid (which you can now drop if you like) and it excludes those treatment = 1 observations that have no exact control that agrees with them on industry_sic and firm_year. If you now run -teffects-, I think it will go through.

            Note: I have not tested this code, so it may contain errors, but this is the gist of the approach.

            Comment


            • #7
              Thanks again for the help Clyde.

              Following your code, when I run upto the following part, the dataset looks as follows:


              Code:
               joinby industry_sic firm_year using `controls' 

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input int firm_year byte industry_sic long uid int(id assets income merger_year) byte merger float treatment
              2000 22  6 111 10  10 . . 0
              2004 22 10 111 60 490 . . 0
              end

              I am unsure if the data is supposed to look like that after rejoining it?


              Right after that if I run the code:

              Code:
              keep if _merge == 1 // THESE ARE THE UNMATCHABLES
              drop _merge
              tempfile unmatchable
              save `unmatchable'

              It gives me an error that _merge is not found.


              I tried repeating with _merger instead of _merge but I get the same error.


              I know I'm probably messing up somewhere here by not being able to tailor the code to my dataset but I'm not being able to identify where my mistake is, especially in regards to why the _merger is not working.
              Last edited by Tahseen Hasan; 05 Jan 2019, 14:55.

              Comment


              • #8
                Yes, sorry. I forgot that -joinby- only generates a -merge- variable if the -unmatched()- option is also specified. So change the -joinby- command to

                Code:
                joinby industry_sic firm_year using `controls', unmatched(both) _merge(_merge)
                You cannot use -merge- instead of -joinby- here: the whole idea is to pair up each treatment variable with every control that agrees with it on industry_sic and firm_year. -merge- does not do that.

                Comment


                • #9
                  Hey Clyde thanks for all your patience with me.

                  These are my results so far. This is the full code I have used:

                  Code:
                  drop _all
                  clear
                  use "C:\Users\Tahseen\Desktop\Temp\Diff.dta",clear
                  cd "C:\Users\Tahseen\Desktop\Temp"
                  gen treatment = 0
                  replace treatment = 1 if merger==1
                  
                  gen long uid = _n
                  preserve
                  
                  keep if treatment == 0
                  tempfile controls
                  save `controls'
                  restore, preserve
                  
                  keep if treatment==1
                  keep uid industry_sic firm_year
                  joinby industry_sic firm_year using `controls', unmatched(both) _merge(_merge)
                  
                  keep if _merge == 1
                  drop _merge
                  tempfile unmatchable
                  save `unmatchable'
                  
                  restore
                  merge 1:1 uid using `unmatchable', keep(master) nogenerate

                  It ran without any errors. Once I ran it, the data looks like this:


                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input int(id firm_year) byte industry_sic int(assets income merger_year) byte merger float treatment long uid
                  111 2000 22  10  10    . . 0  1
                  111 2002 22  30 400    . . 0  3
                  111 2003 22  50 470    . . 0  4
                  111 2004 22  60 490    . . 0  5
                  333 2000 22  15  10 2001 1 1  6
                  333 2002 22  70 200    . . 0  8
                  333 2003 22  80 260    . . 0  9
                  333 2004 22  85 270 2007 1 1 10
                  333 2005 22  90 280    . . 0 11
                  333 2006 22  95 290    . . 0 12
                  333 2007 22 120 700    . . 0 13
                  555 2001 37  60  50    . . 0 15
                  555 2002 37  70  70    . . 0 16
                  end


                  Once I run the teffects code then I end up getting the same error except it says "9 observations have no exact matches".

                  Code:
                  teffects nnmatch (income industry_sic assets firm_year) (treatment), osample(Unobserved) biasadj(assets) ematch(firm_year industry_sic) vce(robust) dmvariables

                  I am wondering if this is a legitimate bug in the stata 13 software? I don't see why there wouldn't be a standard code to simply ignore the unmatched variables in their nearest neighbor matching. For large sized panel data (in my case I have over 500K observations) it is seems nearly impossible to run exact matching.

                  Comment


                  • #10
                    I really don't know what to say here. As I indicated, I'm an infrequent user of -teffects- and have never used it with -nnmatch-, so it may be that there is something that we are both missing here. I just don't know. Sorry I can't be more helpful here.

                    Comment


                    • #11
                      Please don't feel bad about that at all! I scoured through probably 30+ threads on this this forum and on stack exchange on matching topics. It seems like nearest neighbor matching is not a popular command at all because most of these threads either don't have a solution or the questions go unanswered. I am truly truly grateful for all the help you've provided and the patience you had through this and if anything I learned a new way of database management from your codes which I wasn't aware of before. Thank you professor Clyde!

                      My next step is to see if I can do something similar with Propensity Score Matching (psmatch2) and try to get the same results because that code is slightly more popular and has more support. Nnmatch would have been ideal for me for the purposes of my paper but I will try to see if it will be sufficient to use psmatch2 instead. I will update this thread if I find useful results.
                      Last edited by Tahseen Hasan; 05 Jan 2019, 16:57.

                      Comment


                      • #12
                        Dear Tahseen, As far as I know, most (or all) of the existing matching approaches do not respect the structure of panel data. As such, you might want to try entropy balancing (search ebalance) or coarsened exact matching (search cem).
                        Ho-Chuan (River) Huang
                        Stata 19.0, MP(4)

                        Comment


                        • #13
                          Just FYI, despite the name, psmatch2 also allows for nearest neighbour matching if you use the mahalanobis option. I would also suggest reshaping your data into a cross section before starting your matching. I.e. there should not be a year variable anymore, instead you should have assets2000 assets2002 etc. As River Huang mentioned, the matching modules don't support panel structures at all.

                          Once you have reshaped your data, you no longer need to exact match on firm_year I think? (I did not read the first posts) There is indeed no way to ignore errors to ematch (or caliper for that matter). All you can do if use osample(newvar), which will create a new variable identifying the problem cases and then running the command again with osample == 0 (I think 0, maybe it's 1). Then you have to hope the omitted observations weren't used to match any other observations, although I think that is more an issue with caliper than ematch.

                          Alternatively, you could do something along the following lines (code may need adjustment, didn't test)

                          Code:
                          * Generate sample used for estimation
                          teffects nnmatch ... [do not use ematch option]
                          gen esample = e(sample)
                          
                          * Group observations by FY/SIC
                          egen group_FY_SIC = group(firm_year industry_sic) if esample == 1
                          
                          * Count how many levels of the treatment value are present in each group
                          unique treatment, by(group_FY_SIC) gen(treatment_levels)
                          
                          * Generate new sample restriction (only keep groups with multiple treatment levels)
                          gen esample_em = esample if treatment_levels > 1
                          
                          teffects nnmatch ... if esample_em, ematch(...)
                          I am working on a wrapper for teffects that essentially changes these (to us) inconvenient design decisions. If I ever find the time, I'll add an option that fixes these ematch-issues.

                          Comment


                          • #14
                            Thank you River and Jesse for letting me now. I really really appreciate the help.


                            I am not able to reshape my data to wide because Stata is giving me an error that my "Firm_Year values within ID are not unique". I could not find a solution for that even though I don't see how that is the case with my data. I will provide a screenshot of the data at the bottom.


                            Because of that issue I proceeded onwards with my long dataset.

                            This is the code I have used (closely following your one). My total sample size is 50,434.


                            Code:
                            teffects nnmatch (Dependent sic2 fyear X1 X2) (Treatment), biasadj(X1 X2) osample(match1) dmvariables vce(robust)        ///No ematch here
                            gen esample = e(sample)
                            
                            egen group_FY_SIC = group(fyear sic2) if esample == 1
                            
                            unique Treatment, by(group_FY_SIC) gen(treatment_levels)
                            
                            gen esample_em = esample if treatment_levels > 1
                            
                            *This has worked fine so far
                            
                            _____________________________
                            
                            teffects nnmatch (Dependent sic2 fyear X1 X2) (Treatment) if esample_em==1, ematch(sic2 fyear) biasadj(X1 X2) osample(match2) dmvariables vce(robust)
                            
                            * Error: 19332 observations have no exact matches
                            
                            teffects nnmatch (Dependent sic2 fyear X1 X2) (Treatment) if esample_em, ematch(sic2 fyear) biasadj(size_w bm_lag1) osample(match2) dmvariables vce(robust)
                            
                            *Error: 20083 observations have no exact matches
                            
                            drop if esample_em != 1
                            
                            teffects nnmatch (Dependent sic2 fyear X1 X2) (Treatment), ematch(sic2 fyear) biasadj(size_w bm_lag1) osample(match2) dmvariables vce(robust)
                            
                            *Error: 20083 observations have no exact matches
                            
                            ______________________________


                            Just when I get to the final teffects estimation, I am getting the same error again about observations not having exact matches. When I run it with 'if esample_em==1' the number of unmatched observations decrease to 19,332.


                            I have also tried the following where I tried to only match when osample(match1)==0 but I am getting the same error.

                            Code:
                            teffects nnmatch (Dependent sic2 fyear X1 X2) (Treatment), biasadj(X1 X2) osample(match1) dmvariables vce(robust)        ///No ematch here, osample = match1
                            *Using osample, where assigned observations are 0 or missing
                            teffects nnmatch (Dependent sic2 fyear X1 X2) (Treatment) if match1==0, ematch(sic2 fyear) biasadj(X1 X2) osample(match2) dmvariables vce(robust)
                            
                            *Error: 20083 observations have no exact matches

                            I am willing to drop all the "non-exact matches" observations and proceed with the analysis if that is what's needed.


                            If possible could you see where I'm going wrong here? Is this error occurring specifically because my data is not in wide format?


                            Thank you so much again Jesse I really appreciate this. This is the rough format of my dataset.
                            ID Year Income Treat
                            111 2000 10 1
                            111 2001 40 0
                            111 2002 90 0
                            111 2003 100 1
                            111 2004 120 0
                            111 2005 190 0
                            333 2000 10 1
                            333 2001 45 1
                            333 2002 90 0
                            333 2003 110 1
                            333 2004 160 0
                            333 2005 240 1
                            333 2006 290 0
                            333 2007 380 0
                            555 2000 10 0
                            555 2001 20 1
                            555 2002 85 0
                            555 2003 195 0
                            555 2004 215 0
                            Last edited by Tahseen Hasan; 07 Jan 2019, 13:33.

                            Comment


                            • #15
                              Do you have any missings in year or id? Those might be holding back your reshape. Try duplicates tag fyear id, gen(dubs) and see what you get.

                              Comment

                              Working...
                              X