Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing values in Panel Data

    Hello everyone,

    I am working with panel data describing company code as id variable and financial year as time variable. The dependend variable (patents), as well as the independent variable (Numberofmergers) are complete within the observed time window (2000-2010). However, there are multiple (random) missing values within my moderator and control variable (xrdintensity and ln_emp) within the mentioned time window.
    My general question on this is: How are these missing values handled by Stata? Are Panels with missing values omitted as a whole? Or does Stata conduct the regression within a smaller observation window, if possible?
    If I had to exclude every panel with 1 or more missing value(s) in my Dataset, this would radically reduce the size of my Dataset.
    Best regards and thank you in advance for your time

    Christopher

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double(PERMNO YearEffective) int patents float(xrdintensity ln_emp) int AcquirorPrimarySICCode str4 curcd byte Numberofmergers
    10078 2000  463    .116026  3.686376 3577 "USD" 0
    10078 2001  434  .11512018 3.7999735 3577 "USD" 1
    10078 2002  500  .11106404   3.69883 3577 "USD" 4
    10078 2003  564  .14177898  3.613617 3577 "USD" 3
    10078 2004  679   .1376267  3.514526 3577 "USD" 0
    10078 2005  653  .12579282  3.465736 3577 "USD" 0
    10078 2006  775  .13963665 3.6635616 3577 "USD" 0
    10078 2007  606  .12678368  3.561046 3577 "USD" 0
    10078 2008  508  .13005579  3.580737 3577 "USD" 0
    10078 2009  562  .14699075 3.4011974 3577 "USD" 0
    10104 2000   77  .10322525  3.782529 7372 "USD" 0
    10104 2001   67  .09962963   3.76134 7372 "USD" 0
    10104 2002   87   .1066522 3.7293015 7372 "USD" 0
    10104 2003   86   .1001332  3.753215 7372 "USD" 0
    10104 2004   99  .07429787  3.929313 7372 "USD" 1
    10104 2005   94   .0671742  4.045382 7372 "USD" 3
    10104 2006  179  .06785838 4.3264346 7372 "USD" 0
    10104 2007  169  .05849623  4.445389 7372 "USD" 0
    10104 2008  199  .05856673  4.465908 7372 "USD" 1
    10104 2009  207  .05284355 4.6634393 7372 "USD" 0
    10104 2010  305  .06145373  4.691348 7372 "USD" 0
    10107 2000  342 .072387345 3.6913764 7372 "USD" 1
    10107 2001  396  .07389844 3.8836236 7372 "USD" 0
    10107 2002  499  .06366969  3.941582 7372 "USD" 1
    10107 2003  499  .05855148 4.0253515 7372 "USD" 1
    10107 2004  629  .08419833  4.060443 7372 "USD" 0
    10107 2005  746  .08732613 4.1271343 7372 "USD" 3
    10107 2006 1459  .09460178  4.276666 7372 "USD" 3
    10107 2007 1632  .11272578 4.3820267 7372 "USD" 0
    10107 2008 2018  .11215364 4.5217886 7372 "USD" 1
    10107 2009 2901  .11567892  4.543295 7372 "USD" 1
    10107 2010 2743  .10119262 4.4998097 7372 "USD" 0
    10147 2000  149  .07368656  3.222868 3572 "USD" 1
    10147 2001  138   .0939065  3.049273 3572 "USD" 1
    10147 2002  147  .08148286  2.912351 3572 "USD" 0
    10147 2003  151  .05304885 3.0445225 3572 "USD" 1
    10147 2004  143   .0561048  3.165475 3572 "USD" 2
    10147 2005  140  .06088182  3.314186 3572 "USD" 1
    10147 2006  166  .06945954  3.468856 3572 "USD" 4
    10147 2007  177  .06857087   3.65584 3572 "USD" 1
    10147 2008  193  .07569182  3.763523 3572 "USD" 0
    10147 2009  260  .06070076  3.788725 3572 "USD" 0
    10147 2010  279  .06123302  3.901973 3572 "USD" 0
    10302 2000  113  .07810763 1.6928595 3674 "USD" 1
    10302 2001  107  .15411177 1.6409366 3674 "USD" 2
    10302 2002  122  .18445005 1.6294366 3674 "USD" 1
    10302 2003  183   .1604035 1.6160163 3674 "USD" 0
    10302 2004  175   .1762429  1.704748 3674 "USD" 0
    10302 2005  149  .14079961 1.8082888 3674 "USD" 1
    10302 2006  162  .11495226 1.9169226 3674 "USD" 0
    10302 2007  114  .05297389 2.1860514 3674 "USD" 0
    10302 2008  105   .2068327 1.6292405 3674 "USD" 0
    10302 2009   62  .19856155 1.5260563 3674 "USD" 0
    10302 2010  119  .16481714 1.5040774 3674 "USD" 0
    10696 2000    0          .   2.70805 7374 "USD" 0
    10696 2001    0          .   2.95491 7374 "USD" 1
    10696 2002    0          .  3.015535 7374 "USD" 0
    10696 2003    0          .  3.122365 7374 "USD" 0
    10696 2004    0          .  3.135494 7374 "USD" 0
    10696 2005    0          .  3.135494 7374 "USD" 0
    10696 2006    0          .  3.178054 7374 "USD" 1
    10696 2007    0          . 3.2580965 7374 "USD" 0
    10696 2008    1          . 3.0445225 7374 "USD" 0
    10696 2009    0          . 3.0445225 7374 "USD" 0
    10696 2010    2          .  2.995732 7374 "USD" 0
    11403 2000   11  .17900307  1.894617 7372 "USD" 0
    11403 2001   11   .1860511 1.8870697 7372 "USD" 1
    11403 2002   21   .1530078  1.820509 7372 "USD" 1
    11403 2003   41  .12336163  1.757858 7372 "USD" 3
    11403 2004   56  .12621783 1.7749523 7372 "USD" 0
    11403 2005   95  .11764284 1.7917595 7372 "USD" 0
    11403 2006  116  .13389133 1.8245493 7372 "USD" 0
    11403 2007   59  .12831071 1.8405496 7372 "USD" 1
    11403 2008   68   .2731335 1.7749523 7372 "USD" 1
    11403 2009   91  .25145772  1.686399 7372 "USD" 0
    11403 2010  115  .21731396 1.7227666 7372 "USD" 0
    11531 2000    0  .18822227 1.0130544 7372 "USD" 0
    11531 2001    0   .2282132 1.0112373 7372 "USD" 0
    11531 2002    1  .21989964  .9951019 7372 "USD" 1
    11531 2003    0  .19663237 1.0006319 7372 "USD" 0
    11531 2004    0  .15850393  .9582002 7372 "USD" 0
    11531 2005    0  .13468757  .9913982 7372 "USD" 0
    11552 2000    6  .15202783 .27459684 2834 "USD" 1
    11552 2001   11   .1911199 .32425505 2834 "USD" 0
    11552 2002    9   .4296657  .4446858 2834 "USD" 0
    11552 2003    9  .15505424  .5181984 2834 "USD" 0
    11552 2004    4  .14526598  .5687171 2834 "USD" 0
    11552 2005    7  .15307905  .6647477 2834 "USD" 0
    11552 2006   11  .09453244  .8272409 2834 "USD" 0
    11552 2007    9  .11037348  .9876807 2834 "USD" 0
    11552 2008   19   .6009125  1.235762 2834 "USD" 0
    11552 2009   12  .14748602 1.3384163 2834 "USD" 0
    11552 2010   12  .11088504  1.645191 2834 "USD" 0
    11976 2000    9  .07735918 2.1198635 7373 "USD" 0
    11976 2001    4  .17905986 1.9926574 7373 "USD" 0
    11976 2002    3  .17638376 2.0619137 7373 "USD" 0
    11976 2003    9  .19282055 2.0053914 7373 "USD" 0
    11976 2004    6  .06867174 2.0674956 7373 "USD" 0
    11976 2005    8  .06505964  1.974081 7373 "USD" 1
    11976 2006   12   .0646319   1.94591 7373 "USD" 0
    end

  • #2
    Cristopher:
    for all commands Stata applies listwise deletion to all the observations with missing values in any variables.
    Hence, Stata will not omit the whole panel, but only those observations plagued with missing values.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thanks you very much for your reply! So stata will calculate with panels with at least two non-missing values in my case? Panels with only one observation can not be used, right?

      Comment


      • #4
        Cristopher:
        other things being equal, panes with one observations only will actually be included in e(sample), as you can see from the following, really teribble, toy-example:
        Code:
        . use "https://www.stata-press.com/data/r16/union.dta"
        (NLS Women 14-24 in 1968)
        
        . save "C:\Users\user\Desktop\CARLO_PAJNEL.dta"
        file C:\Users\user\Desktop\CARLO_PAJNEL.dta saved
        
        . drop if year>72 & idcode==1
        (6 observations deleted)
        
        
        . xtreg age i.grade if idcode<=2, fe
        note: 12.grade omitted because of collinearity
        
        Fixed-effects (within) regression               Number of obs     =         10
        Group variable: idcode                          Number of groups  =          2
        
        R-sq:                                           Obs per group:
             within  =      .                                         min =          1
             between = 0.5000                                         avg =        5.0
             overall =      .                                         max =          9
        
                                                        F(0,8)            =       0.00
        corr(u_i, Xb)  =      .                         Prob > F          =          .
        
        ------------------------------------------------------------------------------
                 age |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
            12.grade |          0  (omitted)
               _cons |       28.4   1.753568    16.20   0.000     24.35627    32.44373
        -------------+----------------------------------------------------------------
             sigma_u |  6.5996633
             sigma_e |  5.5452683
                 rho |  .58616822   (fraction of variance due to u_i)
        ------------------------------------------------------------------------------
        F test that all u_i=0: F(1, 8) = 2.55                        Prob > F = 0.1490
        
        .
        Obviously, a panel with one observation only is not a panel and this bears on coefficient estimates.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Carlo, thank you very much for your quick reply. I might get a little deep here but i do not understand the following behaviour of Stata then:

          given as above, my Variables "xrdintensity" and "ln_emp" contain missing values. Initially, before I read your first answer, I used the "fillin" command to rectangularize the observations and create a balanced Dataset. However, "fillin" only creates missing values, which should be ommited anyways if I understood you corectly. Hence, it should not matter for the regression, whether i use the "fillin" command, or just leave the dataset as it is, creating an unbalanced Panel. Is that correct?
          In the regression, Stata returns "888" as number of observations if i used the "fillin" data, and only "887" as number of observations if i do not use it. What happenes to the one observation that apparently is not considered when i am using the unbalanced panel model?
          I know this is only a slight difference, but I am here to understand the way Stata works in detail .

          Thank you very much for your time and best regards

          If you need more Info, just let me know! I will provide you with it.

          Christopher

          Comment


          • #6
            Cristopher:
            1) using -fillin- is perfectly legal, technically speaking, provided that you know that, as per its -helpfile- it basically creates interaction-based observations. Probably, the community-contributed module -ipolate- is more in line with what you had in mind. That said, the potential drawback rests on ending up with a dataset that is a pale representation of your original dataset. As it always happens with missing values, any imputation procedure should be precedeed by a diagnostic session aimed at investigating the mechanism (and the pattern(s)) related to missing values;
            2) 888 vs. 887. It simply mean that -fillin- created one more observation.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Hello Carlo,

              just to make sure:
              I used the following command to prepare the Data of "xrdintensity" and "ln_emp":
              fillin cusip fyear (cusip works as identifier, fyear as time variable)
              my interpretation of this command is, that every panel is extended, so it shows yearly values from 2000-2010 in my case, as this is the largest observed range in years. The added values (_fillin == 1) however, are classified as missing values for the variables "xrdintensity" and "ln_emp" by the nature of the command "fillin". If this holds true, and Stata applies listwise deletion to all the observations with missing values in any variables, it both should be regressed with 888 observations, regardless of whether I work with the fillin command or if I use the unbalanced form of my panel in my understanding.
              I am afraid I still dont quite understand the reason behind the discrepancy in observations based on these assumptions.
              Sorry for bothering you with with my questions and thank you in advance!

              Christopher

              Comment


              • #8
                Cristopher:
                you may find -e(sample)- useful.
                See https://stats.idre.ucla.edu/stata/fa...using-esample/.
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  Thank you very much, that helped. I will create a new topic on the findings using e(sample).

                  Christopher

                  Comment

                  Working...
                  X