Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find out which observation is causing unbalanced panel

    I have a panel dataset with data for different countries for the years 1975, 1980, 1985, 1990 and 1995. When I run the xtset command, I get the following result which suggests that my panel is unbalanced. How do I find out the source of this "unbalancedness"?

    Code:
    xtset country year
           panel variable:  country (unbalanced)
            time variable:  year, 1975 to 1995, but with gaps
                    delta:  1 unit
    Here is the Stata output for xtdescribe:

    Code:
     xtdescribe
    
     country:  1, 2, ..., 106                                    n =        105
        year:  1975, 1980, ..., 1995                             T =          5
               Delta(year) = 1 unit
               Span(year)  = 21 periods
               (country*year uniquely identifies each observation)
    
    Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                             4       5       5         5         5       5       5
    
         Freq.  Percent    Cum. |  Pattern*
     ---------------------------+----------
          103     98.10   98.10 |  11111
            2      1.90  100.00 |  1111.
     ---------------------------+----------
          105    100.00         |  XXXXX
     --------------------------------------
     *Each column represents 5 periods.
    If I understood correctly, the preceding results indicate that for 103 countries, I have all the five time periods. But for 2 countries, not all of the time periods are available. How do I find out which countries are those? Also, can I still use a fixed effect model with this unbalanced data?

  • #2
    Code:
    by country (year), sort: gen obs_count = _N
    tab country if obs_count < 5
    will show you the two countries that do not have complete data.

    You do not need balanced data to use basic fixed-effects models.

    Last edited by Clyde Schechter; 10 Feb 2018, 21:33.

    Comment


    • #3
      Clyde Schechter Thank you very much for the codes. It was very helpful.

      Comment


      • #4
        Taz: You need to specify delta(5) to xtset; otherwise some things may not work and others may "work" but produce puzzling or even incorrect results.

        Comment


        • #5
          Nick Cox Thank you for your input. A side question: Do I need xtset if I use "areg" instead of "xtreg"?

          Comment


          • #6
            Come on: that's something that experiment will teach you quickly!

            Comment


            • #7
              Nick Cox since you suggested the use of delta(5), I started wondering if anything like that is required for areg too. That's it.

              Comment


              • #8
                There are, it seems, two quite different questions here. or close in your space.

                Whether areg requires xtset is a question of Stata syntax.

                Whether areg is a good idea for panel data is very different, and you need better advice than I can give, but open that up in a new thread with different title if you want some views.

                As I understand it, areg ignores panel structure except in so far as you specify it indirectly. Whether that fits your research goals is key.

                Comment


                • #9
                  Nick Cox thanks again for the clarification. This is all useful information for a beginner like me. I highly appreciate your help.

                  Comment


                  • #10
                    Clyde Schechter Sorry for bothering you again. But I have another question related to your code. It seems that I can have the same output produced by your code by using the following code:

                    Code:
                    bysort country: gen obs=_N
                    My question is, what is the purpose of writing (year) in your code? I will appreciate if you could explain.

                    Comment


                    • #11
                      You are quite right, it was not necessary to specify (year).

                      I did it out of habit. When you sort data in Stata, if the sort key variables do not uniquely identify observations in the data, then Stata sorts the data into random order within the sort key. This can cause some commands (but not -gen obs = _N-) to produce random, non-repeatable results. It's a very nasty bug to get bitten by, because tracking it down in the code can be very difficult--it tends to elude detection. So I have developed a general habit of specifying a full sort key that uniquely identifies observations in the data (or comes as close to doing so as possible), even if it is not necessary for the particular command. It is also often convenient, because in a situation like yours, it is often case that the next command will require sorting on year within country--so it saves me having to do a second sort. (If you've ever sorted a large dataset, you know how painfully slow that can be.)

                      But in the end, this was done just out of habit and is not necessary here.

                      Comment


                      • #12
                        Clyde Schechter Thank you so much for the clarification.

                        Comment


                        • #13
                          Dear Clyde Schechter ,

                          As you can see my data set has 2951 individuals but only 3 years. This is data is from a panel survey.

                          Is there anyway to summarise the characteristics of those individuals that dropped out of the survey each year? Essentially I am trying to do an intuitive test for attrition bias.

                          For example, for the 833 people that only responded to the survey in the first year I want to see what their mean age is (I use age as an example, but will test for other regressors too). I can then compare this to the overall mean for that year, and the overall mean across all years. I can then explain if there is/isn't attrition bias.

                          If the people who stopped responding to the survey scored higher on the dependent variable in the first year, then the mean of the dependent variable may be lower the next year because of this effect. I will include time dummies (using i.year). Will this correct for attrition bias in the independent variable?
                          Click image for larger version

Name:	Screen Shot 2020-04-01 at 20.47.32.jpg
Views:	1
Size:	47.3 KB
ID:	1544155

                          Comment


                          • #14
                            So if you run something like this (change the names to match what's in your actual data):
                            Code:
                            use panel_dataset, clear
                            
                            frame copy default participation_years
                            frame change participation_years
                            keep panelid year
                            sort panelid year
                            forvalues y = 2016/2018 {
                                by panelid (year): egen participated`y' = max(year == `y')
                            }
                            drop year
                            duplicates drop
                            
                            frame change default
                            frlink m:1 panelid, link(participation_years)
                            frget participated*, from(participation_years)
                            frame drop participation_years
                            you will now have a data set that includes three variables indicating whether a given person participated in each of the years 2016, 2017, and 2018. If you wanted, for example to look at the mean value of variable X in those who participated in 2016 and 2017 but not 2018 you could run
                            Code:
                            summ X if participated2016 & participated2017 & !participated2018
                            You can use those three variables to identify any of the 7 possible patterns of participation. (I assume that there are no observations for people who did not participate in any year.)

                            As for the use of a year variable to adjust for attrition bias, it is better than nothing. Similarly, looking at the values of variables in people who dropped out versus those who persisted is a necessary step in the analysis, but all it can do is alert you to the presence of problems in variables you have measured. Even if everything looks the same in dropouts and persisters, there could still be differences between them on attributes that are not observed in the data but are nevertheless important. And there could be differences in the associations and relationships among the variables, even if their overall levels are not different: in principle you could examine that as well, but the number of such relationships is usually far too large for this to be practical.) So, no matter what you do, you can never be certain you have adjusted out all of the biases resulting from attrition. You do your best with what you have.

                            In addition to observing patterns in the data that reflect attrition, you also need to try to learn about the processes that led to the attrition in the first place. Do we know anything about why participants dropped out? If the mechanism is completely at random (rarely happens in real life) you don't have to worry about the attrition at all: your complete data sample is unbiased. If the mechanism is at random (in the technical sense of the term, also not common, but not rare either) then you may benefit from using multiple imputation, or a full-information maximum likelihood approach in your analysis. If the data are missing not at random, then probably the best you can do is some sensitivity analyses to see whether whatever your findings are remain robust to reasonable alternative scenarios about what the missing responses would have been.

                            Comment


                            • #15
                              Clyde Schechter Thank you very much for the input.

                              Yes in fact the very reason I am doing this is because I noticed significant differences in the the associations and relationships among the variables for 2018 vs the other 2 years and am trying to work out why.

                              It is a survey which people fill in so as you said highly unlikely to be randomly dropping out. I will check the details of how the survey was conducted which will help me to think about the process that led to attrition. I will also be doing a fixed effects regression which as I understand will eliminate the attrition problem since it is measuring within individuals.

                              Thank you for introducing me to the concept of frames which I was not familiar with - until now I have been using an excel spreadsheets to store multiple datasets! However the line in your code:

                              frlink m:1 panelid, link(participation_years)

                              gave me the error: option frame() required. I changed it to the following:

                              frlink m:1 panelid, frame(participation_years)

                              which seemed to work and did the job you described.

                              Last edited by Alessandro Bessadi; 01 Apr 2020, 17:40.

                              Comment

                              Working...
                              X