Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • xtsum and cleaning time invariant data

    Hi,

    I have a large unbalanced panel for 2010, 2013, 2016 and 2019. xtsum shows that the time invariant variable 'male' for person_id has variance of 0.08 and min, max values of -.23 and 1.23. This suggests errors with the data. How can I figure out which person_ids are causing the problem?

    Code:
    . xtset person_id year
    
    . xtsum male
    
    Variable         |      Mean   Std. dev.       Min        Max |    Observations
    -----------------+--------------------------------------------+----------------
    male     overall |  .4853448   .4998003          0          1 |     N =   16547
             between |             .4947272          0          1 |     n =    8320
             within  |              .082274  -.2646552   1.235345 | T-bar = 1.98882

    Any advise much appreciated.


  • #2
    Code:
    by person_id (male), sort: gen byte flag = ///
        (male[1] != male[_N]) | male[1] < 0 | male[_N] > 1
    browse if flag
    will show them to you.

    Comment


    • #3
      Clyde Schechter thank you! I am ever so grateful. Just one more question. This is an unbalanced panel so the same person_id occur just in 1, 2, 3 or all 4 rounds.

      Can you please suggest how I can do the following: For person_id that is found in 3 rounds , male equals the value of 'male' if it occurs 2 on 3 times. If they are found in all 4 rounds, then male is the dominant value occurring 3 on 4 times.

      For example, in the extract below person_id 1 is male 2/3 times. Therefore, I can reasonably assign male=1 for this person. But person id 4 remains ambiguous.


      Code:
      person_id    year    male
      1    2013    1
      1    2016    0
      1    2019    1
      2    2010    0
      2    2016    1
      3    2010    0
      3    2013    1
      3.   2016    1
      4.   2019    0



      Comment


      • #4
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input byte person_id int year byte male
        1 2013 1
        1 2016 0
        1 2019 1
        2 2010 0
        2 2016 1
        3 2010 0
        3 2013 1
        3 2016 1
        4 2019 0
        end
        
        by person_id male, sort: gen freq = _N
        by person_id: gen rounds = _N
        by person_id (freq), sort: replace male = male[_N] if 2*freq[_N] > rounds[_N]
        sort person_id year
        By the way, this kind of inconsistency is very common in large data sets, so it is useful to know these techniques for cleaning. Inconsistency in reporting race and ethnicity is even more frequent than it is with sex.

        In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
        Last edited by Clyde Schechter; 09 Jun 2023, 16:07.

        Comment

        Working...
        X