xtsum and cleaning time invariant data

Fathima Salih

Join Date: Jun 2020
Posts: 25

xtsum and cleaning time invariant data

09 Jun 2023, 12:30

Hi,

I have a large unbalanced panel for 2010, 2013, 2016 and 2019. xtsum shows that the time invariant variable 'male' for person_id has variance of 0.08 and min, max values of -.23 and 1.23. This suggests errors with the data. How can I figure out which person_ids are causing the problem?

Code:

. xtset person_id year

. xtsum male

Variable         |      Mean   Std. dev.       Min        Max |    Observations
-----------------+--------------------------------------------+----------------
male     overall |  .4853448   .4998003          0          1 |     N =   16547
         between |             .4947272          0          1 |     n =    8320
         within  |              .082274  -.2646552   1.235345 | T-bar = 1.98882

Any advise much appreciated.

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30168

09 Jun 2023, 13:32

Code:

by person_id (male), sort: gen byte flag = ///
    (male[1] != male[_N]) | male[1] < 0 | male[_N] > 1
browse if flag

will show them to you.

Comment

Fathima Salih

Join Date: Jun 2020

Posts: 25
#3

09 Jun 2023, 15:55

Clyde Schechter thank you! I am ever so grateful. Just one more question. This is an unbalanced panel so the same person_id occur just in 1, 2, 3 or all 4 rounds.

Can you please suggest how I can do the following: For person_id that is found in 3 rounds , male equals the value of 'male' if it occurs 2 on 3 times. If they are found in all 4 rounds, then male is the dominant value occurring 3 on 4 times.

For example, in the extract below person_id 1 is male 2/3 times. Therefore, I can reasonably assign male=1 for this person. But person id 4 remains ambiguous.

Code:

person_id year male 1 2013 1 1 2016 0 1 2019 1 2 2010 0 2 2016 1 3 2010 0 3 2013 1 3. 2016 1 4. 2019 0
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30168
#4

09 Jun 2023, 16:04

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte person_id int year byte male 1 2013 1 1 2016 0 1 2019 1 2 2010 0 2 2016 1 3 2010 0 3 2013 1 3 2016 1 4 2019 0 end by person_id male, sort: gen freq = _N by person_id: gen rounds = _N by person_id (freq), sort: replace male = male[_N] if 2*freq[_N] > rounds[_N] sort person_id year

By the way, this kind of inconsistency is very common in large data sets, so it is useful to know these techniques for cleaning. Inconsistency in reporting race and ethnicity is even more frequent than it is with sex.

In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Last edited by Clyde Schechter; 09 Jun 2023, 16:07.
Comment

Announcement

xtsum and cleaning time invariant data

Comment

Comment

Comment