Did I set up my panel data (with restrictions) correctly?

Matthew Williams

Join Date: Feb 2021
Posts: 195

Did I set up my panel data (with restrictions) correctly?

26 Aug 2021, 19:08

Hi all,

I would like to seek your advice regarding my analysis sample. Specifically, I have a three-wave panel dataset that is described as follows:
- Wave 1: is a baseline of the survey. I named this wave as "wave1" in the data example and code (see below)
- Wave 2: includes two different samples: i) a follow-up survey of wave 1 (named as wave1_followup) and ; ii) a fresh sample to compensate for attrition occurred in wave 1 (named as wave2_fresh).
- Similarly wave 3 compresses two different datasets: a follow-up survey of wave 1 + two samples of wave 2 (named as wave2_followup); and ii) a fresh sample to compensate for attrition occurred in wave 2 (named as wave3_fresh)

I impose two restrictions to my analysis sample, that are I limit my sample to individuals who participated in two or more interviews and who have at least one living parent at the time of their first interview. the following code are how I construct my panel data with the two restrictions, but I am not sure what I did is correct. Thus, any advice would be highly appreciated.

I first keep only individuals whose parents are alive in their first interview, meaning wave1, wave2_fresh and wave3_fresh are used here

Code:

use `wave1', clear
    keep if parent_alive==1   // keep only those with either father or mother alive
    sort id
    tempfile wave1_new
save `wave1_new'

use `wave2_fresh', clear
    keep if parent_alive==1   // keep only those with either father or mother alive
    tempfile wave2_fresh_new
save `wave2_fresh_new'

use `wave3_fresh', clear
    keep if parent_alive==1   // keep only those with either father or mother alive
    tempfile wave3_fresh_new
save `wave3_fresh_new'

Next, I merge data cleaned in the first step above with follow-up waves and keep only individuals who participated in two or more interviews

Code:

* Merge follow-up data of wave 1 to the fresh sample in wave 2
use `wave1_followup', clear
    merge 1:1 id using `wave2_fresh_new', nogen
    tempfile wave1_2
save `wave1_2'

* Merge follow-up data of wave 1 + wave 2 to the fresh sample in wave 3
use `wave2_followup', clear
    merge 1:1 id using `wave3_fresh_new', nogen
    tempfile wave2_3
save `wave2_3'

* Append dataset and drop those who are observed once
    append using `wave1_2'
    append using `wave1_new'
    
    isid id year, sort    
by id: egen nwave = max(_N)
    drop if nwave==1      // drop those who are observed once

The followings are data example of each wave

Code:

*** Wave 1
clear
input int id float(year age) byte sex float parent_alive
 1 2007 62 1 1
 2 2007 75 1 0
 3 2007 58 1 0
 4 2007 64 1 1
 5 2007 52 0 1
 6 2007 65 1 0
 7 2007 54 0 0
 8 2007 54 1 0
 9 2007 64 0 0
10 2007 71 0 0
11 2007 56 0 0
12 2007 66 1 0
13 2007 68 1 0
15 2007 57 1 1
16 2007 58 1 1
17 2007 69 1 0
18 2007 58 0 0
19 2007 71 1 1
20 2007 66 1 0
21 2007 68 0 0
22 2007 65 1 1
23 2007 73 0 0
24 2007 62 1 0
25 2007 57 1 1
26 2007 64 0 0
27 2007 73 1 1
28 2007 51 0 1
29 2007 65 0 0
30 2007 54 0 0
31 2007 51 0 0
end

    tempfile wave1
save `wave1'

*** Wave 2 - Fresh sample
clear
input int id float(year age) byte sex float parent_alive
3863 2009 50 1 0
3864 2009 62 0 0
3865 2009 55 1 1
3866 2009 67 1 0
3867 2009 68 0 0
3868 2009 57 0 1
3869 2009 55 1 1
3870 2009 61 0 0
3871 2009 75 1 0
3872 2009 56 0 0
3873 2009 50 0 1
3874 2009 59 0 0
3875 2009 51 0 0
3876 2009 68 1 1
3877 2009 53 1 1
3878 2009 54 0 0
3879 2009 68 0 0
3880 2009 67 1 0
3881 2009 73 1 0
3882 2009 65 0 0
3883 2009 58 1 0
3884 2009 75 1 0
3885 2009 57 1 1
3886 2009 52 0 0
3887 2009 50 1 0
3888 2009 52 1 0
3889 2009 69 0 0
3890 2009 59 0 0
3891 2009 58 0 0
3892 2009 73 0 0
end

    tempfile wave2_fresh
save `wave2_fresh'

*** Wave 3 - Fresh sample
clear
input int id float(year age) byte sex float parent_alive
5303 2011 59 0 0
5304 2011 54 0 0
5305 2011 71 1 0
5306 2011 59 1 0
5307 2011 59 1 1
5308 2011 52 1 1
5309 2011 62 0 0
5310 2011 75 0 0
5311 2011 60 1 1
5312 2011 62 0 0
5313 2011 69 1 0
5314 2011 57 1 1
5315 2011 71 0 0
5316 2011 60 0 0
5317 2011 63 1 1
5318 2011 55 0 0
5319 2011 51 0 1
5320 2011 54 1 0
5321 2011 67 0 0
5322 2011 66 1 0
5323 2011 67 0 0
5324 2011 63 0 0
5325 2011 69 0 0
5326 2011 74 1 0
5327 2011 70 0 0
5328 2011 68 0 0
5329 2011 57 0 1
5330 2011 55 0 0
5331 2011 58 1 1
5332 2011 52 1 1
end

    tempfile wave3_fresh
save `wave3_fresh'

*** Follow-up of wave 1
clear
input int id float(year age sex parent_alive)
 1 2009 64 1 1
 2 2009 77 1 0
 3 2009 60 1 0
 5 2009 55 0 1
 6 2009 67 1 0
 8 2009 56 1 0
 9 2009 66 0 0
12 2009 68 1 0
13 2009 70 1 0
15 2009 59 1 1
16 2009 61 1 1
17 2009 72 1 0
18 2009 60 0 0
19 2009 73 1 1
20 2009 68 1 0
21 2009 70 0 0
22 2009 67 1 1
25 2009 59 1 1
26 2009 66 0 0
27 2009 75 1 1
29 2009 67 0 0
30 2009 57 0 0
31 2009 54 0 0
34 2009 72 1 0
36 2009 62 1 1
38 2009 69 0 0
39 2009 71 0 0
42 2009 61 1 0
43 2009 61 0 1
45 2009 54 0 1
end

    tempfile wave1_followup
save `wave1_followup'

*** Follow-up of wave 1 +  wave 2
clear
input int id float(year age sex parent_alive)
 1 2011 66 1 1
 2 2011 79 1 0
 3 2011 62 1 0
 5 2011 57 0 1
 6 2011 69 1 1
 8 2011 58 1 1
 9 2011 68 0 0
13 2011 72 1 1
15 2011 61 1 1
16 2011 63 1 1
17 2011 74 1 0
18 2011 62 0 0
19 2011 75 1 1
20 2011 70 1 0
21 2011 72 0 0
22 2011 69 1 0
25 2011 61 1 1
26 2011 68 0 0
27 2011 77 1 1
29 2011 69 0 0
30 2011 59 0 1
31 2011 56 0 1
34 2011 74 1 0
36 2011 64 1 0
38 2011 71 0 0
39 2011 73 0 0
42 2011 63 1 0
43 2011 63 0 1
45 2011 56 0 1
46 2011 59 1 0
end

    tempfile wave2_followup
save `wave2_followup'

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

27 Aug 2021, 13:02

First, thank you for the superlative presentation of your data and code. It was a pleasure to work with. Sorry to say, I would take the code in an entirely different direction.

As someone who has been working with panel data for some years now, here is how I would go about combining your 5 datasets.

Code:

clear append using `wave1' `wave2_fresh' `wave3_fresh' `wave1_followup' `wave2_followup' isid id year, sort by id: generate pa1 = parent_alive[1] by id: generate nwave = _N

This reflects my experience working with panel data.

First, I limit my preparation of the basic input files to issues of data cleaning and standardization. For example, I'd want to look at your two followup files and see if there are any fractional values of sex, and if not, recast the variable as a byte in those files, like it is in the fresh samples, just because. (This is indeed a trivial example, but a convenient one.) And if the variable name were Age in the fresh samples and age in the followup, I'd rename it in one set or the other so I don't wind up with two different variables in the appended data. Don't be thinking about things like "who will be in the analytic sample" at that point - start by getting your raw data cleaned and appended into a panel structure, where it is easier to work with en masse. But let me emphasize, this preparation is crucial to understanding your data. Appending it without prior review is poor practice.

Second, by constructing my derived variables in the appended data, I'm doing it once, and it's guaranteed to be consistent across all the waves, no typo in the code for the third of the five raw datasets.

Third, I keep my options open as until the last minute. Don't drop data early in the process. This is based on painful experience. If I were preseing an analyis of this data to the project leader, or worse yet, to a journal for review, I imagine I would hear something like "That looks good. However, the work of Dewey, Nailem and Howe (2013) suggests including the cases with just one wave, as long as it's the third wave. You should address that." Where "address" means "rerun everything that way and then add a footnote saying that in work not reported here you found that taking into account the appraoch of Dewey et. al (2013) did not yield appreciable difference in the results." Easier to do if you don't have to go back to square one. Or, think about your previous topic, which looked like you were going to be analyzing children's data, but the father's observation was needed to add the father's age to the child's observation.

All this is a lot easier if you don't drop data, at least not data that is related in some sense to the data in what you project your ultimate analytic sample to be. If you want to tab sex for your analytical sample, it's as easy as

Code:

tab sex if pa1 & nwave>1
Comment
Matthew Williams

Join Date: Feb 2021

Posts: 195
#3

28 Aug 2021, 11:08

Dear William,

Thank you so much for the great experience sharing on how to manage panel data. It’s very helpful for my future work and I am grateful for your help and sharing.

I know that my approach to panel data as shown in #1 is not a good one, I myself was not confident about it and that is why I created this topic. To be honest, I am pretty new to panel data and there are lots of things to learn. I am happy that I have learned useful things from you through this thread.

as for the code in #2, could you please explain a bit to
me about the following code line, I do not clearly understand about how it works. I guess that that code is used to identify individuals whose parents were alive in their first interview, but I don’t understand what does [1] mean.

Code:

by id: generate pa1 = parent_alive[1]

Thank you.

Last edited by Matthew Williams; 28 Aug 2021, 11:11.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

28 Aug 2021, 12:16

With regard to your question, when a variable name is followed by a bracketed expression, the expression is treated as the observation number - a very specific application of Stata's subscript notation. So without the by: prefix

Code:

generate pa1 = parent_alive[1]

would set pa1 equal, in every observation, to the value of parent_alive in the first observation of the dataset. Not very useful! But we sorted by id and year, so the

Code:

by id:

prefix runs the code separately for each value of id, so that

Code:

by id: generate pa1 = parent_alive[1]

sets pa1 equal, for every observation with the same id, to the value of parent_alive in the first observation for that id – which since we sorted by year, is the observation in the first year in which that id appeared.

For more documentation, see the output of

Code:

help subscripting help by

OK, having answered that, I'm going to share a thought with you. In thinking about your five datasets, and about the difference in the storage of sex, and for that matter the fact that your parent_alive variable is stored as a float (not that either of these make one bit of difference to the results), I wondered if you hadn't perhaps already done some significant massaging of your data before getting to the point of post #1 in this topic.

In particular, I'm a little surprised to see separate datasets for new interviews and followups. I would have expected the following, which is what I commonly see.

Code:

*** Wave 1 clear input int id float(year age) byte sex float parent_alive 1 2007 62 1 1 2 2007 75 1 0 3 2007 58 1 0 4 2007 64 1 1 5 2007 52 0 1 6 2007 65 1 0 7 2007 54 0 0 8 2007 54 1 0 9 2007 64 0 0 10 2007 71 0 0 end tempfile wave1 save `wave1' *** Wave 2 clear input int id float(year age) byte sex float parent_alive 1 2009 64 . 1 2 2009 77 . 0 3 2009 60 . 0 5 2009 55 . 1 6 2009 67 . 0 8 2009 56 . 0 9 2009 66 . 0 3863 2009 50 1 0 3864 2009 62 0 0 3865 2009 55 1 1 3866 2009 67 1 0 3867 2009 68 0 0 3868 2009 57 0 1 3869 2009 55 1 1 3870 2009 61 0 0 3871 2009 75 1 0 3872 2009 56 0 0 end tempfile wave2 save `wave2' *** Wave 3 clear input int id float(year age) byte sex float parent_alive 1 2011 66 . 1 2 2011 79 . 0 3 2011 62 . 0 5 2011 57 . 1 6 2011 69 . 1 8 2011 58 . 1 9 2011 68 . 0 3863 2011 52 . 0 3864 2011 64 . 0 3865 2011 57 . 1 3866 2011 69 . 0 3867 2011 70 . 0 3868 2011 59 . 1 3869 2011 57 . 1 3870 2011 63 . 0 3871 2011 77 . 0 3872 2011 58 . 0 5303 2011 59 0 0 5304 2011 54 0 0 5305 2011 71 1 0 5306 2011 59 1 0 5307 2011 59 1 1 5308 2011 52 1 1 5309 2011 62 0 0 5310 2011 75 0 0 5311 2011 60 1 1 5312 2011 62 0 0 end tempfile wave3 save `wave3'

Separate datasets for each wave of the survey, with new additions identified by the range in which their IDs lie, and with missing values for sex in the followup observations, because sex was asked once in the initial wave for each individual, and (in most panel surveys up until now, but this will certainly be changing) was assumed to be a fixed characteristic and you don't want to waste the subject's time on questions they've already answered once in an earlier wave. (That could also be true for age, if year of birth is asked in the first wave.) I had to make up wave 3 followup data for those who joined the panel in wave 2.

With that data structure, the code becomes

Code:

append using `wave1' `wave2' `wave3' isid id year, sort by id: replace sex = sex[1] by id: generate pa1 = parent_alive[1] by id: generate nwave = _N

and you see that the missing values of sex are easily copied to the followup observations.

With that, if you are working with panel data you should certainly familiarize yourself with the Stata Longitudinal-Data/Panel-Data Reference Manual PF included in your Stata installation and accessible through Stata's help menu, if you have not done so already.

And with that said, since you were not familiar with subscript notation, let me add the following.

When I began using Stata in a serious way, I started, as have others here, by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. All of these manuals are included as PDFs in the Stata installation and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu.

In particular, the Getting Started manual recommends chapters 11, 12, and 13 of the User's Guide, and amongst those chapters you will find subscript syntax explained, and much more of the most fundamental parts of using Stata.

The objective in doing the reading was not so much to master Stata - I'm still far from that goal - as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax, and know how to find out more about them in the help files and PDF manuals.

Stata supplies exceptionally good documentation that amply repays the time spent studying it - there's just a lot of it. The path I followed surfaces the things you need to know to get started in a hurry and to work effectively.

Last edited by William Lisowski; 28 Aug 2021, 12:22.
1 like
Comment
Matthew Williams

Join Date: Feb 2021

Posts: 195
#5

28 Aug 2021, 20:25

Millions of thumb up for you, very clear and easy-to-understand explanations, thoughtful data review and excellent advice on where and how to read Stata manual. I find myself lucky to have you in this topic, indeed. Thanks a lot.

As for missing values of sex, I was aware of that and I already solved that problem (not only sex but also age, weight and height - my study subject is older people, so these characteristics barely change over time) in my almost-final dataset. I am sorry for providing uncleaned data in post #1.

I will follow your advice by staring with sections that you have suggested to read in the Stata manual. There must be lots of interesting things to lear, I guess.

I just have one follow-up question, I also observe inconsistent values of parent_alive between the initial wave and later waves. Let's take id=6 as an example in post #4, we can see that parent_alive=0 in the first and second wave, but in the third wave it becomes 1. There must be something wrong here and I will need to go back to the raw data to see what happened. However, let's assume that there were no problems with the raw data and I want to make values of parent_alive consistent across waves, one way to do is something like:

Code:

bys id: replace parent_alive = 0 if parent_alive[1]==0 & parent_alive[3]~=0

The code above works, but imagine if I have T=20 or more then that code may be not a good approach. I think there would be a smarter way of handling that problem, do you have any suggestion?

Thank you.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

29 Aug 2021, 06:18

In post #4 a better version of the code woud have been

Code:

append using `wave1' `wave2' `wave3' isid id year, sort by id (year): replace sex = sex[1] by id (year): generate pa1 = parent_alive[1] by id (year): generate nwave = _N

to assert that the data was sorted by id and year, and then to run the commands separately for each id. Or, if we did not want to assume that the data was already sorted,

Code:

bysort id (year): replace sex = sex[1] bysort id (year): generate pa1 = parent_alive[1] bysort id (year): generate nwave = _N

With that out of the way, since the code above assumes that an initial value of 0 for parent_alive is always correct, you could do

Code:

bysort id (year): replace parent_alive = 0 if parent_alive[1]==0

to carry the value 0 through all further waves.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

29 Aug 2021, 07:35

Let me add that

Code:

bysort id (year): egen problem = max(parent_alive==1 & parent_alive[1]==0) browse if problem==1

will set problem to 1 for every observation of any id for which the parent who was not alive in the initial wave was shown as alive in a subsequent wave, and then open the Data Browser window showing only those ids. This will help you get a handle on the magnitude of the problem, and will perhaps help you see some patterns. in particular, I would be concerned about individuals who were 0 in the initial wave and 1 in several waves immediately afterwards, which would suggest that the initial wave was in error.
1 like
Comment
Matthew Williams

Join Date: Feb 2021

Posts: 195
#8

30 Aug 2021, 01:46

Dear William,

Thank you so much for your code and suggestions, very helpful. I will take a closer look at my data.
Comment

Announcement

Did I set up my panel data (with restrictions) correctly?

Comment

Comment

Comment

Comment

Comment

Comment

Comment