How to restrict ages within a specific range (in one given wave) in panel data?

Guri Gray

Join Date: Jul 2022

Posts: 6
#1

How to restrict ages within a specific range (in one given wave) in panel data?

02 Jul 2022, 12:28

I am having troubles when trying to include individuals who were interviewed in the 1st, 2nd, 3rd and 4th wave of SHARE dataset and who were 50–75 years old at the 2nd wave interview. Specifically, the part were the interviewed individuals have to be between 50-75 years old at the 5th wave interview is what I am failing to achieve without stata automatically dropping all other waves.

This is what I have done so far:

1. In order to keep only those who participated in all waves (1-4) I did the following (which was succesful):
recode wave (4=1) (5=2) (6=3) (7=4)
isid mergeid wave, sort
assert inlist(wave, 1, 2, 3, 4)
by mergeid (wave): keep if _N == 4

My sample now only includes individuals who are interviewed in all waves. I can see this by:

tab wave

wave | Freq. Percent Cum.
------------+-----------------------------------
1 | 24,352 25.00 25.00
2 | 24,352 25.00 50.00
3 | 24,352 25.00 75.00
4 | 24,352 25.00 100.00
------------+-----------------------------------
Total | 97,408 100.00

Then I de-string the person identifier variable (called mergeid): encode mergeid, generate (id) label (id)

2. This is the critical part. I am now trying to include only those who were interviewed in all waves AND WHO WERE 50-75 YEARS OLD AT THE 5TH WAVE INTERVIEW (which is now labelled as wave 2).
The are 4 age variables in my dataset, one for each wave (age2011 for the 1st wave, age2013 for the 2nd wave, age2015 for wave 3I have tried 2 methods:

FIRST METHOD:.drop if age2013<50
drop if age2013>75

SECOND METHOD: bysort id (wave): drop if age2013<50
bysort id (wave): drop if age2013>75

In both cases, when I enter tab wave, stata automatically drops the 1st, 3rd, and 4th wave, leaving me only with wave 2:

wave | Freq. Percent Cum.
------------+-----------------------------------
2 | 19,177 100.00 100.00
------------+-----------------------------------
Total | 19,177 100.00

I would be really thankful if someone could explain to me what I am doing wrong, and what I can do to perform this type of sample restriction, i.e. I want to keep only those individuals who are interviewed in all 4 waves, AND who were between 50–75 years old at the 2nd wave interview.

Many thanks,
Guri
Tags: None
Guri Gray

Join Date: Jul 2022

Posts: 6
#2

02 Jul 2022, 12:31

CORRECTION: In the second row, it should say "Specifically, the part were the interviewed individuals have to be between 50-75 years old at the 2nd wave interview is what I am failing to achieve without stata automatically dropping all other waves". I accidentaly wrote 5th.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10482
#3

02 Jul 2022, 13:18

You need something like:

Code:

bys mergeid: egen tag= max(inrange(age2013, 50, 75)) keep if tag
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

02 Jul 2022, 13:18

I'm assuming that age2013 is a missing value in all the waves other than 2.

Code:

bysort mergeid (wave): egen tokeep = max(inrange(age2013,50,75)) keep if tokeep

For observations in which age2013 is between 50 and 75, the value of inrange() will be 1; for all other observations it will be 0.
1 like
Comment
Guri Gray

Join Date: Jul 2022

Posts: 6
#5

04 Jul 2022, 11:45

Thank you for your assistance!

Following I am trying to create treatment and control groups. Specifically, I am doing some research on the effect of retirement on mental health, and I am using state pension ages as an instrumental variable for actually retiring.

As I have mentioned, my dataset contains 4 waves of interviews in total. I want to assign individuals who crossed the state pension age between the first and second interviews to the treatment group. Subsequently, those who did not cross the state pension age during that period will be assigned to the control group.

What I am finding to be difficult is that I have 10 different countries, which in my case means several different state pension ages.
Individuals reaching their state pension age between the first and the second waves were between 60 and 65 years old depending on country and gender. Therefore I am wondering how I could make stata detect and identify individuals who crossed their respective state pension age Between wave 1 and 2? Do I need to do everything separately for every country or can I use something that can do this for all countries at once even though they have different state pension ages.
*I am not interested in the effects for any specific country, this is a multi-country research.

Additionally, the date of interviews across waves differs and thus It seems that there is no easy way to manually construct these groups.

*The relevant (I believe) variables I have are:
wave = wave indentifier (1, 2, 3, 4)
mergeid = person identifier (fix across modules and waves)
id = after "encoding" mergeid from string to numerical
country = country identifier
age = an age variable I defined by summing year of birth and month of birth to get more precise ages
age2011 = respondents age in wave 1
age2013 = respondents age in wave 2
age2015 = respondents age in wave 3
age2016 = respondents age in wave 4
age_int = age of respondent at the time of the interview
int_year = interview year
int_month = interview month

Without making this more confusing, I will summarize what I would like to get help with. What I am trying to do is the following:

I have 4 waves of interviews. I want to assign those who crossed the state pension age (in their respective country) between wave 1 and 2 to the treatment group. Subsequently, I want to assign those who did not cross the state pension age (in their respective country) during that same period to the control group. I have data from 10 european countries who have different state pension ages. How can I create these groups the best way?

Many thanks,
Guri
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

04 Jul 2022, 12:07

Do you have gender in your dataset? Along with country, that is needed to determine the state pension age.

1) Create a dataset with country, gender, and the pension age in that country for that age:

Code:

    country   gender   pension_age  
          A     Male            65  
          A   Female            65  
          B     Male            64  
          B   Female            60

2) Merge this dataset into the dataset you already have

Code:

use surveydata, clear
merge m:1 country gender using pensiondata

3) generate what you want.

Code:

generate older = age>=pension_age
bysort mergeid (wave): egen both = total( (wave==2 & age>=pension_age) | (wave==1 & age<pension_age) )
generate wanted = both==2

Comment

Guri Gray

Join Date: Jul 2022

Posts: 6
#7

04 Jul 2022, 13:47

Hi again William! Thanks for your assistance, I appreciate it a lot!

To answer your question: Yes, I have gender in my dataset.

I did everything the way you explained, but I suspect I did something wrong when creating the dataset with those 3 variables. Unfortunately, the variables do not get matched when combining the datasets. This is what I get:

(variable country was byte, now long to accommodate using data's values)
(variable gender was byte, now long to accommodate using data's values)
(label country already defined)
(label gender already defined)

Result Number of obs
-----------------------------------------
Not matched 61,104
from master 61,084 (_merge==1)
from using 20 (_merge==2)

Matched 0 (_merge==3)
-----------------------------------------

When coding the variables, I tried both using the name of the gender/country and their value.

This is how I did it the last time in the data editor:

pension_age country gender
65 11 1
60 11 2
65.08334 12 1
65.08334 12 2
65 13 1
65 13 2
65 15 1
65 15 2
66 16 1
66 16 2
65 17 1
65 17 2
65 18 1
65 18 2
65 20 1
64 20 2
65 23 1
65 23 2
62.5 28 1
61.33333 28 2

Can you see what I am doing wrong?

Many thanks,
Guri

Last edited by Guri Gray; 04 Jul 2022, 13:51.
Comment
Guri Gray

Join Date: Jul 2022

Posts: 6
#8

04 Jul 2022, 13:57

I apologise for the messy structure, this is only my first day as a statlist user and therefore my skills here are not impressive. But as you perhaps can see, I have coded the countries and genders according to the values they have in the "master" file:

PHP Code:

pension_age country gender 65 11 1 60 11 2 65.08334 12 1 65.08334 12 2 65 13 1 65 13 2 65 15 1 65 15 2 66 16 1 66 16 2 65 17 1 65 17 2 65 18 1 65 18 2 65 20 1 64 20 2 65 23 1 65 23 2 62.5 28 1 61.33333 28 2

Last edited by Guri Gray; 04 Jul 2022, 14:07.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

04 Jul 2022, 14:41

Please take a few moments to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question. It is particularly helpful to copy commands and output from your Stata Results window and paste them into your Statalist post using code delimiters [CODE] and [/CODE], and to use the dataex command to provide sample data, as described in section 12 of the FAQ.

To understand fully why your datasets did not match, we need to see example data from each dataset, presented as the output of the dataex command described below. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, dataex is already part of your official Stata installation. If not, run ssc install dataex to get it. Either way, run help dataex and read the simple instructions for using it. dataex includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

By default dataex will output the first 100 observations, but this can be changed using if and in and the obs() option. We don't need to see more than 20 observations from each dataset. You can also limit the output from the survey dataset to the variables wave, id, wave, country, and age by specifying the as the variable list on the command - otherwise it will try to output every variable.

The output of dataex will look something like the following.

Code:

----------------------- copy starting from the next line ----------------------- [CODE] * Example generated by -dataex-. To install: ssc install dataex clear input int(x1 x2 x3) float x4 int x5 byte x6 4195 24 1 2 10 0 10371 16 3 3.5 17 0 4647 28 3 2 11 0 ... 5079 24 4 2.5 8 1 8129 21 4 2.5 8 1 4296 21 3 2.5 16 1 end label values x6 yesno label def yesno 0 "No", modify label def yesno 1 "Yes", modify [/CODE] ------------------ copy up to and including the previous line ------------------

In your dataex output you will select the lines between, but not including, "copy starting from the next line" and "copy up to and including the previous line" and then paste that into your reply. The result presented in your post will look something like the following.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(x1 x2 x3) float x4 int x5 byte x6 4195 24 1 2 10 0 10371 16 3 3.5 17 0 4647 28 3 2 11 0 ... 5079 24 4 2.5 8 1 8129 21 4 2.5 8 1 4296 21 3 2.5 16 1 end label values x6 yesno label def yesno 0 "No", modify label def yesno 1 "Yes", modify
Comment

Announcement

How to restrict ages within a specific range (in one given wave) in panel data?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment