Deleting ID that is not present in both datasets for Panel Data

Alyaa Ramli

Join Date: Apr 2023

Posts: 2
#1

Deleting ID that is not present in both datasets for Panel Data

09 Apr 2023, 15:24

Hi,

I am interested in doing regression analysis with panel data.The data I've chosen is in 2 waves, which is the 1st wave and 9th wave. I have appended the 9th wave dataset onto the 1st wave dataset and used the keep command in my best attempt to 'clean' the data to show only variables I need. For Panel Data tests, I need a variable like "year" or in this case, "wave" which would differentiate between 2 different recordings of a variable for the same person across the 2 waves. Here is my DO commands for now:

Code:

/* 2ND WAVE DATASET Please execute these for ci_indresp_w.dta on a separate Stata window use "C:\Users\User\OneDrive\Desktop\ci_indresp_w.dta" generate wave=., after(pidp) replace wave=9 if wave==. *save dataset and exit* */ //1ST WAVE DATASET// use "C:\Users\User\OneDrive\Desktop\ca_indresp_w.dta" generate wave=., after(pidp) replace wave=1 if wave==. //combine both waves of dataset// append using "C:\Users\User\OneDrive\Desktop\ci_indresp_w.dta" sort pidp keep pidp wave ca_netpay_amount ca_netpay_period ci_netpay_amount ci_netpay_period ca_hours ci_hours ca_sex ci_sex ca_age ci_age ca_couple ci_couple ca_hhcompa ci_hhcompa ca_hhcompb ci_hhcompb ca_hhcompc ci_hhcompc ca_hhcompd ci_hhcompd ca_hhcompe ci_hhcompe gen netpay=ca_netpay_amount, after(pidp) replace netpay=ci_netpay_amount if netpay==. gen netpayperiod=ca_netpay_period, after(pidp) replace netpayperiod=ci_netpay_period if netpayperiod==. gen hours=ca_hours, after(pidp) replace hours=ci_hours if hours==. gen sex=ca_sex, after(pidp) replace sex=ci_sex if sex==. gen couple=ca_couple, after(pidp) replace couple=ci_couple if couple==.

For context, I want to make my data look like this because my lecturer taught us panel data using this data structure, and so I thought it would be better for me to run the tests with this kind of structure (forgive me if I'm wrong because I am a Stata novice)

Here is a visual of the data I have now:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long pidp float(couple sex hours netpayperiod netpay wave) 76165 1 2 25 3 3200 1 76165 1 2 38 3 3500 9 280165 1 2 0 3 1700 1 469205 2 2 16 3 750 9 469205 2 2 0 3 650 1 599765 1 2 37 3 2591 1 599765 1 2 35 3 2617 9 732365 2 1 -8 -8 -8 9 732365 2 1 -8 -8 -8 1 1587125 2 2 37 1 600 1 1587125 2 2 37 3 2200 9 3424485 2 2 -8 -8 -8 1 3424485 2 2 -8 -8 -8 9 4849085 1 1 38 3 3215 9 4849085 1 1 46 3 3200 1 68002725 2 2 -8 -8 -8 9 68008847 2 2 39 3 1389 9 68008847 2 2 39 3 1202 1 68010887 1 2 37 3 1300 1 68031967 2 2 -8 -8 -8 1 68035365 2 1 -8 -8 -8 9 68035365 2 1 -8 -8 -8 1 end

This is only a snapshot and the dataset contains 30579 observations (including duplicate IDs)

The ID's in bold are what I am trying to delete or drop because they are not present in both the datasets. Is there away to do this or is it futile? Or is the presence of these ID's without pairs not going to affect testing later?

I also noticed that the wave variable values alternate but not in a uniformed way, for example, I can see that it alternates like this: "1,9,9,1,1,9,9,1,.....so on" but up until row 11 it changes , but the alternating starts again. Is there a way to sort wave so that it does not alternate and is uniform like : "1,9,1,9,1,9,...." ?

I appreciate any help at all. I am sorry for the poor composition of this question and the messy Stata codes and output.

Thank you and have a good day.

Last edited by Alyaa Ramli; 09 Apr 2023, 15:26. Reason: added tags
Tags: label, panel data, remove, sort, syntax
Clyde Schechter

Join Date: Apr 2014

Posts: 30177
#2

09 Apr 2023, 18:26

Code:

isid pidp wave, sort assert inlist(wave, 1, 9) by pidp (wave): drop if _N == 1

will do what you ask.

That said, are you sure you want to do this? There is a good chance that those people who responded to both waves of the survey are different in some relevant way from those who only responded once. By dropping the singletons, you are likely selecting a biased subsample. Moreover, your results would not be generalizable beyond the data set because for any person not already in it, you do not know whether they would respond both times or only once, so their eligibility is indeterminate. There is no advantage to having balanced panel data here: nearly all Stata panel data analysis commands work just fine with unbalanced panels. So think carefully about whether this is really wise.
2 likes
Comment
Alyaa Ramli

Join Date: Apr 2023

Posts: 2
#3

10 Apr 2023, 11:43

Thank you so much for your input on this. I didn't expect my plan would produce biased estimators, and that in this case, unbalanced panel data is the better option. I understand your clear explanation and I'll take note of it.
Again, I really appreciate this!!
Comment

Announcement

Deleting ID that is not present in both datasets for Panel Data

Comment

Comment