Panel data: Multiple IDs, partly duplicated

Christopher Bratt

Join Date: May 2019

Posts: 144
#1

Panel data: Multiple IDs, partly duplicated

22 May 2021, 05:32

I have received panel data that need lots of cleaning. The most substantial challenge is finding a unique ID for each respondent.

At each measurement occasion, different service providers have been involved. They have used different variable names for IDs. Not a problem. But single panel waves also include lots of duplicated IDs, as identified for instance by

Code:

duplicates list idvar

The origin of these duplicate IDs: The questionnaire was online, so apparently several individuals have had repeated sessions with the questionnaire. Responses among such duplicated IDs are nearly identical, but not fully. So

Code:

duplicates report

does not identify any problem.

The combination of many waves, seven or so different ID variables, and duplicate use of IDs proves to be a larger challenge than I expected.

Stata has advanced features for handling panel data, so maybe it's possible to solve the problem without doing all by hand or developing a new function? (I can't program in Stata, only in R.)

My aim is to develop panel data (long format, with -xtset-) that has no duplicate IDs within single waves and an ID variable that identifies as many individuals as possible with repeated participation across waves. The data include not only dropouts but also new participants who join in at later measurement occasions
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#2

22 May 2021, 05:37

you have not shown a data example or shown us your xtset statement (or, really, much of anything) but this can probably be handled as

Code:

xtset idvar wave

(substituting whatever your actual variable names are of course); if that doesn't work, you may need to form a new variable for your id using, e.g., the "group" function of the -egen- command; see

Code:

help egen
Comment
Christopher Bratt

Join Date: May 2019

Posts: 144
#3

22 May 2021, 07:33

Thanks, but

Code:

xtset idvar wave

(ordinary setup for panel data) doesn't solve the issue of multiple id variables with duplicates. I think the description is clear enough for people to understand the problem and indicate whether any function/program exists that will help people experiencing this problem.

I'm not sure I personally understand how -egen- with group solves the problem, but I don't know Stata code well. Anyway, it seems using -egen- with group is moving towards doing all this by hand. Which might be the only solution.

I have been experimenting, starting with code such as

Code:

egen m_counts = count(m), by(m) bysort wave: tab m_counts

and

Code:

egen tag = tag(m) bysort wave: tab m if tag

but didn't get far. I guess I'll figure out something with R instead.

PS. If others find this thread later. Here are some posts that might help (but apparently less in my case, since I struggle with multiple ID variables).

https://www.statalist.org/forums/for...-in-panel-data

https://www.statalist.org/forums/for...scending-order

Last edited by Christopher Bratt; 22 May 2021, 07:45.
Comment
Christopher Bratt

Join Date: May 2019

Posts: 144
#4

23 May 2021, 11:17

Here is a suggested solution to the original problem.

The challenge is having (1) panel data with different ID variables and (2) duplicate IDs at a single measurement occasion.

So, there are two tasks:

1. Generate a single, consistent ID variable to be used both at a single measurement occasion and across all waves.
If each single respondent is identified with only one ID variable at all waves, then a simple replace ... if missing() will do.

2. 'Merge' data from duplicate observations, which are evident as duplicate IDs (Stata calls this collapse, as opposed to merging data files)

The first code chunk assumes data from each wave are in memory as a single data frame, and that there are "only" four different id variables in the original data. Again, an important assumption is also that each data provider uses the same id variable across all measurement occasions. (Each individual's scores at various across measurement occasions will help clarify that - by inspecting variables such as age and gender.)

Code:

frame Wave_1 { gen ID = id_var1 replace ID = id_var2 if missing(ID) replace ID = id_var3 if missing(ID) replace ID = id_var4 if missing(ID) duplicates list ID // Show duplicated IDs mdesc ID // Show number of missing IDs (mdesc package) }

Now each respondent has an ID. But the problem of duplicated IDs at a single measurement occasion remains. The duplicated response can be solved by collapsing observations.

How to collapse ('merge' one respondent's multiple responses into one observation, i.e. one row) will vary dependent on the specific study. Here, duplicates are collapsed by using the first non-missing value for a specific variable.

When the data include many variables, it will be easier to use a local macro to represent a varlist of all variables except for the ID variable.

Code:

// Collapse duplicates (for all variables except ID) frame Wave_1 { quietly describe, varlist // Develop varlist of all variables local vars `r(varlist)' local omit ID local all_vars : list vars - omit // Remove the ID variable from varlist collapse (firstnm) `all_vars', by(ID) // Collaps duplicates }

An alternative is looping through waves (measurement occasions). The code above focuses on a single wave.
Comment

Announcement

Panel data: Multiple IDs, partly duplicated

Comment

Comment

Comment