Issue with a variable when combining datasets to create a longitudinal dataset

Amal Khanolkar

Join Date: Feb 2015

Posts: 142
#1

Issue with a variable when combining datasets to create a longitudinal dataset

16 Dec 2015, 12:23

Hi All

I'm in the process of creating a very large longitudinal dataset by combining 8-9 smaller datasets, each one representing a calendar year. While the process is relatively straightforward for most variables across the dataset. I have a particular issue with the variable that denotes which clinic a subject belongs to. The number of clinics included in a dataset increases with time. For example, the dataset for 2008 has around 90 clinics which eventually increase to 180 clinics in the dataset for 2013 (as more clinics joined the research programme with each passing year).

The variable 'clinic' is numeric in all but one dataset (where it is a string var). I need to ensure that when I append the datasets together to create the final longitudinal dataset, the number of subjects in a particular clinic adds up as expected across all years (a small proportion of course change clinic with time) For example:

Code:

dataset 2008: ID clinic 1 1 2 4 3 6 4 7 5 8 6 . 7 9 dataset 2009 ID clinic 1 1 2 4 3 6 4 7 5 8 6 . 7 9 8 12 9 13 10 3 dataset 2010 ID clinic 1 1 2 4 3 6 4 7 5 8 6 . 7 9 8 12 9 13 10 3 12 2 13 4 dataset 2011 ID clinic 1 1 2 4 3 6 4 7 5 8 6 2 7 9 8 12 9 13 10 3 11 15 12 15 13 16 14 17

From the above one can see, that new clinics join in different years (clinic 3 not in 2008 is found in 2009 and onward etc). If I append the datasets as above I will get a wrong number of subjects in a particular clinic across time as the clinic variable is numeric (in almost all datasets).

Is the easiest approach to convert the clinic variable from numeric to string before appending the datasets? My thought here is that Stata will combine or append the clinic variable based on the label assigned to a particular clinic. Or is there a better way to achieve this?

Ensuring that the clinic variable is coded exactly the same across all 8 years of data is almost impossible due the large number of clinics and the addition of several new clinics each year (complicated by the fact a very small number of clinics tend to combine in some years).

Thanks

/Amal
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#2

16 Dec 2015, 13:35

I'm not sure what you mean when you say that in dataset 2013 clinic is a string variable, since you don't show us any examples from 2013. Do you mean it is a variable with Stata data type string, but its values are in fact numerics displayed as strings, such as "1", "2", "17", "22", etc.? If so, then when you read in that data set, just -destring clinic, replace- and you should be fine. If you mean that the values of variable clinic in the 2013 data set are things like "Downtown Central", "St. Marshmallow's Hospital", etc., then you need to create a crosswalk between those names and the numbers that were used to encode them in the earlier years, then -merge- the 2013 data set with the crosswalk, drop the clinic variable and rename the numeric code clinc.
Comment
Amal Khanolkar

Join Date: Feb 2015

Posts: 142
#3

17 Dec 2015, 07:37

Hi Clyde

In the original datasets, the clinic variables are mostly strings but I used the 'encode' command to convert them all to numeric as it was easier for me to clean them as numeric variables.

I now have all the clinic variables for all years as numeric (storage type long). So if browse the clinic var for the year 2013, for the first 5 obs:

Code:

br clinic CLINIC PZ101 PZ102 PZ001 PZ005 PZ006

The above are clinics for the first five subjects in the 2013 data. I can also cross-check the total number of subjects in a particular clinic by either using tab or the count:

Code:

count if clinic==1 . 607 count if clinic==10 . 335

The '1' & '10' correspond to PZ001 (i.e.clinic 1) and PZ010 (i.e. clinic 10).

clinic==1 in the 2013 dataset is not the same as clinic==1 in the datasets for 2008-2012. So, appending or merging the 2013 dataset to a dataset for 2008-12 wouldn't work if I'm not mistaken. So, the crosswalk you suggested is what I should try in this case?

Or, could I tell Stata to read 'PZ101' from 2013 the same way as 'PZ101' from all other years even if the same clinic is not coded identically (i.e. '...==1') in the different datasets?

Hope, I'm clearer now.

Thanks

/Amal
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#4

17 Dec 2015, 08:10

So, if I understand you correctly, when you initially read in the data, the clinic names are strings that look like PZ101, etc. Is that right? If so, I think the simplest thing is to keep them as strings and then do a single application of -encode- to the complete data set that includes all of the originals. That will guarantee consistency.

In other words, don't clean the clinic variable until after the files have all been put together. That's much easier than encoding each separately and taking steps to assure consistency of numeric encoding across files (though that can be done).
Comment
Amal Khanolkar

Join Date: Feb 2015

Posts: 142
#5

17 Dec 2015, 08:28

Yes, Clyde, the the original individual datasets have the clinic variable as a string. Will try what you suggested then.

Just out of curiosity, when I encode the clinic variable and convert it to numeric using -encode-, the variable still retains the PZ001, PZ002 so on and so forth but is essentially a numeric variable with each clinic coded as 1,2,3,4 etc. So, it retains the labels from the original string variable?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#6

17 Dec 2015, 08:52

Yes. That's exactly what -encode- does. It takes a string variable, and assigns an arbitrary numeric code (it starts with 1 and counts up from there taking the values of the string variable in alphabetic order), and then creates and applies a value label mapping each of those numeric values to the corresponding string. That way when the variable is -display-ed or -list-ed, or appears in the variable name column of a regression model as a factor variable, the string labels are shown. This makes the output much easier to read and understand. The thing to remember is that the numeric values are arbitrary (alphabetic rank order) and you should never use them in calculations.

-encode- also has an option, -label()-, which you can use to specify a pre-existing label rather than the default behavior of assigning values in alphabetical order. This is often helpful when there is an ordering among the string values that is more natural than alphabetic, or when starting with 1 is undesirable. It is often necessary to use this approach when merging or appending files in situations like yours, to enforce consistency. But when you can just apply -encode- once after all the -merge-s and -append-s are done, that's easier.
Comment

Announcement

Issue with a variable when combining datasets to create a longitudinal dataset

Comment

Comment

Comment

Comment

Comment