Hi All
I'm in the process of creating a very large longitudinal dataset by combining 8-9 smaller datasets, each one representing a calendar year. While the process is relatively straightforward for most variables across the dataset. I have a particular issue with the variable that denotes which clinic a subject belongs to. The number of clinics included in a dataset increases with time. For example, the dataset for 2008 has around 90 clinics which eventually increase to 180 clinics in the dataset for 2013 (as more clinics joined the research programme with each passing year).
The variable 'clinic' is numeric in all but one dataset (where it is a string var). I need to ensure that when I append the datasets together to create the final longitudinal dataset, the number of subjects in a particular clinic adds up as expected across all years (a small proportion of course change clinic with time) For example:
From the above one can see, that new clinics join in different years (clinic 3 not in 2008 is found in 2009 and onward etc). If I append the datasets as above I will get a wrong number of subjects in a particular clinic across time as the clinic variable is numeric (in almost all datasets).
Is the easiest approach to convert the clinic variable from numeric to string before appending the datasets? My thought here is that Stata will combine or append the clinic variable based on the label assigned to a particular clinic. Or is there a better way to achieve this?
Ensuring that the clinic variable is coded exactly the same across all 8 years of data is almost impossible due the large number of clinics and the addition of several new clinics each year (complicated by the fact a very small number of clinics tend to combine in some years).
Thanks
/Amal
I'm in the process of creating a very large longitudinal dataset by combining 8-9 smaller datasets, each one representing a calendar year. While the process is relatively straightforward for most variables across the dataset. I have a particular issue with the variable that denotes which clinic a subject belongs to. The number of clinics included in a dataset increases with time. For example, the dataset for 2008 has around 90 clinics which eventually increase to 180 clinics in the dataset for 2013 (as more clinics joined the research programme with each passing year).
The variable 'clinic' is numeric in all but one dataset (where it is a string var). I need to ensure that when I append the datasets together to create the final longitudinal dataset, the number of subjects in a particular clinic adds up as expected across all years (a small proportion of course change clinic with time) For example:
Code:
dataset 2008: ID clinic 1 1 2 4 3 6 4 7 5 8 6 . 7 9 dataset 2009 ID clinic 1 1 2 4 3 6 4 7 5 8 6 . 7 9 8 12 9 13 10 3 dataset 2010 ID clinic 1 1 2 4 3 6 4 7 5 8 6 . 7 9 8 12 9 13 10 3 12 2 13 4 dataset 2011 ID clinic 1 1 2 4 3 6 4 7 5 8 6 2 7 9 8 12 9 13 10 3 11 15 12 15 13 16 14 17
Is the easiest approach to convert the clinic variable from numeric to string before appending the datasets? My thought here is that Stata will combine or append the clinic variable based on the label assigned to a particular clinic. Or is there a better way to achieve this?
Ensuring that the clinic variable is coded exactly the same across all 8 years of data is almost impossible due the large number of clinics and the addition of several new clinics each year (complicated by the fact a very small number of clinics tend to combine in some years).
Thanks
/Amal
Comment