Hello all,
I have never had the opportunity to deal with handling missing data so I'd appreciate any guidance. Thank you in advance.
I have a dataset with 10 variables and 399 observations. Gender, Q51 and PTSD are binary. Age, Race, Geography, Year of Residency and Q50 are categorical with more than 3 categories each.
The main goal of my project is to see if age, gender, race/ethnicity, geography, year of residency, Q50 and Q51 differ between PTSD positive and PTSD negative group.
Since this survey data has some missing information I need to first fill in the data gaps. I am thinking of doing multiple imputation. What is the best way to do this in STATA? I got as far as setting the mi dataset, setting the variable as imputed and regular and setting number of imputation as 5 then got confused regarding imputation method. How should I approach this? Also, once I get the imputed dataset, is it valid to run univariate analysis such as chi square test? Almost all the examples talks about fitting the predictive model so I was curious if we can do simple analysis in imputed dataset or not.
The description of missing data is as follows.
Variable | Missing Total Percent Missing
----------------+-----------------------------------------------
UniqueIden~r | 0 399 0.00
Residency_~e | 0 399 0.00
Gender | 10 399 2.51
Age | 3 399 0.75
Race | 19 399 4.76
Geography | 4 399 1.00
Year_Resid~y | 0 399 0.00
Q50 | 0 399 0.00
Q51 | 1 399 0.25
PTSD | 5 399 1.25
----------------+-----------------------------------------------
Example dataset:
input int UniqueIdentifier long(Residency_Type Gender Age Race) float Geography long Year_Residency float Q50 long Q51 float PTSD
1 1 2 3 4 4 1 2 1 0
3 1 1 6 3 1 2 3 2 0
4 1 1 2 3 1 2 3 2 0
5 1 2 2 2 3 4 2 2 1
6 1 2 6 3 4 1 2 1 0
7 1 2 2 1 1 2 3 2 0
8 1 1 2 3 1 1 3 2 0
9 1 1 2 3 1 1 3 2 1
13 1 2 2 3 4 1 2 1 0
14 1 2 2 2 1 4 2 1 1
end
label values Residency_Type Residency_Type
label def Residency_Type 1 "Surgery-General", modify
label values Gender Gender
label def Gender 1 "Female", modify
label def Gender 2 "Male", modify
label values Age Age
label def Age 2 "25 - 29", modify
label def Age 3 "30 - 34", modify
label def Age 6 "55 - 59", modify
label def Age 7 "60 - 64", modify
label values Race Race
label def Race 1 "African American", modify
label def Race 2 "Asian/Pacific Islander", modify
label def Race 3 "Caucasian", modify
label def Race 4 "Hispanic/ Latino", modify
label def Race 5 "Other", modify
label values Geography Geography
label def Geography 1 "Northeast", modify
label def Geography 2 "Midwest", modify
label def Geography 3 "South", modify
label def Geography 4 "West", modify
label values Year_Residency Year_Residency
label def Year_Residency 1 "PGY 1", modify
label def Year_Residency 2 "PGY 2", modify
label def Year_Residency 4 "PGY 4", modify
label def Year_Residency 5 "PGY 5", modify
label values Q50 Q50
label def Q50 2 "3 to 5", modify
label def Q50 3 "6 to 10", modify
label def Q50 4 "11 to 20", modify
label values Q51 Q51
label def Q51 1 "Community", modify
label def Q51 2 "University", modify
label values PTSD PTSD
label def PTSD 0 "Negative (0, 1 or 2)", modify
label def PTSD 1 "Positive (3 or 4)", modify
[/CODE]
Thanks,
PA.
I have never had the opportunity to deal with handling missing data so I'd appreciate any guidance. Thank you in advance.
I have a dataset with 10 variables and 399 observations. Gender, Q51 and PTSD are binary. Age, Race, Geography, Year of Residency and Q50 are categorical with more than 3 categories each.
The main goal of my project is to see if age, gender, race/ethnicity, geography, year of residency, Q50 and Q51 differ between PTSD positive and PTSD negative group.
Since this survey data has some missing information I need to first fill in the data gaps. I am thinking of doing multiple imputation. What is the best way to do this in STATA? I got as far as setting the mi dataset, setting the variable as imputed and regular and setting number of imputation as 5 then got confused regarding imputation method. How should I approach this? Also, once I get the imputed dataset, is it valid to run univariate analysis such as chi square test? Almost all the examples talks about fitting the predictive model so I was curious if we can do simple analysis in imputed dataset or not.
The description of missing data is as follows.
Variable | Missing Total Percent Missing
----------------+-----------------------------------------------
UniqueIden~r | 0 399 0.00
Residency_~e | 0 399 0.00
Gender | 10 399 2.51
Age | 3 399 0.75
Race | 19 399 4.76
Geography | 4 399 1.00
Year_Resid~y | 0 399 0.00
Q50 | 0 399 0.00
Q51 | 1 399 0.25
PTSD | 5 399 1.25
----------------+-----------------------------------------------
Example dataset:
input int UniqueIdentifier long(Residency_Type Gender Age Race) float Geography long Year_Residency float Q50 long Q51 float PTSD
1 1 2 3 4 4 1 2 1 0
3 1 1 6 3 1 2 3 2 0
4 1 1 2 3 1 2 3 2 0
5 1 2 2 2 3 4 2 2 1
6 1 2 6 3 4 1 2 1 0
7 1 2 2 1 1 2 3 2 0
8 1 1 2 3 1 1 3 2 0
9 1 1 2 3 1 1 3 2 1
13 1 2 2 3 4 1 2 1 0
14 1 2 2 2 1 4 2 1 1
end
label values Residency_Type Residency_Type
label def Residency_Type 1 "Surgery-General", modify
label values Gender Gender
label def Gender 1 "Female", modify
label def Gender 2 "Male", modify
label values Age Age
label def Age 2 "25 - 29", modify
label def Age 3 "30 - 34", modify
label def Age 6 "55 - 59", modify
label def Age 7 "60 - 64", modify
label values Race Race
label def Race 1 "African American", modify
label def Race 2 "Asian/Pacific Islander", modify
label def Race 3 "Caucasian", modify
label def Race 4 "Hispanic/ Latino", modify
label def Race 5 "Other", modify
label values Geography Geography
label def Geography 1 "Northeast", modify
label def Geography 2 "Midwest", modify
label def Geography 3 "South", modify
label def Geography 4 "West", modify
label values Year_Residency Year_Residency
label def Year_Residency 1 "PGY 1", modify
label def Year_Residency 2 "PGY 2", modify
label def Year_Residency 4 "PGY 4", modify
label def Year_Residency 5 "PGY 5", modify
label values Q50 Q50
label def Q50 2 "3 to 5", modify
label def Q50 3 "6 to 10", modify
label def Q50 4 "11 to 20", modify
label values Q51 Q51
label def Q51 1 "Community", modify
label def Q51 2 "University", modify
label values PTSD PTSD
label def PTSD 0 "Negative (0, 1 or 2)", modify
label def PTSD 1 "Positive (3 or 4)", modify
[/CODE]
Thanks,
PA.
Comment