Dear all,
I am a Stata 14 user and I was wondering if you could help with the following.
I have a dataset containing survey information from teachers about their school. Some of them have taken the survey multiple times, and I need to remove duplicates from the dataset. I would like to keep the oldest response, ie the row corresponding to the first time they took the survey. However, in some cases I have information for a variable missing in the oldest response but not missing in the following ones.
Example:
As you can see, school_type and subjects_taught, for example, are missing for id == 0001 in the first row but not in the second. So I wrote the following code to replace the information in row 1 with that in row 2:
The code doesn't seem to be doing anything: when I look at the dataset before and after, the information hasn't been replaced. Could you please help?
Thanks in advance!
I am a Stata 14 user and I was wondering if you could help with the following.
I have a dataset containing survey information from teachers about their school. Some of them have taken the survey multiple times, and I need to remove duplicates from the dataset. I would like to keep the oldest response, ie the row corresponding to the first time they took the survey. However, in some cases I have information for a variable missing in the oldest response but not missing in the following ones.
Example:
id | school_type | years_teaching | subjects_taught | n. students |
0001 | . | 5 | . | |
0001 | 1 | 5 | 1,2,5 | . |
0002 | 2 | . | 6,7 | 24 |
0002 | . | 2 | 6,7 | . |
As you can see, school_type and subjects_taught, for example, are missing for id == 0001 in the first row but not in the second. So I wrote the following code to replace the information in row 1 with that in row 2:
Code:
//Deal with duplicates duplicates tag email survey_treated finished, g(dup) ta dup sort email teacher_startdate egen email_id = group(email) if dup>0 & dup<=2 // to identify with the same number those who have the same email fre email_id // shows it takes values from 1 to 172 forval i = 1(1)172 { ds, has(type string) // I seprated string and numeric variables because the code was producing a "type mismatch" error unab str_vars: `r(varlist)' unab vars_to_exclude1: consent teacher_startdate teacher_enddate ipaddress progress durationinseconds finished teacher_recordeddate responseid locationlatitude locationlongitude randomid foreach var in `:list str_vars - vars_to_exclude1' { replace `var' = "`=`var'[_n + 2]'" if `var' == "" & _n == 1 /// & `var'[_n + 1] == "" & `var'[_n + 2] != "" & dup == 2 & email_id == `i' // if the same person took the survey 3 times, replace the first row with information from the third if the second is empty replace `var' = "`=`var'[_n + 1]'" if `var' == "" & _n == 1 /// & `var'[_n + 1] != "" & `var'[_n + 2] == "" & dup == 2 & _n == `i' // if the same person took the survey 3 times, replace the first row with information from the second if the third is empty replace `var' = "`=`var'[_n + 1]'" if `var' == "" & _n == 1 & `var'[_n + 1] != "" & dup == 1 & email_id == `i' // if the same person took the survey 2 times, replace the first row with information from the second row } ds, has(type numeric) unab num_vars: `r(varlist)' unab vars_to_exclude2: teacher_startdate teacher_enddate ipaddress progress durationinseconds finished teacher_recordeddate responseid locationlatitude locationlongitude randomid foreach var in `:list num_vars - vars_to_exclude2' { replace `var' = `=`var'[_n + 2]' if `var' == . & _n == 1 /// & `var'[_n + 1] == . & `var'[_n + 2] != . & dup == 2 & email_id == `i' // if the same person took the survey 3 times, replace the first row with information from the third if the second is empty replace `var' = `=`var'[_n + 1]' if `var' == . & _n == 1 /// & `var'[_n + 1] != . & `var'[_n + 2] == . & dup == 2 & email_id == `i' // if the same person took the survey 3 times, replace the first row with information from the second if the third is empty replace `var' = `=`var'[_n + 1]' if `var' == . & _n == 1 & `var'[_n + 1] != . & dup == 1 & email_id == `i' // if the same person took the survey 2 times, replace the first row with information from the second row } }
Thanks in advance!
Comment