I appended two enrollment files, one for each of two years, with the goal of determining the unique number of enrollees across the two years. I discovered that appending seemed to copy the observations from one of the files twice, increasing the total number of observations. By using duplicates drop I was able to determine the number I was looking for, but I was surprised by the additional observations in my intervening step, i.e., the 39M in the bolded portion below. Does anyone know why the append command seems to be duplicating the observations from the file with 13344000M obs? Here's my code:
use "C:\Enrollment Summary\CCAEA054.DTA", clear
. keep enrolid
. duplicates drop enrolid, force
Duplicates in terms of enrolid
(0 observations are duplicates)
. save "C:\Enrollment Summary\CCAEA054_enrolidonly.dta", replace
file C:\Enrollment Summary\CCAEA054_enrolidonly.dta saved
. use "C:\Enrollment Summary\CCAEA063.DTA", clear
. keep enrolid
. duplicates drop enrolid, force
Duplicates in terms of enrolid
(0 observations are duplicates)
. save "C:\Enrollment Summary\CCAEA063_enrolidonly.dta", replace
file C:\Enrollment Summary\CCAEA063_enrolidonly.dta saved
. /* The goal now is to determine the number of unique enrolids across the two years. */
. append using "C:\Enrollment Summary\CCAEA054_enrolidonly.dta" "C:\Enrollment Summary\CCAEA063_enrolidonly.dta"
. dis _N
39354448
. use "C:\Enrollment Summary\CCAEA054_enrolidonly.dta", clear
. dis _N
12666448
. use "C:\Enrollment Summary\CCAEA063_enrolidonly.dta", clear
. dis _N
13344000
. dis 12666448 + 13344000
26010448
. dis 26010448 + 13344000
39354448
. append using "C:\Enrollment Summary\CCAEA054_enrolidonly.dta" "C:\Enrollment Summary\CCAEA063_enrolidonly.dta"
. duplicates drop enrolid, force
Duplicates in terms of enrolid
(23,029,669 observations deleted)
. dis _N
16324779
use "C:\Enrollment Summary\CCAEA054.DTA", clear
. keep enrolid
. duplicates drop enrolid, force
Duplicates in terms of enrolid
(0 observations are duplicates)
. save "C:\Enrollment Summary\CCAEA054_enrolidonly.dta", replace
file C:\Enrollment Summary\CCAEA054_enrolidonly.dta saved
. use "C:\Enrollment Summary\CCAEA063.DTA", clear
. keep enrolid
. duplicates drop enrolid, force
Duplicates in terms of enrolid
(0 observations are duplicates)
. save "C:\Enrollment Summary\CCAEA063_enrolidonly.dta", replace
file C:\Enrollment Summary\CCAEA063_enrolidonly.dta saved
. /* The goal now is to determine the number of unique enrolids across the two years. */
. append using "C:\Enrollment Summary\CCAEA054_enrolidonly.dta" "C:\Enrollment Summary\CCAEA063_enrolidonly.dta"
. dis _N
39354448
. use "C:\Enrollment Summary\CCAEA054_enrolidonly.dta", clear
. dis _N
12666448
. use "C:\Enrollment Summary\CCAEA063_enrolidonly.dta", clear
. dis _N
13344000
. dis 12666448 + 13344000
26010448
. dis 26010448 + 13344000
39354448
. append using "C:\Enrollment Summary\CCAEA054_enrolidonly.dta" "C:\Enrollment Summary\CCAEA063_enrolidonly.dta"
. duplicates drop enrolid, force
Duplicates in terms of enrolid
(23,029,669 observations deleted)
. dis _N
16324779
Comment