Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • apparent anomaly when appending two files

    I appended two enrollment files, one for each of two years, with the goal of determining the unique number of enrollees across the two years. I discovered that appending seemed to copy the observations from one of the files twice, increasing the total number of observations. By using duplicates drop I was able to determine the number I was looking for, but I was surprised by the additional observations in my intervening step, i.e., the 39M in the bolded portion below. Does anyone know why the append command seems to be duplicating the observations from the file with 13344000M obs? Here's my code:


    use "C:\Enrollment Summary\CCAEA054.DTA", clear

    . keep enrolid

    . duplicates drop enrolid, force

    Duplicates in terms of enrolid

    (0 observations are duplicates)

    . save "C:\Enrollment Summary\CCAEA054_enrolidonly.dta", replace

    file C:\Enrollment Summary\CCAEA054_enrolidonly.dta saved

    . use "C:\Enrollment Summary\CCAEA063.DTA", clear

    . keep enrolid

    . duplicates drop enrolid, force

    Duplicates in terms of enrolid

    (0 observations are duplicates)

    . save "C:\Enrollment Summary\CCAEA063_enrolidonly.dta", replace

    file C:\Enrollment Summary\CCAEA063_enrolidonly.dta saved

    . /* The goal now is to determine the number of unique enrolids across the two years. */


    . append using "C:\Enrollment Summary\CCAEA054_enrolidonly.dta" "C:\Enrollment Summary\CCAEA063_enrolidonly.dta"

    . dis _N

    39354448


    . use "C:\Enrollment Summary\CCAEA054_enrolidonly.dta", clear

    . dis _N

    12666448

    . use "C:\Enrollment Summary\CCAEA063_enrolidonly.dta", clear

    . dis _N

    13344000

    . dis 12666448 + 13344000

    26010448

    . dis 26010448 + 13344000

    39354448


    . append using "C:\Enrollment Summary\CCAEA054_enrolidonly.dta" "C:\Enrollment Summary\CCAEA063_enrolidonly.dta"

    . duplicates drop enrolid, force

    Duplicates in terms of enrolid

    (23,029,669 observations deleted)

    . dis _N

    16324779


  • #2
    I think you are misunderstanding what -append- does. When you write:

    Code:
    use file1, clear
    append using file2 file3
    Stata first loads file1 into memory. The append command then tells Stata to add the observations in file2 and file3 to whatever data is already in memory. In your case, file3 is the same as file1, and file1 is already in memory, so, of course, you end up with 2 copies of everything that was in file1. The append command does not in anyway compare what is being brought into memory with what is already there. It just literally follows your instructions to add the second and third (in your case the same as first) files to what is already there (namely, the first file.)

    Specifically, to get the result of adding the observations of file 63_enrollid.dta to those of 54_enrollid.dta the commands would be:
    Code:
    use 63_enrollid, clear
    append using 54_enrollid
    or

    Code:
    clear
    append using 63_enrollid 54_enrollid

    Comment


    • #3
      Thanks this is very lucid. Obviously I missed a step and I'll use the code you recommend when I'm back on site with the data.

      Comment

      Working...
      X