Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • append multiple dtas, but the total observations is not right

    Hi, I have been struggling with my appending results for a while. I appended multiple dtas in the same folder, and I know the observations for each dta, but after appending, the total observations is larger than just adding up the observations of each individual dta. Say, if I have three dtas with 10, 20, 30 observations respectively. After appending, the total observations is more than 60. I use duplicates drop command, and the observations finally matched. I was confused about this situation. Can somebody tell me why sometimes appending yields wrong result in terms of the total observations?
    I paste part of my do file in the following:

    HTML Code:
    cd "mypath"
    local dts: dir . files "*.dta"
    foreach dt in `dts' {
        append using `dt'
    }


  • #2
    My guess is that your full code might look something like this.
    Code:
    cd "mypath"
    local dts: dir . files "*.dta"
    foreach dt in `dts' {
        append using `dt'
    }
    save alldata, replace
    so that the second time you run the code, the output dataset was included among the input datasets at some point. That would explain duplicate observations.

    In any event, you might want to add to your code something to tell you what you are appending so you can confirm that you are appending only the datasets you meant to append.
    Code:
    cd "mypath"
    local dts: dir . files "*.dta"
    display "appending `dts'"
    clear
    foreach dt in `dts' {
        append using `dt'
    }

    Comment


    • #3
      Example:

      Code:
      sysuse auto, clear
      gen set = 1
      tempfile one
      save `one'
      replace set = 2
      tempfile two
      save `two'
      replace set = 3
      tempfile three
      save `three'
      
      use `one'
      
      tab set
      
      foreach x in `one' `two' `three' {
          append using `x'
          }
          
      tab set

      Comment


      • #4
        Brian Brosmer
        ​​​​​​​
        Code:
        use `one'
        
        tab set
        
        foreach x in `one' `two' `three' {
        You start with set 1 in memory, then you append another copy of set 1. Either replace the use command with clear, or remove `one' from the foreach command.

        Comment


        • #5
          @Brian Brosmer
          @William Lisowski
          Thanks. Both of you provides great example. Brian Brosmer example shows clearly that data 1 was appended twice, so the result of "tab set" shows there are 74+74=148 observation for set=1.

          I got the point why my problem happened. The master data in the memory was appended twice. If there are just several dtas in the folder, we can just exclude the master data from the using data, so it will not be appended twice. But if we have many dtas in the folder, we can still loop through all the dtas, but drop those duplicates by adding "duplicates drop"

          cd "mypath" using master_data,clear local dts: dir . files "*.dta" foreach dt in `dts' { append using `dt' } duplicates drop

          Comment


          • #6
            Noe Zhou, careful, I don't think that dropping duplicates is a generally safe approach to use. Of course, this depends on your specific data, which you have not shown to us.

            The safest approach is to use William Lisowski's suggestion to replace the initial `use' with a -clear- statement. This way, a fresh dataset is built by simply concatenating each dataset together. A master dataset when appending doesn't make sense to me, unlike when performing a -merge-, because no one dataset is more special than another.

            Comment


            • #7
              My recommendation is to delete the master dataset that you plan to replace before you generate the list of filenames, and to clear your working dataset.
              Code:
              cd "mypath"
              erase alldata.dta
              local dts: dir . files "*.dta"
              clear
              foreach dt in `dts' {
                  append using `dt'
              }
              save alldata, replace

              Comment


              • #8
                @Brian Brosmer
                @William Lisowski Yeah, I got it. The methods you offered is safer than just dropping the duplicates.

                Comment

                Working...
                X