Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Append data

    Dear Statalisters,

    The process of my coding is:
    1. I have 90 small datasets to be merged into a master data
    2. Let us say we start with first merge
    3. Compute some values after merging
    4. Save the data , repeat step 2 ,3,4 with the second dataset merge with master, append to last saved data.
    I made up some data below and present my code (showing very simple calculations), My code is intended to show my thought process. My real data involves complex calculation in step 3.

    I was expecting 4 observation after appending two datasets together, but I only got two.

    What did I do wrong?

    match1975_sic2834 (first data)
    clear
    input int(gvkey eyear) byte(prex postx)
    1000 1975 3 4
    1001 1975 2 3
    end

    match1975_sic3334
    clear
    input int(gvkey eyear) byte(prex postx)
    1002 1975 3 4
    1003 1975 2 3
    end


    master
    clear
    input int(gvkey eyear y)
    1000 1975 11
    1001 1975 232
    1002 1975 32
    1003 1975 56
    1004 1976 2
    end


    Code:
    foreach x of numlist 2834 3334 {
    
    clear
    save event`x'_prepost, replace emptyok
    
    quie {
        use match1975_sic`x',clear 
        merge using master
        keep if _merge==3
    
    * I omitted the computation of variables here, it should not affect the append process
        drop _merge
    
    
        append using event`x'_prepost
        save event`x'_prepost, replace
        }
    }

    question #2: above I presented 2 datasets match1975_sic2834 , match1975_sic3334. I have 88 other datasets such as match1976_sic2511, match1976_sic2512, you see the pattern of the name of my data, the year are changing (1975 1976 etc), and the 4 numbers after sic are changing. My task is to merge in all 90 data with the master, and do some computation, and save the data, and append all the saved ones together.

    What is the succinct way of doing this ? I think i need modify my code above with two foreach loop, given that the numbers after sic changes from year to year,so I was not sure if I can use two foreach loop , loop 1 takes year, loop 2 take number after sic.

    thank you,
    Rochelle

  • #2
    Before you start building a loop, you should fine tune what you want to happen in the loop with a single case. If you had done that, you would have noticed a big error in your code. Here's what you are doing:

    Code:
    clear
    input int(gvkey eyear) byte(prex postx)
    1000 1975 10 11
    1001 1975 12 13 
    end
    save "match1975_sic2834.dta", replace
    
    clear
    input int(gvkey eyear) byte(prex postx)
    1002 1975 14 15 
    1003 1975 16 17
    end
    save "match1975_sic3334.dta", replace
    
    clear
    input int(gvkey eyear y)
    1000 1975 21 
    1001 1975 232 
    1002 1975 32 
    1003 1975 56 
    1004 1976 2 
    end
    save "master.dta", replace
    
    
    use "match1975_sic3334.dta",clear 
    merge using "master.dta"
    keep if _merge==3
    and here's the output

    Code:
    . use "match1975_sic3334.dta",clear 
    
    . merge using "master.dta"
    (note: you are using old merge syntax; see [D] merge for new syntax)
    
    . keep if _merge==3
    (3 observations deleted)
    
    . 
    . list
    
         +---------------------------------------------+
         | gvkey   eyear   prex   postx     y   _merge |
         |---------------------------------------------|
      1. |  1002    1975     14      15    21        3 |
      2. |  1003    1975     16      17   232        3 |
         +---------------------------------------------+
    Your use of quietly in the loop masked the note that points out that you are using old merge syntax. Your omission of merge key variables resulted in a one-to-one merge by observation, which matched the 2 observations from the "match1975_sic3334.dta" dataset to the first two observation from the "master.dta", without regards to the values of gvkey and eyear.

    A minor point, I know that these are just toy datasets but try to select names that are consistent with the terminology of the commands you use. Your "master.dta" is the using dataset while your master dataset starts with "match" which is confusing as well since the results of a merge is a match.

    I also don't see why you can't simply append your bunch of small datasets and then merge the combined data once with "master.dta".

    You are up to 306 posts at this point so I think you should know by now that the first step is to go read the documentation on the commands you use

    Code:
    help merge
    and practice on small examples until you are convinced that you understand what you are doing.

    Comment


    • #3
      Dear Robert,

      Thanks for your post and critique !

      This morning before your post, I found out the mistake - merge without gvkey fyear. I was writing the post late last night, I guess my brain is not working efficiently.

      About the naming of dataset, your point is well taken. My reason for naming the data master, because my many small datasets are merged in to "master"-the main data. I will be careful in the future about naming data.

      You suggested appending my small data into one big data, then merge only once. Notice the name of my small data contains the year , e.g. 1975, 1976, as I said in #1 post, i do some calculations after merge, that is based on the year , e.g. if data is 1975, i will compute two year pre and post 1975 for certain variables, . Hence, appending all small data together, complicates my coding for such calculation given each data has a different reference year.

      Best,
      Rochelle

      Comment

      Working...
      X