Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging files with a condition

    Hello, I am trying to merge several files but the merging should satisfy a specific condition. This is how my situation is:
    1. I have a main file which includes an ID number and a year indicating which survey wave a respondent participated in. See below.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    ID wave Salary
     1 2011 .
     2 2012 .
     3 2010 .
     4 2011 .
     5 2010 .
     6 2015 .
     7 2019 .
     8 2003 .
     9 2003 .
    10 2005 .
    end
    2) Now, I have to add a salary variable to my main file. The salary variable however are in different files. Each file contains salary information for a specific year and a lot of the respondents have salary information for multiple years.
    3) What I want to do is to merge the salary for the respondents BUT only if the salary year matches the wave year for that particular respondent.

    I tried merging each of the salary files and dropped cases that were not merged in the using file (see code below)
    Code:
    use "C:\Users\ASUS\OneDrive - mainfile.dta"
    merge ID using "C:\Users\ASUS\OneDrive - 2019_salary_data.dta", gen(merge2019)
    drop if merge2010==2
    merge ID using "C:\Users\ASUS\OneDrive - 2015_salary_data.dta", gen(merge2015)
    drop if merge2015==2
    merge ID using "C:\Users\ASUS\OneDrive - 2013_salary_data.dta", gen(merge2013)
    drop if merge2013==2
    This code however only retained the salary information from the 2019 files since all the respondents have 2019 salary files.

    I wanted to do something like the one below but Im getting an invalid syntax saying "if" is not allowed. Any suggestions on what syntax to use since the "if" condition is the only one I cant think of but obviously is not possible. Thank you in advance.

    Code:
    use "C:\Users\ASUS\OneDrive - mainfile.dta"
    merge ID using "C:\Users\ASUS\OneDrive - 2019_salary_data.dta" if wave==2019

  • #2
    it is impossible to give a definitive answer on the basis of the above but my guess is that you need to make 2 changes: (1) merge on both ID and wave; (2) use the "update" option; see
    Code:
    h merge

    Comment


    • #3
      Perhaps this example code will start you in a useful direction.
      Code:
      use "C:\Users\ASUS\OneDrive - 2019_salary_data.dta"
      generate year = 2019
      append using "C:\Users\ASUS\OneDrive - 2015_salary_data.dta"
      replace year = 2015 if missing(year)
      append using "C:\Users\ASUS\OneDrive - 2013_salary_data.dta"
      replace year = 2013 if missing(year)
      merge ID year using "C:\Users\ASUS\OneDrive - mainfile.dta"

      Comment


      • #4
        Rich Goldstein unfortunately, wave is not a unique identifier so I am not sure how it will work but but since I cant think of any other options to solve my problem, might as well try. Thanks for the suggestion!

        Also, my dataset can only be accessed in a data center that doesn't allow internet connection due to security regulations so I couldn't copy an example from it but I tried as much as possible to replicate how it looks like based on the issue at hand. Any specific point that I need to expound so that I can help readers understand my issue in a clearer way?

        Comment


        • #5
          William Lisowski , I am trying to add new variables though that's why I was using merge. Isn't append meant to add new observations? Please correct me if Im wrong since I am still learning STATA

          Comment


          • #6
            the question is not whether wave is a distinct identifier, the question is whether the combination of ID and wave are distinct identifiers - and be sure to use the "update" option

            I'm assuming here that your descriptive example in #1 of mainfile is correct - you could have done the same kind of realistic example for at least one of the other files

            Comment


            • #7
              Rich Goldstein re ID+wave = aha, I see your point now. In that case, yes the combination of ID + wave will now be distinct identifiers.
              And yes, I will update the data example for the other files.
              Thank you!

              Comment


              • #8
                Additional info -- this is how the salary files look like. Each file is named based on the year when the salary was reported.

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                ID Salary Province
                 1 10000 "Ontario"      
                 2 30000 "Alberta"      
                 3 50000 "BC"          
                 4 20000 "Newfoundland"
                 5 15000 "Ontario"      
                 6 20000 "New Brunswick"
                 7 25345 "Manitoba"    
                 8 23384 "Ontario"      
                 9 32000 "Alberta"      
                10 30000 "Alberta"      
                end

                Comment


                • #9
                  Originally posted by gi peters View Post
                  William Lisowski , I am trying to add new variables though that's why I was using merge. Isn't append meant to add new observations? Please correct me if Im wrong since I am still learning STATA
                  You were using merge three times, I used append to combine three salary files into a single salary file. Since your salary files do not include a year variable, one is created in the appending process. Then the single merge at the end adds salaries from all three years and could merge, as Rich told you was necessary, on both ID and year.

                  Comment


                  • #10
                    William Lisowski Ah I see now what you're trying to point out. My bad, the salary variable has the year suffixed to it so the Salary should be Salary2015 (Salary2014, and so on)

                    Comment


                    • #11
                      And is Province really Province2015, and so on?

                      Comment


                      • #12
                        I'm going to assume your "Salary" datasets all look like this, just with different years on the end of each variable other than ID. In the future, spend more time making sure your example data is correct, and less time pretending it was produced by dataex by sticking in a comment that is untrue. At this point, I'm concerned there may be other variables, or that Province is in fact a numeric value with a value label, either of which could pose a problem.
                        Code:
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input int ID long Salary2015 str13 Province2015
                         1 10000 "Ontario"      
                         2 30000 "Alberta"      
                         3 50000 "BC"           
                         4 20000 "Newfoundland" 
                         5 15000 "Ontario"      
                         6 20000 "New Brunswick"
                         7 25345 "Manitoba"     
                         8 23384 "Ontario"      
                         9 32000 "Alberta"      
                        10 30000 "Alberta"      
                        end
                        Then perhaps this code will start you in a useful direction.
                        Code:
                        use  "C:\Users\ASUS\OneDrive - 2019_salary_data.dta", clear
                        merge ID using "C:\Users\ASUS\OneDrive - 2015_salary_data.dta", nogenerate
                        merge ID using "C:\Users\ASUS\OneDrive - 2013_salary_data.dta", nogenerate
                        reshape long Salary Province, i(ID) j(year)
                        merge ID year using "C:\Users\ASUS\OneDrive - mainfile.dta"

                        Comment


                        • #13
                          William Lisowski yes there are 30 other variables in the salary file but like what I've explained in my comment to Rich Goldstein earlier, my dataset can only be accessed in a data center that doesn't allow internet connection due to security regulations so I couldn't copy an example straight from it but I tried as much as possible to replicate how it looks like and included only relevant variables based on the issue at hand.
                          Thanks for the recommended codes - Im trying them actually but having IT issues in the center so I cant say yet whether they work or not. Hopefully they do.

                          Comment

                          Working...
                          X