Merging files with a condition

gi peters

Join Date: Sep 2020

Posts: 44
#1

Merging files with a condition

20 Jun 2022, 11:08

Hello, I am trying to merge several files but the merging should satisfy a specific condition. This is how my situation is:
1. I have a main file which includes an ID number and a year indicating which survey wave a respondent participated in. See below.

Code:

* Example generated by -dataex-. For more info, type help dataex clear ID wave Salary 1 2011 . 2 2012 . 3 2010 . 4 2011 . 5 2010 . 6 2015 . 7 2019 . 8 2003 . 9 2003 . 10 2005 . end

2) Now, I have to add a salary variable to my main file. The salary variable however are in different files. Each file contains salary information for a specific year and a lot of the respondents have salary information for multiple years.
3) What I want to do is to merge the salary for the respondents BUT only if the salary year matches the wave year for that particular respondent.

I tried merging each of the salary files and dropped cases that were not merged in the using file (see code below)

Code:

use "C:\Users\ASUS\OneDrive - mainfile.dta" merge ID using "C:\Users\ASUS\OneDrive - 2019_salary_data.dta", gen(merge2019) drop if merge2010==2 merge ID using "C:\Users\ASUS\OneDrive - 2015_salary_data.dta", gen(merge2015) drop if merge2015==2 merge ID using "C:\Users\ASUS\OneDrive - 2013_salary_data.dta", gen(merge2013) drop if merge2013==2

This code however only retained the salary information from the 2019 files since all the respondents have 2019 salary files.

I wanted to do something like the one below but Im getting an invalid syntax saying "if" is not allowed. Any suggestions on what syntax to use since the "if" condition is the only one I cant think of but obviously is not possible. Thank you in advance.

Code:

use "C:\Users\ASUS\OneDrive - mainfile.dta" merge ID using "C:\Users\ASUS\OneDrive - 2019_salary_data.dta" if wave==2019
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4547
#2

20 Jun 2022, 11:46

it is impossible to give a definitive answer on the basis of the above but my guess is that you need to make 2 changes: (1) merge on both ID and wave; (2) use the "update" option; see

Code:

h merge
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

20 Jun 2022, 11:55

Perhaps this example code will start you in a useful direction.

Code:

use "C:\Users\ASUS\OneDrive - 2019_salary_data.dta"
generate year = 2019
append using "C:\Users\ASUS\OneDrive - 2015_salary_data.dta"
replace year = 2015 if missing(year)
append using "C:\Users\ASUS\OneDrive - 2013_salary_data.dta"
replace year = 2013 if missing(year)
merge ID year using "C:\Users\ASUS\OneDrive - mainfile.dta"

Comment

gi peters

Join Date: Sep 2020

Posts: 44
#4

20 Jun 2022, 12:02

Rich Goldstein unfortunately, wave is not a unique identifier so I am not sure how it will work but but since I cant think of any other options to solve my problem, might as well try. Thanks for the suggestion!

Also, my dataset can only be accessed in a data center that doesn't allow internet connection due to security regulations so I couldn't copy an example from it but I tried as much as possible to replicate how it looks like based on the issue at hand. Any specific point that I need to expound so that I can help readers understand my issue in a clearer way?
Comment
gi peters

Join Date: Sep 2020

Posts: 44
#5

20 Jun 2022, 12:06

William Lisowski , I am trying to add new variables though that's why I was using merge. Isn't append meant to add new observations? Please correct me if Im wrong since I am still learning STATA
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4547
#6

20 Jun 2022, 12:08

the question is not whether wave is a distinct identifier, the question is whether the combination of ID and wave are distinct identifiers - and be sure to use the "update" option

I'm assuming here that your descriptive example in #1 of mainfile is correct - you could have done the same kind of realistic example for at least one of the other files
Comment
gi peters

Join Date: Sep 2020

Posts: 44
#7

20 Jun 2022, 12:11

Rich Goldstein re ID+wave = aha, I see your point now. In that case, yes the combination of ID + wave will now be distinct identifiers.
And yes, I will update the data example for the other files.
Thank you!
Comment

gi peters

Join Date: Sep 2020
Posts: 44

20 Jun 2022, 12:19

Additional info -- this is how the salary files look like. Each file is named based on the year when the salary was reported.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
ID Salary Province
 1 10000 "Ontario"      
 2 30000 "Alberta"      
 3 50000 "BC"          
 4 20000 "Newfoundland"
 5 15000 "Ontario"      
 6 20000 "New Brunswick"
 7 25345 "Manitoba"    
 8 23384 "Ontario"      
 9 32000 "Alberta"      
10 30000 "Alberta"      
end

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

20 Jun 2022, 13:49

Originally posted by gi peters View Post

William Lisowski , I am trying to add new variables though that's why I was using merge. Isn't append meant to add new observations? Please correct me if Im wrong since I am still learning STATA

You were using merge three times, I used append to combine three salary files into a single salary file. Since your salary files do not include a year variable, one is created in the appending process. Then the single merge at the end adds salaries from all three years and could merge, as Rich told you was necessary, on both ID and year.
1 like
Comment
gi peters

Join Date: Sep 2020

Posts: 44
#10

20 Jun 2022, 21:28

William Lisowski Ah I see now what you're trying to point out. My bad, the salary variable has the year suffixed to it so the Salary should be Salary2015 (Salary2014, and so on)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

21 Jun 2022, 04:04

And is Province really Province2015, and so on?
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#12

21 Jun 2022, 05:49

I'm going to assume your "Salary" datasets all look like this, just with different years on the end of each variable other than ID. In the future, spend more time making sure your example data is correct, and less time pretending it was produced by dataex by sticking in a comment that is untrue. At this point, I'm concerned there may be other variables, or that Province is in fact a numeric value with a value label, either of which could pose a problem.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int ID long Salary2015 str13 Province2015
 1 10000 "Ontario"      
 2 30000 "Alberta"      
 3 50000 "BC"           
 4 20000 "Newfoundland" 
 5 15000 "Ontario"      
 6 20000 "New Brunswick"
 7 25345 "Manitoba"     
 8 23384 "Ontario"      
 9 32000 "Alberta"      
10 30000 "Alberta"      
end

Then perhaps this code will start you in a useful direction.

Code:

use  "C:\Users\ASUS\OneDrive - 2019_salary_data.dta", clear
merge ID using "C:\Users\ASUS\OneDrive - 2015_salary_data.dta", nogenerate
merge ID using "C:\Users\ASUS\OneDrive - 2013_salary_data.dta", nogenerate
reshape long Salary Province, i(ID) j(year)
merge ID year using "C:\Users\ASUS\OneDrive - mainfile.dta"

Comment

gi peters

Join Date: Sep 2020

Posts: 44
#13

21 Jun 2022, 14:17

William Lisowski yes there are 30 other variables in the salary file but like what I've explained in my comment to Rich Goldstein earlier, my dataset can only be accessed in a data center that doesn't allow internet connection due to security regulations so I couldn't copy an example straight from it but I tried as much as possible to replicate how it looks like and included only relevant variables based on the issue at hand.
Thanks for the recommended codes - Im trying them actually but having IT issues in the center so I cant say yet whether they work or not. Hopefully they do.
Comment

Announcement