Maximizing datapoints from 2 datasets with same variables, different observations

Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#1

Maximizing datapoints from 2 datasets with same variables, different observations

01 Jun 2018, 11:05

Hi All,

I am currently working with two datasets, A and B. Both of these datasets look like the following (they have the exact same variables) and format:

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(year firm employee department score) 1971 1 2 1 21 1971 1 3 1 32 1971 1 4 1 3 1971 2 1 1 32 1971 2 3 1 23 1971 1 2 2 32 1971 1 3 2 12 1971 1 4 2 32 1972 1 2 1 232 1972 1 3 1 12 end

------------------ copy up to and including the previous line ------------------

Listed 10 out of 10 observations

In the above dataset, each datapoint represents a year-employer-employee-department rating. So for instance, the first row corresponds to firm 1, employee number 2, in department number 1, who received a score of 21 in the year 1971. I have two datasets that have a significant overlap of coverage in terms of years considered, but potentially contain a lot of different data (conversely, they contain a lot of the same data as well). What I would like to do is to combine both datasets that would maximize information from either datasets and drop duplicates.
There is one issue that is forecastable- what would happen to those observations where the rating is different, but all else the same? Hopefully, there are not too many of those! In such cases, I would not mind picking one dataset over another, as there is no a priori I have about the credibility of data in both datasets. I would greatly appreciate any help!

Many thanks,
Chinmay
Tags: None
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#2

01 Jun 2018, 12:05

I think I may have figured it out- I use dataset A, append dataset B, and the drop duplicates using the force option. This should work, in my opinion.
Comment
Igor Paploski

Join Date: Oct 2014

Posts: 174
#3

01 Jun 2018, 12:19

Hi Chinmay,

Try appending datasets and later checking for duplicates. Append can be found on the GUI at Data -> Combine datasets -> Append datasets. Append will combine datasets vertically (it adds more rows to your dataset).

After that you can check for duplicates given a set of variables (read here for extra information).

Code:

sort year firm employee department score quietly by year firm employee department score: gen dup = cond(_N==1,0,_n)

This code will create a new variable called dup that count duplicates (duplicates observations will be those that have the same values for year, firm, employee, department and score). This dup variable has the following values:

dup = 0 record is unique
dup = 1 record is duplicate, first occurrence
dup = 2 record is duplicate, second occurrence
dup = 3 record is duplicate, third occurrence
And so on.

If you would like to count duplicates on the same criteria as above but removing score, just remove it from both lines of the code:

Code:

sort year firm employee department quietly by year firm employee department: gen dup = cond(_N==1,0,_n)
1 like
Comment

Announcement

Maximizing datapoints from 2 datasets with same variables, different observations

Comment

Comment