Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Maximizing datapoints from 2 datasets with same variables, different observations

    Hi All,

    I am currently working with two datasets, A and B. Both of these datasets look like the following (they have the exact same variables) and format:



    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(year firm employee department score)
    1971 1 2 1  21
    1971 1 3 1  32
    1971 1 4 1   3
    1971 2 1 1  32
    1971 2 3 1  23
    1971 1 2 2  32
    1971 1 3 2  12
    1971 1 4 2  32
    1972 1 2 1 232
    1972 1 3 1  12
    end
    ------------------ copy up to and including the previous line ------------------

    Listed 10 out of 10 observations


    In the above dataset, each datapoint represents a year-employer-employee-department rating. So for instance, the first row corresponds to firm 1, employee number 2, in department number 1, who received a score of 21 in the year 1971. I have two datasets that have a significant overlap of coverage in terms of years considered, but potentially contain a lot of different data (conversely, they contain a lot of the same data as well). What I would like to do is to combine both datasets that would maximize information from either datasets and drop duplicates.
    There is one issue that is forecastable- what would happen to those observations where the rating is different, but all else the same? Hopefully, there are not too many of those! In such cases, I would not mind picking one dataset over another, as there is no a priori I have about the credibility of data in both datasets. I would greatly appreciate any help!


    Many thanks,
    Chinmay

  • #2
    I think I may have figured it out- I use dataset A, append dataset B, and the drop duplicates using the force option. This should work, in my opinion.

    Comment


    • #3
      Hi Chinmay,

      Try appending datasets and later checking for duplicates. Append can be found on the GUI at Data -> Combine datasets -> Append datasets. Append will combine datasets vertically (it adds more rows to your dataset).

      After that you can check for duplicates given a set of variables (read here for extra information).

      Code:
      sort year firm employee department score
      quietly by year firm employee department score:  gen dup = cond(_N==1,0,_n)
      This code will create a new variable called dup that count duplicates (duplicates observations will be those that have the same values for year, firm, employee, department and score). This dup variable has the following values:

      dup = 0 record is unique
      dup = 1 record is duplicate, first occurrence
      dup = 2 record is duplicate, second occurrence
      dup = 3 record is duplicate, third occurrence
      And so on.

      If you would like to count duplicates on the same criteria as above but removing score, just remove it from both lines of the code:

      Code:
      sort year firm employee department 
      quietly by year firm employee department:  gen dup = cond(_N==1,0,_n)

      Comment

      Working...
      X