Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to deal with "Duplicated observation" after merge command

    Dear all,

    I am quite new to Stata and I am trying to merge two datasets but I get unmatched observations classified as "Duplicated observation".
    In my master, I have groups (entryid) and under each group there are several country (countryid) -year (lmyear_n) observations. In my using, I have information on such country-year observations and I want to add it to my master. All goes well, a part from the "Duplicated obsevation" (_merge==1), which are not really duplicates because they are unique within each group of the master. For such duplicated observation, Stata does not provide the info I want from the using and leaves just a missing value "."

    The code I use is: merge m:1 lmyear_n countryid using "GCI_Hostcountries_4dim_v9.dta"

    I tried then to create a using file with group-country-year obsevations, but it does not work, as I get the same "Duplicated observation" for the same observations, which are not real duplicates, as they are unique within groups. Here I used the code: merge m:1 entryid lmyear_n countryid using "GCI_Hostcountries_4dim_v9_expanded.dta"

    So the question is: why do I get "Duplicated observation"? And how can I overcome this? I cannot deleted the "Duplicated observation" as they are not duplicated within groups (entryid).

    Just an info: I also tried the joinby (but I did not manage to make it work), the m:m (same problem arises, plus it is widely not suggested), I sorted my data by group in the master.

    Any suggestion is more than welcomed! Thank you a lot in advance!
    Ilaria


    Below an example of what I get. 1="Duplicated observation"

    input float(entryid lmyear_n countryid) double(gci_est gci_pp gci_rpm gci_wda) byte _merge
    1 2017 1 . . . . 1
    1 2017 2 3.3571429249999998 3.444444418 3.868686914 3.696969748 3
    1 2017 3 3.832371473 4.46684742 4.406903267 4.066738605 3
    1 2017 4 3.1862185 3.325876951 2.903080225 3.086344481 3
    1 2017 5 . . . . 1
    1 2017 6 2.64601779 2.345132828 2.333333254 2.739130497 3
    1 2017 7 . . . . 1
    1 2017 8 . . . . 1
    1 2017 9 3.755302429 3.348157644 4.520998001 4.363435745 3
    1 2017 10 3.626057863 4.069918633 4.216534615 3.898905277 3
    1 2017 11 4.150473595 4.806335449 4.579015732 4.464984417 3
    1 2017 12 3.568708181 3.797599792 3.57373786 3.393399 3
    1 2017 13 4.937436104 4.531143665 6.079502106 5.690037727 3
    1 2017 14 . . . . 1
    1 2017 15 4.711019993 4.712504864 4.790915966 4.673634052 3
    1 2017 16 3.322090626 3.761115551 3.915687323 3.750954866 3
    1 2017 17 4.198631784867267 3.5236213995485888 4.576725159859171 3.845379157936559 3
    1 2017 18 5.121985912 4.520885944 5.811362267 5.550701618 3
    2 2012 1 . . . . 1
    2 2012 2 3.064055829896907 3.3552341221649487 3.2863143329896904 3.984685267525773 3
    2 2012 3 4.361998715625 4.360532275 4.1845177875 3.9375299562499997 3
    2 2012 4 2.5709801416666664 2.3843297249999997 2.278467695833333 1.8959146333333334 3
    2 2012 5 . . . . 1
    2 2012 6 3.6628700703125006 3.2716203265625 2.8106725968750004 2.2806640218750003 3
    2 2012 7 . . . . 1
    2 2012 8 . . . . 1
    2 2012 9 3.846512918918919 2.7605287027027026 4.428788135135135 3.5951550270270274 3
    2 2012 10 3.6380153644171775 4.394995083435583 3.925456224539877 3.2193912392638033 3
    2 2012 11 4.078443947643979 4.60922019895288 3.783767640837696 3.757755537172775 3
    2 2012 12 3.057318725 3.6237620999999995 3.5452239125 3.11159975 3
    2 2012 13 4.6110981142857135 3.7754580857142854 5.9300087999999995 5.0241026857142845 3
    2 2012 14 . . . . 1
    2 2012 15 4.640976586206896 4.416021344827586 5.039093303448276 4.046210551724138 3
    2 2012 16 3.0197708677419355 3.5288765322580646 3.5142787290322577 2.8206980451612904 3
    2 2012 17 4.5774431009174315 3.690304073394495 5.183563392660551 4.18609726146789 3
    2 2012 18 4.830055855629139 3.6365744953642385 5.426200631788079 4.958060577483444 3
    2 2012 19 3.248048 3.691751 3.568578 3.124173 3






  • #2
    Welcome to Statalist.

    Can you run the following and post your results?

    In the Master dataset:
    Code:
    duplicates report entryid lmyear_n countryid
    In the Using dataset:
    Code:
    duplicates report entryid lmyear_n countryid
    That would let us know if there were cases with repeated entryid lmyear_n countryid.

    The phenomenon (that some cases got an index 1 with no information should not be called "duplicate". It'd be very confusing. A duplicate would mean that in the data, two or more cases have identical entryid lmyear_n, and countryid.

    An index 1 means that unique ID case exists in the "Master" file but not in the "Using" file. You can test that, go to the Using file, try:

    Code:
    list if entryid == 1 & lmyear_n == 2017 & countryid == 1
    and see that there should be no information returned. That case does not exist. So when it is merged to the "Master", missing values filled out the cells.

    To learn more, read pages 7-9 of https://www.stata.com/manuals/dmerge.pdf.
    Last edited by Ken Chui; 15 Sep 2023, 08:39.

    Comment


    • #3
      I though a _merge==1 implies the observations are only in the master data, not duplicates. Does it say "duplicated observation"?

      Comment


      • #4
        George is right about _merge==1. The documentation for -merge- includes this "match results" table:

        Code:
                   Numeric    Equivalent
                    code      word (results)     Description
                   -------------------------------------------------------------------
                      1       master             observation appeared in master only
                      2       using              observation appeared in using only
                      3       match              observation appeared in both
        
                      4       match_update       observation appeared in both,
                                                   missing values updated
                      5       match_conflict     observation appeared in both,
                                                   conflicting nonmissing values
                   -------------------------------------------------------------------
                   Codes 4 and 5 can arise only if the update option is specified.
                   If codes of both 4 and 5 could pertain to an observation, then 5 is
                   used.
        --
        Bruce Weaver
        Email: [email protected]
        Version: Stata/MP 19.5 (Windows)

        Comment


        • #5
          Dear all,

          Thanks a lot for your replies. Below the answers to your questions. However, I still do not know how to deal with the "Duplicated observation". They are not duplicated in the master, because even though they might have the same lmyear_n and countryid, they still have different entryid. Do you have any other ideas?

          @Kei Chui: This is what I got using the coding you wrote down.

          From master:
          duplicates report entryid lmyear_n countryid

          Duplicates in terms of entryid lmyear_n countryid

          --------------------------------------
          Copies | Observations Surplus
          ----------+---------------------------
          1 | 134524 0
          --------------------------------------

          From using:

          duplicates report lmyear_n countryid --> here I removed the variable entryid, because it is not present in the using file. Entryid is only for the master. What I want to do is to match each entryid lmyear_n countryid in the master with lmyear_n countryid in the using

          Duplicates in terms of lmyear_n countryid

          --------------------------------------
          Copies | Observations Surplus
          ----------+---------------------------
          1 | 2128 0


          This is the results if I type "list if countryid==6 & lmyear==2016 --> again I left out entryid, because it is not relevant for the using file.

          list if countryid==6 & lmyear_n==2016

          +-----------------------------------------------------------------------------------------------------------------+
          | count~ld countr~l countr~i CGIcou~e lmyear_n gci_est gci_pp gci_rpm gci_wda count~id |
          |-----------------------------------------------------------------------------------------------------------------|
          5. | 7 AGO AGO Angola 2016 2.7772926 2.5203245 2.1724785 2.4091228 6 |
          +-----------------------------------------------------------------------------------------------------------------+



          @George Ford: yes, it says "Duplicated observation" for the _merge==1 (I just removed the _merge==2 because they are of no interest for me).

          merge m:1 lmyear_n countryid using "GCI_Hostcountries_4dim_v9.dta"

          Result Number of obs
          -----------------------------------------
          Not matched 32,613
          from master 32,448 (_merge==1)
          from using 165 (_merge==2)

          Matched 102,076 (_merge==3)
          -----------------------------------------


          @Bruce Weaver: The observations are actually not only in the master but also in the using, as in master I have combinations of entryid lmyear_n countryid, while in the using I have combinations of lmyear_n countryid that should match the master. It is true though, that not all combinations in the master are also in using, but not as many as 32,448 (see above when I run the merge m:1).





          Comment


          • #6
            You are not explaining yourself very well. Provide a sample of the master dataset and the using dataset and show us exactly what the issue is.

            Code:
            dataex if  countryid<4 & lmyear_n<2018
            It is true though, that not all combinations in the master are also in using, but not as many as 32,448 (see above when I run the merge m:1).
            In particular, from the dataex sample, show us an example of an observation that should be matched and has not been.

            Comment

            Working...
            X