Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • finding duplicates in multiple variables and counting the one that are only duplicate in 1 variable

    Hello

    I have this kind of dataset

    HHID herd_size ls_code exotic
    1 1 3 0
    2 3 3 1
    3 2 6 0
    1 3 3 1
    2 6 5 1
    2 1 3 0
    3 1 4 1

    I would like, in a first time, to have a list of all the HHID dat have the same HHID number and the same ls_code and if possible herd_size
    So I could drop those duplicates
    In a second time I would like to sum up the herd_sizes for the one that have the same HHID and same ls_code but a different herd_size
    After I have the sum of those i would like to drop the rest of the duplicates HHID numbers

    Can someone help me with this? Untill now I haven't figured out how to do this.

    Thanks a lot

  • #2
    I would like, in a first time, to have a list of all the HHID dat have the same HHID number and the same ls_code and if possible herd_size
    So I could drop those duplicates
    Code:
    clear
    input float(HHID herd_size ls_code exotic)
    1 1 3 0
    2 3 3 1
    3 2 6 0
    1 3 3 1
    2 6 5 1
    2 1 3 0
    3 1 4 1
    end
    
    duplicates tag HHID ls_code herd_size, gen(flag)
    tab HHID if flag
    dulicates drop HHID ls_code herd_size, force
    Note that the force option is necessary here because the observations still may differ on the variable exotic. Note also that this means that this information is being lost, and you should probably drop that variable to avoid confusion later.
    In a second time I would like to sum up the herd_sizes for the one that have the same HHID and same ls_code but a different herd_size
    What you say here is different from the title of your post, so I can't tell if what you want is a count of these, or the total of the herd sizes. The code below gives both.
    Code:
    by HHID ls_code, sort: egen count_of_obs = count(herd_size)
    by HHID ls_code, sort: egen total_herd_size = total(herd_size)
    After I have the sum of those i would like to drop the rest of the duplicates HHID numbers
    OK. At this point, the variable herd_size will be meaningless, because the herd sizes will be different depending on which one out of each set of duplicates is retained. So, to avoid confusion, I drop that variable first.
    Code:
    drop herd_size
    by HHID ls_code: keep if _n == 1
    In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.





    Comment


    • #3
      Thanks it help me a lot.
      I can identifie a part of the doubles now

      Comment

      Working...
      X