Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping duplicates with most missing and retaining non-missing cells from dropped duplicate

    Hello all,

    In a large dataset I'm currently working with there's a fair amount of duplicates of the variable 'bvd_id'. As I'm interested in merging this data on another dataset, using bvd_id, I need to drop duplicates.

    For most observations there's missing data on one variable or another, hence in an attempt to retain the rows with the most data i have the following code:

    bysort bvd_id: egen count = count if (country_iso != "." & bvd_sector != "." & incorporation_date != "." & employees0104 != "." & employees_last_avail != "." & revenue0104 != "." & revenue_last != ".")
    sort count
    duplicates drop bvd_id, force
    drop count


    When manually checking some of the duplicated rows, before dropping, i see that one row could have data on variable X where it is missing in the second row - but then second row has data on Y when first row doesn't. So what I'm looking to do is to combine the data to retain as much information as possible. In case of both rows is non-missing on X, i would want to prioritize the row with the least missing, i.e. the first one to be listed following my previously listed code.


    I'm a bit lost as to how to do this. I would greaty appreciate any help!

    Kind regards, Lorens









  • #2
    This kind of problem usually calls for collapse not duplicates. One strategy is to keep the minimum and maximum of numeric variables; if they are equal you have just one distinct non-missing value.

    Comment


    • #3
      Hi Nick,
      Thanks for your reply!

      I was unfamiliar with collapse, but looks like exactly what i want. I have a follow up question if you don't mind.

      For numeric values I'm leaning towards collapsing using means, and for string variables using first non-missing. Is there a way to collaps numeric and string variables simultaneously?

      I've attempted the following, but & is not allowed for collapse:
      local vars employees0104 employees_last revenue0104 revenue_last last_avail_year
      local vars1 country_iso bvd_sector incorporation_date
      collapse (mean) `vars' & collapse (firstnm) `vars1', by (bvd_id)








      Comment


      • #4
        Code:
        collapse (mean) `vars' (firstnm) `vars1', by (bvd_id)

        Comment


        • #5
          Cheers Nick!

          Sorting wasn't needed when collapsing. As I now need to apply duplicates drop, I'm once again interested in sorting my data in a way so that, for duplicated rows of 'company', the rows with the least missing cells appears first. This way i would drop the least informative rows when using duplicates drop. Alternatively, if there's a way for duplicates drop to not keep the first row.

          Listed below is an example of duplicated rows of a company. Here I want to keep the second row.

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str486 company float employees0104 double employees_last float revenue0104 double revenue_last float last_avail_year
          "AIRBUS DEFENCE AND SPACE GMBH"      .     .        .        .    .
          "AIRBUS DEFENCE AND SPACE GMBH" 110662 70054 32075000 35925000 2019
          "AIRBUS DEFENCE AND SPACE GMBH"      .     1        .        . 2017
          "AIRBUS DEFENCE AND SPACE GMBH"      .  3104        .  1694116 2015
          end

          Thank you very much for your help so far

          Comment


          • #6
            I don't understand #5 as nothing in your data example looks like duplicated observations. It seems that you have panel data/

            Comment


            • #7
              It is not panel data although I understand it might look like it, specifically due to the variable 'last_avail_year'. I believe the rows represent different subsidiaries of a branch, and just so happens to have 'last_avail_year' sorted in a way that makes it look like panel.

              Here's a different example from my data, including more variables, which should hopefully make it clear that it is not panel data:

              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str486 company str17 bvd_id str56 bvd_sector str10 incorporation_date float employees0104 double employees_last float revenue0104 double revenue_last float last_avail_year
              "AJINOMOTO OMNICHEM" "BE2142568563" "Chemicals, Petroleum, Rubber & Plastic" " 1/10/1978" .   . .         .    .
              "AJINOMOTO OMNICHEM" "BE2036255276" "Chemicals, Petroleum, Rubber & Plastic" "."          .   . .         .    .
              "AJINOMOTO OMNICHEM" "BE0403078352" "Chemicals, Petroleum, Rubber & Plastic" " 4/19/1946" . 846 . 451205.55 2021
              "AJINOMOTO OMNICHEM" "BE2142568662" "Chemicals, Petroleum, Rubber & Plastic" " 7/16/1980" .   . .         .    .
              end


              Nonetheless, I'd still love your input on how to sort it so that the row with the least missing is listed first. I've played around with rmiss and rmiss2, but can't quite get it to work correctly.

              // Lorens

              Comment


              • #8
                There are function in egen to count missings across rows, meaning observations.

                Comment

                Working...
                X