Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to delete duplicate records

    Dear statalisters,
    I have a dataset that has records that may be duplicates by two variables , below I show you a small set of observations to make my question
    clear.

    Code:
    clear
    input int(firmid    fyear)    byte    csale
    1000 1980 19 
    1000 1980 . 
    1001 1980 2 
    end
    when observations have the same firmid and fyear, I want to keep the records that have non-missing csale values. I all records that share the same firmid and fyear have missing csale, I want to keep at least 1 record.

    I thought of using tag, but not sure if tag can be organized by csale value.

    Thanks,
    Rochelle


  • #2
    Well, the basic task is fairly simple:

    Code:
    by firmid fyear (csale), sort: keep if _n == 1
    This works because any non-missing value sorts before missing.

    But this is an unsafe thing to do without further investigation. You need to be sure that there aren't two or more records with a non-missing value of csale. If there are, the above code would always choose the one with the smallest value, which might not be correct. So you need to first verify that this condition is met:

    Code:
    // DO THIS BEFORE THE CODE ABOVE
    gen byte csale_missing = missing(csale)
    by firmid fyear csale_missing (csale), sort: assert csale[1] == csale[_N] if !csale_missing
    This will assure you that if there is more than one observation for a combination of firmid fyear with non-missing csale, the csale values are all the same. If the above assertion fails, then you need to look at the cases where it fails to figure out why you have inconsistent non-missing values of csale and how to resolve those inconsistencies.

    There is one additional risk in using even those conditions: what about other variables in your data set? In your example, these are the only ones, so no problem. But if your real data has others, you need to verify that those variables all agree on the same value before you remove them (and figure out what's going on and how to fix it if they don't). Without knowing more about the possible other variables, or even if there are any, I won't write code here.

    Comment


    • #3
      Thank you Clyde very much !!!!

      Your comments are very useful. My real data is rather complex and the assertion failed. I will make further investigation to my data before I proceed.


      Regards,
      Rochelle


      Comment


      • #4
        what about
        Code:
        duplicates drop firmid fyear, force
        Regards
        --------------------------------------------------
        Attaullah Shah, PhD.
        Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
        FinTechProfessor.com
        https://asdocx.com
        Check out my asdoc program, which sends outputs to MS Word.
        For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

        Comment


        • #5
          One should always think twice, three time, and four or more times before using the -force- option in any command. Particularly with -duplicates drop-. This will, indeed, remove duplicates and leave you with just one observation for each combination of firmid and fyear. But the whole point raised in #2 and #3 is that the duplicate observations contain conflicting information about the other variables. Using -force- simply obscures the problem without solving it. So before just arbitrarily selecting one and using it, good statistical practice requires first studying those duplicate observations to determine, a) why they even exist (if they shouldn't), and b) which one is the right one to keep, or, perhaps whether they need to be combined (e.g. by averaging, medians, etc.) into a single observation. It is more important to have correct data than to use a "quick and dirty" workaround so that you can rapidly arrive at meaningless and useless results.

          Comment


          • #6
            In the replies above, I think that there is a presumption that non-missing duplicates of csale are unwanted. I read the original problem differently.

            when observations have the same firmid and fyear, I want to keep the records that have non-missing csale values. If all records that share the same firmid and fyear have missing csale, I want to keep at least 1 record.
            This suggests to me that observations may have the same firmid and fyear, but different csale values. Indeed, if I suppose "csale" refers to "customer sale", then there is nothing inherently wrong with the same firm in the same year making two or more sales (hence multiple csales). The only remark that I would make is that "at least 1" is ambiguous because it may include anything between one to the maximum number of missing observations. That said, I will provide code that does the following:

            1. If there are multiple non-missing csale values for the same firmid and fyear, keep all of them.
            2. If there is a non-missing csale value for a given firmid and fyear, delete the observations with missing csale values.
            3. For a group of observations with the same firmid and fyear, if all values for csale are missing, keep only one observation (you may change to clarify the "atleast one" ambiguity).

            The following expanded example illustrates:

            Code:
            clear
            input int(firmid    fyear)    byte    csale
            1000 1980 19
            1000 1980 .
            1001 1980 2
            1001 1980 3
            1002 1980 .
            1002 1980 .
            1002 1980 .
            1003 1980 .
            1003 1980 7
            1003 1980 .
            1004 1980 .
            end

            In the following, firmid 1002 has all missing values, so we keep one, whereas firmid 1001 has 2 non-missing csale values, so we keep both.

            Code:
            
            . list, sepby(firmid)
            
                 +------------------------+
                 | firmid   fyear   csale |
                 |------------------------|
              1. |   1000    1980      19 |
              2. |   1000    1980       . |
                 |------------------------|
              3. |   1001    1980       2 |
              4. |   1001    1980       3 |
                 |------------------------|
              5. |   1002    1980       . |
              6. |   1002    1980       . |
              7. |   1002    1980       . |
                 |------------------------|
              8. |   1003    1980       . |
              9. |   1003    1980       7 |
             10. |   1003    1980       . |
                 |------------------------|
             11. |   1004    1980       . |
                 +------------------------+

            Here is one way to proceed:

            Code:
            gen byte tag = missing(csale)
            bys  firmid fyear: egen tag2= mean(tag)
            drop if tag==1 & tag2!=1
            bys  firmid fyear: replace tag2= sum(tag2)
            drop if tag==1 & tag2!=1
            drop tag tag2
            list, sepby(firmid)

            Code:
            . list, sepby(firmid)
            
                 +------------------------+
                 | firmid   fyear   csale |
                 |------------------------|
              1. |   1000    1980      19 |
                 |------------------------|
              2. |   1001    1980       2 |
              3. |   1001    1980       3 |
                 |------------------------|
              4. |   1002    1980       . |
                 |------------------------|
              5. |   1003    1980       7 |
                 |------------------------|
              6. |   1004    1980       . |
                 +------------------------+

            Comment


            • #7
              Andrew may well be right. I understood the original post to imply that only one value of csale was to be retained, but on re-reading the post it does not actually say that. Here is another way to do what he has proposed that is a bit shorter, and to me, at least, a bit clearer:

              Code:
              gen byte missing_csale = missing(csale)
              by firmid fyear (missing_csale csale), sort: keep if _n == 1 | !missing_csale
              The logic works because if there are non-missing observations of csale will all be retained, and the missing observation(s) all sort at the end, so they are not retained by either part of the -if- clause. When, all observations of csale are missing, only the first is kept.
              Last edited by Clyde Schechter; 31 Jan 2016, 18:43.

              Comment


              • #8
                Thanks Clyde for a more efficient solution!

                Comment


                • #9
                  Actually, overnight I realized that it can be reduced to a single line:

                  Code:
                  by firmid fyear (csale), sort: keep if _n == 1 | !missing_csale
                  (The variable missing_csale that I used in #7 is not actually needed because the missing values automatically sort to the end anyhow.)

                  Comment


                  • #10
                    Thank you Clyde. Your suggestions are really helpful. Your post provides great help to me like student who are looking for ways to resolve issues. Thanks once again.

                    Comment

                    Working...
                    X