Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Deleting duplicate rows in a data panel

    I'm working with a database of a household survey with information for 27 states and 13 years. From the research variables I'm generating new databases with new variables, which generally represent averages or rates of participation, therefore, on that basis there are several lines (representing the same state and the same year) with the same value for these new variables. For the purposes of my work I just need a value, by year and state, for each of these new variables, and I want to export this information to excel, however, the bases have millions of observations can not export to excel, to then delete the repetitions.
    I would like your help to solve my problem.

  • #2
    Well, to get one observation per state and year you can run:
    Code:
    duplicates drop state year
    That will only work if all of the other variables have the same values for each observation having the same state and year.

    If you have more than one observation for a state and year, but some of those observations differ on other variables, then you need to specify some rules for deciding which one you want to keep (or if you want to combine them in some way).

    Comment


    • #3
      The observations have several variables for each state and year, but only one of these variables has the same value for the same combination of state and year, and it is this variable that I want to get.
      How would I do to establish a rule?

      I am thinking about what you said I tried to do the following:

      keep uf ano comodporpes

      duplicates drop uf ano
      where "uf" is the variable to state, "ano" is the variable to year and "comodporpes" is the variable that care for each year and state.

      But the Stata returned me the following message and error code:

      force option required with duplicates drop varlist
      r(198);

      Comment


      • #4
        Yes, that is right. Actuallyl, if you have kept only uf ano and comodporpes, and if comodporpes is, indeed, the same in every observation having the same uf and ano, you can do simply -duplicates drop-. If Stata finds some uf and ano where comodporpes varies, it will tell you about that and it will not drop anything.

        Comment


        • #5
          Clyde, thanks again!

          I followed your advice and it worked perfectly.

          Comment


          • #6
            Hello
            I got the same issue, but I dont know what to do with the "force option required.." message
            Need help ...


            Thanks
            Ludmila

            Comment


            • #7
              It would have been better had you showed the exact code you used and the exact response you got from Stata by copy/pasting from the Results windowor your log file into a code block here. But, I think I can answer your question:

              When you ask Stata to do -duplicates drop var1 var2-, Stata is concerned about the possibility that you will lose information. You might, for example, have two observations with the same values of var1 and var2 that have different values of var3. If that is true in your data, then by executing -duplicates drop var1 var2-, you will be left with a data set that has lost information and has arbitrarily retained one of several possible legitimate values of var3. So Stata wants to make sure you don't just blunder into that situation.

              There are three basic possibilities:

              1. It may be that you have no such situations like that: whenever var1 and var2 are both the same in some observations, then those observations agree on every other variable as well. In that case, the simplest solution is to rewrite the command as just -duplicates drop-.

              2. It may be that you have situations like that, but for your purposes, it won't matter if you lose information and have arbitrary values on other variables. In that case rewrite the command as -duplicates drop var1 var2, force- and Stata will do it without further complaint.

              3. It may be that you have situations like that and it does matter if you lose information and have arbitrary values on other variables: in that case attempting to drop duplicates on var1 and var2 is ill-considered and you should rethink your plan.

              I suppose there is also a fourth possibility: you don't know which of the above three possibilities is correct. In that case, you need to explore your data and find out before proceeding.

              Comment


              • #8
                Such an explanation !
                Thank you very much
                My problem solved .
                God bless you

                Comment


                • #9
                  Thank you so much Clyde! This helped me as well.

                  Comment

                  Working...
                  X