Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Delete the duplicates by column


    Hi, I have more than 3,000 dta files with same format and this is one of the file.dta below.
    As you can see, this is a wide type data set and I want to get rid of same value by each column. For your understanding, Var1 only contains 5 digit number value and the rest of cell is empty. For Var2, it only has date/month/year in the column and the rest is empty as well.
    Var5 and Var7 has the same value, which does not have to be seen more than one so, I need only one 360 ad 1 for them.
    The rest of variables (Var3, Var4, Var6 ...) have binary (0 or 1) values but I need only one value for each one. So, if there "1" in the Var3 column, I do not need to see zero. Otherwise, there should be one "0" by each column.
    I have been trying to use "duplicates", "carryforward" option, however, it is hard to eliminate them with holding all of the conditions that I mentioned.

    Thank you in advance!

    Current state of data
    ID Var1 Var2 Var3 Var4 Var5 Var6 Var7 ... Var99
    1 0 0 360 0 1 0
    2 0 0 360 1 1 0
    3 31/01/1948 0 0 360 0 1 0
    4 23789 1 0 360 0 1 1
    12/12/2014 0 0 360 0 1 1
    12345 0 0 360 0 1 1
    24/07/1999 0 0 360 0 1 0
    ... 98765 0 0 360 0 1 0
    3000 0 0 360 1 1 0
    It should be like this.
    ID Var1 Var2 Var3 Var4 Var5 Var6 Var7 ... Var99
    1 23789 31/01/1948 1 0 360 1 1 1
    2 12345 12/12/2014
    3 98765 24/07/1999
    Last edited by Priver JM; 04 Sep 2019, 11:09.

  • #2
    Not only is this one of the most bizarre data layouts I have ever seen, your description of it does not agree with what you show in the tables. Var1 does not have any five digit numbers. Var5 clearly does not have the same values as Var7.

    Most confusing: how do you know that the 12345 value of Var1 belongs, in the final result, with ID 2: there is nothing in the data you show that suggests that. Similarly, it is unclear why 24/907/1999 is to be linked with ID 3 instead of one of the others. I cannot make any sense of "The rest of variables (Var3, Var4, Var6 ...) have binary (0 or 1) values but I need only one value for each one. So, if there "1" in the Var3 column, I do not need to see zero. Otherwise, there should be one "0" by each column."

    I suggest you post back with a clearer explanation of what you have, what you want, and how Stata is supposed to decide which row's values get placed together with which ID. And please clarify what you mean about Var3.

    Comment


    • #3
      I apologize for making a confusion. As you mentioned, there is no relationship between ID and Variables so please ignore the ID. I just generated the ID to see if I can sort the variables, but it seems not working. Each variables are made from the first column as Var1 to Var99, if a certain keyword is found. For example, if there is a keyword "apple" in the first column, "1" is filled in the Var1 at the certain row position and otherwise it will be "0".

      But, this process made a current situation that is not able to see all of result at a glance since each column has 3,000 rows. That's why I asked this issue. I hope this makes clear to explain to you.

      Comment


      • #4
        I'm sorry, but I don't understand any better at this point. None of my questions in #2 have been answered.

        Comment


        • #5
          1) Var1 does not have any five digit numbers; the value ranges [00001] to [99999]; Numeric values should be kept in the column. Missing values are to be deleted.
          2) Var5 and Var7 has same value within the column respectively. It needs to have one "360" and one "1" in each row.
          3) Var3 has "1" and "0"; if there is "1", I need to delete all of the "0" in the column. If there is not "1", I want to see only one "0" in the column.

          Hope this helps understanding!
          ​​​​​​​

          Comment


          • #6
            I'm sorry, but I still don't understand this at all. Perhaps somebody else does and can help you.

            Comment


            • #7
              I also found your description, as well as the logic and reason for the output, unclear. Below is just a wild guess, targeting at your described output in #1.
              Code:
              tempfile Input Output
              save `Input', replace
              
              drop ID Var1 Var2
              collapse (max) *
              save `Output', replace
              
              use `Input', clear
              keep Var2
              drop if missing(Var2)
              merge 1:1 _n using `Output', nogen
              save `Output', replace
              
              use `Input', clear
              keep Var1
              drop if missing(Var1)
              merge 1:1 _n using `Output', nogen
              By the way, in the future, please use -dataex- to give out a small sample of your data. Read more detail on FAQ, section 12.2.

              Comment

              Working...
              X