Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drop all the observations if there is one duplicate

    Good day to all!

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte obs_number int column str16 varname
    1  11 "Name of the firm"
    2  11 "Name 1"          
    1 253 "Name of the firm"
    2 253 "Name 1"          
    end

    I'm working on a reshaped dataset that have several """variables""" in one Stata variable in order to modify some strings such as "Name 1" here. Now I have two variables that are the same, namely, "Name of the firm", and they are duplicates. I want to drop all possible obs_number values for all duplicates. I could do drop if column == 253, but I have many other duplicates and I would like to do another thing than just looking at the data to report the corresponding column value. I need something like

    1. check if "Name of the firm" is repeated many times in the variable varname,
    2. store the values of column (for the duplicates only) in a macro

    So then I could do a loop for each value of the local and drop if column == this value.

    You help will be very appreciated!
    Last edited by Thomas Brot; 06 Mar 2023, 03:40.

  • #2
    Have you looked at the duplicates command?

    I don't really follow your data example, as there are no exact duplicates on observations.

    Comment


    • #3
      Nick :

      Apologies if it was unclear, I meant duplicates on varname only.

      I'm only a bit familiar with the duplicates command, however I wish to store the associated column values for the corresponding varname duplicates in a macro. Can they do this ?

      For instance, there are two times "Name of the firm" in my dataset. I wish to store in a macro the value of the duplicates' corresponding column value "253" (and many others in my real dataset) so I could do :

      foreach v of local dup {
      drop if column == `v'
      }
      Last edited by Thomas Brot; 06 Mar 2023, 03:45.

      Comment


      • #4
        I am familiar with the duplicates command. It doesn't do anything like what you sketch. If duplicates can be identified systematically, you wouldn't need a loop at all. This sounds like the x-y problem https://xyproblem.info/ -- you have part of a solution in your mind, but it is not obvious that it would be a move in the right direction.

        Please back-up and give a less abstract example and show the data before and after your intended drop.

        Comment


        • #5
          I'm going to give a more detailed example of my data structure first.

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input byte obs_number long colonne str732 varname
          1   2 "Date of submission"
          2   2 "2020-06-05 18:43:20"
          3   2 "2020-06-05 18:41:47"
          4   2 "2020-08-19 13:26:31"
          1  11 "Name of the firm"  
          2  11 "Name 1"            
          3  11 "Name 2"            
          4  11 "Name 3"            
          1 253 "Name of the firm"  
          2 253 "Name 1"            
          3 253 "Name 2"            
          4 253 "Name 3"            
          end
          Normally this isn't the type of data structure one works with. However I have several edits to do that are easier to do for me when everything is stored in one variable like varname. After this, I intend to rename each variable name (i.e. whenever obs_number == 1) into an Stata-compatible variable name, then reshape my dataset with

          reshape wide varname, I(obs_number) j(colonne)

          so that I have "Date of submission"'s ID as a first variable, "Name of the firm"'s ID as a second variable, etc. However as you can guess two variables cannot have the same name, therefore I'm trying to drop all the soon-to-be variable duplicates at this stage of my do-file.

          Doing duplicates drop varname would drop varname observations that may not belong to the same identical pair of variables. imagine if my dataset, for the sake of example, looked this this:

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input byte obs_number long colonne str732 varname
          1   2 "Date of submission"
          2   2 "2020-06-05 18:43:20"
          3   2 "2020-06-05 18:41:47"
          4   2 "Name 3"
          1  11 "Name of the firm"  
          2  11 "Name 1"            
          3  11 "Name 2"            
          4  11 "Name 3"            
          1 253 "Name of the firm"  
          2 253 "Name 1"            
          3 253 "Name 2"            
          4 253 "Name 3"            
          end
          Doing so would also drop the fourth line of the soon-to-be variable "Date of submission" which would mix up my entire data structure. That is why I aim at dropping all the 253 column values in a neat way so that in the reshape wide process, Stata would just skip the varname253 and I would have no error whenever I will try to rename my variables because of identical names. And to generalize this to all duplicates, I need to somehow end up having all the duplicates' column values in a macro so that I could drop observations easily.

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input byte obs_number long colonne str732 varname
          1  2 "Date of submission"
          2  2 "2020-06-05 18:43:20"
          3  2 "2020-06-05 18:41:47"
          4  2 "2020-08-19 13:26:31"
          1 11 "Name of the firm"  
          2 11 "Name 1"            
          3 11 "Name 2"            
          4 11 "Name 3"            
          end
          This is what I would like to achieve at the end, before reshaping wide and swapping my obs_number == 1 values with some defined IDs.

          Thank you for your help !
          Last edited by Thomas Brot; 06 Mar 2023, 04:17.

          Comment


          • #6
            Thanks for the extra detail, which is enough for me to think that your problem is not about duplicates at all, still less about using the duplicates command.

            The x-y problem is still evident!

            But while reshape wide is certainly what I would be inclined to use here, I am still fuzzy on your ultimate goal.

            This may help, and if not other people may understand better what you want.


            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input byte obs_number long colonne str732 varname
            1   2 "Date of submission"
            2   2 "2020-06-05 18:43:20"
            3   2 "2020-06-05 18:41:47"
            4   2 "2020-08-19 13:26:31"
            1  11 "Name of the firm"  
            2  11 "Name 1"            
            3  11 "Name 2"            
            4  11 "Name 3"            
            1 253 "Name of the firm"  
            2 253 "Name 1"            
            3 253 "Name 2"            
            4 253 "Name 3"            
            end
            
            gen newname = strtoname(varname) if obs_number == 1 
            replace newname = newname[_n-1] if missing(newname)
            
            drop if obs_number == 1 
            
            reshape wide varname, i(colonne) j(obs_number)

            Comment


            • #7
              Nick : I can understand why it's still unclear. I'm afraid if I give more context it might end up being even more unclear. Let's still try!

              I already got the strtoname and reshape part covered. Suppose I changed my future variable names with appropriate IDs like below. if I do

              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input byte obs_number long colonne str732 varname
              1   2 "A1"                 
              2   2 "2020-06-05 18:43:20"
              3   2 "2020-06-05 18:41:47"
              4   2 "2020-08-19 13:26:31"
              1  11 "A2"                 
              2  11 "Name 1"             
              3  11 "Name 2"             
              4  11 "Name 3"             
              1 253 "A2"                 
              2 253 "Name 1"             
              3 253 "Name 2"             
              4 253 "Name 3"             
              end
              
              reshape wide varname, i(obs_number) j(colonne)
              I end up with

              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input byte obs_number str732(varname2 varname11 varname253)
              1 "A1"                  "A2"     "A2"    
              2 "2020-06-05 18:43:20" "Name 1" "Name 1"
              3 "2020-06-05 18:41:47" "Name 2" "Name 2"
              4 "2020-08-19 13:26:31" "Name 3" "Name 3"
              end
              Ordinarily, I'm looping across several files and the code was working fine. However for this file (the last one! how unlucky), there are 2 A2 variables and they are exactly the same, so the normal loop

              Code:
              foreach var of varlist varname* {
              rename `var' `=`var'[1]'
              drop in 1
              }
              will not work on this file because of the two A2. I need to drop all trace of this second A2 before reshaping wide, that can only be done by dropping all the colonne values corresponding to the duplicates' name

              My ultimate, ultimate goal is just to have an appended dataset with harmonized variable names across all the files I was given. But this is not the problem here, as I end up with the desired final product if I exclude this specific file. If it's impossible to do then I'll just do things manually, in any case your help is greatly appreciated.

              Comment


              • #8
                Following my comment in #6 I am leaving this wide open for other answers.

                Comment


                • #9
                  I understand. Thank you for this exchange, as it helped me better understand my own problems and come up with a satisfying solution that makes the do-file run.

                  Code:
                  gen dup_todrop = colonne if obs_number == 1 & varname == "Name of the firm"
                                  by varname (colonne), sort: replace dup_todrop = . if _n == 1 
                                  sort obs_number colonne
                                  levelsof dup_todrop, local(dup_todrop)
                                  foreach v of local dup_todrop {
                                      drop if colonne == `v'
                                  }
                                  drop dup_todrop
                  It may not the most efficient solution nor even what it's right to do but at least it solved what I wanted to do, and since it only refers to exactly similar variables, I don't think it will do any harm.

                  Comment

                  Working...
                  X