Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge and create new variable for the non-matching observations

    Hi all,
    I was wondering if there is a way to merge identical observations & then create a new variable containing the non-matching observation e.g. "2" and "3" below.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10 ID str9 var1 str8 var2 str10 var3 str5 var4 str15 var5
    "Id1" "a" "b" "c" "d" "2"
    "Id1" "a" "b" "c" "d" "3"
    end
    resulting in
    input str10 ID str9 var1 str8 var2 str10 var3 str5 var4 str15 var5 str15 newvar
    "Id1" "a" "b" "c" "d" "2" "3"

  • #2
    Kamil, it's not difficult to deal with your two-line example. But I wonder how things would go on with more complicated data. For example, what would be the final result for the data below?

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10 ID str9 var1 str8 var2 str10 var3 str5 var4 str15 var5
    "Id1" "a" "b" "c" "d" "2"
    "Id1" "a" "b" "c" "d" "3"
    "Id1" "a" "b" "c" "e" "2"
    "Id1" "a" "b" "c" "f" "4"
    end

    Comment


    • #3
      Hi Fei, thanks for checking. At the moment, my other variables are exactly the same (matching), except for the last variable.
      If my dataset ever become complicated as your example above, I would probably then want to create a new variable for any non-matching observation. Hope that makes sense, thanks

      Comment


      • #4
        One more quick check, what if line 1 and line 3 are identical, but both differ from line 2 (as below)? Would there be "2", "3", "2", or "2", "3", or something else? More broadly, could you specify the circumstance and purpose of doing so? Maybe there are better ways of handling the issue.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str10 ID str9 var1 str8 var2 str10 var3 str5 var4 str15 var5
        "Id1" "a" "b" "c" "d" "2"
        "Id1" "a" "b" "c" "d" "3"
        "Id1" "a" "b" "c" "d" "2"
        end
        Last edited by Fei Wang; 03 Nov 2021, 00:58.

        Comment


        • #5
          Sorry for not being any clearer, if I could just clarify again, both var4 and var5 are unique but I would prefer to add them whole e.g "2", "3" "4" & "d" "e" "f" in the new matched variable/s
          The context of this is to tidy up my data so that I have a single 'ID' for each observation for easy reporting and further analyses

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str10 ID str9 var1 str8 var2 str10 var3 str5 var4 str15 var5
          "Id1" "a" "b" "c" "d" "2"
          "Id1" "a" "b" "c" "e" "3"
          "Id1" "a" "b" "c" "f" "4"
          end
          e.g
          Code:
          * final output something like below
          input str10 ID str9 var1 str8 var2 str10 var3 str5 var4 str15 var5 ...
          "Id1" "a" "b" "c" "d" "2" "e" "3" "f" "4"
          end

          Comment


          • #6
            The code below is only for the example with one value of "ID". It automatically detects which variables have identical values cross all lines and which don't. Then merge the identical values and expand the different ones.


            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str10 ID str9 v1 str8 v2 str10 v3 str5 v4 str15 v5
            "Id1" "a" "b" "c" "d" "2"
            "Id1" "a" "b" "c" "e" "3"
            "Id1" "a" "b" "c" "f" "4"
            end
            
            local vlist ""
            foreach var of varlist v1-v5 {
                egen diff = total(`var'!=`var'[_n-1])
                if diff[1] > 1 {
                    local vlist = "`vlist'" + "`var'" + " "
                }
                drop diff
            }
            
            gen idn = _n
            reshape wide `vlist', i(ID) j(idn)
            order ID v*, alpha
            Things would become difficult when you have multiple IDs, particularly the list of variables with identical cross-line values are different in different IDs -- I think, still, you may have to figure out first what they should look like, as for the example below.

            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str10 ID str9 v1 str8 v2 str10 v3 str5 v4 str15 v5
            "Id1" "a" "b" "c" "d" "2"
            "Id1" "a" "b" "c" "e" "3"
            "Id1" "a" "b" "c" "f" "4"
            "Id2" "a" "b" "e" "f" "5"
            "Id2" "a" "b" "f" "g" "6"
            end

            Comment


            • #7
              Hi Fei, thank you so much this worked very well ! My apologies I thought the code would accommodate additional duplicate IDs. My dataset consists of about 10, 000 observations with varying duplicate IDs (some are duplicate only twice whilst others can have up to 14 duplicates similarly with just v4 and v5 being different)

              Comment


              • #8
                Originally posted by Kamil Braima View Post
                Hi Fei, thank you so much this worked very well ! My apologies I thought the code would accommodate additional duplicate IDs. My dataset consists of about 10, 000 observations with varying duplicate IDs (some are duplicate only twice whilst others can have up to 14 duplicates similarly with just v4 and v5 being different)
                Kamil, if we know for sure that only v4 and v5 are different, then the code would be simple, as below.

                Code:
                bys ID: gen idn = _n
                reshape wide v4 v5, i(ID) j(idn)
                order ID v*, alpha

                Comment

                Working...
                X