Merge and create new variable for the non-matching observations

Kamil Braima

Join Date: Aug 2021

Posts: 6
#1

Merge and create new variable for the non-matching observations

02 Nov 2021, 23:56

Hi all,
I was wondering if there is a way to merge identical observations & then create a new variable containing the non-matching observation e.g. "2" and "3" below.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str10 ID str9 var1 str8 var2 str10 var3 str5 var4 str15 var5 "Id1" "a" "b" "c" "d" "2" "Id1" "a" "b" "c" "d" "3" end

resulting in
input str10 ID str9 var1 str8 var2 str10 var3 str5 var4 str15 var5 str15 newvar
"Id1" "a" "b" "c" "d" "2" "3"
Tags: None
Fei Wang

Join Date: Oct 2021

Posts: 726
#2

03 Nov 2021, 00:07

Kamil, it's not difficult to deal with your two-line example. But I wonder how things would go on with more complicated data. For example, what would be the final result for the data below?

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str10 ID str9 var1 str8 var2 str10 var3 str5 var4 str15 var5 "Id1" "a" "b" "c" "d" "2" "Id1" "a" "b" "c" "d" "3" "Id1" "a" "b" "c" "e" "2" "Id1" "a" "b" "c" "f" "4" end
Comment
Kamil Braima

Join Date: Aug 2021

Posts: 6
#3

03 Nov 2021, 00:24

Hi Fei, thanks for checking. At the moment, my other variables are exactly the same (matching), except for the last variable.
If my dataset ever become complicated as your example above, I would probably then want to create a new variable for any non-matching observation. Hope that makes sense, thanks
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#4

03 Nov 2021, 00:52

One more quick check, what if line 1 and line 3 are identical, but both differ from line 2 (as below)? Would there be "2", "3", "2", or "2", "3", or something else? More broadly, could you specify the circumstance and purpose of doing so? Maybe there are better ways of handling the issue.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str10 ID str9 var1 str8 var2 str10 var3 str5 var4 str15 var5 "Id1" "a" "b" "c" "d" "2" "Id1" "a" "b" "c" "d" "3" "Id1" "a" "b" "c" "d" "2" end

Last edited by Fei Wang; 03 Nov 2021, 00:58.
Comment
Kamil Braima

Join Date: Aug 2021

Posts: 6
#5

03 Nov 2021, 01:40

Sorry for not being any clearer, if I could just clarify again, both var4 and var5 are unique but I would prefer to add them whole e.g "2", "3" "4" & "d" "e" "f" in the new matched variable/s
The context of this is to tidy up my data so that I have a single 'ID' for each observation for easy reporting and further analyses

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str10 ID str9 var1 str8 var2 str10 var3 str5 var4 str15 var5 "Id1" "a" "b" "c" "d" "2" "Id1" "a" "b" "c" "e" "3" "Id1" "a" "b" "c" "f" "4" end

e.g

Code:

* final output something like below input str10 ID str9 var1 str8 var2 str10 var3 str5 var4 str15 var5 ... "Id1" "a" "b" "c" "d" "2" "e" "3" "f" "4" end
Comment

Fei Wang

Join Date: Oct 2021
Posts: 726

03 Nov 2021, 02:37

The code below is only for the example with one value of "ID". It automatically detects which variables have identical values cross all lines and which don't. Then merge the identical values and expand the different ones.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 ID str9 v1 str8 v2 str10 v3 str5 v4 str15 v5
"Id1" "a" "b" "c" "d" "2"
"Id1" "a" "b" "c" "e" "3"
"Id1" "a" "b" "c" "f" "4"
end

local vlist ""
foreach var of varlist v1-v5 {
    egen diff = total(`var'!=`var'[_n-1])
    if diff[1] > 1 {
        local vlist = "`vlist'" + "`var'" + " "
    }
    drop diff
}

gen idn = _n
reshape wide `vlist', i(ID) j(idn)
order ID v*, alpha

Things would become difficult when you have multiple IDs, particularly the list of variables with identical cross-line values are different in different IDs -- I think, still, you may have to figure out first what they should look like, as for the example below.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 ID str9 v1 str8 v2 str10 v3 str5 v4 str15 v5
"Id1" "a" "b" "c" "d" "2"
"Id1" "a" "b" "c" "e" "3"
"Id1" "a" "b" "c" "f" "4"
"Id2" "a" "b" "e" "f" "5"
"Id2" "a" "b" "f" "g" "6"
end

Comment

Kamil Braima

Join Date: Aug 2021

Posts: 6
#7

03 Nov 2021, 04:33

Hi Fei, thank you so much this worked very well ! My apologies I thought the code would accommodate additional duplicate IDs. My dataset consists of about 10, 000 observations with varying duplicate IDs (some are duplicate only twice whilst others can have up to 14 duplicates similarly with just v4 and v5 being different)
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#8

03 Nov 2021, 05:30

Originally posted by Kamil Braima View Post

Hi Fei, thank you so much this worked very well ! My apologies I thought the code would accommodate additional duplicate IDs. My dataset consists of about 10, 000 observations with varying duplicate IDs (some are duplicate only twice whilst others can have up to 14 duplicates similarly with just v4 and v5 being different)

Kamil, if we know for sure that only v4 and v5 are different, then the code would be simple, as below.

Code:

bys ID: gen idn = _n reshape wide v4 v5, i(ID) j(idn) order ID v*, alpha
Comment

Announcement