Comparing numbers in string variables

Paul Pelzl

Join Date: May 2016

Posts: 24
#1

Comparing numbers in string variables

28 Sep 2016, 09:45

Dear all,

I have a dataset including two string variables, str1 and str2. Both display various pairs of numbers, in the style "A:B , A:B , ..." Suppose that for a given observation (row), str1 equals

"15000:4500 , 17540:5000 , 16000:1000"

(The "" are not part of the string, I just use them here for clarification; however, the comma inbetween pairs is part of the string)

and str2 equals

"15000:2000 , 19000:2000 , 16000:900 , 16490:400".

I would like to create a dummy variable wich takes 1 (and 0 otherwise) if:

a) There is at least one pair in str1 and str2 that share the same A

AND

b) for at least one such case (i.e. common A), the associated B in str1 is larger than the associated B in str2.

In my above example, a) is satisfied because there are two cases of common A (where these A's are 15000 and 16000), and b) is satisfied because looking at the pairs with A=15000, we see that 4500 is larger than 2000. Therefore, the to-be-created dummy variable should take 1.

I have thought about this for a while and searched the Internet but could not find a hint or answer. I hope someone can help me out!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29967
#2

28 Sep 2016, 12:34

I believe this code will do it. I have changed the names of your variables from str1 and str2 to first and second, just to keep things a little clearer midway through the code.

The complication arises because you have ultra-wide data layout within your variables. Once we break those up and go long, it is not particularly hard.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str36 first str47 second "15000:4500 , 17540:5000 , 16000:1000" "15000:2000 , 19000:2000 , 16000:900 , 16490:400" end gen long id = _n // OR USE EXISTING ID VARIABLE split first, gen(first_) parse(",") split second, gen(second_) parse(",") reshape long first_ second_, i(id) j(seq) foreach v of varlist first_ second_ { split `v', gen(`v') parse(":") destring rename `v'1 `v'A rename `v'2 `v'B drop `v' } reshape long first_ second_, i(id seq) j(AB) string reshape long @_, i(id seq AB) j(source) string // NOW FIND IDS WHERE A FIRST A AND A SECOND A OVERLAP by id AB _ (source), sort: gen byte overlap = /// AB == "A" &(_N > 1) & source[1] != source[_N] by id seq source (overlap), sort: replace overlap = overlap[_N] // NOW AMONG THE OVERLAPS FIND ANY WHERE THE CORRESPONDING B VALUE // IS LARGER IN FIRST THAN SECOND reshape wide @_, i(id seq source) j(AB) string by id seq (source), sort: gen byte ordered_overlap = B_[1] > B_[2] & overlap by id (ordered_overlap), sort: replace ordered_overlap = ordered_overlap[_N] by id: keep if _n == 1 drop seq source A_ B_ overlap

Now, I don't know where you go from here. This code restores your data to its original layout, adding the indicator variable, which I've named ordered_overlap. But if you are going to need to work with the A and B values, you would be better off leaving the data fully long. That would just mean leaving out the last two lines of this code.

This code works correctly on the example you gave. But I did not create other test cases where the criteria are not met. So please test this out on your real data. If you find it is not doing what you intended, please respond with the following:

1. Example data, posted using -dataex- as I have done above, that includes cases where the result should be 0 and others where the result should be 1, as well as examples of where the code has given the incorrect result.

2. An explanation of why the incorrect results are not correct, if it isn't obvious.

and I will try to fix it.
2 likes
Comment
Paul Pelzl

Join Date: May 2016

Posts: 24
#3

30 Sep 2016, 06:36

Dear Clyde,

Thanks a lot for your quick answer and help.

I tried your code on my real data and have discovered that the code gives the incorrect result in some cases.

Following your instructions, below I indicate 1. example data and 2. explanations.

1. Example data:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str36 first str47 second "15000:4500 , 17540:5000 , 16000:1000" "15000:2000 , 19000:2000 , 16000:900 , 16490:400" "17000:4500 , 17550:5000 , 16001:1000" "17000:5000 , 19000:2000 , 16000:900 , 16490:400" "" "16004:3400" "13000:3200" "" "13000:200" "13000:200 , 14000:340" "13000:200 , 14000:340" "13000:200" end

2. Explanations
1st row (id=1): That is the one that I used as example in my original post, and your code gives the correct result, i.e. ordered_overlap=1.

2nd row (id=2): This is another example. It should yield ordered_overlap=0: There is one matching A, but 4500<5000. Your code produced ordered_overlap=0 for this example, thus perfect.

3rd row (id=3): Note: The variable "first" is blank, i.e. missing. Therefore, there cannot be any common A, and ordered_overlap should thus be zero. However, your code produced ordered_overlap=1.

4th row (id=4): Note: The variable "second" is blank, i.e. missing. Thus again, ordered_overlap should thus be zero. Your code correctly produces ordered_overlap=0.

5th row (id=5): There is one common A, but 200=200 (and not 200>200), thus ordered_overlap should thus be zero. However, your code produced ordered_overlap=1.

6th row (id=6): There is one common A, but 200=200 (and not 200>200), thus ordered_overlap should thus be zero. Your code correctly produces ordered_overlap=0.

I would be happy if you could fix the code, thank you again!
Comment
Paul Pelzl

Join Date: May 2016

Posts: 24
#4

30 Sep 2016, 07:07

I think I found the "bug" in your code. The fourth line from the bottom in your code should not be

Code:

by id seq (source), sort: gen byte ordered_overlap = B_[1] > B_[2] & overlap

but

Code:

by id seq (source), sort: gen byte ordered_overlap = B_[1] > B_[2] & B_[1]!=. & overlap

As I see it, the underlying issue is that STATA regards "missing" as a huge number, and so "missing" is always (?) regarded as larger than an actual number. At least I have experienced that also in other occasions. Maybe someone can confirm or falsify this statement and provide the reasoning behind treating "missing" as larger than any number?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35467
#5

30 Sep 2016, 07:15

If you sort numeric variables including missings, then the missings must go somewhere. In principle, they could be treated as arbitrarily large and negative, but Stata decided on arbitrarily large and positive. So any kind of numeric missing is indeed larger than any non-missing numeric value.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29967
#6

30 Sep 2016, 08:38

Paul Pelzl has coorrectly recognized that my code in #3 does not handle missing values correctly. His proposed fix in #4 looks correct to me. It would probably be safer to write -B_[1] != .- as -!missing(B_[1])-. In this case, the numeric values of B_ arise from a -destring- operation, and given the source of the data, the only missing values would be coded as -.-. So the precaution I'm suggesting is not necessary here. But in general, if you want to guard against the pitfalls of comparisons with missing values, using the more general -!missing()- rubric is safer, as all of Stata's missing values are sorted as larger than any non-missing number.
Comment
Paul Pelzl

Join Date: May 2016

Posts: 24
#7

30 Sep 2016, 09:04

Thank you to both of you for your answers. I have found an example in which Clyde Schechter's original code produces the incorrect result, and my correction does not solve it either:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str36 first str47 second "12000:50 , 14000:200" "14000:100"

I fixed it by changing the way the observations are sorted; specifically, the line (I also incorporate Clyde's comment regarding my previous fix)

Code:

by id seq (source), sort: gen byte ordered_overlap = B_[1] > B_[2] & !missing(B_[1]) & overlap

should be replaced by

Code:

by id A_ (source), sort: gen byte ordered_overlap = B_[1] > B_[2] & !missing(B_[1]) & overlap
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

01 Oct 2016, 10:09

Sorry if I'm late to the party but here's another way to parse the initial data using moss (from SSC). It assumes that both first and second only contain lists of complete A:B dyads (A or B is never missing).

With the AB dyads in long form per id, I use joinby for form all pairwise combinations of B given a common A value within each id. This allows for repeat values of A within first or second. No harm done if this is not possible, in which case the joinby command will work as if merge had been used.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float id str36 first str47 second
1 "15000:4500 , 17540:5000 , 16000:1000" "15000:2000 , 19000:2000 , 16000:900 , 16490:400"
2 "17000:4500 , 17550:5000 , 16001:1000" "17000:5000 , 19000:2000 , 16000:900 , 16490:400"
3 ""                                     "16004:3400"                                    
4 "13000:3200"                           ""                                              
5 "13000:200"                            "13000:200 , 14000:340"                          
6 "13000:200 , 14000:340"                "13000:200"                                      
7 "12000:50 , 14000:200"                 "14000:100"                                      
8 "14000:50 , 14000:200"                 "14000:100"                                      
9 "17000:4500 , 17000:5000 , 16001:1000" "17000:5000 , 17000:2000 , 16000:900 , 16490:400"
end
save "main_statalist.dta", replace

* parse input into first and second A/B dyads in long form
* to install -moss-, type in Command window: ssc install moss
moss first, match("([0-9]+)") regex prefix(first_)
moss second, match("([0-9]+)") regex prefix(second_)
drop *_pos* *_count*
foreach v of varlist *_match* {
    if regexm("`v'", "([^0-9]+)_match([0-9]+)") {
        local vname = regexs(1)
        local n = regexs(2)
    }
    if mod(`n',2) {
        rename `v' `vname'A`=int(`n'/2)+1'
    }
    else rename `v' `vname'B`=int(`n'/2)'
}
reshape long firstA firstB secondA secondB, i(id)
list id firstA firstB secondA secondB, sepby(id)
save "temp_statalist.dta", replace

* create all pairwise combinations of firstB secondB by id and common A
keep id secondA secondB
keep if !mi(secondA)
rename secondA joinA
save "B_statalist.dta", replace
use id firstA firstB using "temp_statalist.dta", clear
keep if !mi(firstA)
rename firstA joinA
joinby id joinA using "B_statalist.dta"

* reduce to cases where B is greater in first than in second
keep if real(firstB) &gt; real(secondB)

* merge id for these cases back into original data
by id: keep if _n == 1
keep id
merge 1:1 id using "main_statalist.dta"
gen flag = _merge == 3
drop _merge
sort id
list, noobs

Announcement

Comparing numbers in string variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment