Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing numbers in string variables

    Dear all,

    I have a dataset including two string variables, str1 and str2. Both display various pairs of numbers, in the style "A:B , A:B , ..." Suppose that for a given observation (row), str1 equals

    "15000:4500 , 17540:5000 , 16000:1000"

    (The "" are not part of the string, I just use them here for clarification; however, the comma inbetween pairs is part of the string)

    and str2 equals

    "15000:2000 , 19000:2000 , 16000:900 , 16490:400".

    I would like to create a dummy variable wich takes 1 (and 0 otherwise) if:

    a) There is at least one pair in str1 and str2 that share the same A

    AND

    b) for at least one such case (i.e. common A), the associated B in str1 is larger than the associated B in str2.

    In my above example, a) is satisfied because there are two cases of common A (where these A's are 15000 and 16000), and b) is satisfied because looking at the pairs with A=15000, we see that 4500 is larger than 2000. Therefore, the to-be-created dummy variable should take 1.

    I have thought about this for a while and searched the Internet but could not find a hint or answer. I hope someone can help me out!



  • #2
    I believe this code will do it. I have changed the names of your variables from str1 and str2 to first and second, just to keep things a little clearer midway through the code.

    The complication arises because you have ultra-wide data layout within your variables. Once we break those up and go long, it is not particularly hard.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str36 first str47 second
    "15000:4500 , 17540:5000 , 16000:1000" "15000:2000 , 19000:2000 , 16000:900 , 16490:400"
    end
    
    gen long id = _n // OR USE EXISTING ID VARIABLE
    split first, gen(first_) parse(",")
    split second, gen(second_) parse(",")
    reshape long first_ second_, i(id) j(seq)
    
    foreach v of varlist first_ second_ {
        split `v', gen(`v') parse(":") destring
        rename `v'1 `v'A
        rename `v'2 `v'B
        drop `v'
    }
    reshape long first_ second_, i(id seq) j(AB) string
    reshape long @_, i(id seq AB) j(source) string
    
    //    NOW FIND IDS WHERE A FIRST A AND A SECOND A OVERLAP
    by id AB _ (source), sort: gen byte overlap = ///
        AB == "A" &(_N > 1) & source[1] != source[_N]
    by id seq source (overlap), sort: replace overlap = overlap[_N]
    
        //    NOW AMONG THE OVERLAPS FIND ANY WHERE THE CORRESPONDING B VALUE
    //    IS LARGER IN FIRST THAN SECOND
    reshape wide @_, i(id seq source) j(AB) string
    by id seq (source), sort: gen byte ordered_overlap = B_[1] > B_[2] & overlap
    by id (ordered_overlap), sort: replace ordered_overlap = ordered_overlap[_N]
    by id: keep if _n == 1
    drop seq source A_ B_ overlap
    Now, I don't know where you go from here. This code restores your data to its original layout, adding the indicator variable, which I've named ordered_overlap. But if you are going to need to work with the A and B values, you would be better off leaving the data fully long. That would just mean leaving out the last two lines of this code.

    This code works correctly on the example you gave. But I did not create other test cases where the criteria are not met. So please test this out on your real data. If you find it is not doing what you intended, please respond with the following:

    1. Example data, posted using -dataex- as I have done above, that includes cases where the result should be 0 and others where the result should be 1, as well as examples of where the code has given the incorrect result.

    2. An explanation of why the incorrect results are not correct, if it isn't obvious.

    and I will try to fix it.

    Comment


    • #3
      Dear Clyde,

      Thanks a lot for your quick answer and help.

      I tried your code on my real data and have discovered that the code gives the incorrect result in some cases.

      Following your instructions, below I indicate 1. example data and 2. explanations.

      1. Example data:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str36 first str47 second
      "15000:4500 , 17540:5000 , 16000:1000" "15000:2000 , 19000:2000 , 16000:900 , 16490:400"
      "17000:4500 , 17550:5000 , 16001:1000" "17000:5000 , 19000:2000 , 16000:900 , 16490:400"
      "" "16004:3400"
      "13000:3200" ""
      "13000:200" "13000:200 , 14000:340"
      "13000:200 , 14000:340" "13000:200"
      end
      2. Explanations
      1st row (id=1): That is the one that I used as example in my original post, and your code gives the correct result, i.e. ordered_overlap=1.

      2nd row (id=2): This is another example. It should yield ordered_overlap=0: There is one matching A, but 4500<5000. Your code produced ordered_overlap=0 for this example, thus perfect.

      3rd row (id=3): Note: The variable "first" is blank, i.e. missing. Therefore, there cannot be any common A, and ordered_overlap should thus be zero. However, your code produced ordered_overlap=1.

      4th row (id=4): Note: The variable "second" is blank, i.e. missing. Thus again, ordered_overlap should thus be zero. Your code correctly produces ordered_overlap=0.

      5th row (id=5): There is one common A, but 200=200 (and not 200>200), thus ordered_overlap should thus be zero. However, your code produced ordered_overlap=1.

      6th row (id=6): There is one common A, but 200=200 (and not 200>200), thus ordered_overlap should thus be zero. Your code correctly produces ordered_overlap=0.

      I would be happy if you could fix the code, thank you again!

      Comment


      • #4
        I think I found the "bug" in your code. The fourth line from the bottom in your code should not be

        Code:
         by id seq (source), sort: gen byte ordered_overlap = B_[1] > B_[2] & overlap

        but

        Code:
         by id seq (source), sort: gen byte ordered_overlap = B_[1] > B_[2] & B_[1]!=. & overlap
        As I see it, the underlying issue is that STATA regards "missing" as a huge number, and so "missing" is always (?) regarded as larger than an actual number. At least I have experienced that also in other occasions. Maybe someone can confirm or falsify this statement and provide the reasoning behind treating "missing" as larger than any number?





        Comment


        • #5
          If you sort numeric variables including missings, then the missings must go somewhere. In principle, they could be treated as arbitrarily large and negative, but Stata decided on arbitrarily large and positive. So any kind of numeric missing is indeed larger than any non-missing numeric value.

          Comment


          • #6
            Paul Pelzl has coorrectly recognized that my code in #3 does not handle missing values correctly. His proposed fix in #4 looks correct to me. It would probably be safer to write -B_[1] != .- as -!missing(B_[1])-. In this case, the numeric values of B_ arise from a -destring- operation, and given the source of the data, the only missing values would be coded as -.-. So the precaution I'm suggesting is not necessary here. But in general, if you want to guard against the pitfalls of comparisons with missing values, using the more general -!missing()- rubric is safer, as all of Stata's missing values are sorted as larger than any non-missing number.

            Comment


            • #7
              Thank you to both of you for your answers. I have found an example in which Clyde Schechter's original code produces the incorrect result, and my correction does not solve it either:

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str36 first str47 second
              "12000:50 , 14000:200" "14000:100"
              I fixed it by changing the way the observations are sorted; specifically, the line (I also incorporate Clyde's comment regarding my previous fix)

              Code:
              by id seq (source), sort: gen byte ordered_overlap = B_[1] > B_[2] & !missing(B_[1]) & overlap
              should be replaced by

              Code:
              by id A_ (source), sort: gen byte ordered_overlap = B_[1] > B_[2] & !missing(B_[1]) & overlap

              Comment


              • #8
                Sorry if I'm late to the party but here's another way to parse the initial data using moss (from SSC). It assumes that both first and second only contain lists of complete A:B dyads (A or B is never missing).

                With the AB dyads in long form per id, I use joinby for form all pairwise combinations of B given a common A value within each id. This allows for repeat values of A within first or second. No harm done if this is not possible, in which case the joinby command will work as if merge had been used.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input float id str36 first str47 second
                1 "15000:4500 , 17540:5000 , 16000:1000" "15000:2000 , 19000:2000 , 16000:900 , 16490:400"
                2 "17000:4500 , 17550:5000 , 16001:1000" "17000:5000 , 19000:2000 , 16000:900 , 16490:400"
                3 ""                                     "16004:3400"                                    
                4 "13000:3200"                           ""                                              
                5 "13000:200"                            "13000:200 , 14000:340"                          
                6 "13000:200 , 14000:340"                "13000:200"                                      
                7 "12000:50 , 14000:200"                 "14000:100"                                      
                8 "14000:50 , 14000:200"                 "14000:100"                                      
                9 "17000:4500 , 17000:5000 , 16001:1000" "17000:5000 , 17000:2000 , 16000:900 , 16490:400"
                end
                save "main_statalist.dta", replace
                
                * parse input into first and second A/B dyads in long form
                * to install -moss-, type in Command window: ssc install moss
                moss first, match("([0-9]+)") regex prefix(first_)
                moss second, match("([0-9]+)") regex prefix(second_)
                drop *_pos* *_count*
                foreach v of varlist *_match* {
                    if regexm("`v'", "([^0-9]+)_match([0-9]+)") {
                        local vname = regexs(1)
                        local n = regexs(2)
                    }
                    if mod(`n',2) {
                        rename `v' `vname'A`=int(`n'/2)+1'
                    }
                    else rename `v' `vname'B`=int(`n'/2)'
                }
                reshape long firstA firstB secondA secondB, i(id)
                list id firstA firstB secondA secondB, sepby(id)
                save "temp_statalist.dta", replace
                
                * create all pairwise combinations of firstB secondB by id and common A
                keep id secondA secondB
                keep if !mi(secondA)
                rename secondA joinA
                save "B_statalist.dta", replace
                use id firstA firstB using "temp_statalist.dta", clear
                keep if !mi(firstA)
                rename firstA joinA
                joinby id joinA using "B_statalist.dta"
                
                * reduce to cases where B is greater in first than in second
                keep if real(firstB) &gt; real(secondB)
                
                * merge id for these cases back into original data
                by id: keep if _n == 1
                keep id
                merge 1:1 id using "main_statalist.dta"
                gen flag = _merge == 3
                drop _merge
                sort id
                list, noobs

                Comment

                Working...
                X