Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • "strpos" generates a weird number

    Hi,

    I encountered a strange phenomenon in Stata. var1 has the following content
    var1 = COGNITIVE DISORDER, GENERAL; PHASE I CLINICAL TRIAL

    When I use the following, it generates 1 for var2
    gen var2=1 if strpos(var1, "PHASE I")

    When I use the following, it generates 30 for var3
    gen var3=(strpos(var1, "PHASE I"))

    I am really curious how State gives 30? Can anyone help?

    Ray.

  • #2
    No mystery here. The strpos() function returns the number of the character position in the string where the searched-for string is found. In this case PHASE I begins at the 30th character of that longer phrase in var1. See -help strpos()-.

    The reason you get var2 = 1, is because when you say -if strpos(var1, "PHASE I")-, Stata, on seeing the -if-, converts -strpos(var1, "PHASE I")- to a Boolean expression. The rule for converting a number to a Boolean expression is: 0 is false, anything other than 0 (including missing value) is true. Since 30 is not equal to 0, as a Boolean expression, this evaluates to true. Consequently var2 is set to 1 for this observation.

    Note by the way that -gen var2 = 1 if ...- creates a 1/. variable, not a 1/0 variable. While that may be what is wanted, it usually isn't. If you want a 0/1 variable, the syntax would be:

    Code:
    gen var4 = (strpos(var1, "PHASE I") > 0)
    The logic is that if PHASE I does not occur in var 1, strpos() returns 0. Since 0 is not greater than 0, the right hand side of this equation, a Boolean expression, is false, which is then represented numerically as 0. On the other hand, if Phase I does occur in var 1, strpos() returns the position of that P in var1, which will be some number greater than or equal to 1, and hence necessarily > 0. So the right hand side is, in this case, true, which Stata encodes numerically as 1.

    Comment


    • #3
      As a footnote to Clyde's helpful explanations, note that with functions something like

      Code:
      help strpos()
      takes you straight to the help for that function.

      Even without looking at Stata

      http://www.stata.com/help.cgi?strpos()

      tells you what the function does

      strpos(s1,s2) Description: the position in s1 at which s2 is first found; otherwise, 0

      FAQ Advice #3 starts

      3. What should I do before I post?

      Before posting, consider other ways of finding information:
      • the online help for Stata
      Last edited by Nick Cox; 21 Apr 2016, 01:58.

      Comment


      • #4
        I thought var3=(strpos(var1, "PHASE I")) would also generate a Boolean expression, but looking it again it would not. Thank you!

        Comment


        • #5
          One way to remember this is to note that "pos" means "position".

          Comment


          • #6
            why does ?
            . di strpos(`"8° 41 ' 50""',""º"")
            1


            Comment


            • #7
              Use functions ustrpos() and ustrrpos() to search based on characters rather than on bytes.
              Code:
              help strpos()
              and
              Code:
              help ustrpos()

              Comment


              • #8
                Code:
                strpos(`"8° 41 ' 50""',""º"")
                is equivalent to
                Code:
                strpos(`"8° 41 ' 50""',"")
                Stata internal stopped processing the second string at the second double quote. For example

                Code:
                .  di strpos("abcd",""xyz)
                1
                By the way, you may use function tobytes() to investigate byte sequences, in this case:

                Code:
                . di tobytes(`"8° 41 ' 50""')
                \d056\d194\d176\d032\d052\d049\d032\d039\d032\d053\d048\d034
                
                . di tobytes(""º"")
                
                .
                
                . di strlen(""º"")
                0
                where the second call returns empty since the input for second is equivalent to "" for Stata as strlen() indicates as well.

                Last edited by Hua Peng (StataCorp); 22 May 2020, 08:17.

                Comment


                • #9
                  why is
                  Code:
                  di strpos("a","") == 1

                  Comment


                  • #10
                    Originally posted by Bjarte Aagnes View Post
                    why is
                    Code:
                    di strpos("a","") == 1
                    That's a good question. This returns zero when using -ustrpos()-

                    Code:
                    di ustrpos("a", "")

                    Comment


                    • #11
                      -help strpos()-
                      Code:
                         strpos(s1,s2)
                             Description:  the position in s1 at which s2 is first found; otherwise, 0
                      The description should be explicit saying that if s2 is empty (""), the return value is 1.

                      A non-technical explanation may be: the empty string is found before any string thus at position 1.

                      Compare with documentation of C# IndexOf(String).
                      Last edited by Bjarte Aagnes; 22 May 2020, 13:36.

                      Comment


                      • #12
                        Originally posted by Bjarte Aagnes View Post
                        -help strpos()-
                        Code:
                         strpos(s1,s2)
                        Description: the position in s1 at which s2 is first found; otherwise, 0
                        The description should be explicit saying that if s2 is empty (""), the return value is 1.

                        A non-technical explanation may be: the empty string is found before any string thus at position 1.

                        Compare with documentation of C# IndexOf(String).
                        A problem with this is that the expected behaviour between -strpos()- and -ustrpos()- are not consistent. The same non-technical explanation should mean that -ustrpos()- when s2 is an empty string should also return 1.

                        Comment


                        • #13

                          In post 11 my intention was only to describe how I understand the current behaviour of the function strpos(), and ask for an update of the -help strpos()- ; the description should be explicit saying that if s2 is empty (""), the return value is 1.



                          A problem with this is that the expected behaviour between -strpos()- and -ustrpos()- are not consistent.
                          The quote seems like a wish for changing the strpos() function.

                          In principle strpos() and ustrpos() should not allow "empty string" as an argument; in the Stata context - calling strpos() and ustrpos() with "" as an argument could return missing.

                          In practice, changing the return value for strpos(string, "") to "0" may be an alternative.

                          But, changes, e.g. returning "0", will have consequences for running existing code, like true-or-false testing for string match using if strpos(s1, s2).


                          The same non-technical explanation should mean that -ustrpos()- when s2 is an empty string should also return 1.
                          I disagree: A non-existing character can never match.



                          (To avoid problems if your strings are characters; use functions ustrlen(s) and ustrpos(). Ref -help strlen()- and -help strpos()-

                          Comment


                          • #14
                            Adding to #13 an example of using strpos() for true-or-false testing of string match :
                            Code:
                            clear
                            set obs 3
                            
                            gen s1 = "A"
                            gen s2 = cond(_n==1, s1, cond(_n==2, "", "B"))
                            
                            gen found10 = strpos(s1, s2)
                            
                            gen found11 = strpos(s1, s2) if strlen(s2) > 0
                            
                            gen found3 = ustrpos(s1, s2)  
                            
                            count if found10
                            
                            list
                            Code:
                            . count if found10
                              2
                            
                            .
                            . list
                            
                                 +--------------------------------------+
                                 | s1   s2   found10   found11   found3 |
                                 |--------------------------------------|
                              1. |  A    A         1         1        1 |
                              2. |  A              1         .        0 |
                              3. |  A    B         0         0        0 |
                                 +--------------------------------------+
                            Last edited by Bjarte Aagnes; 24 May 2020, 04:49.

                            Comment

                            Working...
                            X