Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • substr vs. bsubstr functions

    Can anyone tell me what the differences between the substr and bsubstr functions are?

    bsubstr is used in many official Stata commands (e.g.: -streg-), but it is not documented in the help file (only substr is documented). I also tried to google it to no avail.

  • #2
    Indeed not documented. Looks like an alias for substr() as both return a string made up of the requested bytes:

    Code:
    . dis bsubstr("ecole",1,3)
    eco
    
    . dis bsubstr("école",1,3)
    éc
    
    . dis substr("école",1,3)
    éc
    
    . dis usubstr("école",1,3)
    éco

    Comment


    • #3
      Robert Picard Thank you for your answer. So, would you say that it is safe to use substr instead of bsubstr to match —for example like in streg— the minimum abbreviation for the distribution option with the available distributions? (It would interesting to know why Stata Corp added the function bsubstr.)

      Also, I have only access to Stata version 14. Could someone please check if the function bsubstr works with Stata version 13 and 12?

      Thank you in advance.

      Comment


      • #4
        not on 13.1:

        Code:
        . dis bsubstr("ecole",1,3)
        unknown function bsubstr()
        r(133);

        Comment


        • #5
          Here is my guess of what might have happened. Robert is probably right on point. Perhaps StataCorp created the function bsubstr(), short for bytesubstr(), when they introduced unicode support in Stata 14 and stared updating their routines, e.g. streg, to work with unicode. It was later decided to stick with substr() so old code would not break, even without version control, and users would not need to learn a new function name for doing the things they are used to. Internally, removing bsubstr() would then not have been wise, given that some routines might already use it and you would not want to go back to check this. For the users however, the new name usubstr() was created to give the new unicode based results and that is what was documented along with the other new u*() string functions. As an aside, I find it a bit inconvenient that the regex() machinery was renamed ustrregex*() instead of just uregex*().

          Anyway, to answer the question: yes it is save to use substr() instead of bsubstr(). In fact, it is not save to use the latter as non-documented stuff is not guaranteed to work in future versions, not even under version control. The situation is similar for undocumented stuff, although it seems the latter commands are rather stable across releases. Andrea might need to think about whether usubstr() is what is needed, but that depends on what exactly is wanted.

          Best
          Daniel

          Comment


          • #6
            @Jorrit: Thank you very much for checking that!

            @daniel: Thank you very much for your reply! I'll use substr then, as all the strings I work with contain only plain ASCII characters.
            Last edited by Andrea Discacciati; 17 Mar 2017, 03:31.

            Comment


            • #7
              daniel klein is right. bsubstr() behaves the same as substr(). We created the function during the transition to Unicode. It is used by the developers to signal the intention of working on bytes instead of characters. We will consider to document the b*() functions in a future update.

              Comment


              • #8
                Thanks for clarification and the opportunity to learn from strategies at StataCorp.

                Originally posted by Hua Peng (StataCorp) View Post
                It is used by the developers to signal the intention of working on bytes instead of characters.
                This is a clever way of emphasizing the intention without writing a comment.

                Best
                Danie

                Comment

                Working...
                X