substr vs. bsubstr functions

Andrea Discacciati

Join Date: Feb 2016

Posts: 194
#1

substr vs. bsubstr functions

16 Mar 2017, 13:13

Can anyone tell me what the differences between the substr and bsubstr functions are?

bsubstr is used in many official Stata commands (e.g.: -streg-), but it is not documented in the help file (only substr is documented). I also tried to google it to no avail.
Tags: None
Robert Picard

Join Date: Mar 2014

Posts: 1536
#2

16 Mar 2017, 13:32

Indeed not documented. Looks like an alias for substr() as both return a string made up of the requested bytes:

Code:

. dis bsubstr("ecole",1,3) eco . dis bsubstr("école",1,3) éc . dis substr("école",1,3) éc . dis usubstr("école",1,3) éco
Comment
Andrea Discacciati

Join Date: Feb 2016

Posts: 194
#3

17 Mar 2017, 01:46

Robert Picard Thank you for your answer. So, would you say that it is safe to use substr instead of bsubstr to match —for example like in streg— the minimum abbreviation for the distribution option with the available distributions? (It would interesting to know why Stata Corp added the function bsubstr.)

Also, I have only access to Stata version 14. Could someone please check if the function bsubstr works with Stata version 13 and 12?

Thank you in advance.
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#4

17 Mar 2017, 02:28

not on 13.1:

Code:

. dis bsubstr("ecole",1,3) unknown function bsubstr() r(133);
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#5

17 Mar 2017, 02:53

Here is my guess of what might have happened. Robert is probably right on point. Perhaps StataCorp created the function bsubstr(), short for bytesubstr(), when they introduced unicode support in Stata 14 and stared updating their routines, e.g. streg, to work with unicode. It was later decided to stick with substr() so old code would not break, even without version control, and users would not need to learn a new function name for doing the things they are used to. Internally, removing bsubstr() would then not have been wise, given that some routines might already use it and you would not want to go back to check this. For the users however, the new name usubstr() was created to give the new unicode based results and that is what was documented along with the other new u*() string functions. As an aside, I find it a bit inconvenient that the regex() machinery was renamed ustrregex*() instead of just uregex*().

Anyway, to answer the question: yes it is save to use substr() instead of bsubstr(). In fact, it is not save to use the latter as non-documented stuff is not guaranteed to work in future versions, not even under version control. The situation is similar for undocumented stuff, although it seems the latter commands are rather stable across releases. Andrea might need to think about whether usubstr() is what is needed, but that depends on what exactly is wanted.

Best
Daniel
Comment
Andrea Discacciati

Join Date: Feb 2016

Posts: 194
#6

17 Mar 2017, 03:18

@Jorrit: Thank you very much for checking that!

@daniel: Thank you very much for your reply! I'll use substr then, as all the strings I work with contain only plain ASCII characters.

Last edited by Andrea Discacciati; 17 Mar 2017, 03:31.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#7

17 Mar 2017, 08:53

daniel klein is right. bsubstr() behaves the same as substr(). We created the function during the transition to Unicode. It is used by the developers to signal the intention of working on bytes instead of characters. We will consider to document the b*() functions in a future update.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#8

17 Mar 2017, 09:13

Thanks for clarification and the opportunity to learn from strategies at StataCorp.

Originally posted by Hua Peng (StataCorp) View Post

It is used by the developers to signal the intention of working on bytes instead of characters.

This is a clever way of emphasizing the intention without writing a comment.

Best
Danie
Comment

Announcement

substr vs. bsubstr functions

Comment

Comment

Comment

Comment

Comment

Comment

Comment