Extract numbers (with varying lengths) from variable label

shem shen

Join Date: Mar 2016
Posts: 136

Extract numbers (with varying lengths) from variable label

19 Dec 2020, 14:34

Hi friends,

I now face the following problem:

Code:

webuse auto,clear
lab var make "STATA#1, 90"
lab var price "STATA#11, 90"
lab var mpg "STATA#22, 91"
lab var rep78 "STATA#2, 91"

lookfor make price mpg rep78
local varlist= r(varlist)
local counter=0
foreach var in `varlist' {
    local varlab: var lab `var'  //extract the whole variable label string
    //di "`varlab'"
    local varlaby = real(reverse(substr(reverse("`varlab'"),1,2))) + 1900  //extract the last two digits in the label and add it by 1900
    di "`varlaby'"

    local varlabm = real(substr("`varlab'"),7,2)   //extract the number after #.
    //This command fails in the cases of make and rep78 and succeeds for price and mpg.
    //I am looking for a general solution to this problem
    di "`varlabm'"
    }

So my job is to extract the numbers in each variable's label, including both the last two digits (90 and 91 in the example) and the first number after # (1, 11, 22, 2 in the example). I know how to do the former, but can't figure out how to extract the number after #, because they vary in length and are always surrounded by nonnumeric symbols.

Thank you!

Last edited by shem shen; 19 Dec 2020, 14:37.

Tags: label

William Lisowski

Join Date: Dec 2014
Posts: 10150

19 Dec 2020, 15:59

I think this demonstrates technique that does what you want.

Code:

webuse auto,clear
lab var make "STATA#1, 90"
lab var price "STATA#11, 90"
lab var mpg "STATA#22, 91"
lab var rep78 "STATA#2, 91"

lookfor #
local varlist `r(varlist)'

foreach var in `varlist' {
    local varlab: var lab `var'  //extract the whole variable label string
    local vl = ustrregexrf("`varlab'", "^.*#" ,"")
    local vl = ustrregexrf("`vl'", ",", " ")
    macro list _var _varlab _vl
    local varlabm : word 1 of `vl'
    local varlaby : word 2 of `vl'
    local varlaby 19`varlaby'
    macro list _varlabm _varlaby
}

Code:

. webuse auto,clear
(1978 Automobile Data)

. lab var make "STATA#1, 90"

. lab var price "STATA#11, 90"

. lab var mpg "STATA#22, 91"

. lab var rep78 "STATA#2, 91"

. 
. lookfor #

              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------
make            str18   %-18s                 STATA#1, 90
price           int     %8.0gc                STATA#11, 90
mpg             int     %8.0g                 STATA#22, 91
rep78           int     %8.0g                 STATA#2, 91

. local varlist `r(varlist)'

. 
. foreach var in `varlist' {
  2.     local varlab: var lab `var'  //extract the whole variable label string
  3.     local vl = ustrregexrf("`varlab'", "^.*#" ,"")
  4.     local vl = ustrregexrf("`vl'", ",", " ")
  5.     macro list _var _varlab _vl
  6.     local varlabm : word 1 of `vl'
  7.     local varlaby : word 2 of `vl'
  8.     local varlaby 19`varlaby'
  9.     macro list _varlabm _varlaby
 10. }
_var:           make
_varlab:        STATA#1, 90
_vl:            1 90
_varlabm:       1
_varlaby:       1990
_var:           price
_varlab:        STATA#11, 90
_vl:            11 90
_varlabm:       11
_varlaby:       1990
_var:           mpg
_varlab:        STATA#22, 91
_vl:            22 91
_varlabm:       22
_varlaby:       1991
_var:           rep78
_varlab:        STATA#2, 91
_vl:            2 91
_varlabm:       2
_varlaby:       1991

.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

19 Dec 2020, 16:24

Here's a version that avoids using regular expression functions, the results are identical to those shown in post #2. Sorry, I've been answering regular expression questions recently and I was still on autopilot.

Code:

webuse auto,clear
lab var make "STATA#1, 90"
lab var price "STATA#11, 90"
lab var mpg "STATA#22, 91"
lab var rep78 "STATA#2, 91"

lookfor #
local varlist `r(varlist)'

foreach var in `varlist' {
    local varlab: var lab `var'  //extract the whole variable label string
    local hloc = strpos("`varlab'","#")
    local vl = substr("`varlab'",`hloc'+1,.)
    local vl = subinstr("`vl'", ",", " ", .)
    macro list _var _varlab _vl
    local varlabm : word 1 of `vl'
    local varlaby : word 2 of `vl'
    local varlaby 19`varlaby'
    macro list _varlabm _varlaby
}

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 784

20 Dec 2020, 08:02

Adding two regex variants extracting both elements:

Code:

qui lookfor #

foreach v of varlist `r(varlist)' {
    
    local d1 = word(ustrregexra("`:var lab `v''", "[^\d\s]",""), 1)  
    local d2 = word(ustrregexra("`:var lab `v''", "[^\d\s]",""), 2)
    
    di "`d1' `d2'"
}

Code:

qui lookfor #

foreach v of varlist `r(varlist)' {
    
    if ustrregexm("`:var lab `v''", "#(\d+),\s(\d+)" ) {
    
        local d1 = ustrregexs(1) 
        local d2 = ustrregexs(2)
    }
    
     di "`d1' `d2'"
}

Comment

shem shen

Join Date: Mar 2016

Posts: 136
#5

20 Dec 2020, 09:53

Thank you William! Your code works for me!

Also thank you Bjarte! May I ask what "#(\d+),\s(\d+)" means? I guess #(\d+) means to search for numbers after #. But I can't figure out what \s means. I googled "\s" and got no relevant results. Similarly, could you also tell me what [^\d\s] means? Google result says it is dominance and submission.

Thank you!
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 784
#6

20 Dec 2020, 10:04

\s Match a white space character. White space is defined as [\t\n\f\r\p{Z}].

The ICU regular expressions are documented at https://unicode-org.github.io/icu/us...gs/regexp.html

The regular expressions can be explained using https://regex101.com/
Comment
shem shen

Join Date: Mar 2016

Posts: 136
#7

20 Dec 2020, 11:44

Originally posted by Bjarte Aagnes View Post

\s Match a white space character. White space is defined as [\t\n\f\r\p{Z}].

The ICU regular expressions are documented at https://unicode-org.github.io/icu/us...gs/regexp.html

The regular expressions can be explained using https://regex101.com/

Thank you so much! Have a good Sunday!
Comment

Announcement

Extract numbers (with varying lengths) from variable label

Comment

Comment

Comment

Comment

Comment

Comment