Evaluating strings in local macro for multiple stubs to be used in the reshape

Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#1

Evaluating strings in local macro for multiple stubs to be used in the reshape

18 Feb 2015, 10:20

Hi,

I've two questions, one is practical and the other educational. I'm following The Problem with Reshape FAQ. I have a set of variables following the naming convention var_2001 var_2002 var2_20012002 var_20032004. I would like to use the modified version of the code below to obtain my stubs for the reshape:

Code:

local stubs : subinstr local vars "_" "", all

Ideally, I would like to delete everything after the _ sign but the code

Code:

local stubs : subinstr local vars "_*" "", all

won't work. My two questions are:
How to delete everything after the _ sign (including the _ sign)?

What actually happens in subinstr local vars "_" "", all ? My first hunch would be to evaluate macro local newtext: substr(`text',1,length(`1')-3). Why do subinstr local vars?

Kind regards,
Konrad
Version: Stata/IC 13.1
Tags: macro, string
daniel klein

Join Date: Mar 2014

Posts: 3859
#2

18 Feb 2015, 10:41

I do not completely follow, but

Code:

subinstr loc vars "_" "", all

will change all _ characters in vars to an empty string (i.e. ""). This is documented.

I agree that subinstr will probably not help here. A looping solution would be

Code:

foreach x of loc vars { loc add = substr("`x'", 1, strpos("`x'", "_")) loc stubs `stubs' `add' } loc stubs : list uniq stubs

Best
Daniel
Comment
Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#3

20 Feb 2015, 07:32

Daniel,

Thank you for getting back to me. Your solution works as required. I drafted the post rather hastily (just before leaving work), hence not the best readability.

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35708

20 Feb 2015, 08:02

You could do this in Mata too. It's kind of a bizarre exercise to do once, but it would be worth thinking about if you needed to program such things.

Let's suppose you have a dataset with variables var_2001 var_2002 var2_20012002 and you want distinct stubs.

Code:

clear 
set obs 1
foreach v in var_2001 var_2002 var2_20012002 var_20032004 {
     gen `v' = 1
}

mata

stata("unab vl : var*_*")

v = tokens(st_local("vl"))'

: v
                   1
    +-----------------+
  1 |       var_2001  |
  2 |       var_2002  |
  3 |  var2_20012002  |
  4 |   var_20032004  |
    +-----------------+

: substr(v, 1, strpos(v, "_") :-  1)
          1
    +--------+
  1 |   var  |
  2 |   var  |
  3 |  var2  |
  4 |   var  |
    +--------+

: uniqrows(substr(v, 1, strpos(v, "_") :-  1))
          1
    +--------+
  1 |   var  |
  2 |  var2  |
    +--------+

: st_local("vl", invtokens(uniqrows(substr(v, 1, strpos(v, "_") :-  1))'))

: end

Comment

Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#5

20 Feb 2015, 08:23

Originally posted by Nick Cox View Post

You could do this in Mata too. It's kind of a bizarre exercise to do once, but it would be worth thinking about if you needed to program such things.

It may come to that as more and more often I tend to work with data sets where variable-naming nomenclature corresponds to: some_indicator_timeseries; otherindicator_timeseries, etc. If it would be possible to fish out the last "_" such a program could be useful.

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment

daniel klein

Join Date: Mar 2014
Posts: 3859

20 Feb 2015, 08:45

Well, inserting a strreverse(), strpos() will find the last instead of the first occurrence.

Here is a draft

Code:

pr mystubs ,rclass
    vers 12.1
    
    syntax varlist(num)
    
    m : MyStubs("`varlist'")
    loc stubs : list uniq stubs
    
    ret loc stubs `stubs'
end

vers 12.1
m :
void MyStubs(string rowvector nams)
{
    nams = strreverse(tokens(nams)')
    nams = substr(nams, strpos(nams, "_") :+ 1 , .)
    nams = strreverse(nams)
    st_local("stubs", invtokens(nams'))
}
end

Usage is simply

Code:

mystubs varlist
return list

uniqrows() will sort the names and I am not sure this is wanted.

Best
Daniel

Last edited by daniel klein; 20 Feb 2015, 09:07.

Comment

Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#7

20 Feb 2015, 10:06

Daniel, thank you very much for this, I appreciate.

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

20 Feb 2015, 10:27

You can also use regex functions to process variable names. The following removes the last underscore and all text that follows

Code:

local vlist var_2001 var_2002 var2_20012002 var_20032004 ///
    some_indicator_timeseries otherindicator_timeseries ///
    year whatif_
    
foreach v in `vlist' {
    local stubs  = "`stubs' " + regexr("`v'","_[^_]*$","")
}

loc stubs : list uniq stubs
dis "`stubs'"

Comment

Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#9

23 Feb 2015, 08:43

Robert,

Thank you for your contribution. With respect to the syntax that you suggested what doe the "_[^_]*$" do?

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#10

23 Feb 2015, 09:45

The pattern breaks down to:
"_" is the literal character "_"

"[^_]" matches a single character that is not found within the bracket expression, i.e. any character that is not "_" (the "^" character indicates that you are matching characters other than those in the square brackets)

"*" modifies #2 to match zero or more characters

"$" requires that the match that satisfies #1 to #3 extend up to the last character of the string.

In plain english, this translates to match an underscore and then zero or more characters that are not an underscore, up to the end of the string.

I could have used a simpler pattern like "_.*" to strip an underscore and what comes afterwards (a period is a wild card that will match any character) but

Code:

. dis regexr("some_indicator_timeseries","_.*","") some

would remove more than you cared for. That's because regular expression matching is greedy and the ".*" will just match everything that follows the first underscore. To make the match non-greedy, you can, for single characters, match the character and then anything but the character. So in this example

Code:

. dis regexr("some_indicator_timeseries","_[^_]*","") some_timeseries

which is still not what you were looking for. To force the match to start at the last underscore, you add an extra requirement that a successful match extend to the end of the string

Code:

. dis regexr("some_indicator_timeseries","_[^_]*$","") some_indicator
1 like
Comment
Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#11

24 Feb 2015, 03:46

Thanks very much for the helpful answer, it clarifies a lot.

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment

Announcement