Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Evaluating strings in local macro for multiple stubs to be used in the reshape

    Hi,

    I've two questions, one is practical and the other educational. I'm following The Problem with Reshape FAQ. I have a set of variables following the naming convention var_2001 var_2002 var2_20012002 var_20032004. I would like to use the modified version of the code below to obtain my stubs for the reshape:
    Code:
    local stubs : subinstr local vars "_" "", all
    Ideally, I would like to delete everything after the _ sign but the code
    Code:
    local stubs : subinstr local vars "_*" "", all
    won't work. My two questions are:
    1. How to delete everything after the _ sign (including the _ sign)?
    2. What actually happens in subinstr local vars "_" "", all ? My first hunch would be to evaluate macro local newtext: substr(`text',1,length(`1')-3). Why do subinstr local vars?
    Kind regards,
    Konrad
    Version: Stata/IC 13.1

  • #2
    I do not completely follow, but

    Code:
    subinstr loc vars "_" "", all
    will change all _ characters in vars to an empty string (i.e. ""). This is documented.

    I agree that subinstr will probably not help here. A looping solution would be

    Code:
    foreach x of loc vars {
        loc add = substr("`x'", 1, strpos("`x'", "_"))
        loc stubs `stubs' `add'
    }
    loc stubs : list uniq stubs


    Best
    Daniel

    Comment


    • #3
      Daniel,

      Thank you for getting back to me. Your solution works as required. I drafted the post rather hastily (just before leaving work), hence not the best readability.
      Kind regards,
      Konrad
      Version: Stata/IC 13.1

      Comment


      • #4
        You could do this in Mata too. It's kind of a bizarre exercise to do once, but it would be worth thinking about if you needed to program such things.

        Let's suppose you have a dataset with variables var_2001 var_2002 var2_20012002 and you want distinct stubs.

        Code:
        clear 
        set obs 1
        foreach v in var_2001 var_2002 var2_20012002 var_20032004 {
             gen `v' = 1
        }
        
        mata
        
        stata("unab vl : var*_*")
        
        v = tokens(st_local("vl"))'
        
        : v
                           1
            +-----------------+
          1 |       var_2001  |
          2 |       var_2002  |
          3 |  var2_20012002  |
          4 |   var_20032004  |
            +-----------------+
        
        : substr(v, 1, strpos(v, "_") :-  1)
                  1
            +--------+
          1 |   var  |
          2 |   var  |
          3 |  var2  |
          4 |   var  |
            +--------+
        
        : uniqrows(substr(v, 1, strpos(v, "_") :-  1))
                  1
            +--------+
          1 |   var  |
          2 |  var2  |
            +--------+
        
        : st_local("vl", invtokens(uniqrows(substr(v, 1, strpos(v, "_") :-  1))'))
        
        : end

        Comment


        • #5
          Originally posted by Nick Cox View Post
          You could do this in Mata too. It's kind of a bizarre exercise to do once, but it would be worth thinking about if you needed to program such things.
          It may come to that as more and more often I tend to work with data sets where variable-naming nomenclature corresponds to: some_indicator_timeseries; otherindicator_timeseries, etc. If it would be possible to fish out the last "_" such a program could be useful.

          Kind regards,
          Konrad
          Version: Stata/IC 13.1

          Comment


          • #6
            Well, inserting a strreverse(), strpos() will find the last instead of the first occurrence.

            Here is a draft


            Code:
            pr mystubs ,rclass
                vers 12.1
                
                syntax varlist(num)
                
                m : MyStubs("`varlist'")
                loc stubs : list uniq stubs
                
                ret loc stubs `stubs'
            end
            
            vers 12.1
            m :
            void MyStubs(string rowvector nams)
            {
                nams = strreverse(tokens(nams)')
                nams = substr(nams, strpos(nams, "_") :+ 1 , .)
                nams = strreverse(nams)
                st_local("stubs", invtokens(nams'))
            }
            end
            Usage is simply

            Code:
            mystubs varlist
            return list
            uniqrows() will sort the names and I am not sure this is wanted.

            Best
            Daniel
            Last edited by daniel klein; 20 Feb 2015, 09:07.

            Comment


            • #7
              Daniel, thank you very much for this, I appreciate.
              Kind regards,
              Konrad
              Version: Stata/IC 13.1

              Comment


              • #8
                You can also use regex functions to process variable names. The following removes the last underscore and all text that follows

                Code:
                local vlist var_2001 var_2002 var2_20012002 var_20032004 ///
                    some_indicator_timeseries otherindicator_timeseries ///
                    year whatif_
                    
                foreach v in `vlist' {
                    local stubs  = "`stubs' " + regexr("`v'","_[^_]*$","")
                }
                
                loc stubs : list uniq stubs
                dis "`stubs'"

                Comment


                • #9
                  Robert,

                  Thank you for your contribution. With respect to the syntax that you suggested what doe the "_[^_]*$" do?
                  Kind regards,
                  Konrad
                  Version: Stata/IC 13.1

                  Comment


                  • #10
                    The pattern breaks down to:
                    1. "_" is the literal character "_"
                    2. "[^_]" matches a single character that is not found within the bracket expression, i.e. any character that is not "_" (the "^" character indicates that you are matching characters other than those in the square brackets)
                    3. "*" modifies #2 to match zero or more characters
                    4. "$" requires that the match that satisfies #1 to #3 extend up to the last character of the string.
                    In plain english, this translates to match an underscore and then zero or more characters that are not an underscore, up to the end of the string.

                    I could have used a simpler pattern like "_.*" to strip an underscore and what comes afterwards (a period is a wild card that will match any character) but

                    Code:
                    . dis regexr("some_indicator_timeseries","_.*","")
                    some
                    would remove more than you cared for. That's because regular expression matching is greedy and the ".*" will just match everything that follows the first underscore. To make the match non-greedy, you can, for single characters, match the character and then anything but the character. So in this example

                    Code:
                    . dis regexr("some_indicator_timeseries","_[^_]*","")
                    some_timeseries
                    which is still not what you were looking for. To force the match to start at the last underscore, you add an extra requirement that a successful match extend to the end of the string

                    Code:
                    . dis regexr("some_indicator_timeseries","_[^_]*$","")
                    some_indicator

                    Comment


                    • #11
                      Thanks very much for the helpful answer, it clarifies a lot.
                      Kind regards,
                      Konrad
                      Version: Stata/IC 13.1

                      Comment

                      Working...
                      X