Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • is there a command that returns maximum and minimum length of a string variable?

    I'm looking for a Stata command that stores the maximum and minimum length of a string variable - as a shortcut to the long way, which would go by:
    gen vnew = strlen(v1)
    egen smax = max(vnew)
    etc.

    I want to write syntax that looks like

    local smax = r(maxlength)
    local smin = r(minlength)

    I need these locals as an indicator for correct / incorrect string variables as part of a testing routine of a host of files .

    My research of such a command had no result - does someone out there know better?

    Last edited by Klaudia Erhardt; 09 Dec 2016, 06:34. Reason: Forgot to add tags

  • #2
    Well, for what it's worth here's a way to get it down to two lines:
    Code:
    egen smax = max(strlen(v1))
    local sm = smax[1]

    Comment


    • #3
      I would replace the second part of your long way

      Code:
      summarize vnew
      after which r(min) and r(max) are defined. If you have to do this repeatedly, you can write a simple wrapper

      Code:
      program mystrlen
          version 12.1
          syntax varname(string)
          tempname tmpvar
          quiety generate long `tmpvar' = strlen(`varlist')
          summarize `tmpvar' , meanonly
      end
      Best
      Daniel
      Last edited by daniel klein; 09 Dec 2016, 06:45.

      Comment


      • #4
        Hello Klaudia,

        I'm unsure if there's a command that automatically stores the information you want. I looked into the ado file for codebook, and it looks as if codebook does pretty much the same thing as what you call the "long way." It seems as if the long way is about as many lines of code as any other way.

        If you are concerned about generating new variables and want to keep all the calculations separate from the dataset, you could do something with levelsof and a loop.

        I messed around and made the following. But again, Stata's codebook command itself merely generates a temp var with the strlen(), and then finds the max [EDIT: What Daniel proposes in #3 is pretty much exactly how codebook does it]. Perhaps somebody else knows of a command that returns something in r()?

        Code:
        sysuse auto
        levelsof make
        loc smax = 0
        loc smin = strlen(make)
        foreach lvl in `r(levels)' {
            if strlen(`"`lvl'"')<`smin' {
                loc smin = strlen(`"`lvl'"')
            }
            if strlen(`"`lvl'"')>`smax' {
                loc smax = strlen(`"`lvl'"')
            }
        }
        di "Max: `smax'"
        di "Min: `smin'"

        Comment


        • #5
          Hi Daniel, you are completely right (tap-to-my-head): the "egen"-part of my "long way" is dispensable, because I can get the returns from the strlen() - Variable which is numeric.
          Sometimes one does not see the obvious
          Thanks a lot!

          Also thanks to Mike Lacy for his optimizing idea.

          So it still seems, a direct return of max and min stringlength is not available in Stata?

          Comment


          • #6
            What do you mean by "direct return"? You can write a program that returns the information in r() or in a local or whatever. There is no (string) function available for this purpose and you cannot write functions in Stata. You can write functions in Mata.

            Best
            Daniel

            Comment


            • #7
              Hello Daniel, exactly - that was my question: is there a command (not a function) in Stata, that returns the maximum and minimum length of a string variable?
              It seems, there is not.
              The shortcut you posted in #3 to my "long version" is a good enough alternative then.

              Comment


              • #8
                In addition, a one-liner using Mata interactively from Stata can return a Stata local:
                Code:
                mata: st_local("smax", strofreal(max(strlen(st_sdata(.,"var1")))))
                More usefull may be similar use of the Mata function colminmax() shown in the last code block below to return a Stata matrix minmax containing minimum and maximum length for all string variables. (The matrix elements can be refered to in expressions either by row and column number, or rownames and colnames using the rownumb() and colnumb() functions.
                Code:
                clear
                input str5 var1 str10 var2 str15 var3
                "A" "BB" "CCC"
                "AA" "BBBB" "CCCCCC"
                "AAA" "BBBBBB" "CCCCCCCCC"
                "AAAA" "BBBBBBBB" "CCCCCCCCCCCC"
                "AAAAA" "BBBBBBBBBB" "CCCCCCCCCCCCCCC"
                end
                describe
                Code:
                mata: st_local("smax", strofreal(max(strlen(st_sdata(.,"var1")))))
                display "`smax'"
                Code:
                ds , has(type string) /* store macro r(varlist) */  
                mata: st_matrix("minmax", colminmax(strlen(st_sdata(.,st_varindex(tokens("`r(varlist)'")))))')
                matlist minmax
                display el(minmax,3,2)
                matrix rownames minmax = `r(varlist)'
                matrix colnames minmax = "min" "max"
                matlist minmax
                display el(minmax, rownumb(minmax, "var3"), colnumb(minmax, "max"))

                NB The mata function st_sdata make copies of data, if large data/memory issues: adapt code using st_sview(), hints - arguments to the function must be defined, and may be defined inside the function; st_sview(nameofview="",.,.) and have a look at -help mf_st_view-

                Comment


                • #9
                  Also, less parenthetically,

                  Code:
                   
                   mata: st_numscalar("smax", max(strlen(st_sdata(.,"var1"))))

                  Comment


                  • #10
                    Thanks to everybody for their answers and explanations.

                    In my syntax the max and min length of stringvars is only two of several indicators that are to be written to a data set that is created by my syntax. Most of those can be captured as a r-value of some command, like summarize. That was the reason I wanted to know if there is a command returning the max and min length of vars.

                    As it seems there is none, I decided on the solution of creating a (temp-)var using length() or strlength() and summarizing the numerical variable containing the length of the strings (see post #3). This fits best my already finshed and fine-working syntax, to which the max and min length of stringvars is an extension.

                    Probably the mata solutions are the better ones in terms of "good programming" - to my shame I have to admit to still have shied away from learning and using mata.

                    Comment

                    Working...
                    X