Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find min/max*length of variable

    Dearest StataList

    I have had a problem when trying to find the maximum and minimum length of every variable in the data set without creating new variables. I have a very large data set containing both string and numeric variables. I know the egen function max(strlen(x)) alt. max(strlen(string(x))) is possible to use when only dealing with strings or numeric variables. However, I would like to avoid both creating new variables (due to time consumption) as well as converting numeric variables to strings.

    Right now I have this piece of code working, but takes a long time to run:

    foreach x of local xvars {
    cap egen maxl_temp = max(strlen(string(`x')))
    cap egen minl_temp = min(strlen(string(`x')))
    cap egen maxl_temp = max(strlen(`x'))
    cap egen minl_temp = min(strlen(`x'))
    local maxl = maxl_temp[1]
    local minl = minl_temp[1]

    cap drop maxl_temp minl_temp
    }


    Is there anyone that has encountered this dilemma before and have found a solution that is better, more efficient?

    Thank you!



  • #2
    This code is puzzling.

    Part of that code puts results in local macros, but each time around the loop you overwrite your existing local. So, the end results are just the results for the last variable processed.

    (*) The length of the string representation of a numeric variable is dependent on the format used in conversion. That may or may not bite you.

    (**) The length of the string representation of a numeric variable has nothing to do with storage or memory, if that is your concern.

    Otherwise the minimum and maximum lengths of a string variable are both constants, and there is no obvious gain in putting either in a variable here, even briefly; and if you do that then using egen to do it is a very slow method. egen is flexible but you don't need any of the flexibility here, and its flexibility just implies replacing one line of code to be interpreted with many more.

    I don't know what you want to do with these results, but this code displays the minimum and maximum length of each variable with the provisos flagged (* **). Sure, that is not what you want, but I don't know what you do want. All your code shows are the range of lengths of the last variable you look at.

    Code:
    sysuse auto, clear
    
    gen work = .
    
    ds, has(type string)
    
    quietly foreach v in `r(varlist)' {
        replace work = length(`v')
        su work, meanonly
        noisily display "`v'{col 36}"  `r(min)'  "{col 42}" r(max)
    }
    
    ds, has(type numeric)
    
    quietly foreach v in `r(varlist)' {
        replace work = length(string(`v'))
        su work, meanonly
        noisily display "`v'{col 36}"  `r(min)'  "{col 42}" r(max)
    }
    Output positions may need a tweak.

    Comment

    Working...
    X