Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • I am trying to make something like -levelsof- saving the levels of a variable, but not in a local, rather in a another variable.

    Good morning,

    I often use -levelsof- to loop through the levels of an unequally spaced variable. It is a fundamental loop that Nick Cox has described in various writings, e.g., Cox, N. J. 2002. Speaking Stata: How to face lists with fortitude. Stata Journal. 2: 202–222.

    The problem with this type of loop is that with too many levels we hit into the Stata limits. An easy solution is to save the levels not in a local, but rather in a new variables.

    Therefore I want to write a command that takes a variable ID, 1) preserves the current sort order of the data, and 2) creates a new variable LevelsOfID which contains the levels of ID in increasing order, with the first level in the first observation, the second level in the second observation and the last Nth level in the Nth observation.

    About a week ago I wrote some double loop, from which I get headache now when I am reading it, and more importantly it fails in the following way. It calculates the levels, but I either mess up the original sort order, or I have the levels scattered all over the place. So my double loop violates either my 1) or my 2).

    Today I had a second go from scratch, and I am still failing. Here is my new attempt:

    Code:
    program define levelstovar, byable(onecall)
            version 11, missing
            syntax newvarname = /exp [if] [in] [, MISSing]
    
            tempvar thetag n var
            tempfile `thedata'
            quietly {
                    gen double `var' = `exp' `if' `in'
                    if "`byvars'"=="" gen `n' = _n
                    else by `byvars': gen `n' = _n
                    save `thedata', replace
                    keep `byvars' `var'        
                      egen `thetag' = tag(`byvars' `var') `if' `in', `missing'
                    keep if `thetag'
                    sort `byvars' `var'
                    if "`byvars'"=="" gen `n' = _n
                    else by `byvars': gen `n' = _n
                    ren `var' `varlist'
                    
                    merge 1:1 `n' using(`thedata'), nogen        
            
            } // Closes Quetly brace. 
    end
    This failed and I even managed to destroy the auto data on which I was trying it.

    Do you have suggestions how to achieve what I want in some simple way? Or alternatively, do you see any reason why my program is causing so much damage, and generally not doing what I want it to do?



  • #2
    Here's one approach:

    Code:
    //setup
    clear
    sysuse auto
    expand 5
    sort make
    //program
    cap program drop levelsvar
    program define levelsvar
        syntax varlist(min=1 max=1),GENerate(name)
        tempvar origsort
        quietly{
            gen `origsort' = _n
            preserve
            sort `varlist'
            levelsof `varlist',local(new`varlist')
            sort `origsort'
            gen `generate' = ""
            forval x = 1/`:word count `new`varlist'''{
                replace `generate' = "`:word `x' of `new`varlist'''" in `x'
            }
            cap destring `generate',replace
            tempfile tempfil
            save `tempfil'
            restore
            merge 1:1 `origsort' using `tempfil', nogen
        }
    end
    //test
    levelsvar make,gen(levelsmake)
    
    . list make levelsmake in 1/5
    
         +---------------------------+
         | make           levelsmake |
         |---------------------------|
      1. | AMC Concord   AMC Concord |
      2. | AMC Concord     AMC Pacer |
      3. | AMC Concord    AMC Spirit |
      4. | AMC Concord     Audi 5000 |
      5. | AMC Concord      Audi Fox |
         +---------------------------+

    Comment


    • #3
      This defeats the purpose, Ali.

      I have nothing against having the results in a local, and I find the native -levelsof- reasonably fast and handy.

      But if we have too many levels, we hit into the limits of Stata when we are dealing through locals. Here

      Code:
      . set obs 1000000
      number of observations (_N) was 0, now 1,000,000
      
      . 
      . gen x = rnormal()
      
      
      
      . levelsvar x,gen(levelsmake)
      macro substitution results in line that is too long
          The line resulting from substituting macros would be longer than allowed.  The maximum allowed
          length is 645,216 characters, which is calculated on the basis of set maxvar.
      
          You can change that in Stata/SE and Stata/MP.  What follows is relevant only if you are using
          Stata/SE or Stata/MP.
      
          The maximum line length is defined as 16 more than the maximum macro length, which is currently
          645,200 characters.  Each unit increase in set maxvar increases the length maximums by 129.  The
          maximum value of set maxvar is 32,767.  Thus, the maximum line length may be set up to 4,227,159
          characters if you set maxvar to its largest value.
      macro substitution results in line that is too long
      r(920);
      
      .
      This is why I want to avoid the local in -levelsof-, because it is not completely general, it is subject to limitations as to how many levels we can have in the variable.

      Comment


      • #4
        The other options you have:
        1. add "sortpreserve" to the properties of the program
        2. Use mata to do the heavy lifting sorting and levels identification.
        Here an example of a mata code:
        Code:
        mata: y = st_data(.,"headroom")
        mata: y = sort(y,1)
        mata: info = panelsetup(y, 1)
        mata: info = panelsetup(y, 1)
        mata: y=y[info[,1],]
        getmata y, force
        HTH

        Comment


        • #5
          I guess I misread the problem - here's another approach which seems to work and doesn't require levelsof:

          Code:
          program define levelsvar
              syntax varlist(min=1 max=1),GENerate(name)
              tempvar origsort
              quietly{
                  gen `origsort' = _n
                  preserve
                  sort `varlist'
                  collapse (first) `origsort',by(`varlist')
                  rename `varlist' `generate'
                  replace `origsort' = _n
                  tempfile tempfi
                  save `tempfi'
                  restore
                  merge 1:1 `origsort' using `tempfi',nogen
              }
          end
          Last edited by Ali Atia; 18 Apr 2021, 06:54.

          Comment


          • #6
            Joro, here is an approach that does what you want with some minor features added that uses frames (though can be modified along the lines of Ali's code to be compatible with earlier versions).

            The program preserves the original sort order, and accepts if/in conditions. It always returns to r() the number of total observations in r(N) and unique levels in r(r), analogous to -levelsof-. The levels are merged back into the dataset in sequential order (starting form obs 1 to -r(r)-) and are pre-sorted. The default variable that is created is called -levels_<original_varname>- ,or may optionally be given a new name with -generate()-. String and numeric vars work just the same.

            Code:
            cap program drop makelevels
            program define makelevels, rclass sortpreserve
              version 16
              syntax varlist(min=1 max=1) [if] [in] [, GENerate(name)]
              marksample touse, strok
              unab myvar : `varlist'
             
              if "`generate'"=="" {
                  local newvarname = strtoname("levels_`myvar'")
              }
              else {
                  local newvarname `generate'
              }
             
              qui pwf
              local curframe `r(currentframe)'
             
              tempname Lvls
              tempfile fnlvls
              frame put `myvar' if `touse', into(`Lvls')
              frame change `Lvls'
              scalar N_obs = _N
              return scalar N = N_obs
             
              sort `myvar'
              qui by `myvar' : keep if _n==1
              scalar r_obs = _N
              return scalar r = r_obs
            
              rename `myvar' `newvarname'
              qui save `fnlvls', replace
            
              frame change `curframe'  
              qui merge 1:1 _n using `fnlvls', nogen
            end
            Example

            Code:
            sysuse auto
            
            makelevels rep78
            makelevels make, gen(auto_makes)
            makelevels make if strmatch(make, "Audi*"), gen(audi)
            list make rep78 levels_rep78 auto_makes audi in 1/5, abbrev(16)
            Results

            Code:
            . list make rep78 levels_rep78 auto_makes audi in 1/5, abbrev(16)
            
                 +----------------------------------------------------------------+
                 | make            rep78   levels_rep78   auto_makes    audi      |
                 |----------------------------------------------------------------|
              1. | AMC Concord         3              1   AMC Concord   Audi 5000 |
              2. | AMC Pacer           3              2   AMC Pacer     Audi Fox  |
              3. | AMC Spirit          .              3   AMC Spirit              |
              4. | Buick Century       3              4   Audi 5000               |
              5. | Buick Electra       4              5   Audi Fox                |
                 +----------------------------------------------------------------+

            Comment


            • #7
              Note also valuesof (SSC)

              levelsof (briefly levels) is an official command that grew out of vallist (STB).

              STB-60 dm90 . . . . . . . . . . . . . Listing distinct values of a variable
              (help vallist if installed) . . . . . . . . . . . . . . . . N. J. Cox
              3/01 pp.8--11; STB Reprints Vol 10, pp.46--49
              see levels command incorporated into Stata 8.0


              Subsequently Patrick Joly was keen to take vallist further in different directions and i gave the command name to him, although that public thread (see SSC) peters out in 2003.
              .
              Almost certainly, the ancient history here, in terms of my own code, is some mix of ideas and questions that appeared on Statalist and what I wanted for myself. The modern history is that StataCorp have rewritten the internals of levelsof quite drastically.

              My version of history -- the original Statalist posts in the late 1990s and so on have long since disappeared -- runs that the motivations were twofold, displaying a concise list of distinct values and producing a list that could be used for looping.

              Although levelsof has been quite popular it wasn't really my intention directly that it be used to cycle over a very long list of levels, especially if the levels are those of a string variable implying a need to worry about quotation marks, and so on.

              As at https://www.stata.com/support/faqs/d...-with-foreach/ I would stress rather using egen's group() function.

              All that said, is this closer to what you want?

              Code:
              *! 1.0.0 NJC 18 April 2021
              program olevels, sort  
                  version 8.2
                  syntax varname [if] [in] , GENerate(name) [by(varlist)]
                  
                  marksample touse
                  quietly count if `touse'
                  if r(N) == 0 error 2000
                  
                  tempvar obsno
                  gen long `obsno' = _n
                  bysort `by' `varlist' (`obsno') : gen `generate' = `varlist' if _n == 1
              end
              Code:
              . sysuse auto, clear
              (1978 Automobile Data)
              
              .. olevels rep78, gen(orep78)
              (69 missing values generated)
              
              . l orep78 if !missing(orep78)
              
                   +--------+
                   | orep78 |
                   |--------|
                1. |      3 |
                5. |      4 |
               12. |      2 |
               20. |      5 |
               40. |      1 |
                   +--------+
              
              . olevels rep78, gen(orep78_2) by(foreign)
              (66 missing values generated)
              
              . l foreign rep78 orep78_2 if !missing(orep78_2)
              
                   +-----------------------------+
                   |  foreign   rep78   orep78_2 |
                   |-----------------------------|
                1. | Domestic       3          3 |
                5. | Domestic       4          4 |
               12. | Domestic       2          2 |
               20. | Domestic       5          5 |
               40. | Domestic       1          1 |
                   |-----------------------------|
               53. |  Foreign       5          5 |
               54. |  Foreign       3          3 |
               55. |  Foreign       4          4 |
                   +-----------------------------+
              The program could be extended so that the non-missing values are in the first few observations -- but not aligned with the original data, or you could just the non-missing values to the start, thus keeping the alignment with the original data (which I would recommend much more strongly).

              NOTE: I started this before a meal when only #1 was visible and finished if afterwards. So I must now read #2 to #6.

              Comment


              • #8
                In terms of original program, in the line

                Code:
                  tempfile `thedata'
                the local macro has not been defined making the tempfile empty, so the -save `thedata',replace- becomes -save, replace- overwriting the original data.


                Comment


                • #9
                  Nick, I managed to achieve what you have done with a horrendous double loop code.

                  However your code suffers from the same defect that my code below suffers: It gives the levels of the variable scattered all over the place. You preserved the original ordering, which is one condition that I wanted to satisfy, but then I also want the levels to be in the new variable as follows: first lowest level in 1st observation, 2nd level in 2nd obs, etc.

                  Do you see any way how you can change your very nice and concise code to achieve this goal too? To preserve the sort order, and to have the new levels nicely put in observation 1, 2, 3, etc?



                  Originally posted by Nick Cox View Post
                  Note also valuesof (SSC)

                  levelsof (briefly levels) is an official command that grew out of vallist (STB).

                  STB-60 dm90 . . . . . . . . . . . . . Listing distinct values of a variable
                  (help vallist if installed) . . . . . . . . . . . . . . . . N. J. Cox
                  3/01 pp.8--11; STB Reprints Vol 10, pp.46--49
                  see levels command incorporated into Stata 8.0


                  Subsequently Patrick Joly was keen to take vallist further in different directions and i gave the command name to him, although that public thread (see SSC) peters out in 2003.
                  .
                  Almost certainly, the ancient history here, in terms of my own code, is some mix of ideas and questions that appeared on Statalist and what I wanted for myself. The modern history is that StataCorp have rewritten the internals of levelsof quite drastically.

                  My version of history -- the original Statalist posts in the late 1990s and so on have long since disappeared -- runs that the motivations were twofold, displaying a concise list of distinct values and producing a list that could be used for looping.

                  Although levelsof has been quite popular it wasn't really my intention directly that it be used to cycle over a very long list of levels, especially if the levels are those of a string variable implying a need to worry about quotation marks, and so on.

                  As at https://www.stata.com/support/faqs/d...-with-foreach/ I would stress rather using egen's group() function.

                  All that said, is this closer to what you want?

                  Code:
                  *! 1.0.0 NJC 18 April 2021
                  program olevels, sort
                  version 8.2
                  syntax varname [if] [in] , GENerate(name) [by(varlist)]
                  
                  marksample touse
                  quietly count if `touse'
                  if r(N) == 0 error 2000
                  
                  tempvar obsno
                  gen long `obsno' = _n
                  bysort `by' `varlist' (`obsno') : gen `generate' = `varlist' if _n == 1
                  end
                  Code:
                  . sysuse auto, clear
                  (1978 Automobile Data)
                  
                  .. olevels rep78, gen(orep78)
                  (69 missing values generated)
                  
                  . l orep78 if !missing(orep78)
                  
                  +--------+
                  | orep78 |
                  |--------|
                  1. | 3 |
                  5. | 4 |
                  12. | 2 |
                  20. | 5 |
                  40. | 1 |
                  +--------+
                  
                  . olevels rep78, gen(orep78_2) by(foreign)
                  (66 missing values generated)
                  
                  . l foreign rep78 orep78_2 if !missing(orep78_2)
                  
                  +-----------------------------+
                  | foreign rep78 orep78_2 |
                  |-----------------------------|
                  1. | Domestic 3 3 |
                  5. | Domestic 4 4 |
                  12. | Domestic 2 2 |
                  20. | Domestic 5 5 |
                  40. | Domestic 1 1 |
                  |-----------------------------|
                  53. | Foreign 5 5 |
                  54. | Foreign 3 3 |
                  55. | Foreign 4 4 |
                  +-----------------------------+
                  The program could be extended so that the non-missing values are in the first few observations -- but not aligned with the original data, or you could just the non-missing values to the start, thus keeping the alignment with the original data (which I would recommend much more strongly).

                  NOTE: I started this before a meal when only #1 was visible and finished if afterwards. So I must now read #2 to #6.

                  Comment


                  • #10
                    Thank you very much Scott for spotting this ! I of course did not mean what I wrote. I just meant

                    Code:
                    tempfile thedata
                    Thank you for letting me know how I destroyed the auto data which I was using.

                    Originally posted by Scott Merryman View Post
                    In terms of original program, in the line

                    Code:
                     tempfile `thedata'
                    the local macro has not been defined making the tempfile empty, so the -save `thedata',replace- becomes -save, replace- overwriting the original data.

                    Comment


                    • #11
                      I made my program work.

                      There were two silly things that I had in the program that were derailing the whole business. First as Scott pointed out -tempfile `thedata'- was wrong. Second, I do not know where I pulled out this syntax
                      -merge 1:1 var using(`thedata')-
                      from, but it is wrong too.

                      So this is the version of my program that works:

                      Code:
                      program define levelstovar, byable(onecall)
                              version 11, missing
                              syntax newvarname = /exp [if] [in] [, MISSing]
                      
                              tempvar thetag n var
                             tempfile thedata
                              quietly {
                                      gen double `var' = `exp' `if' `in'
                                      if "`byvars'"=="" gen `n' = _n
                                      else by `byvars': gen `n' = _n
                                      save `thedata', replace
                                      keep `byvars' `var'        
                                        egen `thetag' = tag(`byvars' `var') `if' `in', `missing'
                                      keep if `thetag'
                                      sort `byvars' `var'
                                      if "`byvars'"=="" gen `n' = _n
                                      else by `byvars': gen `n' = _n
                                      ren `var' `varlist'
                                      
                                      merge 1:1 `n' using `thedata', nogen        
                              
                              } // Closes Quetly brace.
                      end
                      the two rows on which there were errors before are in red.

                      Comment


                      • #12
                        Dear Joro Kolev ,

                        Do you mind providing an example on how to use your code (e.g., in the 'auto' dataset) to generate a variable containg the values of the typical macro that can be created using 'levelsof'?

                        Thank you.

                        Comment


                        • #13
                          Otavio, what you (seem to) want is exactly what Ali showed in his post #2.

                          Try Ali's programme in #2, I tried it, and it works fine.

                          What I am doing is different, I do not want to go through macros, because this is how I hit the Stata limits. Typical use of my program is that you have a variable that has too many levels to be put into a macro, and then you do what I do.

                          On the other hand Ali's code in # does what you say, he converts a macro content to a variable values.

                          Originally posted by Otavio Conceicao View Post
                          Dear Joro Kolev ,

                          Do you mind providing an example on how to use your code (e.g., in the 'auto' dataset) to generate a variable containg the values of the typical macro that can be created using 'levelsof'?

                          Thank you.

                          Comment


                          • #14
                            Otavio Conceicao , the kernel of Ali's code in #2 is a loop of this type:

                            Code:
                            . sysuse auto
                            (1978 Automobile Data)
                            
                            . levelsof rep, local(lvrep)
                            1 2 3 4 5
                            
                            . gen levelsvar = .
                            (74 missing values generated)
                            
                            . local numobs = 1
                            
                            . foreach l of local lvrep {
                              2. replace levelsvar = `l' in `numobs'
                              3. local ++numobs
                              4. }
                            (1 real change made)
                            (1 real change made)
                            (1 real change made)
                            (1 real change made)
                            (1 real change made)
                            
                            . list rep levelsvar in 1/10, sep(0)
                            
                                 +------------------+
                                 | rep78   levels~r |
                                 |------------------|
                              1. |     3          1 |
                              2. |     3          2 |
                              3. |     .          3 |
                              4. |     3          4 |
                              5. |     4          5 |
                              6. |     3          . |
                              7. |     .          . |
                              8. |     3          . |
                              9. |     3          . |
                             10. |     3          . |
                                 +------------------+
                            which seems to be what you want, simply to transfer the content of a macro to a variable.

                            Comment


                            • #15
                              I already answered #9 in #7 to some extent.

                              The program could be extended so that the non-missing values are in the first few observations -- but not aligned with the original data, or you could just the non-missing values to the start, thus keeping the alignment with the original data (which I would recommend much more strongly).
                              I can't see that your desiderata are in general consistent.

                              More crucially, I don't yet see how such a variable would be used in a way that helps more than any existing approach.

                              Comment

                              Working...
                              X