Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting used categorical variables

    Hello everyone.
    I'm using a categorical variable with the prefix -i-. This categorical variable has over 3000 different categories.
    In the regression, probably many of those categories are ruled out.
    So, I would like to know if there is a (direct) way to count how many categories are used in the regression.

    For instance:

    Code:
    reg y x1 i.x2
    Let's suppose x2 has 3000 categories. But, for instance, many categories have too few observations. Stata ruled them out and finally used only 1500.
    I want to know how I can get that number (1500).
    Many thanks in advance.

  • #2
    I couldn't find a direct approach, but the example code below demonstrates a hack that you may be able to adapt to your purposes. If someone knows a less embarrassing approach I'd be glad to learn it.
    Code:
    sysuse auto, clear
    regress price length mpg i.rep78
    testparm i.rep78
    return list
    local n_rep = r(df)
    display "categories used: `n_rep'"
    Code:
    . testparm i.rep78
    
     ( 1)  2.rep78 = 0
     ( 2)  3.rep78 = 0
     ( 3)  4.rep78 = 0
     ( 4)  5.rep78 = 0
    
           F(  4,    62) =    1.12
                Prob > F =    0.3560
    
    . return list
    
    scalars:
                   r(drop) =  0
                   r(df_r) =  62
                      r(F) =  1.11834110850705
                     r(df) =  4
                      r(p) =  .3560277203360988
    
    . local n_rep = r(df)
    
    . display "categories used: `n_rep'"
    categories used: 4
    
    .

    Comment


    • #3
      Originally posted by William Lisowski View Post
      I couldn't find a direct approach, but the example code below demonstrates a hack that you may be able to adapt to your purposes. If someone knows a less embarrassing approach I'd be glad to learn it.
      Code:
      sysuse auto, clear
      regress price length mpg i.rep78
      testparm i.rep78
      return list
      local n_rep = r(df)
      display "categories used: `n_rep'"
      Code:
      . testparm i.rep78
      
      ( 1) 2.rep78 = 0
      ( 2) 3.rep78 = 0
      ( 3) 4.rep78 = 0
      ( 4) 5.rep78 = 0
      
      F( 4, 62) = 1.12
      Prob > F = 0.3560
      
      . return list
      
      scalars:
      r(drop) = 0
      r(df_r) = 62
      r(F) = 1.11834110850705
      r(df) = 4
      r(p) = .3560277203360988
      
      . local n_rep = r(df)
      
      . display "categories used: `n_rep'"
      categories used: 4
      
      .
      It's a smart workaround, though.
      Thank you William Lisowski for the idea.

      Comment


      • #4
        Here is a less embarrassing approach (assuming that at least 1 case per category are if "enough" cases for -reg-):
        Code:
        // install -fre- from SSC if necessary:
        cap which fre
        if _rc ssc install fre-
        
        sysuse auto
        
        * ------------------------------------------------------------------------------
        * Example 1:
        
        qui fre rep78
        di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1
        
        * ------------------------------------------------------------------------------
        * Example 2 (because -reg- will use listwise deletion of missing cases):
        
        replace foreign = . if rep78==1
        mark valid
        markout valid price i.rep78 foreign  // use all variables used in -reg-
        
        qui fre rep78 if valid
        di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1
        Result of example 1:
        Code:
        . qui fre rep78
        . di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1
        
        Number of categories of rep78 to be used by -reg-: 4
        Result of example 2:
        Code:
        . qui fre rep78 if valid
        . di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1
        
        Number of categories of rep78 to be used by -reg-: 3
        Last edited by Dirk Enzmann; 20 May 2021, 16:47.

        Comment


        • #5
          I based my hack on the statement in post #1 that
          In the regression, probably many of those categories are ruled out.
          The issue is, is what is wanted the number ruled out a priori, or the number that actually made it through the regression and had a coefficient estimated?

          Post #4 produces the first number, while post #2 produces the second number.

          Here's a variant of my previous example, with the addition of a variable constructed to be collinear with one of the categories, and thus not excluded until the regress command starts doing the math.
          Code:
          sysuse auto, clear
          generate collinear = rep78==3
          regress price length mpg collinear i.rep78
          testparm i.rep78
          local n_rep = r(df)
          display "categories used: `n_rep'"
          
          mark valid
          markout valid length mpg collinear i.rep78
          qui fre rep78 if valid
          di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1
          Code:
          . regress price length mpg collinear i.rep78
          note: 3.rep78 omitted because of collinearity
          
          [results omitted]
          
          . testparm i.rep78
          
           ( 1)  2.rep78 = 0
           ( 2)  4.rep78 = 0
           ( 3)  5.rep78 = 0
          
                 F(  3,    62) =    1.44
                      Prob > F =    0.2400
          
          . local n_rep = r(df)
          
          . display "categories used: `n_rep'"
          categories used: 3
          
          . 
          . mark valid
          
          . markout valid length mpg collinear i.rep78
          
          . qui fre rep78 if valid
          
          . di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1
          
          Number of categories of rep78 to be used by -reg-: 4
          
          .

          Comment


          • #6
            Originally posted by William Lisowski View Post
            I based my hack on the statement in post #1 that


            The issue is, is what is wanted the number ruled out a priori, or the number that actually made it through the regression and had a coefficient estimated?

            Post #4 produces the first number, while post #2 produces the second number.

            Here's a variant of my previous example, with the addition of a variable constructed to be collinear with one of the categories, and thus not excluded until the regress command starts doing the math.
            Code:
            sysuse auto, clear
            generate collinear = rep78==3
            regress price length mpg collinear i.rep78
            testparm i.rep78
            local n_rep = r(df)
            display "categories used: `n_rep'"
            
            mark valid
            markout valid length mpg collinear i.rep78
            qui fre rep78 if valid
            di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1
            Code:
            . regress price length mpg collinear i.rep78
            note: 3.rep78 omitted because of collinearity
            
            [results omitted]
            
            . testparm i.rep78
            
            ( 1) 2.rep78 = 0
            ( 2) 4.rep78 = 0
            ( 3) 5.rep78 = 0
            
            F( 3, 62) = 1.44
            Prob > F = 0.2400
            
            . local n_rep = r(df)
            
            . display "categories used: `n_rep'"
            categories used: 3
            
            .
            . mark valid
            
            . markout valid length mpg collinear i.rep78
            
            . qui fre rep78 if valid
            
            . di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1
            
            Number of categories of rep78 to be used by -reg-: 4
            
            .
            That's true. I'm interested in the number of categories post-estimation.
            I already try William's approach, and it works just fine. Even though it is an indirect method.
            Thank you both for helping me out with this issue.

            Comment


            • #7
              Just out of curiosity I tried to find a "direct" solution (i.e. without running -testparm-). Here is another alternative that uses more code but could be used as part of a program that could not only give you the number of categories omitted but also which of the categories:
              Code:
              sysuse auto, clear
              generate collinear = rep78==3
              regress price length mpg collinear i.rep78
              
              local vnames : rownames(e(V))
              local k : word count `vnames'
              local ncat = 0
              local ocat = 0
              local catv = "rep78"
              local comit = ""
              forvalues i = 1/`=`k'-1' {
                 local vname : word `i' of `vnames'
                 if strpos("`vname'",".`catv'") > 0 {
                    local ++ncat
                    if strpos("`vname'","o.`catv'") > 0 {
                       local ++ocat
                       local comit = "`comit' " + substr("`vname'",1,strpos("`vname'","o.")-1)
                    }
                 }
              }
              di as res `ocat' as txt " out of " as res `ncat' as txt " categories of " as res "`catv'" as txt " omitted"
              di as txt "omitted categories: " as res "`comit'"
              The result is:
              Code:
              . di as res `ocat' as txt " out of " as res `ncat' as txt " categories of " as res "`catv'" as txt " omitted"
              1 out of 5 categories of rep78 omitted
              
              . di as txt "omitted categories: " as res "`comit'"
              omitted categories:  3
              Last edited by Dirk Enzmann; 22 May 2021, 21:49.

              Comment


              • #8
                Following on post #7, this "direct" solution goes a little further, including the base category among those omitted. But I note that it, like the code in post #7, fails to account for the possibility of time series operators applied to a categorical variable.
                Code:
                sysuse auto, clear
                generate collinear = rep78==3
                regress price length mpg collinear i.rep78
                local catv rep78
                
                local omit : rownames(e(V))
                // remove non-categorical variables
                local omit = ustrregexra(`" `omit' "',`"\b(?<=[^\.])[A-Za-z_][A-Za-z0-9_]+\b"'," ")
                // remove non-omitted categories
                local omit = ustrregexra(`"`omit'"',`"\d+\.[A-Za-z][A-Za-z0-9\_]+"',"")
                display "omitted categories: `omit'"
                Code:
                . display "omitted categories: `omit'" 
                omitted categories:        1b.rep78  3o.rep78
                My earlier code in post #5 does deal successfully with time series operators, however.
                Code:
                sysuse auto, clear
                generate time = _n
                tsset time
                generate collinear = l.rep78==3
                regress price length mpg collinear li.rep78
                testparm li.rep78
                local n_rep = r(df)
                display "categories used: `n_rep'"
                Code:
                . testparm li.rep78
                
                 ( 1)  2L.rep78 = 0
                 ( 2)  4L.rep78 = 0
                 ( 3)  5L.rep78 = 0
                
                       F(  3,    61) =    0.16
                            Prob > F =    0.9256
                
                . local n_rep = r(df)
                
                . display "categories used: `n_rep'"
                categories used: 3
                I'm beginning to respect the code from post #5 a little more. It's at least more transparent than my regular expressions in the code above, which were even more arcane before I gave up trying to account for time series operators.

                Comment

                Working...
                X