Counting used categorical variables

Ariel Soto-Caro

Join Date: Mar 2021

Posts: 30
#1

Counting used categorical variables

20 May 2021, 08:21

Hello everyone.
I'm using a categorical variable with the prefix -i-. This categorical variable has over 3000 different categories.
In the regression, probably many of those categories are ruled out.
So, I would like to know if there is a (direct) way to count how many categories are used in the regression.

For instance:

Code:

reg y x1 i.x2

Let's suppose x2 has 3000 categories. But, for instance, many categories have too few observations. Stata ruled them out and finally used only 1500.
I want to know how I can get that number (1500).
Many thanks in advance.
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

20 May 2021, 13:34

I couldn't find a direct approach, but the example code below demonstrates a hack that you may be able to adapt to your purposes. If someone knows a less embarrassing approach I'd be glad to learn it.

Code:

sysuse auto, clear
regress price length mpg i.rep78
testparm i.rep78
return list
local n_rep = r(df)
display "categories used: `n_rep'"

Code:

. testparm i.rep78

 ( 1)  2.rep78 = 0
 ( 2)  3.rep78 = 0
 ( 3)  4.rep78 = 0
 ( 4)  5.rep78 = 0

       F(  4,    62) =    1.12
            Prob > F =    0.3560

. return list

scalars:
               r(drop) =  0
               r(df_r) =  62
                  r(F) =  1.11834110850705
                 r(df) =  4
                  r(p) =  .3560277203360988

. local n_rep = r(df)

. display "categories used: `n_rep'"
categories used: 4

.

Comment

Ariel Soto-Caro

Join Date: Mar 2021
Posts: 30

20 May 2021, 13:41

Originally posted by William Lisowski View Post

Code:

sysuse auto, clear
regress price length mpg i.rep78
testparm i.rep78
return list
local n_rep = r(df)
display "categories used: `n_rep'"

Code:

. testparm i.rep78

( 1) 2.rep78 = 0
( 2) 3.rep78 = 0
( 3) 4.rep78 = 0
( 4) 5.rep78 = 0

F( 4, 62) = 1.12
Prob > F = 0.3560

. return list

scalars:
r(drop) = 0
r(df_r) = 62
r(F) = 1.11834110850705
r(df) = 4
r(p) = .3560277203360988

. local n_rep = r(df)

. display "categories used: `n_rep'"
categories used: 4

.

It's a smart workaround, though.
Thank you William Lisowski for the idea.

Comment

Dirk Enzmann

Join Date: Apr 2014
Posts: 541

20 May 2021, 16:43

Here is a less embarrassing approach (assuming that at least 1 case per category are if "enough" cases for -reg-):

Code:

// install -fre- from SSC if necessary:
cap which fre
if _rc ssc install fre-

sysuse auto

* ------------------------------------------------------------------------------
* Example 1:

qui fre rep78
di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1

* ------------------------------------------------------------------------------
* Example 2 (because -reg- will use listwise deletion of missing cases):

replace foreign = . if rep78==1
mark valid
markout valid price i.rep78 foreign  // use all variables used in -reg-

qui fre rep78 if valid
di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1

Result of example 1:

Code:

. qui fre rep78
. di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1

Number of categories of rep78 to be used by -reg-: 4

Result of example 2:

Code:

. qui fre rep78 if valid
. di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1

Number of categories of rep78 to be used by -reg-: 3

Last edited by Dirk Enzmann; 20 May 2021, 16:47.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

20 May 2021, 17:12

I based my hack on the statement in post #1 that

In the regression, probably many of those categories are ruled out.

The issue is, is what is wanted the number ruled out a priori, or the number that actually made it through the regression and had a coefficient estimated?

Post #4 produces the first number, while post #2 produces the second number.

Here's a variant of my previous example, with the addition of a variable constructed to be collinear with one of the categories, and thus not excluded until the regress command starts doing the math.

Code:

sysuse auto, clear
generate collinear = rep78==3
regress price length mpg collinear i.rep78
testparm i.rep78
local n_rep = r(df)
display "categories used: `n_rep'"

mark valid
markout valid length mpg collinear i.rep78
qui fre rep78 if valid
di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1

Code:

. regress price length mpg collinear i.rep78
note: 3.rep78 omitted because of collinearity

[results omitted]

. testparm i.rep78

 ( 1)  2.rep78 = 0
 ( 2)  4.rep78 = 0
 ( 3)  5.rep78 = 0

       F(  3,    62) =    1.44
            Prob > F =    0.2400

. local n_rep = r(df)

. display "categories used: `n_rep'"
categories used: 3

. 
. mark valid

. markout valid length mpg collinear i.rep78

. qui fre rep78 if valid

. di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1

Number of categories of rep78 to be used by -reg-: 4

.

Comment

Ariel Soto-Caro

Join Date: Mar 2021
Posts: 30

21 May 2021, 07:00

Originally posted by William Lisowski View Post

I based my hack on the statement in post #1 that

The issue is, is what is wanted the number ruled out a priori, or the number that actually made it through the regression and had a coefficient estimated?

Post #4 produces the first number, while post #2 produces the second number.

Here's a variant of my previous example, with the addition of a variable constructed to be collinear with one of the categories, and thus not excluded until the regress command starts doing the math.

Code:

sysuse auto, clear
generate collinear = rep78==3
regress price length mpg collinear i.rep78
testparm i.rep78
local n_rep = r(df)
display "categories used: `n_rep'"

mark valid
markout valid length mpg collinear i.rep78
qui fre rep78 if valid
di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1

Code:

. regress price length mpg collinear i.rep78
note: 3.rep78 omitted because of collinearity

[results omitted]

. testparm i.rep78

( 1) 2.rep78 = 0
( 2) 4.rep78 = 0
( 3) 5.rep78 = 0

F( 3, 62) = 1.44
Prob > F = 0.2400

. local n_rep = r(df)

. display "categories used: `n_rep'"
categories used: 3

.
. mark valid

. markout valid length mpg collinear i.rep78

. qui fre rep78 if valid

. di _n as txt "Number of categories of rep78 to be used by -reg-: " as res `r(r_valid)'-1

Number of categories of rep78 to be used by -reg-: 4

.

That's true. I'm interested in the number of categories post-estimation.
I already try William's approach, and it works just fine. Even though it is an indirect method.
Thank you both for helping me out with this issue.

Comment

Dirk Enzmann

Join Date: Apr 2014
Posts: 541

22 May 2021, 21:32

Just out of curiosity I tried to find a "direct" solution (i.e. without running -testparm-). Here is another alternative that uses more code but could be used as part of a program that could not only give you the number of categories omitted but also which of the categories:

Code:

sysuse auto, clear
generate collinear = rep78==3
regress price length mpg collinear i.rep78

local vnames : rownames(e(V))
local k : word count `vnames'
local ncat = 0
local ocat = 0
local catv = "rep78"
local comit = ""
forvalues i = 1/`=`k'-1' {
   local vname : word `i' of `vnames'
   if strpos("`vname'",".`catv'") > 0 {
      local ++ncat
      if strpos("`vname'","o.`catv'") > 0 {
         local ++ocat
         local comit = "`comit' " + substr("`vname'",1,strpos("`vname'","o.")-1)
      }
   }
}
di as res `ocat' as txt " out of " as res `ncat' as txt " categories of " as res "`catv'" as txt " omitted"
di as txt "omitted categories: " as res "`comit'"

The result is:

Code:

. di as res `ocat' as txt " out of " as res `ncat' as txt " categories of " as res "`catv'" as txt " omitted"
1 out of 5 categories of rep78 omitted

. di as txt "omitted categories: " as res "`comit'"
omitted categories:  3

Last edited by Dirk Enzmann; 22 May 2021, 21:49.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

23 May 2021, 08:11

Following on post #7, this "direct" solution goes a little further, including the base category among those omitted. But I note that it, like the code in post #7, fails to account for the possibility of time series operators applied to a categorical variable.

Code:

sysuse auto, clear
generate collinear = rep78==3
regress price length mpg collinear i.rep78
local catv rep78

local omit : rownames(e(V))
// remove non-categorical variables
local omit = ustrregexra(`" `omit' "',`"\b(?<=[^\.])[A-Za-z_][A-Za-z0-9_]+\b"'," ")
// remove non-omitted categories
local omit = ustrregexra(`"`omit'"',`"\d+\.[A-Za-z][A-Za-z0-9\_]+"',"")
display "omitted categories: `omit'"

Code:

. display "omitted categories: `omit'" 
omitted categories:        1b.rep78  3o.rep78

My earlier code in post #5 does deal successfully with time series operators, however.

Code:

sysuse auto, clear
generate time = _n
tsset time
generate collinear = l.rep78==3
regress price length mpg collinear li.rep78
testparm li.rep78
local n_rep = r(df)
display "categories used: `n_rep'"

Code:

. testparm li.rep78

 ( 1)  2L.rep78 = 0
 ( 2)  4L.rep78 = 0
 ( 3)  5L.rep78 = 0

       F(  3,    61) =    0.16
            Prob > F =    0.9256

. local n_rep = r(df)

. display "categories used: `n_rep'"
categories used: 3

I'm beginning to respect the code from post #5 a little more. It's at least more transparent than my regular expressions in the code above, which were even more arcane before I gave up trying to account for time series operators.

Announcement