What is the c.operator in reg and set emptycells ?

Phuc Nguyen

Join Date: Mar 2017

Posts: 348
#1

What is the c.operator in reg and set emptycells ?

08 Jul 2021, 18:43

Today when I run this regression

Code:

did2s l_homicide [aweight=popwt], first_stage(i.sid i.year) second_stage(i.post) treatment(post) cluster(sid)

The code above can be explained as below

Code:

reg l_homicide i.sid i.year [aweight=popwt] if post == 0 predict adj, residuals reg adj i.post [aweight=popwt], vce(cluster sid) nocons

I got the error notice as below

Code:

maxvar too small You have attempted to use an interaction with too many levels or attempted to fit a model with too many variables. You need to increase maxvar; it is currently 5000. Use set maxvar; see help maxvar. If you are using factor variables and included an interaction that has lots of missing cells, try set emptycells drop to reduce the required matrix size; see help set emptycells. If you are using factor variables, you might have accidentally treated a continuous variable as a categorical, resulting in lots of categories. Use the c. operator on such variables.

I have a couple of questions as below:

1> I am still confused regarding "factor variables". In my panel dataset, I have indicators for the region, country, firm, and year. So, the factor variables here are "firm" and "year" or all of these four?

2> whether set empty cells affect the dataset (delete some observations, in "set emptycells drop" case)? And how to turn off this set emptycells command?

3> "If you are using factor variables, you might have accidentally treated a continuous variable as a categorical". I do not think that if firms and years are factor variables here, they would not suffers this problem, am I falling into any fallacy?

Many thanks and warmest regards.

Last edited by Phuc Nguyen; 08 Jul 2021, 18:52.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#2

08 Jul 2021, 18:59

1. Basically, a factor variable is any variable that has an i. prefix, or appears in an interaction term without a c. prefix. Factor variable notation is applied to discrete variables to tell Stata to create a series of indicators for the variables and use those in the command. So in the code you show, the factor variables are sid and year.

2. -set empty cells- affects the way Stata sets up the matrices it uses to do regression estimations. It does not modify your data set. If you -set emptycells drop- and later want to turn that off, the command is -set emptycells keep-. (Also, if you exit Stata and re-start it, any previous -set emptycells drop- is forgotten.)

3. Well, what Stata is telling you here is a bit incomplete. The only reason a continuous variable creates this kind of problem is because it will typically have a very large number of different variables, and Stata is trying to create an indicator for each of those values (except one). Well, you can also exceed the number of allowable variables (or maximum matrix size) if you have a single variable with a huge number of variables. Country, region, and year are unlikely to do this. But if you have a very large number of firms in your data set, i.firm might cause this problem. That said, your code makes no reference to firm. What is sid? If sid has a large number of values, that could be the problem here.
2 likes
Comment
Melody Brown

Join Date: May 2022

Posts: 81
#3

01 Jun 2022, 18:46

Originally posted by Clyde Schechter View Post

1. Basically, a factor variable is any variable that has an i. prefix, or appears in an interaction term without a c. prefix. Factor variable notation is applied to discrete variables to tell Stata to create a series of indicators for the variables and use those in the command. So in the code you show, the factor variables are sid and year.

2. -set empty cells- affects the way Stata sets up the matrices it uses to do regression estimations. It does not modify your data set. If you -set emptycells drop- and later want to turn that off, the command is -set emptycells keep-. (Also, if you exit Stata and re-start it, any previous -set emptycells drop- is forgotten.)

3. Well, what Stata is telling you here is a bit incomplete. The only reason a continuous variable creates this kind of problem is because it will typically have a very large number of different variables, and Stata is trying to create an indicator for each of those values (except one). Well, you can also exceed the number of allowable variables (or maximum matrix size) if you have a single variable with a huge number of variables. Country, region, and year are unlikely to do this. But if you have a very large number of firms in your data set, i.firm might cause this problem. That said, your code makes no reference to firm. What is sid? If sid has a large number of values, that could be the problem here.

Hi Clyde,

I met exactly the same issue when I try to run cmp with random coefficients. In my model I have 5 equations and in each equations, there are some variables that have large number of different variables (for one variable, it has more than 7000 categorical values, and another one has more than 200). But these variables are important in my model and I cannot drop them. I wonder if there's any other way that I can get around with this issue?

Thank you very much and look forward to your reply.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#4

01 Jun 2022, 19:37

It would be helpful if you showed the exact command you ran and the exact error message(s) you got from Stata. Also, for each factor variable, give the number of levels of that variable.

That said, run -query memory- and see what your setting for maxvar is. In most flavors of Stata, the default is 5,000. The factor variable "virtual" variables do count against this limit. So your 7000 level variable blows through that all by itself. You can probably get past this by running -set maxvar 10000- or some number that will accommodate this (assuming you are running a flavor of Stata that will let you do that--if you are running BE, you have no way to run a model this large). Of course, estimating coefficients for 7000 indicators is going to be time-consuming, and, unless you have a lot of observations for each of them, the results are going to be uselessly imprecise. I also worry about the feasibility of estimating 7,000 random coefficients for these. I don't know if random coefficients also involve creation of virtual variables that count against maxvar. If it does, then not only do you need maxvar large enough to account the 7000 indicators, but if you have random coefficients for these indicators, then you will need yet another 7000 on top of that. So you may have to try a pretty large number. -help set maxvar- will tell you what is the largest your particular Stata will allow.

So I don't think you're in for smooth sailing. But you can likely get past this particular stumbling block and see what happens. You might ultimately succeed. I am always skeptical about models that involve this many variables. It is more or less inconceivable that you will be able to do anything usefull with 7,000 indicator coefficients. Indeed, even just managing the output will be a challenge. If you are not, in the end, able to run this model, give some serious thought to ways of simplifying it. For example, random intercepts at the level of this variable might make more sense. Or there may be a way to group the levels of this variable to get a more manageable coarse-grained version of the variable. Well, this is starting to get speculative, so I'll leave it at that.
Comment
Melody Brown

Join Date: May 2022

Posts: 81
#5

01 Jun 2022, 20:04

Originally posted by Clyde Schechter View Post

It would be helpful if you showed the exact command you ran and the exact error message(s) you got from Stata. Also, for each factor variable, give the number of levels of that variable.

That said, run -query memory- and see what your setting for maxvar is. In most flavors of Stata, the default is 5,000. The factor variable "virtual" variables do count against this limit. So your 7000 level variable blows through that all by itself. You can probably get past this by running -set maxvar 10000- or some number that will accommodate this (assuming you are running a flavor of Stata that will let you do that--if you are running BE, you have no way to run a model this large). Of course, estimating coefficients for 7000 indicators is going to be time-consuming, and, unless you have a lot of observations for each of them, the results are going to be uselessly imprecise. I also worry about the feasibility of estimating 7,000 random coefficients for these. I don't know if random coefficients also involve creation of virtual variables that count against maxvar. If it does, then not only do you need maxvar large enough to account the 7000 indicators, but if you have random coefficients for these indicators, then you will need yet another 7000 on top of that. So you may have to try a pretty large number. -help set maxvar- will tell you what is the largest your particular Stata will allow.

So I don't think you're in for smooth sailing. But you can likely get past this particular stumbling block and see what happens. You might ultimately succeed. I am always skeptical about models that involve this many variables. It is more or less inconceivable that you will be able to do anything usefull with 7,000 indicator coefficients. Indeed, even just managing the output will be a challenge. If you are not, in the end, able to run this model, give some serious thought to ways of simplifying it. For example, random intercepts at the level of this variable might make more sense. Or there may be a way to group the levels of this variable to get a more manageable coarse-grained version of the variable. Well, this is starting to get speculative, so I'll leave it at that.

Hi Clyde,

Thank you so much for your reply. The code that I use is below:

set maxvar 32767
set matsize 11000
clear
use "*.dta"

drop if id==""
set emptycells drop
ssc install estout, replace
ssc install cmp
ssc install ghk2, replace

mata: mata mlib index
eststo clear
eststo: cmp (lndaily_quantity=lnnonc10 lncd210 lncovidcd110 lninter_timing daily_sku_avg_promotion_d1 daily_store_avg_promotion_d1 daily_price_avg_d1 lnl_purchase_quantity lnconsumertenure i.user_type i.status i.income_level new_id i.subcat3 lndelivery_cost i.delivery daily_weight_d daily_size_d i.public_holiday i.week lnlength lnrain lntemp i.brandi || id: lnnonc10 lncd210 lncovidcd110 lninter_timing daily_sku_avg_promotion_d1 daily_store_avg_promotion_d1 daily_price_avg_d1) (lninter_timing = lnnonc10 lncd210 lncovidcd110 daily_sku_avg_promotion_d1 daily_store_avg_promotion_d1 daily_price_avg_d1 lnl_inter_timing lnconsumertenure i.user_type i.status i.income_level new_id i.subcat3 lndelivery_cost i.delivery daily_weight_d daily_size_d i.public_holiday i.week lnlength lnrain lntemp i.brandi||id:lnnonc10 lncd210 lncovidcd110 daily_sku_avg_promotion_d1 daily_store_avg_promotion_d1 daily_price_avg_d1)(daily_price_avg_d1= lnnonc10 lncd210 lncovidcd110 l_daily_price_avg_d1 lnconsumertenure i.user_type i.status i.income_level new_id i.subcat3 lndelivery_cost i.delivery daily_weight_d daily_size_d i.public_holiday i.week lnlength lnrain lntemp i.brandi ||id: lnnonc10 lncd210 lncovidcd110) (daily_sku_avg_promotion_d1= lnnonc10 lncd210 lncovidcd110 l_daily_sku_avg_promotion_d1 daily_price_avg_d1 lnconsumertenure i.user_type i.status i.income_level new_id i.subcat3 lndelivery_cost i.delivery daily_weight_d daily_size_d i.public_holiday i.week lnlength lnrain lntemp i.brandi||id:lnnonc10 lncd210 lncovidcd110 daily_price_avg_d1) (daily_store_avg_promotion_d1= lnnonc10 lncd210 lncovidcd110 l_daily_store_avg_promotion_d1 daily_price_avg_d1 lnconsumertenure i.user_type i.status i.income_level new_id i.subcat3 lndelivery_cost i.delivery daily_weight_d daily_size_d i.public_holiday i.week lnlength lnrain lntemp i.brandi||id: lnnonc10 lncd210 lncovidcd110 daily_price_avg_d1), ind($cmp_cont $cmp_cont $cmp_cont $cmp_cont $cmp_cont) qui

The errors I got is

There are many factors in the model, but i.subcat3 and i.brands are the ones that I mentioned that could lead to the errors. I won't allow random coefficient on all these variables but just 3-5 key ivs. I tried sureg before, without random coefficient it can actually produce results. I don't know why adding the random coefficient on these key variables would have such a large impact.

Please let me know if you have any thoughts to get through this. Thanks a lot!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#6

01 Jun 2022, 22:09

So, what you are coming up against here is not maxvar but matsize. The default matsize in Stata SE is 400 x 400, which is not even in the ballpark of what you need for this model. Fortunately, Stata SE will allow you to -set matsize #- where # can be up to 11,000. Since you have one variable with 7000 levels and another with 200, you probably need to -set matsize 8000- or thereabouts. Now, while youo might be tempted to just go all the way to 11,00 to start with, be warned. Your model has 5 equations, so there will be 5 such matrices. Each of size 11000*11000*8 bytes (each cell of the matrix is a double, which requires 8 bytes of storage). So the five matrices will consume a little under 5GB of memory. Fine if you have enough RAM to do this (on top of the OS and any running programs) and if your OS will allocate that much to Stata. And remember that this demand on RAM goes up as the square of the value of matsize you set. So, use the smallest value that works (if any works at all.)

Seeing your code here, I don't think that the random coefficients are a big problem. There are 8 in each equation, and they are all on continuous variables. So this is not going to demand an inordinate amount of memory. Estimating those coefficients (along with their standard errors) will be computationally intensive, but they are not adding materially to the amount of memory required. (If you are able to get the needed memory to run this command with an increased setting of matsize, I suspect your next post will be worrying about how long this is taking to run. That will depend not only on the model complexity but also the number of observations in the data set. But even with the smallest number of observations that could hope to support usable, meaningful estimates for a model with this many explanatory variables, I suspect you are looking at runtime measured in days to weeks. I hope I'm wrong about that, but I fear I am not.
1 like
Comment
Melody Brown

Join Date: May 2022

Posts: 81
#7

02 Jun 2022, 05:49

Originally posted by Clyde Schechter View Post

So, what you are coming up against here is not maxvar but matsize. The default matsize in Stata SE is 400 x 400, which is not even in the ballpark of what you need for this model. Fortunately, Stata SE will allow you to -set matsize #- where # can be up to 11,000. Since you have one variable with 7000 levels and another with 200, you probably need to -set matsize 8000- or thereabouts. Now, while youo might be tempted to just go all the way to 11,00 to start with, be warned. Your model has 5 equations, so there will be 5 such matrices. Each of size 11000*11000*8 bytes (each cell of the matrix is a double, which requires 8 bytes of storage). So the five matrices will consume a little under 5GB of memory. Fine if you have enough RAM to do this (on top of the OS and any running programs) and if your OS will allocate that much to Stata. And remember that this demand on RAM goes up as the square of the value of matsize you set. So, use the smallest value that works (if any works at all.)

Seeing your code here, I don't think that the random coefficients are a big problem. There are 8 in each equation, and they are all on continuous variables. So this is not going to demand an inordinate amount of memory. Estimating those coefficients (along with their standard errors) will be computationally intensive, but they are not adding materially to the amount of memory required. (If you are able to get the needed memory to run this command with an increased setting of matsize, I suspect your next post will be worrying about how long this is taking to run. That will depend not only on the model complexity but also the number of observations in the data set. But even with the smallest number of observations that could hope to support usable, meaningful estimates for a model with this many explanatory variables, I suspect you are looking at runtime measured in days to weeks. I hope I'm wrong about that, but I fear I am not.

Hi Clyde,

Thanks for your helpful insights. I got less than 40,000 obs. Theoretically, do you think it's gonna be enough to generate any meaningful result?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#8

02 Jun 2022, 09:34

That's fewer than 6 observations per explanatory variable. While people give various rules of thumb on the ratio of observations to variables, I don't think anybody thinks 6 is enough.
Comment
Melody Brown

Join Date: May 2022

Posts: 81
#9

02 Jun 2022, 13:36

Originally posted by Clyde Schechter View Post

That's fewer than 6 observations per explanatory variable. While people give various rules of thumb on the ratio of observations to variables, I don't think anybody thinks 6 is enough.

Thanks alot for your reply, Clyde. It helps a lot. I'll discuss this with others to see what we can do with respect to this issue.
Comment

Announcement

What is the c.operator in reg and set emptycells ?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment