Collinearity when running regression

Alison Brickson

Join Date: May 2025

Posts: 4
#1

Collinearity when running regression

12 May 2025, 11:04

I have some collinearity problems when i run my regression in Stata.

The linear model looks like this: log(ΔTotal Costs) = β0 + β1 log(ΔSales) + β2 Pilot License + β3 Demand Uncertainty + β4 Financial Leverage+ β5 (log(ΔSales) × Pilot License) + β6 (log⁡(ΔSales) × Demand Uncertainty) + β7 (log(ΔSales) × Financial Leverage)  + Σ βk(log(ΔSales) x Control Variables) + Σ βk(Control Variables) + log⁡(ΔSales) × Industry fixed effects  + log⁡(ΔSales) × Year fixed effects + Industry fixed effects + Year fixed effects

This is how i code it in Stata: reghdfe ln_ch_total_cost_w c.ln_ch_sale_wc#c.pilot_miss_to_zero c.ln_ch_sale_wc#c.UNCERT_c c.ln_ch_sale_wc#c.leverage_wc c.ln_ch_sale_wc#c.ceo_tenure_c c.ln_ch_sale_wc#c.ceo_age_c c.ln_ch_sale_wc#c.employee_intensity_wc c.ln_ch_sale_wc#c.capex_ratio_wc c.ln_ch_sale_wc#c.firmsize_wc c.ln_ch_sale_wc#c.ln_adj_asset_intensity_wc c.ln_ch_sale_wc#c.capital_intensity_wc c.ln_ch_sale_wc#c.roa_wc c.ln_ch_sale_wc c.pilot_miss_to_zero c.UNCERT_c c.leverage_wc c.ceo_tenure_c c.ceo_age_c c.employee_intensity_wc c.capex_ratio_wc c.firmsize_wc c.ln_adj_asset_intensity_wc c.capital_intensity_wc c.roa_wc c.ln_ch_sale_wc#i.fyear c.ln_ch_sale_wc#i.naics3, absorb(fyear naics3) vce(cluster gvkey)

Is it correct to use # instead of ## when interacting the variables with log change in sales?
I also wanted to ask if it was possible to interact fixed effects with sales while also absorbing them? because when i tried to do this i get this message "warning: missing F statistic; dropped variables due to collinearity or too few clusters" is there a way to solve this problem? I already tried mean centering variables and standardizing but i still get missing F statistic.

 
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35685
#2

12 May 2025, 11:20

I don't have a direct answer to your question, primarily because I don't do this kind of modelling myself. But some more information might allow people who do to comment more helpfully.

What is the sample size here? (or #panels x #periods)?

How many parameters are you estimating in total?

The implication of taking log of change in sales and of total costs is that sales or total costs never decrease (or even stay constant). Could you confirm that?

You seem to want to make your model yet more complicated. With collinearity problems that is not usually the right direction.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30090
#3

12 May 2025, 11:40

Is it correct to use # instead of ## when interacting the variables with log change in sales?

The difference between # and ## is that X##Y is expanded to X Y X#Y. This has nothing to do with the real world meaning of the variables, nor whether they are in natural metric or log transformed. In your case, as far as I can see (I haven't checked every single one) all of the variables that you have involved in X#Y terms also appear by themselves. In the case of naics3 and fyear, they appear in the -absorb()- option, but that counts as an appearance! For my part, I would have done this more simply as:

Code:

reghdfe ln_ch_total_cost_w c.ln_ch_sale_wc##(c.pilot_miss_to_zero c.UNCERT_c /// c.leverage_wc c.ceo_tenure_c c.ceo_age_c c.employee_intensity_wc c.capex_ratio_wc /// c.firmsize_wc c.ln_adj_asset_intensity_wc c.capital_intensity_wc c.roa_wc /// i.fyear i.naics3), absorb(fyear naics3) vce(cluster gvkey)

This has the advantage of being much more readable. Additionally, if you did mistakenly omit one of the Y variables from your original code, this way of coding it automatically fixes that error for you.

But your code (unless you dropped one of the Y variables and I missed it) will work the same way and produce the same results. It's just a matter of better notation.

I also wanted to ask if it was possible to interact fixed effects with sales while also absorbing them? because when i tried to do this i get this message "warning: missing F statistic; dropped variables due to collinearity or too few clusters" is there a way to solve this problem?

I don't see any obvious reason why this should have happened, unless sales is itself time-invariant across industries or industry-invariant within years (which, I suppose, is possible but seems unlikely based on the names of the variables).

I would want to see some example data that reproduces this problem to try to figure out what is going on. If possible, post an example that uses fewer variables just to make things simpler. In any case, please use the -dataex- command to post the example data so it will be usable. If you are running version 16 or later, or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Added: Crossed with #2, which raises some excellent points that I overlooked.
Comment
Alison Brickson

Join Date: May 2025

Posts: 4
#4

12 May 2025, 11:52

Number of panels = 1932 and periods = 31
I have to estimate all parameters included in the model.
And yes that's correct

Last edited by Alison Brickson; 12 May 2025, 12:49.
Comment
Alison Brickson

Join Date: May 2025

Posts: 4
#5

12 May 2025, 12:59

Originally posted by Clyde Schechter View Post

I would want to see some example data that reproduces this problem to try to figure out what is going on. If possible, post an example that uses fewer variables just to make things simpler. In any case, please use the -dataex- command to post the example data so it will be usable. If you are running version 16 or later, or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Added: Crossed with #2, which raises some excellent points that I overlooked.

This is an example of the data. I cant use the command dataex after running the regression because it exceeds linesize limit.
Attached Files

Last edited by Alison Brickson; 12 May 2025, 13:03.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30090
#6

12 May 2025, 14:20

An image of the -datex- output is useless. There is no way to import it into Stata. I anticipated that your data set has too many variables to directly show with -dataex-: that's why I said to try to find an example using fewer variables. See if you can reduce the number of variables in the regression to maybe 6 or 8 and still reproduce the error you are getting. Then just use -dataex- with those variables.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35685
#7

12 May 2025, 14:36

The idea that change of sales and change of total costs are always positive across 31 periods and 1932 firms challenges this non-economist.

However, perhaps log(ΔSales) really means Δlog(Sales) -- and so on.

Last edited by Nick Cox; 12 May 2025, 15:00.
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#8

12 May 2025, 15:39

Also, this does not look like panel data. Why is fyear (fiscal year?) in #5 constant across firms (gvkey)? If this is transactional data where you have multiple transactions within a fiscal year, what does a change imply in terms of a given transaction (observation)? In that case, you should have a transaction identifier. Before rushing into estimation, you should first understand the structure of your data and verify that the dependent variable (DV) and cost variables are changes in logs, not logs of changes, which—as Nick hints in #7—doesn't make sense.

On a more econometric note, be aware of a potential endogeneity problem if you predict the change in cost using the change in sales. Costs and sales are often jointly determined. For example, a promotion or price cut may increase sales but also increase variable costs (such as production or distribution). Additionally, omitted variables (like input prices, advertising, or economic conditions) could affect both sales and costs simultaneously.
Comment
Alison Brickson

Join Date: May 2025

Posts: 4
#9

13 May 2025, 03:14

Originally posted by Nick Cox View Post

The idea that change of sales and change of total costs are always positive across 31 periods and 1932 firms challenges this non-economist.

However, perhaps log(ΔSales) really means Δlog(Sales) -- and so on.

Sorry i was mistaken. I believe the costs and sales can also decrease. The variable log(ΔSales) is interpreted as log-change in deflated sales of firm i from year t-1 to year t
Comment

Announcement

Collinearity when running regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment