Dear Statalist,
I am running a simple regression model and I have a question on when to use clustered errors, how to test for the need of clustering and what is the difference between clustering errors and using interaction terms.
Model / setup:
I have a regression model estimating how various factors influence a shipping price of specific good using maritime transport. My data contain information on contracts signed by a large customer for shipping services (e.g. a large producer of shirts buying shipping services from China to Europe/US/AUS from various shipping companies). Each observation contains information on what shipper is responsible from that shipment, on what route is the shipment and price of shipping. Good are always the same and costs of transporting them is constant across time and contracts. The objective of my exercise is to analyse an impact of an industry shock, which occurred at one point in time and continued till today (e.g. merger between two large competitors leading to increased prices). This is modelled with a dummy variable (DVbreak) taking “1” for time period after the industry shock and “0” before.
In its simplified form, my model looks like this

where price is being explained by demand and cost drivers, fixed effects and dummy for industry shock.
Question:
My expectation is that error terms will have different variance for each carrier, maybe even route. I tested for heteroscedasticity using Breusch-Pagan test, which confirmed heteroscedasticity. I also suspect that error terms might be correlated for the same carrier and/or route. In that case clustering standard errors would be necessary.
I am running a simple regression model and I have a question on when to use clustered errors, how to test for the need of clustering and what is the difference between clustering errors and using interaction terms.
Model / setup:
I have a regression model estimating how various factors influence a shipping price of specific good using maritime transport. My data contain information on contracts signed by a large customer for shipping services (e.g. a large producer of shirts buying shipping services from China to Europe/US/AUS from various shipping companies). Each observation contains information on what shipper is responsible from that shipment, on what route is the shipment and price of shipping. Good are always the same and costs of transporting them is constant across time and contracts. The objective of my exercise is to analyse an impact of an industry shock, which occurred at one point in time and continued till today (e.g. merger between two large competitors leading to increased prices). This is modelled with a dummy variable (DVbreak) taking “1” for time period after the industry shock and “0” before.
In its simplified form, my model looks like this
where price is being explained by demand and cost drivers, fixed effects and dummy for industry shock.
Question:
My expectation is that error terms will have different variance for each carrier, maybe even route. I tested for heteroscedasticity using Breusch-Pagan test, which confirmed heteroscedasticity. I also suspect that error terms might be correlated for the same carrier and/or route. In that case clustering standard errors would be necessary.
- How can I test for correlation of standard errors within a specific group? Is there a rule of thumb from what level of correlation clustering is appropriate/necessary?
- What is the difference between clustering for standard errors and including an interaction term in the model? E.g. if errors were clustered by carrier, I could include interaction term “DVbreak * FEcarrier”. Would that lead to the same results as clustering by carrier?
Comment