Control Variables - Logistic Regression Model

Aria Mendoza

Join Date: Aug 2022
Posts: 2

Control Variables - Logistic Regression Model

03 Sep 2022, 20:42

Hi!

I'm having a problem with creating my logistic regression output. My model has a dichotomous dependent variable: Data breach (did have a data breach=1, 0 otherwise), a factor variable: i.Year (for the years 2007 to 2011), categorical variables size of the firm (large=4, medium=3, small=2, micro=1), and sector (1= agriculture, 2= education, 3=finance, 4=healthcare). The purpose of this regression is to show/present the trend in Data Breaches between 2007 and 2011. I included size and sector because I'm interested (in further analysis) in comparing the firm's size to the proportion of data breaches, and the sector to the proportion of data breaches.

Originally, the command I used was:
logistic DataBreach i.Year i.size i.sector, coef

Logistic regression

DataBreach	Coef.		St.Err.	t-value		p-value	[95% Conf		Interval]	Sig
2007b	0		.	.		.	.		.
2008	.548		.148	3.70		0	.257		.838	***
2009	.634		.139	4.56		0	.362		.907	***
2010	.16		.139	1.15		.021	-.113		.433
2011	.24		.143	2.32		.371	.051		.611	**
size: base Micro	0		.	.		.	.		.
Small	.559		.117	4.76		0	.329		.789	***
Medium	.742		.135	5.49		0	.477		1.007	***
Large	.903		.142	6.35		0	.625		1.182	***
Sector: bas~Agriculture	0		.	.		.	.		.
Education	.109		.116	0.94		.347	-.118		.337
Finance	.299		.116	2.59		.01	.072		.526	***
Healthcare	-.359		.188	-1.91		.056	-.727		.009	*
Constant	-.612		.125	-4.89		0	-.857		-.367	***

Mean dependent var		0.537			SD dependent var			0.499
Pseudo r-squared		0.038			Number of obs			2058
Chi-square		107.539			Prob > chi2			0.000
Akaike crit. (AIC)		2756.218			Bayesian crit. (BIC)			2818.142
* p<.01, p<.05, * p<.1

The issue I have is that the inclusion of size and sector seems to have drastically changed the coefficients of the years 2008 to 2011 (as shown above). Whereas if I used the command: logistic DataBreach i.Year, coef
My regression output is:

Logistic regression

DataBreach	Coef.		St.Err.	t-value		p-value	[95% Conf		Interval]	Sig
2007b	0		.	.		.	.		.
2008	.26		.142	4.65		0	.382		.938	***
2009	.363		.135	4.92		0	.399		.926	***
2010	.2		.134	1.49		.136	-.063		.463
2011	.2		.137	2.38		.017	.058		.597	**
Constant	-.2		.091	-2.21		.027	-.378		-.022	**

Mean dependent var		0.537			SD dependent var			0.499
Pseudo r-squared		0.013			Number of obs			2058
Chi-square		35.651			Prob > chi2			0.000
Akaike crit. (AIC)		2816.106			Bayesian crit. (BIC)			2844.253
* p<.01, p<.05, * p<.1

The coefficient values above seem to be more accurate as they match the annual trend of my raw data (when calculating the proportion of data breaches manually on excel). Specifically the years 2010 and 2011. My raw data showed that the percentage of firms that had a data breach in both 2010 and 2011 was 31%. Therefore the coefficients for 2010 and 2011 are both equal. However, in my previous regression (the one that includes size and sector), the coefficients are very different.

I've realised that perhaps it had something to do with having size and sector as indicator variables. I'm looking into whether it would be better to use control variables for size and sector. if so, does anyone know the relevant commands?

Thank you!

Last edited by Aria Mendoza; 03 Sep 2022, 20:46.

Tags: categorical, Controlling for variable, data, logit, regression

Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#2

03 Sep 2022, 23:52

First, you should never be surprised when the addition or removal of variables to or from a model results in changes in the other coefficients. That is normal and expected. The question is which model is appropriate to your research question. In your post, your stated research question is "The purpose of this regression is to show/present the trend in Data Breaches between 2007 and 2011." If that is truly what your research is about, there is really no reason to include any variables other than year in the model. In fact, if that is the goal, there isn't even any reason to do a logistic regression: just show the proportions of firms having a data breach in each year in the raw data. Done!

But perhaps you have something else in mind, such as gaining an understanding of causal factors in data breaches and you have hypotheses about sector and size being relevant. If that is the case, then you can only approach that with a model that contains those variables.

And even if the time trend in data breaches is your primary focus, you might want to gain an understanding of what might lie behind the trend you observe in the raw data. In that case, you would want to adjust for variables that are changing over time and are also related to data breach risk--which sector or size might be (I have no idea--not my area).

So you need to clarify just what your research question(s) is(are) and then model accordingly.
2 likes
Comment

Announcement

Control Variables - Logistic Regression Model

Comment