Hi!
I'm having a problem with creating my logistic regression output. My model has a dichotomous dependent variable: Data breach (did have a data breach=1, 0 otherwise), a factor variable: i.Year (for the years 2007 to 2011), categorical variables size of the firm (large=4, medium=3, small=2, micro=1), and sector (1= agriculture, 2= education, 3=finance, 4=healthcare). The purpose of this regression is to show/present the trend in Data Breaches between 2007 and 2011. I included size and sector because I'm interested (in further analysis) in comparing the firm's size to the proportion of data breaches, and the sector to the proportion of data breaches.
Originally, the command I used was:
logistic DataBreach i.Year i.size i.sector, coef
Logistic regression
The issue I have is that the inclusion of size and sector seems to have drastically changed the coefficients of the years 2008 to 2011 (as shown above). Whereas if I used the command: logistic DataBreach i.Year, coef
My regression output is:
Logistic regression
The coefficient values above seem to be more accurate as they match the annual trend of my raw data (when calculating the proportion of data breaches manually on excel). Specifically the years 2010 and 2011. My raw data showed that the percentage of firms that had a data breach in both 2010 and 2011 was 31%. Therefore the coefficients for 2010 and 2011 are both equal. However, in my previous regression (the one that includes size and sector), the coefficients are very different.
I've realised that perhaps it had something to do with having size and sector as indicator variables. I'm looking into whether it would be better to use control variables for size and sector. if so, does anyone know the relevant commands?
Thank you!
I'm having a problem with creating my logistic regression output. My model has a dichotomous dependent variable: Data breach (did have a data breach=1, 0 otherwise), a factor variable: i.Year (for the years 2007 to 2011), categorical variables size of the firm (large=4, medium=3, small=2, micro=1), and sector (1= agriculture, 2= education, 3=finance, 4=healthcare). The purpose of this regression is to show/present the trend in Data Breaches between 2007 and 2011. I included size and sector because I'm interested (in further analysis) in comparing the firm's size to the proportion of data breaches, and the sector to the proportion of data breaches.
Originally, the command I used was:
logistic DataBreach i.Year i.size i.sector, coef
Logistic regression
| DataBreach | Coef. | St.Err. | t-value | p-value | [95% Conf | Interval] | Sig | ||||
| 2007b | 0 | . | . | . | . | . | |||||
| 2008 | .548 | .148 | 3.70 | 0 | .257 | .838 | *** | ||||
| 2009 | .634 | .139 | 4.56 | 0 | .362 | .907 | *** | ||||
| 2010 | .16 | .139 | 1.15 | .021 | -.113 | .433 | |||||
| 2011 | .24 | .143 | 2.32 | .371 | .051 | .611 | ** | ||||
| size: base Micro | 0 | . | . | . | . | . | |||||
| Small | .559 | .117 | 4.76 | 0 | .329 | .789 | *** | ||||
| Medium | .742 | .135 | 5.49 | 0 | .477 | 1.007 | *** | ||||
| Large | .903 | .142 | 6.35 | 0 | .625 | 1.182 | *** | ||||
| Sector: bas~Agriculture | 0 | . | . | . | . | . | |||||
| Education | .109 | .116 | 0.94 | .347 | -.118 | .337 | |||||
| Finance | .299 | .116 | 2.59 | .01 | .072 | .526 | *** | ||||
| Healthcare | -.359 | .188 | -1.91 | .056 | -.727 | .009 | * | ||||
| Constant | -.612 | .125 | -4.89 | 0 | -.857 | -.367 | *** | ||||
| Mean dependent var | 0.537 | SD dependent var | 0.499 | ||||||||
| Pseudo r-squared | 0.038 | Number of obs | 2058 | ||||||||
| Chi-square | 107.539 | Prob > chi2 | 0.000 | ||||||||
| Akaike crit. (AIC) | 2756.218 | Bayesian crit. (BIC) | 2818.142 | ||||||||
| *** p<.01, ** p<.05, * p<.1 | |||||||||||
My regression output is:
Logistic regression
| DataBreach | Coef. | St.Err. | t-value | p-value | [95% Conf | Interval] | Sig | ||||
| 2007b | 0 | . | . | . | . | . | |||||
| 2008 | .26 | .142 | 4.65 | 0 | .382 | .938 | *** | ||||
| 2009 | .363 | .135 | 4.92 | 0 | .399 | .926 | *** | ||||
| 2010 | .2 | .134 | 1.49 | .136 | -.063 | .463 | |||||
| 2011 | .2 | .137 | 2.38 | .017 | .058 | .597 | ** | ||||
| Constant | -.2 | .091 | -2.21 | .027 | -.378 | -.022 | ** | ||||
| Mean dependent var | 0.537 | SD dependent var | 0.499 | ||||||||
| Pseudo r-squared | 0.013 | Number of obs | 2058 | ||||||||
| Chi-square | 35.651 | Prob > chi2 | 0.000 | ||||||||
| Akaike crit. (AIC) | 2816.106 | Bayesian crit. (BIC) | 2844.253 | ||||||||
| *** p<.01, ** p<.05, * p<.1 | |||||||||||
I've realised that perhaps it had something to do with having size and sector as indicator variables. I'm looking into whether it would be better to use control variables for size and sector. if so, does anyone know the relevant commands?
Thank you!

Comment