Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Control Variables - Logistic Regression Model

    Hi!

    I'm having a problem with creating my logistic regression output. My model has a dichotomous dependent variable: Data breach (did have a data breach=1, 0 otherwise), a factor variable: i.Year (for the years 2007 to 2011), categorical variables size of the firm (large=4, medium=3, small=2, micro=1), and sector (1= agriculture, 2= education, 3=finance, 4=healthcare). The purpose of this regression is to show/present the trend in Data Breaches between 2007 and 2011. I included size and sector because I'm interested (in further analysis) in comparing the firm's size to the proportion of data breaches, and the sector to the proportion of data breaches.

    Originally, the command I used was:
    logistic DataBreach i.Year i.size i.sector, coef

    Logistic regression
    DataBreach Coef. St.Err. t-value p-value [95% Conf Interval] Sig
    2007b 0 . . . . .
    2008 .548 .148 3.70 0 .257 .838 ***
    2009 .634 .139 4.56 0 .362 .907 ***
    2010 .16 .139 1.15 .021 -.113 .433
    2011 .24 .143 2.32 .371 .051 .611 **
    size: base Micro 0 . . . . .
    Small .559 .117 4.76 0 .329 .789 ***
    Medium .742 .135 5.49 0 .477 1.007 ***
    Large .903 .142 6.35 0 .625 1.182 ***
    Sector: bas~Agriculture 0 . . . . .
    Education .109 .116 0.94 .347 -.118 .337
    Finance .299 .116 2.59 .01 .072 .526 ***
    Healthcare -.359 .188 -1.91 .056 -.727 .009 *
    Constant -.612 .125 -4.89 0 -.857 -.367 ***
    Mean dependent var 0.537 SD dependent var 0.499
    Pseudo r-squared 0.038 Number of obs 2058
    Chi-square 107.539 Prob > chi2 0.000
    Akaike crit. (AIC) 2756.218 Bayesian crit. (BIC) 2818.142
    *** p<.01, ** p<.05, * p<.1
    The issue I have is that the inclusion of size and sector seems to have drastically changed the coefficients of the years 2008 to 2011 (as shown above). Whereas if I used the command: logistic DataBreach i.Year, coef
    My regression output is:


    Logistic regression
    DataBreach Coef. St.Err. t-value p-value [95% Conf Interval] Sig
    2007b 0 . . . . .
    2008 .26 .142 4.65 0 .382 .938 ***
    2009 .363 .135 4.92 0 .399 .926 ***
    2010 .2 .134 1.49 .136 -.063 .463
    2011 .2 .137 2.38 .017 .058 .597 **
    Constant -.2 .091 -2.21 .027 -.378 -.022 **
    Mean dependent var 0.537 SD dependent var 0.499
    Pseudo r-squared 0.013 Number of obs 2058
    Chi-square 35.651 Prob > chi2 0.000
    Akaike crit. (AIC) 2816.106 Bayesian crit. (BIC) 2844.253
    *** p<.01, ** p<.05, * p<.1
    The coefficient values above seem to be more accurate as they match the annual trend of my raw data (when calculating the proportion of data breaches manually on excel). Specifically the years 2010 and 2011. My raw data showed that the percentage of firms that had a data breach in both 2010 and 2011 was 31%. Therefore the coefficients for 2010 and 2011 are both equal. However, in my previous regression (the one that includes size and sector), the coefficients are very different.

    I've realised that perhaps it had something to do with having size and sector as indicator variables. I'm looking into whether it would be better to use control variables for size and sector. if so, does anyone know the relevant commands?

    Thank you!
    Last edited by Aria Mendoza; 03 Sep 2022, 20:46.

  • #2
    First, you should never be surprised when the addition or removal of variables to or from a model results in changes in the other coefficients. That is normal and expected. The question is which model is appropriate to your research question. In your post, your stated research question is "The purpose of this regression is to show/present the trend in Data Breaches between 2007 and 2011." If that is truly what your research is about, there is really no reason to include any variables other than year in the model. In fact, if that is the goal, there isn't even any reason to do a logistic regression: just show the proportions of firms having a data breach in each year in the raw data. Done!

    But perhaps you have something else in mind, such as gaining an understanding of causal factors in data breaches and you have hypotheses about sector and size being relevant. If that is the case, then you can only approach that with a model that contains those variables.

    And even if the time trend in data breaches is your primary focus, you might want to gain an understanding of what might lie behind the trend you observe in the raw data. In that case, you would want to adjust for variables that are changing over time and are also related to data breach risk--which sector or size might be (I have no idea--not my area).

    So you need to clarify just what your research question(s) is(are) and then model accordingly.

    Comment

    Working...
    X