Clustering variables

Luisa Márquez

Join Date: Apr 2014

Posts: 27
#1

Clustering variables

19 Oct 2016, 09:14

Hello,

I am developing a model to analyze how the percentaje of women in the founding team influences the goals, achievements and challenges of the business. I have three DV. They are continuous variables. So far, I have run OLS regression analysis. However, I would like to have a better picture of the phenomenon addressed. I would like to split the sample between those business with high percentaje of women in the team founding and those ones with low percentaje. Should I cluster the sample?

Thank you so much in advance.

Luisa
Tags: None
Sebastian Geiger

Join Date: Oct 2015

Posts: 124
#2

19 Oct 2016, 12:05

Lusia,

I don't quite know what you with clustering mean, but for a start, you may consider interactions between the share of women and other predictor variables. With interaction terms, the effect of the other covariates is allowed to depend on the share of women.

For example

Code:

reg goals c.share c. numberofpersons i.sector c.share#c.numberofpersons c.share#i.sector
Comment
Luisa Márquez

Join Date: Apr 2014

Posts: 27
#3

19 Oct 2016, 12:30

Hello Sebastian,

Thank you so much for your answer.

So far I have run different interactions between the share of women in the founding teams and other predictor variables such as profits, the relevance of income for women and the market. I have not tested the number of persons but I will folllow your suggestion!

I am interested in clustering the sample because so far, the only information I can get from OLS regression analysis is how the share of women affects the goals, achievements and challenges of the business by differentiating between social and economic ones. However, this cannot allow me to know to what extent a specific share of women in the founding teams affect having more social or economic orientation.

So far, I have split the sample by specifying "if the_share_of_women>=0.5". However, I wonder if it would be better to use another method.

Thank you so much.

Luisa
Comment

Sebastian Geiger

Join Date: Oct 2015
Posts: 124

19 Oct 2016, 13:39

The variable -numberofpersons- was not a suggestion, I just made up this variable for the purpose of illustration. You should choose the variables and interactions according to your theoretical model.

Your dependent variables are continuous and reflect the goals, right? In this case, the coefficient of the share of women should be the extent that this very variable affects the goals. Therefore, I'm still not sure why you should split the sample. If you expect that a higher share of women has a non-linear effect (e.g. higher between 0 and 50 percent than between 50 and 100 percent), I would consider polynomials of higher order (e.g. a squared value of the share of women as an additional covariate). The coefficient of a squared variable indicates whether the effect is increasing or decreasing as the share of women increases:

For example:

Code:

webuse cattaneo2
(Excerpt from Cattaneo (2010) Journal of Econometrics 155: 138-154)
r; t=1.48 21:30:19

reg bweight i.mmarried c.mage c.mage#c.mage

      Source |       SS           df       MS      Number of obs   =     4,642
-------------+----------------------------------   F(3, 4638)      =     53.30
       Model |  51820175.9         3    17273392   Prob > F        =    0.0000
    Residual |  1.5031e+09     4,638  324075.908   R-squared       =    0.0333
-------------+----------------------------------   Adj R-squared   =    0.0327
       Total |  1.5549e+09     4,641  335032.156   Root MSE        =    569.28

-------------------------------------------------------------------------------
      bweight |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
     mmarried |
     married  |   209.3104   21.33721     9.81   0.000     167.4793    251.1414
         mage |   5.055696    12.2432     0.41   0.680    -18.94679    29.05818
              |
c.mage#c.mage |  -.0337501   .2209105    -0.15   0.879    -.4668398    .3993396
              |
        _cons |   3106.001   160.1259    19.40   0.000     2792.078    3419.924
-------------------------------------------------------------------------------

The effect of mage is positive on the dependent variable (bweight). In particular, an increase from 0 to 1 increases bweight by 5.06. But the effects for the next increase (from 2 to 3) is smaller since the coefficient of the squared value (c.mage#c.mage) is negative. Specifically, the effect is 4.99 [(5.055696 + 1*(2 * -.0337501)] now.

Is this what you mean?

Comment

Luisa Márquez

Join Date: Apr 2014

Posts: 27
#5

20 Oct 2016, 03:48

Hello Sebastian,

Yes, my dependent variables are continuous and they reflect the goals of the business. And yes, your explanation is what I am trying to address. Following your example, please, find attached my results. If I am right. an increase in the share of women, icreases the social goals of the business by 1.83. But the effects for the next increase is smaller because as in your case, the coefficient is negative. Following what you mention: If you expect that a higher share of women has a non-linear effect (e.g. higher between 0 and 50 percent than between 50 and 100 percent), I would consider polynomials of higher order (e.g. a squared value of the share of women as an additional covariate). I wonder how I can get more information about when the share of women more than 50 percent affect the goals of the business taking into account this method.

Thank you so much for your help. I really appreciate it.

Luisa
Attached Files
Comment

Sebastian Geiger

Join Date: Oct 2015
Posts: 124

20 Oct 2016, 04:46

You are right. The coefficient on the squared value indicates that the effect of an additional increase is decreasing if the share of women is already very high. However, I should have mentioned that you need, of course, to look at the significance level. Since your p-value is 0.386 the coefficient is not significant at any conventional level. Strictly speaking, we cannot say if the effect is, in fact, non-linear. In practice, many researchers will conclude that this indicates that the effect is rather linear (even though this conclusion may result in an error of second order, i.e. the inability to reject the null hypothesis of no effect does not mean that the null hypothesis is true necessarily). The coefficient might become significant, however, if you include additional control variables, but it can also become even less significant.

If you think that the effect increases if the share is higher than 50 percent (e.g. a majority), then you can include a dummy that becomes 1 if the share is at least than 50 percent. The coefficient of this indicates (if significant) whether there is a parallel shift in the effect. For example: If the coefficient on woman is 1.8 and the coefficient on the dummy is 0.5, this means that the effect is 1.8 for all shares below 50 percent and 2.3 for all shares above this threshold.

You may report both model specifications in your paper to test whether there is a parallel shift or a non-linear increase/decrease in the effect.

The regression analysis is even more flexible. You may include interaction between the dummy and the squared value (additionally to the dummy and the squared value by themselves). In this case the squared value reports whether the effect is non-linear below 50 percent while the sum* of the squared value and the interaction is the shows whether there is a non-linear effect over 50 percent. This tests whether the effect jumps up when women reach the majority even though the effect is actually decreasing for each additional woman.

* Note: It's actually not the mere sum because the first derivative of a squared value (x²) is 2*x. Therefore, the marginal effect needs to be multiplied by 2 (incl. the effect for the interaction). Of course, this might lead to confusion if you include may squared values and interactions. Stata offers with the margins command a nice way to evaluate the marginal effect at different levels of the covariates.

To illustrate my explanations I use the test dataset from above. The regression evaluates if the weight of a baby (-bweight-) is related to the age of a woman (-mage-). I construct a dummy -above30- to indicate late pregnancies (I'm a economist, I don't know if that is accurate ;-) ). The regression includes mage, mage squared, the -above30- dummy, and an interaction between the dummy and the squared value. I also included a married dummy to indicate that this model can include, of course, additional control variables. The "i." and "c." prefixes mark dummy or continuous variables, respectively (this is helpful to make sure that the marginal effects are computed correctly). The margins command needs to be called after the regression. I ran it twice. The first time, I computed the marginal effects for a woman at the age of 40. The second time, I computed the average marginal effect i.e. the effect of each individual in the sample. Note that an increase in age has an positive effect on the baby's weight overall, but a decreasing effect on women at the age of 40 (ignoring the fact that the coefficient is not significant).

Code:

. webuse cattaneo2, clear                                         // Load test dataset from across the web
(Excerpt from Cattaneo (2010) Journal of Econometrics 155: 138-154)
r; t=1.17 12:42:46

. gen above30 = mage>=30 if !missing(mage)        // Generate dummy
r; t=0.00 12:42:46

.
. reg bweight i.mmarried c.mage c.mage#c.mage i.above30 i.above30#c.mage#c.mage   // Run the regression

      Source |       SS           df       MS      Number of obs   =     4,642
-------------+----------------------------------   F(5, 4636)      =     32.47
       Model |  52608697.6         5  10521739.5   Prob > F        =    0.0000
    Residual |  1.5023e+09     4,636   324045.63   R-squared       =    0.0338
-------------+----------------------------------   Adj R-squared   =    0.0328
       Total |  1.5549e+09     4,641  335032.156   Root MSE        =    569.25

---------------------------------------------------------------------------------------
              bweight |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------+----------------------------------------------------------------
             mmarried |
             married  |   208.1555    21.3643     9.74   0.000     166.2713    250.0397
                 mage |   33.20331   29.85098     1.11   0.266    -25.31881    91.72542
                      |
        c.mage#c.mage |  -.6277024   .6463295    -0.97   0.332    -1.894816     .639411
                      |
            1.above30 |  -258.0051   201.1127    -1.28   0.200    -652.2817    136.2714
                      |
above30#c.mage#c.mage |
                   1  |   .2682805   .2371341     1.13   0.258    -.1966152    .7331762
                      |
                _cons |   2785.358    336.873     8.27   0.000     2124.927    3445.789
---------------------------------------------------------------------------------------
r; t=0.05 12:42:46

.
. margins, dydx(*) at(mage=40)                    // Compute the marginal effect of a woman aged 40 years

Average marginal effects                        Number of obs     =      4,642
Model VCE    : OLS

Expression   : Linear prediction, predict()
dy/dx w.r.t. : 1.mmarried mage 1.above30
at           : mage            =          40

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    mmarried |
    married  |   208.1555    21.3643     9.74   0.000     166.2713    250.0397
        mage |  -10.41973   16.78597    -0.62   0.535     -43.3282    22.48875
   1.above30 |   171.2436   183.3105     0.93   0.350    -188.1322    530.6195
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.
r; t=0.28 12:42:47

.
. margins, dydx(*)                                                // Compute the average marginal effects for every variable

Average marginal effects                        Number of obs     =      4,642
Model VCE    : OLS

Expression   : Linear prediction, predict()
dy/dx w.r.t. : 1.mmarried mage 1.above30

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    mmarried |
    married  |   208.1555    21.3643     9.74   0.000     166.2713    250.0397
        mage |   5.375576   2.585566     2.08   0.038     .3066361    10.44452
   1.above30 |  -61.07211   39.52539    -1.55   0.122    -138.5607    16.41645
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.
r; t=0.26 12:42:47

Comment

Luisa Márquez

Join Date: Apr 2014

Posts: 27
#7

24 Oct 2016, 04:06

Hello Sebastian,

Thank you so much for the detailed explanation. I have tested my model following your example and it works very well. Thank you so much for your help! :-). Very grateful.
Comment
Sebastian Geiger

Join Date: Oct 2015

Posts: 124
#8

24 Oct 2016, 06:43

I had another idea how to show that the marginal effect depends on the share of women (in my example the age of the woman): the marginsplot command.

Code:

webuse cattaneo2, clear gen above30 = mage>=30 if !missing(mage) // Generate dummy reg bweight i.mmarried c.mage c.mage#c.mage i.above30 i.above30#c.mage#c.mage // Run the regression margins, dydx(mage) at(mage=(0(5)100)) over(above30) marginsplot, xdimension(mage) graph export "testgraph.png", width(2048) replace

You specific in the margins command what you would like to estimate. Subsequently, you call the marginsplot command. I selected to predict the marginal effects of -mage- over the dummy -above30-. Of course, it does not make much sense to evaluate the red line (above30) for ages below 30 years (and vice versa for the blue line). In my example, the initial effect of age is positive but become negative with increasing age (ignoring the fact that none of these effects is statistically significant). Late pregnancies, shift the marginal effect of age upwards. This is not to confuse with the generally negative effect of late pregnancies (see coefficient of i.above30).

Attached Files
Comment

Announcement