Creating a New Variable from Within-Distribution Averages and its Implications

Lorenzo Fabiani

Join Date: Jun 2022

Posts: 12
#1

Creating a New Variable from Within-Distribution Averages and its Implications

17 Oct 2023, 16:59

Hello everybody,

I'm deeply engaged in a research project focusing on the relationship between impunity (quality of Justice) and the reportes of sexual violence across Italian provinces. My underlying hypothesis is that regions with higher impunity might, on average, have fewer reports of sexual violences. This supposition stems from potential factors like a lack of trust in the justice system and an ingrained sense of shame among victims. Preliminary visual analysis of my data, particularly using geographical mapping, suggests more reports of sexual violence in Northern Italy, juxtaposed with higher impunity levels in the South, somewhat reinforcing my hypothesis. My methodological inclination is towards fixed-effects regression, especially after some preliminary tests. However, instead of using the direct number of reports of sexual violences as my dependent variable, I considered deriving a new metric by subtracting the average of the top 10 provinces (based on reports) from each observation, year on year, and then reversing the sign. This would ideally yield a 'GAP' indicating the difference with the "best-performing" provinces.

The measure I've devised is:

GAP = Reports in a given province - Average of the top 10 provinces for reports

My core questions are:
1. Is this a conventional or previously employed methodology? Can I find pertinent literature or guidelines on it?
2. Would it be methodologically right to exclude the top 10 provinces of any year from the model? I'm contemplating this due to potential divergent effects:
a. The gap decreases with increasing impunity because there are more reports in all the other provinces.
b. The gap decrases with rising impunity because the "high-performing" provinces report less.
And, of course, the converse situations where the gap increase.
3. Can I possibly address this by introducing a dummy variable accounting for provinces with high reporting rates and then crafting an interaction term with this variable?

I genuinely appreciate your insights and assistance. I hope I've articulated my concerns comprehensively.

Warm regards,
Lorenzo
Tags: None
Daniel Schaefer

Join Date: Mar 2020

Posts: 818
#2

18 Oct 2023, 13:51

1. Subtracting the average of the top 10 provinces just re-center's the variable and shouldn't have any effect on the regression coefficient. Reversing the sign on every observation should just reverse the sign on the regression coefficient.
2. You should not exclude the observations, you should model the divergence with an interaction.
3. Yes, exactly.

Just to expand a bit on 1: I don't know what your data look like, but suppose you have 10 years and 1000 provinces.

Code:

clear set seed 23456 set obs 10000 gen province = mod(_n, 1000) egen year = seq(), block(1000) gen impunity = runiform() bysort year: gen violence = 2 * impunity + rnormal() xtset province year xtreg violence impunity, fe

Just using violence as the outcome, we get this model.

Code:

------------------------------------------------------------------------------ violence | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- impunity | 1.985421 .0367834 53.98 0.000 1.913317 2.057525 _cons | .0135477 .0209594 0.65 0.518 -.0275376 .054633

Now let's create a new variable like you describe:

Code:

bysort year (violence), sort: egen top_ten_mean = mean(violence) if _n > 990 bysort year (violence), sort: replace top_ten_mean = top_ten_mean[_N] gen diff_and_flip = -(violence - top_ten_mean)

Look at that last line. Notice it is mathematically equivalent to subtracting each observation for violence from the top ten mean:

Code:

gen reverse_and_diff = top_ten_mean - violence assert diff_and_flip == reverse_and_diff rename reverse_and_diff new_violence

Now, what happens when we plug this into the new model?

Code:

new_violence | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- impunity | -1.985924 .0369868 -53.69 0.000 -2.058427 -1.913422 _cons | 3.977565 .0210753 188.73 0.000 3.936252 4.018877

The sign of the coefficient changes because we flipped the sign of the outcome. We also have a new intercept. This makes sense, because the intercept is the predicted mean of the outcome holding impunity constant. We changed the mean of violence when we re-centered it, so we get a different intercept. Now the intercept represents something like the average amount of violence (number of violent acts in your case) less than the average of the top-10 most violent provinces.

I considered deriving a new metric by subtracting the average of the top 10 provinces (based on reports) from each observation, year on year, and then reversing the sign. This would ideally yield a 'GAP' indicating the difference with the "best-performing" provinces.

The wording here is a little confusing. By "best-performing", do you actually mean provinces with the fewest acts of violence? I would call those the bottom 10 provinces, at least when speaking in terms of acts of violence. If you mean least violent acts, the code above would need to be changed slightly, as would the interpretation of the intercept, but much of the above analysis stays the same. Or do you mean "top-ten" in terms of impunity? This is not really clear at all.

If you want to do this (or a similar) manipulation on your outcome to bring the model more in line with theory or the hypothesis you want to test, then I think you should go ahead and do it, and you should feel confident that you aren't making any egregious statistical errors, at least with respect to this particular transformation. As for papers you can cite for this: I think you should be able to just explain clearly in your writeup what you're doing, and as long as you are clear, that should be enough. If you really feel you need a citation, I don't have one off hand, but you might just search for mean centering (preferably looking for sources broadly within your field) and go from there.

Last edited by Daniel Schaefer; 18 Oct 2023, 13:58.
1 like
Comment
Lorenzo Fabiani

Join Date: Jun 2022

Posts: 12
#3

18 Oct 2023, 18:37

Thank you so much for the help. There are a few things I wasn't able to explain.

I did some research, it should be a technique of centering the mean between groups (the years), but it's conditional since I choose the mean of only the group of the top 10 provinces (out of 106), year by year.

From your code, it seems to me that only the top 10 provinces of the total are calculated, not year by year. I managed to write a code, perhaps not pretty, but it works:

Code:

replace Sex_viol_100k = -1e10 if missing(Sex_viol_100k) sort year Sex_viol_100k by year: gen rank = _N - _n + 1 by year: egen top10_avg = mean(cond(rank <= 10, Sex_viol_100k, .)) gen GAP = Sex_viol_100k - top10_avg replace Sex_viol_100k = . if Sex_viol_100k == -1e10 replace GAP = . if missing(Sex_viol_100k) replace GAP = -GAP

Regarding the regression, the values change a lot because it's not a standard centering I guess. These are the results with the original variable:

Code:

------------------------------------------------------------------------------------- | Robust Sex_viol_100k | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------------+---------------------------------------------------------------- Impunity | L1. | -3.067214 1.539991 -1.99 0.049 -6.120732 -.0136961

Which is as I expect: as "Impunity" increases, reports decrease.

These are with the new "GAP" variable:

Code:

------------------------------------------------------------------------------------- | Robust GAP | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------------+---------------------------------------------------------------- Impunity | L1. | -7.407721 1.3642 -5.43 0.000 -10.11268 -4.702764

NOTE: are the reported sexual assaults, not the effective ones. My (opinable) idea is based on the fact that actually less assaults are reported than the real ones. That's why, for me, the "best-performing" provinces are the ones with more reports of sexual assaults, not fewer. The idea is, for example, Milan has 10 reports per 100k inhabitants, Rome has 7. At the same conditions, it's Rome that has fewer reports than there should be, for this reason Milan has better performance). I use also control variables.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 818
#4

18 Oct 2023, 19:34

From your code, it seems to me that only the top 10 provinces of the total are calculated, not year by year.

No, that is not correct. These two lines from me:

Code:

bysort year (violence), sort: egen top_ten_mean = mean(violence) if _n > 990 bysort year (violence), sort: replace top_ten_mean = top_ten_mean[_N]

Are equivalent to these three lines from your code:

Code:

sort year violence by year: gen rank = _N - _n + 1 by year: egen top10_avg = mean(cond(rank <= 10, violence, .))

This assertion passes on my end:

Code:

assert top10_avg == top_ten_mean

If you want to talk about regression models, you need to show me the lines of code you use to generate the results.

The idea is, for example, Milan has 10 reports per 100k inhabitants, Rome has 7. At the same conditions, it's Rome that has fewer reports than there should be, for this reason Milan has better performance). I use also control variables.

Okay, so do you take the per capita rate of sexual assaults, or do you use the raw number as the outcome?
Comment

Lorenzo Fabiani

Join Date: Jun 2022
Posts: 12

19 Oct 2023, 02:32

Thank you again for your time. I use the number of reported sexual assaults per 100,000 inhabitants.

For the regression code, I use:

Code:

clear all

capture log close

import excel "C:\Users\loren\OneDrive\Desktop\Università\PhD\Impunità\Database\PANEL_Sex_FDI.xlsx", sheet("panel") firstrow

keep if year >= 2008 & year <= 2020

replace Sex_viol_100k = -1e10 if missing(Sex_viol_100k)
sort year Sex_viol_100k
by year: gen rank = _N - _n + 1
by year: egen top10_avg = mean(cond(rank <= 10, Sex_viol_100k, .))
gen GAP = Sex_viol_100k - top10_avg
replace Sex_viol_100k = . if Sex_viol_100k == -1e10
replace GAP = . if missing(Sex_viol_100k)
replace GAP = -GAP

xtset id year

xtreg Sex_viol_100k l.Impunity GAP_Employment_rate l.Homicides Pct_Foreigners, fe vce(cluster id)

With the following results:

Code:

. xtreg Sex_viol_100k l.Impunity GAP_Employment_rate l.Homicides Pct_Foreigners, fe vce(cluster id)

Fixed-effects (within) regression               Number of obs     =      1,263
Group variable: id                              Number of groups  =        106

R-sq:                                           Obs per group:
     within  = 0.0139                                         min =          9
     between = 0.1996                                         avg =       11.9
     overall = 0.0853                                         max =         12

                                                F(4,105)          =       3.76
corr(u_i, Xb)  = -0.6954                        Prob > F          =     0.0067

                                          (Std. Err. adjusted for 106 clusters in id)
-------------------------------------------------------------------------------------
                    |               Robust
      Sex_viol_100k |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
           Impunity |
                L1. |  -3.067214   1.539991    -1.99   0.049    -6.120732   -.0136961
                    |
GAP_Employment_rate |   .0156553   .0298014     0.53   0.600    -.0434353    .0747459
                    |
          Homicides |
                L1. |   .1773109   .0945441     1.88   0.064    -.0101525    .3647743
                    |
     Pct_Foreigners |   -27.4156   14.26005    -1.92   0.057    -55.69065    .8594384
              _cons |   10.97524   2.021957     5.43   0.000     6.966069     14.9844
--------------------+----------------------------------------------------------------
            sigma_u |  2.5575401
            sigma_e |  2.0100704
                rho |  .61816158   (fraction of variance due to u_i)
-------------------------------------------------------------------------------------

Or replacing Sex_viol_100k with GAP:

Code:

. xtreg GAP l.Impunity GAP_Employment_rate l.Homicides Pct_Foreigners, fe vce(cluster id)

Fixed-effects (within) regression               Number of obs     =      1,263
Group variable: id                              Number of groups  =        106

R-sq:                                           Obs per group:
     within  = 0.0530                                         min =          9
     between = 0.2337                                         avg =       11.9
     overall = 0.1414                                         max =         12

                                                F(4,105)          =      14.52
corr(u_i, Xb)  = -0.1660                        Prob > F          =     0.0000

                                          (Std. Err. adjusted for 106 clusters in id)
-------------------------------------------------------------------------------------
                    |               Robust
                GAP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
           Impunity |
                L1. |  -7.407721     1.3642    -5.43   0.000    -10.11268   -4.702764
                    |
GAP_Employment_rate |    .077659   .0282592     2.75   0.007     .0216262    .1336918
                    |
          Homicides |
                L1. |  -.2047511   .1093433    -1.87   0.064    -.4215587    .0120566
                    |
     Pct_Foreigners |  -35.39252   12.38293    -2.86   0.005    -59.94559   -10.83945
              _cons |   12.19836   1.674713     7.28   0.000     8.877711      15.519
--------------------+----------------------------------------------------------------
            sigma_u |  1.7706093
            sigma_e |  2.0315732
                rho |  .43168653   (fraction of variance due to u_i)
-------------------------------------------------------------------------------------

Code:

. summarize Sex_viol_100k GAP Impunity GAP_Employment_rate Homicides Pct_Foreigners

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
Sex_vio~100k |      1,369    7.335156    2.765973   .9673963   23.69745
         GAP |      1,369    5.788866    2.795932  -9.468582   12.94524
    Impunity |      1,369    .6912153    .0890384   .4281333   .9888552
GAP_Employ~e |      1,372    19.07534    5.783479   4.550507    38.3802
   Homicides |      1,372    .7653262    .8987815          0   12.39038
-------------+---------------------------------------------------------
Pct_Foreig~s |      1,372    .0708101    .0338653   .0085237   .1852237

Comment

Daniel Schaefer

Join Date: Mar 2020

Posts: 818
#6

20 Oct 2023, 16:23

Hi Lorenzo,

Sorry for the delay in getting back to you. I really wanted to take some time to think this through carefully. Setting aside my explanation below for a moment, I don't actually think you need to modify your outcome like you propose. You should be able to capture what you want to capture with an interaction. You'll still need to think carefully about how to correctly model an interaction with a time-invariant variable (assuming the same provinces are in the top-ten year over year) in a fixed-effects model, but it seems like it would be worth figuring that out.

Now on to the explanation. In #1 I mistakenly assume there are no unobserved fixed-effects. In my defense, I modeled this including fixed-effects over the last couple of days and it was still difficult to generate simulated data and a regression model that behaves like what you have here. What I said in #2 appears to be generally correct under the usual regression assumptions. Even increasing the year over year variability in the top-ten average doesn't explain nearly everything you have going on in #5.

After a lengthy investigation, I think three things are happening here. First of all, model misspecification issues like omitted variable bias can increase the variability in the estimates generated across models. That could be part of the issue.

Second, notice the relatively large change in the correlation between the time-invariant fixed effects and the time-variant part of the model (where it says corr(u_i, Xb) in the output). It's really surprising to me that such a change is possible (though it certainly is possible). Consider that the only part of the model that changes between these two outcomes is the time-varying part. You can prove this by looping through each year, then checking the correlation between gap and Sex_viol_100k. The correlation should be one every year. The change in the correlation suggests (though it does not prove) that there may be a nonlinear relationship between your unobserved fixed effects and your observed variables. You may account for this nonlinearity when you translate your outcome by the mean of the top ten, then invert around zero on the number line - surprising, given that is a linear transformation of the outcome. That said, maybe the interaction you suggest in #1 also accounts for the non-linearity?

There is a third possibility: many timeseries models assume that the data are stationary, meaning that the time-variant portion of the variation in the outcome should be drawn from the same data generation process year over year. If the distribution is non-stationary, it turns out that it is fairly easy to get results like you have above, especially when the fixed effects are large and the relationship between the fixed effects and the time varying part of the model is nonlinear.

I'm guessing you have some amount of all three of these issues going on with your model. I would talk to an advisor or senior colleague about this in more detail if you can. You might need a whole set of regression diagnostics to systematically figure out what's happening here.

Here is the latest version of the script I've been playing around with over the last couple of days. It's not a Monte Carlo simulation yet, but at this point I should probably turn it into one.

Code:

clear set seed 23456 set obs 10000 egen province = seq(), from(1) to(100) egen year = seq(), block(100) gen fixed = 10 * rnormal() if year == 1 bysort province (year), sort: replace fixed = fixed[1] gen time_varying = fixed^2 + fixed + rnormal(year, 3) scatter time_varying fixed gen outcome = fixed + time_varying + rnormal() bysort year (outcome), sort: egen top10_avg = mean(cond(_N - _n + 1 < 10, outcome, .)) gen gap = -(outcome - top10_avg) sum outcome gap time_varying fixed top10_avg corr outcome gap time_varying fixed top10_avg forv year = 1/100{ qui corr gap outcome if year == `year' assert r(rho) } xtset province year xtreg outcome time_varying, fe xtreg gap time_varying, fe

Last edited by Daniel Schaefer; 20 Oct 2023, 16:27.
Comment

Lorenzo Fabiani

Join Date: Jun 2022
Posts: 12

21 Oct 2023, 16:30

Hi Daniel,

Once again thank you very much.

To clarify, the top 10 provinces vary over the years; they are not always the same.

I've conducted several tests and evaluated non-linearity, and indeed, it doesn't seem to be linearity. At the moment, I've obtained results that appear promising since they yield interesting and with plausible findings. My primary concern is about the inclusion of a dummy variable that takes the value 1 for the provinces in the top 10 and 0 for all the others. By introducing the dummy variable, I aim to account for the effects of the top 10 provinces, distinguishing them from all the others. This setup should answer the question I initially posed: "Is the gap narrowing because the worst-performing provinces are improving, or because the best-performing ones are deteriorating?"

I've also performed some general tests, which I've outlined below to seek further insights.

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity:

The test indicates no evidence of heteroskedasticity (p-value = 0.4968).

However, for safety, I will still employ Fixed Effects (FE) with clustered standard errors.

Code:

hettest
	 
	Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
	         Ho: Constant variance
	         Variables: fitted values of GAP
	 
	         chi2(1)      =     0.46
	         Prob > chi2  =   0.4968

Wooldridge test for autocorrelation:

The test suggests the presence of autocorrelation (p-value = 0.0004).
I'll use the vce(cluster) option to address this.

Code:

xtserial GAP II
 
Wooldridge test for autocorrelation in panel data
H0: no first-order autocorrelation
    F(  1,     105) =     13.264
           Prob > F =      0.0004

Skewness/Kurtosis tests for Normality:

The residuals exhibit deviations from normality (p-value = 0.0003).

However, normality isn't a fundamental requirement, especially as I'm not aiming for inferences on individual independent variables.

Code:

sktest residuals
	 
	                    Skewness/Kurtosis tests for Normality
	                                                          ------ joint ------
	    Variable |        Obs  Pr(Skewness)  Pr(Kurtosis) adj chi2(2)   Prob>chi2
	-------------+---------------------------------------------------------------
	   residuals |      1,369     0.0000        0.3765       15.97         0.0003

Ramsey RESET test:

The test suggests that the model should be well-specified(?) (p-value = 0.0916).

Code:

estat ovtest
 
Ramsey RESET test using powers of the fitted values of GAP
       Ho:  model has no omitted variables
                F(3, 1358) =      2.15
                  Prob > F =      0.0916

Variance Inflation Factor (VIF):

There's high VIF between the independent variable and its square, which is to be expected. It should be okay.

Code:

. vif
	 
	    Variable |       VIF       1/VIF 
	-------------+----------------------
	       sq_II |    129.33    0.007732
	          II |    126.29    0.007918
	Pct_Foreig~s |      1.67    0.599989
	GAP_Employ~e |      1.61    0.622461
	   Homicides |      1.24    0.809555
	   dummy_top |      1.07    0.933300
	     Density |      1.06    0.944199
	-------------+----------------------
	    Mean VIF |     37.47

Breusch and Pagan Lagrangian multiplier test for random effects:

The test suggests a preference for a Random Effects (RE) or Fixed Effects (FE) model over pooled OLS.

Code:

xttest0
	 
	Breusch and Pagan Lagrangian multiplier test for random effects
	 
	        GAP[id,t] = Xb + u[id] + e[id,t]
	 
	        Estimated results:
	                         |       Var     sd = sqrt(Var)
	                ---------+-----------------------------
	                     GAP |   7.817237       2.795932
	                       e |    3.65969       1.913032
	                       u |   1.027117       1.013468
	 
	        Test:   Var(u) = 0
	                             chibar2(01) =   394.65
	                          Prob > chibar2 =   0.0000

Hausman test:

The test further confirms the preference for a FE model.

Code:

hausman fe re
	 
	                 ---- Coefficients ----
	             |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
	             |       fe           re         Difference          S.E.
	-------------+----------------------------------------------------------------
	          II |    -18.8688    -15.56642       -3.302389        1.407948
	       sq_II |    11.27862     8.464198        2.814425        1.123021
	   dummy_top |   -5.033781    -5.509129        .4753475        .1117318
	     Density |   -.0056516    -.0004654       -.0051862        .0030046
	GAP_Employ~e |     .057368      .088577        -.031209        .0119168
	   Homicides |   -.0620467    -.0086348       -.0534118        .0304741
	Pct_Foreig~s |   -11.67048    -12.46384        .7933654        5.968137
	------------------------------------------------------------------------------
	                           b = consistent under Ho and Ha; obtained from xtreg
	            B = inconsistent under Ha, efficient under Ho; obtained from xtreg
	 
	    Test:  Ho:  difference in coefficients not systematic
	 
	                  chi2(7) = (b-B)'[(V_b-V_B)^(-1)](b-B)
	                          =       45.58
	                Prob>chi2 =      0.0000

The regression:

Code:

. xtreg GAP II sq_II dummy_top Density GAP_Employment_rate Homicides Pct_Foreigners, fe vce(cluster id)
	
	Fixed-effects (within) regression               Number of obs     =      1,369
	Group variable: id                              Number of groups  =        106
	
	R-sq:                                           Obs per group:
	     within  = 0.1624                                         min =         10
	     between = 0.2298                                         avg =       12.9
	     overall = 0.1721                                         max =         13
	
	                                                F(7,105)          =      16.58
	corr(u_i, Xb)  = -0.6462                        Prob > F          =     0.0000
	
	                                          (Std. Err. adjusted for 106 clusters in id)
	-------------------------------------------------------------------------------------
	                    |               Robust
	                GAP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
	--------------------+----------------------------------------------------------------
	                 II |   -18.8688    7.90572    -2.39   0.019    -34.54439   -3.193222
	              sq_II |   11.27862   5.672944     1.99   0.049     .0302238    22.52702
	          dummy_top |  -5.033781   .5621316    -8.95   0.000    -6.148384   -3.919178
	            Density |  -.0056516   .0021355    -2.65   0.009    -.0098858   -.0014174
	GAP_Employment_rate |    .057368   .0257632     2.23   0.028     .0062843    .1084517
	          Homicides |  -.0620467    .071068    -0.87   0.385    -.2029614     .078868
	     Pct_Foreigners |  -11.67048   9.883563    -1.18   0.240    -31.26776    7.926802
	              _cons |   14.80785   3.038474     4.87   0.000     8.783113    20.83258
	--------------------+----------------------------------------------------------------
	            sigma_u |   2.377606
	            sigma_e |  1.9130317
	                rho |  .60702157   (fraction of variance due to u_i)
	-------------------------------------------------------------------------------------

I would be very interested to read your opinion

Lorenzo

Comment

Daniel Schaefer

Join Date: Mar 2020

Posts: 818
#8

21 Oct 2023, 17:45

A few things:
You are treating your dummy_top variable as if it were continuous in the regression at the end of #7. It should start with the i. prefix.

The vce(cluster id) is not necessary, nor do I think it correctly accounts for temporal autocorrelation. Temporal autocorrelation suggests that adjacent years should be autocorrelated, and that autocorrelation should decline as years get further apart. You are already accounting for the fact that every year within a province is correlated with every other year within a province. That's what the fixed-effects model is doing.

As per #6, you may want to check and see if your variables are stationary or not. A fixed effects regression might still be most appropriate, but it's worth thinking a bit about.

I don't think it's appropriate to rescale your outcome by the top ten and use a dummy indicating the top ten in the same model. Of course the dummy will be correlated with the gap - it defines the gap. You aren't learning anything by including the dummy and your probably introducing model specification issues.

Your research question seems to have shifted a bit from #1. In #7 you are asking a question about a trend: does the gap get narrower? To rephrase that, does the gap depend on the year, such that later years have smaller gaps. This isn't a hypothesis about whether the effect of impunity on sexual assault is different for the top ten anymore. It seems like you need include your year variable as a predictor to test this new hypothesis about a trend in the gap.

In #1 it seems like you want to test the hypothesis that the size of the effect of impunity on cases of sexual assault is different for the top ten. In that case, just use sexual assault as the outcome and interact impunity with an indicator for the top ten. You can ignore the first-order coefficient on the top ten indicator because it is obviously associated with the number of sexual assaults: It is literally defined in terms of the outcome.
Comment

Lorenzo Fabiani

Join Date: Jun 2022
Posts: 12

22 Oct 2023, 05:03

Thank you, Daniel.

I'm providing the test results below, and at the end, I'll detail my research question. This might aid you in assisting me.

The stationarity test results (lags =1) are:

GAP, II, GAP_Employment_rate, and Homicides: With a p-value of 0.000, these series are non-stationary.
dummy_top and Density: With a p-value of 1.000, these series are stationary.
Pct_Foreigners: With a p-value of 0.687, this series is stationary. (But I believe it might be beneficial to keep it in the model)

I'm also showing the regression results after removing the dummy and the stationary variable “Density” to give you a clearer picture.

Code:

. xtreg GAP II sq_II i.dummy_top Density GAP_Employment_rate Homicides Pct_Foreigners, fe

Fixed-effects (within) regression               Number of obs     =      1,369
Group variable: id                              Number of groups  =        106

R-sq:                                           Obs per group:
     within  = 0.1624                                         min =         10
     between = 0.2298                                         avg =       12.9
     overall = 0.1721                                         max =         13

                                                F(7,1256)         =      34.80
corr(u_i, Xb)  = -0.6462                        Prob > F          =     0.0000

-------------------------------------------------------------------------------------
                GAP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
                 II |   -18.8688    7.13349    -2.65   0.008    -32.86367   -4.873934
              sq_II |   11.27862   5.193185     2.17   0.030     1.090349     21.4669
        1.dummy_top |  -5.033781   .3410927   -14.76   0.000    -5.702955   -4.364607
            Density |  -.0056516   .0030195    -1.87   0.061    -.0115755    .0002723
GAP_Employment_rate |    .057368   .0207118     2.77   0.006     .0167344    .0980016
          Homicides |  -.0620467    .085097    -0.73   0.466    -.2289947    .1049013
     Pct_Foreigners |  -11.67048   6.963706    -1.68   0.094    -25.33225    1.991299
              _cons |   14.80785   2.774837     5.34   0.000     9.364021    20.25167
--------------------+----------------------------------------------------------------
            sigma_u |   2.377606
            sigma_e |  1.9130317
                rho |  .60702157   (fraction of variance due to u_i)
-------------------------------------------------------------------------------------
F test that all u_i=0: F(105, 1256) = 5.30                   Prob > F = 0.0000



. xtreg GAP II sq_II i.dummy_top GAP_Employment_rate Homicides Pct_Foreigners, fe
 
Fixed-effects (within) regression               Number of obs     =      1,369
Group variable: id                              Number of groups  =        106
 
R-sq:                                           Obs per group:
     within  = 0.1601                                         min =         10
     between = 0.6465                                         avg =       12.9
     overall = 0.3703                                         max =         13
 
                                                F(6,1257)         =      39.94
corr(u_i, Xb)  = 0.3793                         Prob > F          =     0.0000
 
-------------------------------------------------------------------------------------
                GAP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
                 II |  -18.26564   7.133299    -2.56   0.011    -32.26012   -4.271154
              sq_II |    10.8836   5.194059     2.10   0.036     .6936171    21.07358
        1.dummy_top |  -4.926352   .3365633   -14.64   0.000     -5.58664   -4.266064
GAP_Employment_rate |   .0572027   .0207323     2.76   0.006      .016529    .0978763
          Homicides |  -.0571469   .0851414    -0.67   0.502    -.2241818     .109888
     Pct_Foreigners |  -9.653367   6.886654    -1.40   0.161    -23.16397    3.857235
              _cons |   12.92054    2.58771     4.99   0.000     7.843828    17.99724
--------------------+----------------------------------------------------------------
            sigma_u |  1.3569913
            sigma_e |  1.9149356
                rho |  .33429383   (fraction of variance due to u_i)
-------------------------------------------------------------------------------------
F test that all u_i=0: F(105, 1257) = 5.31                   Prob > F = 0.0000




. xtreg GAP II sq_II Density GAP_Employment_rate Homicides Pct_Foreigners, fe
 
Fixed-effects (within) regression               Number of obs     =      1,369
Group variable: id                              Number of groups  =        106
 
R-sq:                                           Obs per group:
     within  = 0.0172                                         min =         10
     between = 0.0255                                         avg =       12.9
     overall = 0.0179                                         max =         13
 
                                                F(6,1257)         =       3.67
corr(u_i, Xb)  = -0.2349                        Prob > F          =     0.0013
 
-------------------------------------------------------------------------------------
                GAP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
                 II |  -16.06501   7.721445    -2.08   0.038    -31.21335   -.9166736
              sq_II |    9.35039   5.621432     1.66   0.096    -1.678032    20.37881
            Density |    .001847    .003223     0.57   0.567    -.0044759      .00817
GAP_Employment_rate |   .0588982   .0224266     2.63   0.009     .0149004    .1028959
          Homicides |   -.098551   .0921046    -1.07   0.285    -.2792467    .0821448
     Pct_Foreigners |   -11.5942   7.540339    -1.54   0.124    -26.38724    3.198837
              _cons |   11.59314   2.995337     3.87   0.000     5.716734    17.46956
--------------------+----------------------------------------------------------------
            sigma_u |  2.0001192
            sigma_e |   2.071442
                rho |  .48248807   (fraction of variance due to u_i)
-------------------------------------------------------------------------------------
F test that all u_i=0: F(105, 1257) = 7.68                   Prob > F = 0.0000



. xtreg GAP II sq_II GAP_Employment_rate Homicides Pct_Foreigners, fe
 
Fixed-effects (within) regression               Number of obs     =      1,369
Group variable: id                              Number of groups  =        106
 
R-sq:                                           Obs per group:
     within  = 0.0170                                         min =         10
     between = 0.2820                                         avg =       12.9
     overall = 0.1414                                         max =         13
 
                                                F(5,1258)         =       4.34
corr(u_i, Xb)  = 0.2522                         Prob > F          =     0.0006
 
-------------------------------------------------------------------------------------
                GAP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
                 II |  -16.24776   7.712798    -2.11   0.035    -31.37912   -1.116394
              sq_II |   9.469414   5.616094     1.69   0.092    -1.548529    20.48736
GAP_Employment_rate |   .0589648   .0224203     2.63   0.009     .0149794    .1029501
          Homicides |   -.100461   .0920197    -1.09   0.275      -.28099     .080068
     Pct_Foreigners |  -12.27209   7.444993    -1.65   0.100    -26.87806    2.333878
              _cons |   12.20485   2.797955     4.36   0.000      6.71568    17.69403
--------------------+----------------------------------------------------------------
            sigma_u |  1.7292266
            sigma_e |   2.070889
                rho |  .41081251   (fraction of variance due to u_i)
-------------------------------------------------------------------------------------
F test that all u_i=0: F(105, 1258) = 7.99                   Prob > F = 0.0000

Now, regarding my research question, I believe it's essential to clarify this aspect. I aim to demonstrate that Impunity (II), proxy of the quality of the legal system, affects the number of reported cases. Specifically, my hypothesis is that in areas with higher impunity, victims report less due to a lack of trust in the justice system. My supervisor advised me to use the difference between provinces with the highest number of reports, the top 10 (GAP), to investigate this aspect.

Comment

Daniel Schaefer

Join Date: Mar 2020

Posts: 818
#10

22 Oct 2023, 08:24

Ideally, you want your variables to be stationary. I don't know why you would want to remove your stationary variable "Density" from the model. You should not include dummy_top unless you are estimating an interaction, but that has nothing to do with whether the variable is stationary or not.

Specifically, my hypothesis is that in areas with higher impunity, victims report less due to a lack of trust in the justice system.

Then the number of reported cases per capita should be your outcome and you expect a negative coefficient on impunity in the model. No need for an interaction, because you don't hypothesize that the size of the effect of impunity changes by levels of some other variable - just that impunity is negatively associated with reported cases. I happen to disagree with your supervisor: I don't think it is necessary or helpful to convert your outcome to the "gap" to test this hypothesis as stated. Just look at the way you've worded your hypothesis: It is in terms of the number of reported cases. That's your stated outcome. It's perfectly conceptually clear as is, and I don't really see what you gain by transforming this into a hypothesis about the gap. That said, I don't think it is a statistically invalid transformation (baring some worries related to your non-stationary variables) so your supervisor isn't wrong, its just not an analytic choice I would make. I think it confuses things, they probably think it clarifies things theoretically. It's up to you to work those details out with them.

I think it would be incorrect to interact impunity with dummy_top if the outcome is the gap variable, but that's a different story.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 818
#11

22 Oct 2023, 08:32

By the way, I am not advising you to drop your non-stationary variables from the model! Since N>T I think you should stick with the usual fixed effects regression. Just keep in mind that you have an assumption violation that may increase the size of your standard errors.
Comment
Lorenzo Fabiani

Join Date: Jun 2022

Posts: 12
#12

06 Nov 2023, 05:44

Originally posted by Daniel Schaefer View Post

Ideally, you want your variables to be stationary. I don't know why you would want to remove your stationary variable "Density" from the model. You should not include dummy_top unless you are estimating an interaction, but that has nothing to do with whether the variable is stationary or not.

Then the number of reported cases per capita should be your outcome and you expect a negative coefficient on impunity in the model. No need for an interaction, because you don't hypothesize that the size of the effect of impunity changes by levels of some other variable - just that impunity is negatively associated with reported cases. I happen to disagree with your supervisor: I don't think it is necessary or helpful to convert your outcome to the "gap" to test this hypothesis as stated. Just look at the way you've worded your hypothesis: It is in terms of the number of reported cases. That's your stated outcome. It's perfectly conceptually clear as is, and I don't really see what you gain by transforming this into a hypothesis about the gap. That said, I don't think it is a statistically invalid transformation (baring some worries related to your non-stationary variables) so your supervisor isn't wrong, its just not an analytic choice I would make. I think it confuses things, they probably think it clarifies things theoretically. It's up to you to work those details out with them.

I think it would be incorrect to interact impunity with dummy_top if the outcome is the gap variable, but that's a different story.

Hello Daniel,

I've come to a conclusion, and I believe it's a definitive one, thanks also to your advice, so I think it's fair to let you know how I've concluded:

I used the logarithm of the number of sexual reports as the dependent variable to normalize the distribution.

The indipendent variables are Impunity, GDP per Capita, Female Unemployment, Density, Foreigners Rate, Number of Homicides.

After the Ramsey test suggested specification problems or nonlinearity, I conducted a linktest in Stata for each variable and found that Impunity, GDP per Capita, and Number of Homicides had a nonlinear relationship. By adding the square of each variable, I addressed the issue. The VIF analysis reported high values for the variables with their squares, but I guess that's natural.

Both the White test and the Breusch-Pagan/Cook-Weisberg test show no signs of heteroskedasticity issues. The Wooldridge test for autocorrelation in panel data indicates evidence of first-order autocorrelation. With an F-statistic of 7.427 (degrees of freedom: 1, 105) and a p-value of 0.0075. As you said, temporal autocorrelation suggests that adjacent years should be autocorrelated, and that autocorrelation should decline as years get further apart. For this reason the use of a fixed-effects model to control for unobserved heterogeneity within each province, may indirectly mitigate some of the effects of autocorrelation by absorbing time-invariant differences across provinces.

The Breusch and Pagan Lagrangian Multiplier test ruled out the OLS model. Hausman ruled out the RE model in favor of the FE model.

This should be the final model:

And these are the results:

Code:

. xtreg log_sex_reports l.impunity l.sq_impunity GDP_Procapita sq_GDP_Procapita Homicides sq_homicides FUn > employment Density Pct_Foreigners, fe Fixed-effects (within) regression Number of obs = 1,263 Group variable: id Number of groups = 106 R-sq: Obs per group: within = 0.0362 min = 9 between = 0.1744 avg = 11.9 overall = 0.0661 max = 12 F(9,1148) = 4.80 corr(u_i, Xb) = -0.7880 Prob > F = 0.0000 ---------------------------------------------------------------------------------- log_sex_reports | Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------------+---------------------------------------------------------------- impunity | L1. | -2.468629 .9296035 -2.66 0.008 -4.292541 -.6447161 | sq_impunity | L1. | 1.60081 .6883595 2.33 0.020 .2502266 2.951394 | GDP_Procapita | -.000056 .0000263 -2.13 0.034 -.0001077 -4.36e-06 sq_GDP_Procapita | 9.37e-10 4.02e-10 2.33 0.020 1.48e-10 1.73e-09 Homicides | -.0323166 .0210059 -1.54 0.124 -.073531 .0088977 sq_homicides | .0065274 .0046619 1.40 0.162 -.0026195 .0156742 FUnemployment | -.0060995 .0027922 -2.18 0.029 -.0115778 -.0006212 Density | -.0003254 .0005407 -0.60 0.547 -.0013863 .0007355 Pct_Foreigners | -3.752885 1.158392 -3.24 0.001 -6.025687 -1.480083 _cons | 4.201845 .5901845 7.12 0.000 3.043884 5.359806 -----------------+---------------------------------------------------------------- sigma_u | .35830274 sigma_e | .2438755 rho | .68339991 (fraction of variance due to u_i) ---------------------------------------------------------------------------------- F test that all u_i=0: F(105, 1148) = 7.01 Prob > F = 0.0000

Residuals Test:

I think there are no problems with these results, the nonlinear relationship suggests that at a low level of impunity there is no relationship between impunity and the number of reports, but after a certain threshold, the number of reports starts to decrease as impunity increases.

Thanks again for your help

Bests,
Lorenzo

Last edited by Lorenzo Fabiani; 06 Nov 2023, 05:53.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 818
#13

06 Nov 2023, 19:36

Thanks Lorenzo. I am certainly not an expert in all of these statistical tests, but overall, I feel this is a thoughtful and well argued analysis.

I'm curious, why did you decide to take the log of the outcome? Why not use the Poisson or negative binomial regressions?
Comment

Lorenzo Fabiani

Join Date: Jun 2022
Posts: 12

#14

07 Nov 2023, 06:04

Hello Daniel,

actually, I have read several discordant papers regarding the necessity of a normal distribution in the dependent variable, so I tried both variables and with the logarithm, the results seem to me slightly better compared to the original variable. Unfortunately, I do not have knowledge of Poisson or negative binomial regression methodologies; I have relied rather on the existing literature. Here are the results with the non-transformed variable:

Code:

. xtreg Sex_viol_100k l.impunity l.sq_impunity GDP_Procapita sq_GDP_Procapita Homicides sq_homicides FUnem
> ployment Density Pct_Foreigners, fe

Fixed-effects (within) regression               Number of obs     =      1,263
Group variable: id                              Number of groups  =        106

R-sq:                                           Obs per group:
     within  = 0.0388                                         min =          9
     between = 0.0988                                         avg =       11.9
     overall = 0.0414                                         max =         12

                                                F(9,1148)         =       5.15
corr(u_i, Xb)  = -0.7682                        Prob > F          =     0.0000

----------------------------------------------------------------------------------
   Sex_viol_100k |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
        impunity |
             L1. |   -19.8069   7.470177    -2.65   0.008    -34.46363   -5.150173
                 |
     sq_impunity |
             L1. |    12.7676    5.53157     2.31   0.021     1.914478    23.62072
                 |
   GDP_Procapita |  -.0005088   .0002115    -2.41   0.016    -.0009239   -.0000938
sq_GDP_Procapita |   9.20e-09   3.23e-09     2.84   0.005     2.85e-09    1.55e-08
       Homicides |  -.1579414    .168801    -0.94   0.350    -.4891345    .1732517
    sq_homicides |   .0230998   .0374626     0.62   0.538    -.0504031    .0966026
   FUnemployment |  -.0409989   .0224375    -1.83   0.068    -.0850219    .0030242
         Density |  -.0037301   .0043451    -0.86   0.391    -.0122552    .0047951
  Pct_Foreigners |  -32.01802    9.30869    -3.44   0.001    -50.28197   -13.75407
           _cons |   25.13789   4.742648     5.30   0.000     15.83266    34.44312
-----------------+----------------------------------------------------------------
         sigma_u |   3.130042
         sigma_e |  1.9597528
             rho |  .71838328   (fraction of variance due to u_i)
----------------------------------------------------------------------------------
F test that all u_i=0: F(105, 1148) = 8.51                   Prob > F = 0.0000

Announcement