Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a New Variable from Within-Distribution Averages and its Implications

    Hello everybody,

    I'm deeply engaged in a research project focusing on the relationship between impunity (quality of Justice) and the reportes of sexual violence across Italian provinces. My underlying hypothesis is that regions with higher impunity might, on average, have fewer reports of sexual violences. This supposition stems from potential factors like a lack of trust in the justice system and an ingrained sense of shame among victims. Preliminary visual analysis of my data, particularly using geographical mapping, suggests more reports of sexual violence in Northern Italy, juxtaposed with higher impunity levels in the South, somewhat reinforcing my hypothesis. My methodological inclination is towards fixed-effects regression, especially after some preliminary tests. However, instead of using the direct number of reports of sexual violences as my dependent variable, I considered deriving a new metric by subtracting the average of the top 10 provinces (based on reports) from each observation, year on year, and then reversing the sign. This would ideally yield a 'GAP' indicating the difference with the "best-performing" provinces.

    The measure I've devised is:

    GAP = Reports in a given province - Average of the top 10 provinces for reports

    My core questions are:
    1. Is this a conventional or previously employed methodology? Can I find pertinent literature or guidelines on it?
    2. Would it be methodologically right to exclude the top 10 provinces of any year from the model? I'm contemplating this due to potential divergent effects:
    a. The gap decreases with increasing impunity because there are more reports in all the other provinces.
    b. The gap decrases with rising impunity because the "high-performing" provinces report less.
    And, of course, the converse situations where the gap increase.
    3. Can I possibly address this by introducing a dummy variable accounting for provinces with high reporting rates and then crafting an interaction term with this variable?

    I genuinely appreciate your insights and assistance. I hope I've articulated my concerns comprehensively.

    Warm regards,
    Lorenzo

  • #2
    1. Subtracting the average of the top 10 provinces just re-center's the variable and shouldn't have any effect on the regression coefficient. Reversing the sign on every observation should just reverse the sign on the regression coefficient.
    2. You should not exclude the observations, you should model the divergence with an interaction.
    3. Yes, exactly.

    Just to expand a bit on 1: I don't know what your data look like, but suppose you have 10 years and 1000 provinces.

    Code:
    clear
    set seed 23456
    set obs 10000
    gen province = mod(_n, 1000)
    egen year = seq(), block(1000)
    
    gen impunity = runiform()
    bysort year: gen violence = 2 * impunity + rnormal()
    
    xtset province year
    xtreg violence impunity, fe
    Just using violence as the outcome, we get this model.

    Code:
    ------------------------------------------------------------------------------
        violence | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
        impunity |   1.985421   .0367834    53.98   0.000     1.913317    2.057525
           _cons |   .0135477   .0209594     0.65   0.518    -.0275376     .054633
    Now let's create a new variable like you describe:

    Code:
    bysort year (violence), sort: egen top_ten_mean =  mean(violence) if _n > 990
    bysort year (violence), sort: replace top_ten_mean = top_ten_mean[_N]
    gen diff_and_flip = -(violence - top_ten_mean)
    Look at that last line. Notice it is mathematically equivalent to subtracting each observation for violence from the top ten mean:

    Code:
    gen reverse_and_diff = top_ten_mean - violence
    assert diff_and_flip == reverse_and_diff
    rename reverse_and_diff new_violence
    Now, what happens when we plug this into the new model?

    Code:
    new_violence | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
        impunity |  -1.985924   .0369868   -53.69   0.000    -2.058427   -1.913422
           _cons |   3.977565   .0210753   188.73   0.000     3.936252    4.018877
    The sign of the coefficient changes because we flipped the sign of the outcome. We also have a new intercept. This makes sense, because the intercept is the predicted mean of the outcome holding impunity constant. We changed the mean of violence when we re-centered it, so we get a different intercept. Now the intercept represents something like the average amount of violence (number of violent acts in your case) less than the average of the top-10 most violent provinces.

    I considered deriving a new metric by subtracting the average of the top 10 provinces (based on reports) from each observation, year on year, and then reversing the sign. This would ideally yield a 'GAP' indicating the difference with the "best-performing" provinces.
    The wording here is a little confusing. By "best-performing", do you actually mean provinces with the fewest acts of violence? I would call those the bottom 10 provinces, at least when speaking in terms of acts of violence. If you mean least violent acts, the code above would need to be changed slightly, as would the interpretation of the intercept, but much of the above analysis stays the same. Or do you mean "top-ten" in terms of impunity? This is not really clear at all.

    If you want to do this (or a similar) manipulation on your outcome to bring the model more in line with theory or the hypothesis you want to test, then I think you should go ahead and do it, and you should feel confident that you aren't making any egregious statistical errors, at least with respect to this particular transformation. As for papers you can cite for this: I think you should be able to just explain clearly in your writeup what you're doing, and as long as you are clear, that should be enough. If you really feel you need a citation, I don't have one off hand, but you might just search for mean centering (preferably looking for sources broadly within your field) and go from there.
    Last edited by Daniel Schaefer; 18 Oct 2023, 13:58.

    Comment


    • #3
      Thank you so much for the help. There are a few things I wasn't able to explain.

      I did some research, it should be a technique of centering the mean between groups (the years), but it's conditional since I choose the mean of only the group of the top 10 provinces (out of 106), year by year.

      From your code, it seems to me that only the top 10 provinces of the total are calculated, not year by year. I managed to write a code, perhaps not pretty, but it works:

      Code:
      replace Sex_viol_100k = -1e10 if missing(Sex_viol_100k)
      sort year Sex_viol_100k
      by year: gen rank = _N - _n + 1
      by year: egen top10_avg = mean(cond(rank <= 10, Sex_viol_100k, .))
      gen GAP = Sex_viol_100k - top10_avg
      replace Sex_viol_100k = . if Sex_viol_100k == -1e10
      replace GAP = . if missing(Sex_viol_100k)
      replace GAP = -GAP
      Regarding the regression, the values change a lot because it's not a standard centering I guess. These are the results with the original variable:

      Code:
       
      -------------------------------------------------------------------------------------
                          |               Robust
            Sex_viol_100k |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      --------------------+----------------------------------------------------------------
                 Impunity |
                      L1. |  -3.067214   1.539991    -1.99   0.049    -6.120732   -.0136961
      Which is as I expect: as "Impunity" increases, reports decrease.

      These are with the new "GAP" variable:

      Code:
      -------------------------------------------------------------------------------------
                          |               Robust
                      GAP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      --------------------+----------------------------------------------------------------
                 Impunity |
                      L1. |  -7.407721     1.3642    -5.43   0.000    -10.11268   -4.702764

      NOTE: are the reported sexual assaults, not the effective ones. My (opinable) idea is based on the fact that actually less assaults are reported than the real ones. That's why, for me, the "best-performing" provinces are the ones with more reports of sexual assaults, not fewer. The idea is, for example, Milan has 10 reports per 100k inhabitants, Rome has 7. At the same conditions, it's Rome that has fewer reports than there should be, for this reason Milan has better performance). I use also control variables.

      Comment


      • #4
        From your code, it seems to me that only the top 10 provinces of the total are calculated, not year by year.
        No, that is not correct. These two lines from me:

        Code:
        bysort year (violence), sort: egen top_ten_mean =  mean(violence) if _n > 990
        bysort year (violence), sort: replace top_ten_mean = top_ten_mean[_N]
        Are equivalent to these three lines from your code:

        Code:
        sort year violence
        by year: gen rank = _N - _n + 1
        by year: egen top10_avg = mean(cond(rank <= 10, violence, .))
        This assertion passes on my end:

        Code:
        assert top10_avg == top_ten_mean
        If you want to talk about regression models, you need to show me the lines of code you use to generate the results.

        The idea is, for example, Milan has 10 reports per 100k inhabitants, Rome has 7. At the same conditions, it's Rome that has fewer reports than there should be, for this reason Milan has better performance). I use also control variables.
        Okay, so do you take the per capita rate of sexual assaults, or do you use the raw number as the outcome?

        Comment


        • #5
          Thank you again for your time. I use the number of reported sexual assaults per 100,000 inhabitants.

          For the regression code, I use:

          Code:
          clear all
          
          capture log close
          
          import excel "C:\Users\loren\OneDrive\Desktop\Università\PhD\Impunità\Database\PANEL_Sex_FDI.xlsx", sheet("panel") firstrow
          
          keep if year >= 2008 & year <= 2020
          
          replace Sex_viol_100k = -1e10 if missing(Sex_viol_100k)
          sort year Sex_viol_100k
          by year: gen rank = _N - _n + 1
          by year: egen top10_avg = mean(cond(rank <= 10, Sex_viol_100k, .))
          gen GAP = Sex_viol_100k - top10_avg
          replace Sex_viol_100k = . if Sex_viol_100k == -1e10
          replace GAP = . if missing(Sex_viol_100k)
          replace GAP = -GAP
          
          xtset id year
          
          xtreg Sex_viol_100k l.Impunity GAP_Employment_rate l.Homicides Pct_Foreigners, fe vce(cluster id)
          With the following results:

          Code:
          . xtreg Sex_viol_100k l.Impunity GAP_Employment_rate l.Homicides Pct_Foreigners, fe vce(cluster id)
          
          Fixed-effects (within) regression               Number of obs     =      1,263
          Group variable: id                              Number of groups  =        106
          
          R-sq:                                           Obs per group:
               within  = 0.0139                                         min =          9
               between = 0.1996                                         avg =       11.9
               overall = 0.0853                                         max =         12
          
                                                          F(4,105)          =       3.76
          corr(u_i, Xb)  = -0.6954                        Prob > F          =     0.0067
          
                                                    (Std. Err. adjusted for 106 clusters in id)
          -------------------------------------------------------------------------------------
                              |               Robust
                Sex_viol_100k |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          --------------------+----------------------------------------------------------------
                     Impunity |
                          L1. |  -3.067214   1.539991    -1.99   0.049    -6.120732   -.0136961
                              |
          GAP_Employment_rate |   .0156553   .0298014     0.53   0.600    -.0434353    .0747459
                              |
                    Homicides |
                          L1. |   .1773109   .0945441     1.88   0.064    -.0101525    .3647743
                              |
               Pct_Foreigners |   -27.4156   14.26005    -1.92   0.057    -55.69065    .8594384
                        _cons |   10.97524   2.021957     5.43   0.000     6.966069     14.9844
          --------------------+----------------------------------------------------------------
                      sigma_u |  2.5575401
                      sigma_e |  2.0100704
                          rho |  .61816158   (fraction of variance due to u_i)
          -------------------------------------------------------------------------------------
          Or replacing Sex_viol_100k with GAP:


          Code:
          . xtreg GAP l.Impunity GAP_Employment_rate l.Homicides Pct_Foreigners, fe vce(cluster id)
          
          Fixed-effects (within) regression               Number of obs     =      1,263
          Group variable: id                              Number of groups  =        106
          
          R-sq:                                           Obs per group:
               within  = 0.0530                                         min =          9
               between = 0.2337                                         avg =       11.9
               overall = 0.1414                                         max =         12
          
                                                          F(4,105)          =      14.52
          corr(u_i, Xb)  = -0.1660                        Prob > F          =     0.0000
          
                                                    (Std. Err. adjusted for 106 clusters in id)
          -------------------------------------------------------------------------------------
                              |               Robust
                          GAP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          --------------------+----------------------------------------------------------------
                     Impunity |
                          L1. |  -7.407721     1.3642    -5.43   0.000    -10.11268   -4.702764
                              |
          GAP_Employment_rate |    .077659   .0282592     2.75   0.007     .0216262    .1336918
                              |
                    Homicides |
                          L1. |  -.2047511   .1093433    -1.87   0.064    -.4215587    .0120566
                              |
               Pct_Foreigners |  -35.39252   12.38293    -2.86   0.005    -59.94559   -10.83945
                        _cons |   12.19836   1.674713     7.28   0.000     8.877711      15.519
          --------------------+----------------------------------------------------------------
                      sigma_u |  1.7706093
                      sigma_e |  2.0315732
                          rho |  .43168653   (fraction of variance due to u_i)
          -------------------------------------------------------------------------------------

          Code:
          . summarize Sex_viol_100k GAP Impunity GAP_Employment_rate Homicides Pct_Foreigners
          
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
          Sex_vio~100k |      1,369    7.335156    2.765973   .9673963   23.69745
                   GAP |      1,369    5.788866    2.795932  -9.468582   12.94524
              Impunity |      1,369    .6912153    .0890384   .4281333   .9888552
          GAP_Employ~e |      1,372    19.07534    5.783479   4.550507    38.3802
             Homicides |      1,372    .7653262    .8987815          0   12.39038
          -------------+---------------------------------------------------------
          Pct_Foreig~s |      1,372    .0708101    .0338653   .0085237   .1852237

          Comment


          • #6
            Hi Lorenzo,

            Sorry for the delay in getting back to you. I really wanted to take some time to think this through carefully. Setting aside my explanation below for a moment, I don't actually think you need to modify your outcome like you propose. You should be able to capture what you want to capture with an interaction. You'll still need to think carefully about how to correctly model an interaction with a time-invariant variable (assuming the same provinces are in the top-ten year over year) in a fixed-effects model, but it seems like it would be worth figuring that out.

            Now on to the explanation. In #1 I mistakenly assume there are no unobserved fixed-effects. In my defense, I modeled this including fixed-effects over the last couple of days and it was still difficult to generate simulated data and a regression model that behaves like what you have here. What I said in #2 appears to be generally correct under the usual regression assumptions. Even increasing the year over year variability in the top-ten average doesn't explain nearly everything you have going on in #5.

            After a lengthy investigation, I think three things are happening here. First of all, model misspecification issues like omitted variable bias can increase the variability in the estimates generated across models. That could be part of the issue.

            Second, notice the relatively large change in the correlation between the time-invariant fixed effects and the time-variant part of the model (where it says corr(u_i, Xb) in the output). It's really surprising to me that such a change is possible (though it certainly is possible). Consider that the only part of the model that changes between these two outcomes is the time-varying part. You can prove this by looping through each year, then checking the correlation between gap and Sex_viol_100k. The correlation should be one every year. The change in the correlation suggests (though it does not prove) that there may be a nonlinear relationship between your unobserved fixed effects and your observed variables. You may account for this nonlinearity when you translate your outcome by the mean of the top ten, then invert around zero on the number line - surprising, given that is a linear transformation of the outcome. That said, maybe the interaction you suggest in #1 also accounts for the non-linearity?

            There is a third possibility: many timeseries models assume that the data are stationary, meaning that the time-variant portion of the variation in the outcome should be drawn from the same data generation process year over year. If the distribution is non-stationary, it turns out that it is fairly easy to get results like you have above, especially when the fixed effects are large and the relationship between the fixed effects and the time varying part of the model is nonlinear.

            I'm guessing you have some amount of all three of these issues going on with your model. I would talk to an advisor or senior colleague about this in more detail if you can. You might need a whole set of regression diagnostics to systematically figure out what's happening here.

            Here is the latest version of the script I've been playing around with over the last couple of days. It's not a Monte Carlo simulation yet, but at this point I should probably turn it into one.

            Code:
            clear
            set seed 23456
            set obs 10000
            egen province = seq(), from(1) to(100)
            egen year = seq(), block(100)
            
            gen fixed = 10 * rnormal() if year == 1
            bysort province (year), sort: replace fixed = fixed[1]
            
            gen time_varying = fixed^2 + fixed + rnormal(year, 3)
            scatter time_varying fixed
            
            gen outcome = fixed + time_varying + rnormal()
            
            bysort year (outcome), sort: egen top10_avg = mean(cond(_N - _n + 1 < 10, outcome, .))
            gen gap = -(outcome - top10_avg)
            
            sum outcome gap time_varying fixed top10_avg
            corr outcome gap time_varying fixed top10_avg
            
            forv year = 1/100{
                qui corr gap outcome if year == `year'
                assert r(rho)
            }
            
            xtset province year
            
            xtreg outcome time_varying, fe
            
            xtreg gap time_varying, fe
            Last edited by Daniel Schaefer; 20 Oct 2023, 16:27.

            Comment


            • #7
              Hi Daniel,

              Once again thank you very much.

              To clarify, the top 10 provinces vary over the years; they are not always the same.

              I've conducted several tests and evaluated non-linearity, and indeed, it doesn't seem to be linearity. At the moment, I've obtained results that appear promising since they yield interesting and with plausible findings. My primary concern is about the inclusion of a dummy variable that takes the value 1 for the provinces in the top 10 and 0 for all the others. By introducing the dummy variable, I aim to account for the effects of the top 10 provinces, distinguishing them from all the others. This setup should answer the question I initially posed: "Is the gap narrowing because the worst-performing provinces are improving, or because the best-performing ones are deteriorating?"


              I've also performed some general tests, which I've outlined below to seek further insights.


              Breusch-Pagan / Cook-Weisberg test for heteroskedasticity:
              • The test indicates no evidence of heteroskedasticity (p-value = 0.4968).
              • However, for safety, I will still employ Fixed Effects (FE) with clustered standard errors.

                Code:
                hettest
                	 
                	Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
                	         Ho: Constant variance
                	         Variables: fitted values of GAP
                	 
                	         chi2(1)      =     0.46
                	         Prob > chi2  =   0.4968

              Wooldridge test for autocorrelation:
              • The test suggests the presence of autocorrelation (p-value = 0.0004).
              • I'll use the vce(cluster) option to address this.
              Code:
              xtserial GAP II
               
              Wooldridge test for autocorrelation in panel data
              H0: no first-order autocorrelation
                  F(  1,     105) =     13.264
                         Prob > F =      0.0004

              Skewness/Kurtosis tests for Normality:
              • The residuals exhibit deviations from normality (p-value = 0.0003).
              • However, normality isn't a fundamental requirement, especially as I'm not aiming for inferences on individual independent variables.

                Code:
                sktest residuals
                	 
                	                    Skewness/Kurtosis tests for Normality
                	                                                          ------ joint ------
                	    Variable |        Obs  Pr(Skewness)  Pr(Kurtosis) adj chi2(2)   Prob>chi2
                	-------------+---------------------------------------------------------------
                	   residuals |      1,369     0.0000        0.3765       15.97         0.0003

              Ramsey RESET test:
              • The test suggests that the model should be well-specified(?) (p-value = 0.0916).
              Code:
              estat ovtest
               
              Ramsey RESET test using powers of the fitted values of GAP
                     Ho:  model has no omitted variables
                              F(3, 1358) =      2.15
                                Prob > F =      0.0916
              Variance Inflation Factor (VIF):
              • There's high VIF between the independent variable and its square, which is to be expected. It should be okay.

                Code:
                . vif
                	 
                	    Variable |       VIF       1/VIF 
                	-------------+----------------------
                	       sq_II |    129.33    0.007732
                	          II |    126.29    0.007918
                	Pct_Foreig~s |      1.67    0.599989
                	GAP_Employ~e |      1.61    0.622461
                	   Homicides |      1.24    0.809555
                	   dummy_top |      1.07    0.933300
                	     Density |      1.06    0.944199
                	-------------+----------------------
                	    Mean VIF |     37.47

              Breusch and Pagan Lagrangian multiplier test for random effects:
              • The test suggests a preference for a Random Effects (RE) or Fixed Effects (FE) model over pooled OLS.
                Code:
                xttest0
                	 
                	Breusch and Pagan Lagrangian multiplier test for random effects
                	 
                	        GAP[id,t] = Xb + u[id] + e[id,t]
                	 
                	        Estimated results:
                	                         |       Var     sd = sqrt(Var)
                	                ---------+-----------------------------
                	                     GAP |   7.817237       2.795932
                	                       e |    3.65969       1.913032
                	                       u |   1.027117       1.013468
                	 
                	        Test:   Var(u) = 0
                	                             chibar2(01) =   394.65
                	                          Prob > chibar2 =   0.0000

              Hausman test:
              • The test further confirms the preference for a FE model.

                Code:
                hausman fe re
                	 
                	                 ---- Coefficients ----
                	             |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
                	             |       fe           re         Difference          S.E.
                	-------------+----------------------------------------------------------------
                	          II |    -18.8688    -15.56642       -3.302389        1.407948
                	       sq_II |    11.27862     8.464198        2.814425        1.123021
                	   dummy_top |   -5.033781    -5.509129        .4753475        .1117318
                	     Density |   -.0056516    -.0004654       -.0051862        .0030046
                	GAP_Employ~e |     .057368      .088577        -.031209        .0119168
                	   Homicides |   -.0620467    -.0086348       -.0534118        .0304741
                	Pct_Foreig~s |   -11.67048    -12.46384        .7933654        5.968137
                	------------------------------------------------------------------------------
                	                           b = consistent under Ho and Ha; obtained from xtreg
                	            B = inconsistent under Ha, efficient under Ho; obtained from xtreg
                	 
                	    Test:  Ho:  difference in coefficients not systematic
                	 
                	                  chi2(7) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                	                          =       45.58
                	                Prob>chi2 =      0.0000
                The regression:

                Code:
                . xtreg GAP II sq_II dummy_top Density GAP_Employment_rate Homicides Pct_Foreigners, fe vce(cluster id)
                	
                	Fixed-effects (within) regression               Number of obs     =      1,369
                	Group variable: id                              Number of groups  =        106
                	
                	R-sq:                                           Obs per group:
                	     within  = 0.1624                                         min =         10
                	     between = 0.2298                                         avg =       12.9
                	     overall = 0.1721                                         max =         13
                	
                	                                                F(7,105)          =      16.58
                	corr(u_i, Xb)  = -0.6462                        Prob > F          =     0.0000
                	
                	                                          (Std. Err. adjusted for 106 clusters in id)
                	-------------------------------------------------------------------------------------
                	                    |               Robust
                	                GAP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                	--------------------+----------------------------------------------------------------
                	                 II |   -18.8688    7.90572    -2.39   0.019    -34.54439   -3.193222
                	              sq_II |   11.27862   5.672944     1.99   0.049     .0302238    22.52702
                	          dummy_top |  -5.033781   .5621316    -8.95   0.000    -6.148384   -3.919178
                	            Density |  -.0056516   .0021355    -2.65   0.009    -.0098858   -.0014174
                	GAP_Employment_rate |    .057368   .0257632     2.23   0.028     .0062843    .1084517
                	          Homicides |  -.0620467    .071068    -0.87   0.385    -.2029614     .078868
                	     Pct_Foreigners |  -11.67048   9.883563    -1.18   0.240    -31.26776    7.926802
                	              _cons |   14.80785   3.038474     4.87   0.000     8.783113    20.83258
                	--------------------+----------------------------------------------------------------
                	            sigma_u |   2.377606
                	            sigma_e |  1.9130317
                	                rho |  .60702157   (fraction of variance due to u_i)
                	-------------------------------------------------------------------------------------
                I would be very interested to read your opinion

                Lorenzo

              Comment


              • #8
                A few things:
                • You are treating your dummy_top variable as if it were continuous in the regression at the end of #7. It should start with the i. prefix.
                • The vce(cluster id) is not necessary, nor do I think it correctly accounts for temporal autocorrelation. Temporal autocorrelation suggests that adjacent years should be autocorrelated, and that autocorrelation should decline as years get further apart. You are already accounting for the fact that every year within a province is correlated with every other year within a province. That's what the fixed-effects model is doing.
                • As per #6, you may want to check and see if your variables are stationary or not. A fixed effects regression might still be most appropriate, but it's worth thinking a bit about.
                • I don't think it's appropriate to rescale your outcome by the top ten and use a dummy indicating the top ten in the same model. Of course the dummy will be correlated with the gap - it defines the gap. You aren't learning anything by including the dummy and your probably introducing model specification issues.
                Your research question seems to have shifted a bit from #1. In #7 you are asking a question about a trend: does the gap get narrower? To rephrase that, does the gap depend on the year, such that later years have smaller gaps. This isn't a hypothesis about whether the effect of impunity on sexual assault is different for the top ten anymore. It seems like you need include your year variable as a predictor to test this new hypothesis about a trend in the gap.

                In #1 it seems like you want to test the hypothesis that the size of the effect of impunity on cases of sexual assault is different for the top ten. In that case, just use sexual assault as the outcome and interact impunity with an indicator for the top ten. You can ignore the first-order coefficient on the top ten indicator because it is obviously associated with the number of sexual assaults: It is literally defined in terms of the outcome.

                Comment


                • #9
                  Thank you, Daniel.

                  I'm providing the test results below, and at the end, I'll detail my research question. This might aid you in assisting me.

                  The stationarity test results (lags =1) are:

                  GAP, II, GAP_Employment_rate, and Homicides: With a p-value of 0.000, these series are non-stationary.
                  dummy_top and Density: With a p-value of 1.000, these series are stationary.
                  Pct_Foreigners: With a p-value of 0.687, this series is stationary. (But I believe it might be beneficial to keep it in the model)

                  I'm also showing the regression results after removing the dummy and the stationary variable “Density” to give you a clearer picture.

                  Code:
                  . xtreg GAP II sq_II i.dummy_top Density GAP_Employment_rate Homicides Pct_Foreigners, fe
                  
                  Fixed-effects (within) regression               Number of obs     =      1,369
                  Group variable: id                              Number of groups  =        106
                  
                  R-sq:                                           Obs per group:
                       within  = 0.1624                                         min =         10
                       between = 0.2298                                         avg =       12.9
                       overall = 0.1721                                         max =         13
                  
                                                                  F(7,1256)         =      34.80
                  corr(u_i, Xb)  = -0.6462                        Prob > F          =     0.0000
                  
                  -------------------------------------------------------------------------------------
                                  GAP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                  --------------------+----------------------------------------------------------------
                                   II |   -18.8688    7.13349    -2.65   0.008    -32.86367   -4.873934
                                sq_II |   11.27862   5.193185     2.17   0.030     1.090349     21.4669
                          1.dummy_top |  -5.033781   .3410927   -14.76   0.000    -5.702955   -4.364607
                              Density |  -.0056516   .0030195    -1.87   0.061    -.0115755    .0002723
                  GAP_Employment_rate |    .057368   .0207118     2.77   0.006     .0167344    .0980016
                            Homicides |  -.0620467    .085097    -0.73   0.466    -.2289947    .1049013
                       Pct_Foreigners |  -11.67048   6.963706    -1.68   0.094    -25.33225    1.991299
                                _cons |   14.80785   2.774837     5.34   0.000     9.364021    20.25167
                  --------------------+----------------------------------------------------------------
                              sigma_u |   2.377606
                              sigma_e |  1.9130317
                                  rho |  .60702157   (fraction of variance due to u_i)
                  -------------------------------------------------------------------------------------
                  F test that all u_i=0: F(105, 1256) = 5.30                   Prob > F = 0.0000
                  
                  
                  
                  . xtreg GAP II sq_II i.dummy_top GAP_Employment_rate Homicides Pct_Foreigners, fe
                   
                  Fixed-effects (within) regression               Number of obs     =      1,369
                  Group variable: id                              Number of groups  =        106
                   
                  R-sq:                                           Obs per group:
                       within  = 0.1601                                         min =         10
                       between = 0.6465                                         avg =       12.9
                       overall = 0.3703                                         max =         13
                   
                                                                  F(6,1257)         =      39.94
                  corr(u_i, Xb)  = 0.3793                         Prob > F          =     0.0000
                   
                  -------------------------------------------------------------------------------------
                                  GAP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                  --------------------+----------------------------------------------------------------
                                   II |  -18.26564   7.133299    -2.56   0.011    -32.26012   -4.271154
                                sq_II |    10.8836   5.194059     2.10   0.036     .6936171    21.07358
                          1.dummy_top |  -4.926352   .3365633   -14.64   0.000     -5.58664   -4.266064
                  GAP_Employment_rate |   .0572027   .0207323     2.76   0.006      .016529    .0978763
                            Homicides |  -.0571469   .0851414    -0.67   0.502    -.2241818     .109888
                       Pct_Foreigners |  -9.653367   6.886654    -1.40   0.161    -23.16397    3.857235
                                _cons |   12.92054    2.58771     4.99   0.000     7.843828    17.99724
                  --------------------+----------------------------------------------------------------
                              sigma_u |  1.3569913
                              sigma_e |  1.9149356
                                  rho |  .33429383   (fraction of variance due to u_i)
                  -------------------------------------------------------------------------------------
                  F test that all u_i=0: F(105, 1257) = 5.31                   Prob > F = 0.0000
                  
                  
                  
                  
                  . xtreg GAP II sq_II Density GAP_Employment_rate Homicides Pct_Foreigners, fe
                   
                  Fixed-effects (within) regression               Number of obs     =      1,369
                  Group variable: id                              Number of groups  =        106
                   
                  R-sq:                                           Obs per group:
                       within  = 0.0172                                         min =         10
                       between = 0.0255                                         avg =       12.9
                       overall = 0.0179                                         max =         13
                   
                                                                  F(6,1257)         =       3.67
                  corr(u_i, Xb)  = -0.2349                        Prob > F          =     0.0013
                   
                  -------------------------------------------------------------------------------------
                                  GAP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                  --------------------+----------------------------------------------------------------
                                   II |  -16.06501   7.721445    -2.08   0.038    -31.21335   -.9166736
                                sq_II |    9.35039   5.621432     1.66   0.096    -1.678032    20.37881
                              Density |    .001847    .003223     0.57   0.567    -.0044759      .00817
                  GAP_Employment_rate |   .0588982   .0224266     2.63   0.009     .0149004    .1028959
                            Homicides |   -.098551   .0921046    -1.07   0.285    -.2792467    .0821448
                       Pct_Foreigners |   -11.5942   7.540339    -1.54   0.124    -26.38724    3.198837
                                _cons |   11.59314   2.995337     3.87   0.000     5.716734    17.46956
                  --------------------+----------------------------------------------------------------
                              sigma_u |  2.0001192
                              sigma_e |   2.071442
                                  rho |  .48248807   (fraction of variance due to u_i)
                  -------------------------------------------------------------------------------------
                  F test that all u_i=0: F(105, 1257) = 7.68                   Prob > F = 0.0000
                  
                  
                  
                  . xtreg GAP II sq_II GAP_Employment_rate Homicides Pct_Foreigners, fe
                   
                  Fixed-effects (within) regression               Number of obs     =      1,369
                  Group variable: id                              Number of groups  =        106
                   
                  R-sq:                                           Obs per group:
                       within  = 0.0170                                         min =         10
                       between = 0.2820                                         avg =       12.9
                       overall = 0.1414                                         max =         13
                   
                                                                  F(5,1258)         =       4.34
                  corr(u_i, Xb)  = 0.2522                         Prob > F          =     0.0006
                   
                  -------------------------------------------------------------------------------------
                                  GAP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                  --------------------+----------------------------------------------------------------
                                   II |  -16.24776   7.712798    -2.11   0.035    -31.37912   -1.116394
                                sq_II |   9.469414   5.616094     1.69   0.092    -1.548529    20.48736
                  GAP_Employment_rate |   .0589648   .0224203     2.63   0.009     .0149794    .1029501
                            Homicides |   -.100461   .0920197    -1.09   0.275      -.28099     .080068
                       Pct_Foreigners |  -12.27209   7.444993    -1.65   0.100    -26.87806    2.333878
                                _cons |   12.20485   2.797955     4.36   0.000      6.71568    17.69403
                  --------------------+----------------------------------------------------------------
                              sigma_u |  1.7292266
                              sigma_e |   2.070889
                                  rho |  .41081251   (fraction of variance due to u_i)
                  -------------------------------------------------------------------------------------
                  F test that all u_i=0: F(105, 1258) = 7.99                   Prob > F = 0.0000
                  Now, regarding my research question, I believe it's essential to clarify this aspect. I aim to demonstrate that Impunity (II), proxy of the quality of the legal system, affects the number of reported cases. Specifically, my hypothesis is that in areas with higher impunity, victims report less due to a lack of trust in the justice system. My supervisor advised me to use the difference between provinces with the highest number of reports, the top 10 (GAP), to investigate this aspect.

                  Comment


                  • #10
                    Ideally, you want your variables to be stationary. I don't know why you would want to remove your stationary variable "Density" from the model. You should not include dummy_top unless you are estimating an interaction, but that has nothing to do with whether the variable is stationary or not.

                    Specifically, my hypothesis is that in areas with higher impunity, victims report less due to a lack of trust in the justice system.
                    Then the number of reported cases per capita should be your outcome and you expect a negative coefficient on impunity in the model. No need for an interaction, because you don't hypothesize that the size of the effect of impunity changes by levels of some other variable - just that impunity is negatively associated with reported cases. I happen to disagree with your supervisor: I don't think it is necessary or helpful to convert your outcome to the "gap" to test this hypothesis as stated. Just look at the way you've worded your hypothesis: It is in terms of the number of reported cases. That's your stated outcome. It's perfectly conceptually clear as is, and I don't really see what you gain by transforming this into a hypothesis about the gap. That said, I don't think it is a statistically invalid transformation (baring some worries related to your non-stationary variables) so your supervisor isn't wrong, its just not an analytic choice I would make. I think it confuses things, they probably think it clarifies things theoretically. It's up to you to work those details out with them.

                    I think it would be incorrect to interact impunity with dummy_top if the outcome is the gap variable, but that's a different story.

                    Comment


                    • #11
                      By the way, I am not advising you to drop your non-stationary variables from the model! Since N>T I think you should stick with the usual fixed effects regression. Just keep in mind that you have an assumption violation that may increase the size of your standard errors.

                      Comment


                      • #12
                        Originally posted by Daniel Schaefer View Post
                        Ideally, you want your variables to be stationary. I don't know why you would want to remove your stationary variable "Density" from the model. You should not include dummy_top unless you are estimating an interaction, but that has nothing to do with whether the variable is stationary or not.



                        Then the number of reported cases per capita should be your outcome and you expect a negative coefficient on impunity in the model. No need for an interaction, because you don't hypothesize that the size of the effect of impunity changes by levels of some other variable - just that impunity is negatively associated with reported cases. I happen to disagree with your supervisor: I don't think it is necessary or helpful to convert your outcome to the "gap" to test this hypothesis as stated. Just look at the way you've worded your hypothesis: It is in terms of the number of reported cases. That's your stated outcome. It's perfectly conceptually clear as is, and I don't really see what you gain by transforming this into a hypothesis about the gap. That said, I don't think it is a statistically invalid transformation (baring some worries related to your non-stationary variables) so your supervisor isn't wrong, its just not an analytic choice I would make. I think it confuses things, they probably think it clarifies things theoretically. It's up to you to work those details out with them.

                        I think it would be incorrect to interact impunity with dummy_top if the outcome is the gap variable, but that's a different story.


                        Hello Daniel,

                        I've come to a conclusion, and I believe it's a definitive one, thanks also to your advice, so I think it's fair to let you know how I've concluded:


                        I used the logarithm of the number of sexual reports as the dependent variable to normalize the distribution.

                        The indipendent variables are Impunity, GDP per Capita, Female Unemployment, Density, Foreigners Rate, Number of Homicides.


                        After the Ramsey test suggested specification problems or nonlinearity, I conducted a linktest in Stata for each variable and found that Impunity, GDP per Capita, and Number of Homicides had a nonlinear relationship. By adding the square of each variable, I addressed the issue. The VIF analysis reported high values for the variables with their squares, but I guess that's natural.


                        Both the White test and the Breusch-Pagan/Cook-Weisberg test show no signs of heteroskedasticity issues. The Wooldridge test for autocorrelation in panel data indicates evidence of first-order autocorrelation. With an F-statistic of 7.427 (degrees of freedom: 1, 105) and a p-value of 0.0075. As you said, temporal autocorrelation suggests that adjacent years should be autocorrelated, and that autocorrelation should decline as years get further apart. For this reason the use of a fixed-effects model to control for unobserved heterogeneity within each province, may indirectly mitigate some of the effects of autocorrelation by absorbing time-invariant differences across provinces.

                        The Breusch and Pagan Lagrangian Multiplier test ruled out the OLS model. Hausman ruled out the RE model in favor of the FE model.

                        This should be the final model:

                        Click image for larger version

Name:	Immagine1.png
Views:	1
Size:	88.1 KB
ID:	1732855

                        And these are the results:


                        Code:
                        . xtreg log_sex_reports l.impunity l.sq_impunity GDP_Procapita sq_GDP_Procapita Homicides sq_homicides FUn
                        > employment Density Pct_Foreigners, fe
                        
                        Fixed-effects (within) regression               Number of obs     =      1,263
                        Group variable: id                              Number of groups  =        106
                        
                        R-sq:                                           Obs per group:
                             within  = 0.0362                                         min =          9
                             between = 0.1744                                         avg =       11.9
                             overall = 0.0661                                         max =         12
                        
                                                                        F(9,1148)         =       4.80
                        corr(u_i, Xb)  = -0.7880                        Prob > F          =     0.0000
                        
                        ----------------------------------------------------------------------------------
                         log_sex_reports |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                        -----------------+----------------------------------------------------------------
                                impunity |
                                     L1. |  -2.468629   .9296035    -2.66   0.008    -4.292541   -.6447161
                                         |
                             sq_impunity |
                                     L1. |    1.60081   .6883595     2.33   0.020     .2502266    2.951394
                                         |
                           GDP_Procapita |   -.000056   .0000263    -2.13   0.034    -.0001077   -4.36e-06
                        sq_GDP_Procapita |   9.37e-10   4.02e-10     2.33   0.020     1.48e-10    1.73e-09
                               Homicides |  -.0323166   .0210059    -1.54   0.124     -.073531    .0088977
                            sq_homicides |   .0065274   .0046619     1.40   0.162    -.0026195    .0156742
                           FUnemployment |  -.0060995   .0027922    -2.18   0.029    -.0115778   -.0006212
                                 Density |  -.0003254   .0005407    -0.60   0.547    -.0013863    .0007355
                          Pct_Foreigners |  -3.752885   1.158392    -3.24   0.001    -6.025687   -1.480083
                                   _cons |   4.201845   .5901845     7.12   0.000     3.043884    5.359806
                        -----------------+----------------------------------------------------------------
                                 sigma_u |  .35830274
                                 sigma_e |   .2438755
                                     rho |  .68339991   (fraction of variance due to u_i)
                        ----------------------------------------------------------------------------------
                        F test that all u_i=0: F(105, 1148) = 7.01                   Prob > F = 0.0000

                        Residuals Test:
                        Click image for larger version

Name:	media residui.png
Views:	1
Size:	46.7 KB
ID:	1732848Click image for larger version

Name:	qnorm residuals.png
Views:	1
Size:	45.9 KB
ID:	1732846 Click image for larger version

Name:	istogramma residui.png
Views:	1
Size:	69.0 KB
ID:	1732847 Click image for larger version

Name:	residui test eteroshedasticita.png
Views:	1
Size:	97.4 KB
ID:	1732854

                        I think there are no problems with these results, the nonlinear relationship suggests that at a low level of impunity there is no relationship between impunity and the number of reports, but after a certain threshold, the number of reports starts to decrease as impunity increases.

                        Thanks again for your help

                        Bests,
                        Lorenzo
                        Last edited by Lorenzo Fabiani; 06 Nov 2023, 05:53.

                        Comment


                        • #13
                          Thanks Lorenzo. I am certainly not an expert in all of these statistical tests, but overall, I feel this is a thoughtful and well argued analysis.

                          I'm curious, why did you decide to take the log of the outcome? Why not use the Poisson or negative binomial regressions?

                          Comment


                          • #14
                            Hello Daniel,

                            actually, I have read several discordant papers regarding the necessity of a normal distribution in the dependent variable, so I tried both variables and with the logarithm, the results seem to me slightly better compared to the original variable. Unfortunately, I do not have knowledge of Poisson or negative binomial regression methodologies; I have relied rather on the existing literature. Here are the results with the non-transformed variable:



                            Code:
                            . xtreg Sex_viol_100k l.impunity l.sq_impunity GDP_Procapita sq_GDP_Procapita Homicides sq_homicides FUnem
                            > ployment Density Pct_Foreigners, fe
                            
                            Fixed-effects (within) regression               Number of obs     =      1,263
                            Group variable: id                              Number of groups  =        106
                            
                            R-sq:                                           Obs per group:
                                 within  = 0.0388                                         min =          9
                                 between = 0.0988                                         avg =       11.9
                                 overall = 0.0414                                         max =         12
                            
                                                                            F(9,1148)         =       5.15
                            corr(u_i, Xb)  = -0.7682                        Prob > F          =     0.0000
                            
                            ----------------------------------------------------------------------------------
                               Sex_viol_100k |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                            -----------------+----------------------------------------------------------------
                                    impunity |
                                         L1. |   -19.8069   7.470177    -2.65   0.008    -34.46363   -5.150173
                                             |
                                 sq_impunity |
                                         L1. |    12.7676    5.53157     2.31   0.021     1.914478    23.62072
                                             |
                               GDP_Procapita |  -.0005088   .0002115    -2.41   0.016    -.0009239   -.0000938
                            sq_GDP_Procapita |   9.20e-09   3.23e-09     2.84   0.005     2.85e-09    1.55e-08
                                   Homicides |  -.1579414    .168801    -0.94   0.350    -.4891345    .1732517
                                sq_homicides |   .0230998   .0374626     0.62   0.538    -.0504031    .0966026
                               FUnemployment |  -.0409989   .0224375    -1.83   0.068    -.0850219    .0030242
                                     Density |  -.0037301   .0043451    -0.86   0.391    -.0122552    .0047951
                              Pct_Foreigners |  -32.01802    9.30869    -3.44   0.001    -50.28197   -13.75407
                                       _cons |   25.13789   4.742648     5.30   0.000     15.83266    34.44312
                            -----------------+----------------------------------------------------------------
                                     sigma_u |   3.130042
                                     sigma_e |  1.9597528
                                         rho |  .71838328   (fraction of variance due to u_i)
                            ----------------------------------------------------------------------------------
                            F test that all u_i=0: F(105, 1148) = 8.51                   Prob > F = 0.0000

                            Comment

                            Working...
                            X