Statalist

Connecting the means in a dot plot

Ben Littenberg — Mon, 08 Dec 2025 20:01:14 GMT

I am asked to create a dot plot with two means on each line - one for group A and the other for group B. My collaborators want there to be a line connecting just the 2 means, but not extending below or above them. I can easily get a dotted line from 0 to maximum, but I don't see how to get it just from A to B. Here is what I have so far:

Code:

graph dot (mean) *_want, ///
    over(role,  relabel(1 "Mental Health*" 2 "Heat Illness*" 3 "Vector-borne Illness*" 4 "Air-quality*" ///
        5 "Pollen*" 6 "Storm/flood Trauma*" 7 "Storm/flood Illness*" 8 "Water-borne Illness*" 9 "Food-borne Illness*")) ///
    groupyvars ///
    ytitle("Percentage reporting resources would be moderately or very helpful", size(small)) ///
    legend(pos(4) ring(0) size(small)) ///
    subtitle("*P<0.05 for Physician vs. NP/PA", pos(7) span size(small)) ///
    name(want_db, replace)

Array

Any suggestions? Thanks

New command -regoptwgt- in SSC: estimate regression/IV coefficients accurately when you're considering weighting

David Price — Mon, 08 Dec 2025 15:21:00 GMT

This post is to introduce the new command -regoptwgt-, which is now available from the SSC.

Researchers often wonder whether to use weighted or unweighted estimators when they want accurate estimation but are unsure which estimator will be better. This new command uses maximum likelihood (LIML in the case of endogenous regressors) to select among a weighted, an unweighted, or an intermediate estimator, assuming uncorrelated error terms. For multi-equation models, it can use different weighting for each equation.

In many cases, it can easily replace commonly-used commands. For example:

Code:

ssc install regoptwgt
set seed 0
clear
set obs 100
gen pop = 1/_n
gen z = rnormal()
gen xeta = rnormal()
gen x = z + rnormal() + .1/sqrt(pop)*(xeta + rnormal())
gen y = x + rnormal() + .1/sqrt(pop)*(xeta + rnormal())

reg y z [w=pop] // Commonly-used command
regoptwgt y z [w=pop] // This command can replace the one above it

ivregress 2sls y (x=z) [w=pop] // Commonly-used command
regoptwgt y (x=z) [w=pop] // This command can replace the one above it

For more information on the command, see:
https://davidjonathanprice.com/docs/...edasticity.pdf

That paper mainly details a common setting where -regoptwgt- substantially outperforms weighted or unweighted estimators: when observation accuracy is thought to roughly follow a power law, as when there is one observation per city or per firm. In that setting, weighted and unweighted estimators can fail to be consistent; and weighted estimators can fail to be asymptotically normal, have standard errors that are inconsistent and uninformative about estimators' standard deviations, and produce inaccurate inference in large samples.

Comments and suggestions on the Stata command (or the paper!) are, of course, appreciated.

csdid: ATT omitted

Harriet WANG — Mon, 08 Dec 2025 14:42:05 GMT

Hi all,

I am using csdid to estimate treatment effects in a staggered DID, and I am running into problems once I add control variables. My data are an unbalanced firm-year panel.

Obs Mean Std. Dev. Min Max
Independent Variables
Treated 968 0.1384 0.3455 0.0000 1.0000
Dependent Variables
Green Patent 532 1.1838 1.4638 0.0000 7.1017
Control Variables
Firm Age 536 2.9363 0.2992 1.9459 3.8067
Firm Size 541 22.8617 1.7382 19.3825 30.5711
Leverage 541 -0.8723 0.5661 -3.0372 0.1113
Liquidity 632 -0.6472 0.4451 -2.7989 -0.0285
Shareholder Rate 508 3.5292 0.4721 1.9330 4.4129
(Treat=0: 834; Treat=1: 134)

My baseline specification is: csdid ptotal size liq large age lev , ivar(firmid) time(year) gvar(firstyear) method(dripw) vce(cluster firmid)

In some specifications, the estimation results become almost empty, or the ATT estimates disappear for many (g,t) cells.

When I run estat simple, I do not get a single ATT estimate.

I wondered whether the issue could be due to the limited number of observations, but unfortunately, the dataset cannot be expanded for the core variables because of data availability constraints.

Is there any way to solve this problem?

Many thanks in advance for any guidance or suggestions.

Best,

Harrite

Help with National Travel Survey UK

Amy Plumb — Mon, 08 Dec 2025 14:36:02 GMT

I am working with Stata for the first time and I have been tasked with finding data on 'supercommuters'. I am working with data from the UK's National Travel Survey wave 6 dataset.

Basically, I have to find those commuters that have travelled over 90 minutes (in the table that is shown as 9 consecutive primary activities (pri) listed as 'travelling'). I have come accross some issues that I do not understnad how to solve.

Respondents (mainid) may have two dirary orders (diaryord), and I want to close this down to focus on only one of their responses
I am trying to find those candidates that have travelled for 9 consecutive periods but I am finding in understanding how to find these individuals

The time variable seems to be tricky as they have listed each time period (pri = primary activity) as its each individual variables.

- The value label I am interested in are from 111 to 116. [The ones listed as Travelling]

- Each time unit is its own variable (e.g. pri1, pri2, pri3)

- Is there a way that I could find those individuals that have value label ranging from 111 to 116 for 9+ consecutive pri (e.g. pri1 to pri9; or pri112 to pri 121)

Any help in understanding this would be much appreciated. Thanks!

JWDID estimation equation

Christiaan de Swardt — Mon, 08 Dec 2025 14:13:18 GMT

Dear Statalist Comunity,

I am interested in running some difference-in-differences equations with oen treatment group, multiple pre and post time periods, and nonlinear outcome variables. For this purpose, I want to make use of the 'jwdid' command, based on the extended two-way fixed effects developed by Wooldridge (2021: "Two-way fixed effects, the two-way mundlak regression, and difference-in-differences estimators") (2023: Simple approaches to nonlinear difference-in-differences with panel data).

I implement it in Stata with the following command syntax:

PHP Code:


jwdid y x1 x2, ivar(id) tvar(wave) gvar(treat) never method(logit) hettype(time)

. My accompanying estimation equation is attached.

Has anyone worked with this DiD method and/or Stata package before? My question is simply whether, given my Stata syntax and data structure, my estimation equation is correctly specified, or whether I am missing anything important in the Stata syntax or estimation equation?

Thank you in advance for any feedback!

Testing instrument relevance for self-selection model with categorical nominal instrument and endogenous binary regressor

Felix Kaysers — Mon, 08 Dec 2025 13:41:43 GMT

I want to assess the three conditions required for instrumental variable estimation in a cross-sectional sample of individuals nested in regions (Bastardoz et al., 2023).

Data-generating process: I am interested in the effect of z on y, conditional on self-selection into treatment x, where y and x are binary variables and z is a categorical nominal variable.
Stata pseudocode:

Code:

eprobit y x, entreat(x=z)

Problem: To my best knowledge, the standard setup discussed in Bastardoz et al. (2023) is a linear regression, where y is continuous, and x is continuous. I have seen that for binary endogenous x, it is common to use linear IV regression to test for instrument relevance. However, I am uncertain how to test for instrument relevance for nominal instruments and if these tests are available in Stata. For example, should I do a joint F-test for all dummies using linear IV regression or is there a completely different test?

References
Bastardoz, N., Matthews, M. J., Sajons, G. B., Ransom, T., Kelemen, T. K., & Matthews, S. H. (2023). Instrumental variables estimation: Assumptions, pitfalls, and guidelines. The Leadership Quarterly, 34(1), 101673. https://doi.org/10.1016/j.leaqua.2022.101673

Can I use orthogonal transformation to fix multicollinearity in FEM?

RASTI WIJAYANTI — Mon, 08 Dec 2025 06:42:18 GMT

hallo everyone I am working with a panel dataset consisting of 219 observations (73 firms over 3 years). My model uses 5 independent variables (X1–X5) and 1 dependent variable (Y). I estimated the model using fixed effects (FEM).

When I checked multicollinearity using vif, I found that X1 and X5 have VIF values above 10, while the other variables are below the usual thresholds.

To address this, I orthogonalized X1 and X5 (to reduce their correlation with the other predictors). After orthogonalization, VIF values dropped to acceptable levels.

My questions:

Is orthogonalizing variables (like X1 and X5) acceptable practice in panel data models estimated via fixed effects?
After orthogonalization, should I still worry about multicollinearity when interpreting coefficients?
Is there a better approach to handle multicollinearity in panel data, besides orthogonalization (e.g., centering, dropping variables, or using ridge regression)?
Does the short time dimension (T=3) affect how VIF behaves in panel settings?

Any guidance on best practices in Stata for dealing with multicollinearity in panel FEM models would be very helpful.

Thank you.

Can I use orthogonal transformation to fix multicollinearity in FEM?

RASTI WIJAYANTI — Mon, 08 Dec 2025 06:29:53 GMT

hello everyone, please help me
I am working with a panel dataset consisting of 219 observations (73 firms over 3 years). My model uses 5 independent variables (X1–X5) and 1 dependent variable (Y). I estimated the model using fixed effects (FEM).

When I checked multicollinearity using vif, I found that X1 and X5 have VIF values above 10, while the other variables are below the usual thresholds.

To address this, I orthogonalized X1 and X5 (to reduce their correlation with the other predictors). After orthogonalization, VIF values dropped to acceptable levels.

My questions:

Is orthogonalizing variables (like X1 and X5) acceptable practice in panel data models estimated via fixed effects?
After orthogonalization, should I still worry about multicollinearity when interpreting coefficients?
Is there a better approach to handle multicollinearity in panel data, besides orthogonalization (e.g., centering, dropping variables, or using ridge regression)?
Does the short time dimension (T=3) affect how VIF behaves in panel settings?

Any guidance on best practices in Stata for dealing with multicollinearity in panel FEM models would be very helpful.

Thank you.

First systematic benchmarking of LLMs on Stata

Khaled Eltokhy — Sun, 07 Dec 2025 22:59:16 GMT

It is no secret that many researchers and RAs have copy-pasted code from ChatGPT to Stata. It is also true that those same researchers often spend hours debugging "hallucinated" commands or syntax that simply doesn't run.

I created Stata Bench to measure which models are actually reliable for our work.

It is the first such benchmark systematically testing current models on real-world Stata tasks to see which ones generate runnable, accurate code.

The project covers:

A Leaderboard: Comparing pass rates on data cleaning and analysis tasks.
Fine-tuning: I am training a smaller model specifically on Stata syntax to reduce hallucinations.

You can see the results here: www.khaledeltokhy.com/benchmarks

Array

Handling Heteroskedasticity and Autocorrelation in My Panel Regression Models

RASTI WIJAYANTI — Sun, 07 Dec 2025 15:37:02 GMT

Hello everyone,
I hope you are all doing well. I would like to ask for your guidance regarding an issue in my panel data analysis.

I am working with 219 observations (73 firms over 3 years). My variables include:

Y (firm value),
X1 (ESG),
X2 (ROA),
X3 (Size),
and FC as a mediation variable.

Because of the mediation, I have three regression models:

Model 1: Y on the independent variables — selected Fixed Effects
Model 2: Mediator (FC) on X — estimated with OLS
Model 3: Y on X and the mediator — selected Fixed Effects

I conducted the classical assumption tests using Model 3 because it contains all variables. I detected heteroskedasticity and autocorrelation, so I applied vce(robust).

My questions are:

Is using vce(robust) in this situation appropriate?
When I compared vce(cluster) to vce(robust), Models 1 and 3 show no change, but Model 2 shows different standard errors (although the direction of coefficients stays the same).
For a thesis, what is the correct reference or justification for choosing vce(cluster) instead of vce(robust)?

Thank you very much for your assistance.

Multi-level model with small number of groups at the highest level: ML with default SEs vs REML?

Sam Murgatroyd — Sun, 07 Dec 2025 10:09:56 GMT

Hello,

I have an unbalanced panel of country-level data: 909 observations from 180 countries. Data are available every two years over the period 2014-2024 (i.e., T=6). The data looks as follows:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double price_dispersion_use float TS_ce byte POWE double unem float(income region_id) int year
 44.44444444444444 1 18            18.055 3 4 2014
56.666666666666664 1 18            15.418 3 4 2016
              62.5 1 18            12.304 3 4 2018
 60.60606060606061 1 19             11.69 3 4 2020
                60 1 19            10.137 3 4 2022
                50 1 19             10.25 3 4 2024
 33.33333333333333 4 12            10.207 3 1 2014
35.714285714285715 6 12            10.202 3 1 2016
                15 6 13            12.137 3 1 2018
                50 6 13            14.057 3 1 2020
 48.57142857142857 3 13            12.346 3 1 2022
42.857142857142854 6 13            11.427 3 1 2024
 72.85714285714285 4 11               5.3 1 4 2014
 72.85714285714285 4 11               3.3 1 4 2016
 77.77777777777777 1 11               1.8 1 4 2018
 68.44993141289439 1 11               2.9 1 4 2020
 69.86301369863014 1 11               2.1 1 4 2022
 59.09090909090908 1 11               1.4 1 4 2024
                25 2 13             16.69 3 1 2020
                25 2 13            14.602 3 1 2022
 28.57142857142857 2 13            14.464 3 1 2024
                40 2 16 7.423938916311391 1 2 2024
41.935483870967744 2 18             7.268 3 2 2014
             37.75 2 18             8.085 3 2 2016
 45.34920634920635 2 18              9.22 3 2 2018
18.726114649681527 2 18            11.461 3 2 2020
13.384615384615383 2 18             6.805 3 2 2022
40.055248618784525 2 18             7.876 3 2 2024
                30 4 13            17.498 3 4 2014
26.666666666666668 4 14            17.617 3 4 2016
42.857142857142854 2 14            18.966 3 4 2018
              47.5 1 18            18.175 3 4 2020
 48.23529411764706 1 19            13.379 3 4 2022
                35 1 19            13.329 3 4 2024
 78.93318965517241 1 19             6.078 1 6 2014
 73.84341637010677 1 19             5.711 1 6 2016
 82.34126984126985 1 19               5.3 1 6 2018
 71.02189781021899 4 18             6.456 1 6 2020
 68.45524542829644 4 18             3.728 1 6 2022
 60.58098915241773 4 18             4.072 1 6 2024
 80.61224489795919 3 13             5.674 1 4 2014
                80 3 15             6.064 1 4 2016
                80 3 15             4.933 1 4 2018
 82.45614035087719 3 17             5.201 1 4 2020
 68.35820895522387 3 16             4.992 1 4 2022
 83.07692307692308 3 16             5.439 1 4 2024
                24 4 12              4.91 3 4 2014
             56.25 1 13                 5 3 4 2016
23.076923076923077 4 14               4.9 3 4 2018
 47.05882352941177 4 14              7.24 3 4 2020
 55.55555555555556 4 14              5.65 3 4 2022
 48.88888888888889 1 14             5.594 3 4 2024
48.658536585365916 1 10              13.8 1 2 2014
 40.22346368715088 1 10              12.7 1 2 2016
 61.08949416342412 1 10            12.027 1 2 2020
 81.76100628930817 1 10             8.463 1 2 2024
                40 2 15               1.2 1 3 2018
 34.78260869565218 2 15             1.781 1 3 2020
58.333333333333336 2 15             1.326 1 3 2022
58.333333333333336 2 15             1.102 1 3 2024
15.789473684210526 5 13             4.416 3 5 2014
22.727272727272727 5 15              4.35 3 5 2016
 33.33333333333333 5 15             4.407 3 5 2018
32.142857142857146 5 15             5.436 3 5 2020
                25 5 15             4.593 3 5 2022
30.864197530864196 5 15              4.68 3 5 2024
 79.32850559578671 1 13             12.17 1 2 2014
             81.25 1 13             8.247 1 2 2016
 45.23433385992628 1 16             8.322 1 2 2018
 78.84615384615384 1 16             8.365 1 2 2022
 79.98999499749875 1 15             7.529 1 2 2024
            35.625 4 14             5.902 3 4 2014
31.914893617021278 4 15             5.844 3 4 2016
30.645161290322577 4 15             4.763 3 4 2018
 25.71428571428572 4 15             4.049 3 4 2020
23.958333333333332 4 15             3.574 3 4 2022
 47.26027397260275 4 15             3.361 3 4 2024
 80.82901554404145 3 14             8.523 1 4 2014
 81.64556962025317 3 15              7.83 1 4 2016
 83.33333333333334 3 15             5.941 1 4 2018
 85.29411764705883 3 15             5.545 1 4 2020
              72.5 3 15              5.57 1 4 2022
 65.21739130434783 3 15             5.488 1 4 2024
 41.66666666666667 1 10                 7 3 2 2016
                40 1 10             7.896 3 2 2018
                50 1 10            10.784 3 2 2020
50.391644908616186 1 10             8.763 3 2 2022
 55.55555555555556 1 10                 7 3 2 2024
                20 2 15             1.784 3 1 2016
              22.5 2 15              1.41 3 1 2018
47.368421052631575 2 15             1.502 3 1 2020
                50 2 17             1.722 3 1 2024
35.714285714285715 2 16             2.021 3 2 2014
                24 5 16             3.498 3 2 2016
                40 4 14             3.519 3 2 2018
                50 4 18             3.552 3 2 2022
 36.40776699029126 4 16             3.091 3 2 2024
 67.44186046511628 3 10            27.517 3 4 2014
60.416666666666664 3 10            25.408 3 4 2016
 68.96551724137932 3 10              18.4 3 4 2018
end
label values TS_ce TS_ce_l
label def TS_ce_l 1 "1. specific uniform", modify
label def TS_ce_l 2 "2. advalorem uniform", modify
label def TS_ce_l 3 "3. mixed uniform", modify
label def TS_ce_l 4 "4. specific_tiered", modify
label def TS_ce_l 5 "5. advalorem tiered", modify
label def TS_ce_l 6 "6. mixed tiered", modify
label values income income_l
label def income_l 1 "1. High", modify
label def income_l 3 "3. Middle", modify
label values region_id region_id_l
label def region_id_l 1 "1. AFR", modify
label def region_id_l 2 "2. AMR", modify
label def region_id_l 3 "3. EMR", modify
label def region_id_l 4 "4. EUR", modify
label def region_id_l 5 "5. SEA", modify
label def region_id_l 6 "6. WPR", modify

I have run a two-level random intercepts model that includes two time-constant variables (income group and region_id), with standard errors clustered at the country level as shown below. This is my main model.

Code:

. mixed price_dispersion_use i.TS_ce unem POWE i.income i.region_id i.year || country: , vce(cluster country)

Performing EM optimization ...

Performing gradient-based optimization: 
Iteration 0:  Log pseudolikelihood = -3684.8293  
Iteration 1:  Log pseudolikelihood = -3684.8293  

Computing standard errors ...

Mixed-effects regression                             Number of obs    =    909
Group variable: country                              Number of groups =    180
                                                     Obs per group:
                                                                  min =      1
                                                                  avg =    5.0
                                                                  max =      6
                                                     Wald chi2(19)    = 485.11
Log pseudolikelihood = -3684.8293                    Prob > chi2      = 0.0000

                                       (Std. err. adjusted for 180 clusters in country)
---------------------------------------------------------------------------------------
                      |               Robust
 price_dispersion_use | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
----------------------+----------------------------------------------------------------
                TS_ce |
2. advalorem uniform  |  -12.79406   3.177566    -4.03   0.000    -19.02198    -6.56615
    3. mixed uniform  |  -.1719755   2.991038    -0.06   0.954    -6.034301     5.69035
  4. specific_tiered  |  -12.01681   3.243756    -3.70   0.000    -18.37445   -5.659162
 5. advalorem tiered  |  -14.82506   4.008941    -3.70   0.000    -22.68244   -6.967678
     6. mixed tiered  |  -12.34974   4.721909    -2.62   0.009    -21.60451   -3.094967
                      |
                 unem |  -.2082926   .1604439    -1.30   0.194    -.5227567    .1061716
                 POWE |   .3630399   .3414757     1.06   0.288    -.3062401     1.03232
                      |
               income |
              2. Low  |  -21.74503    3.60037    -6.04   0.000    -28.80162   -14.68843
           3. Middle  |  -15.07393   2.614105    -5.77   0.000    -20.19749   -9.950383
                      |
            region_id |
              2. AMR  |   7.216272   3.616945     2.00   0.046     .1271903    14.30535
              3. EMR  |    -3.9681   4.507714    -0.88   0.379    -12.80306    4.866857
              4. EUR  |   9.939152   4.166454     2.39   0.017     1.773052    18.10525
              5. SEA  |  -6.913486   5.509264    -1.25   0.210    -17.71144    3.884473
              6. WPR  |   6.218809   4.986658     1.25   0.212     -3.55486    15.99248
                      |
                 year |
                2016  |   1.136412   1.125618     1.01   0.313    -1.069759    3.342582
                2018  |   1.815113   1.429828     1.27   0.204    -.9872978    4.617523
                2020  |   2.493011   1.442856     1.73   0.084    -.3349358    5.320958
                2022  |   2.409131   1.480869     1.63   0.104    -.4933187    5.311581
                2024  |   3.506218    1.70807     2.05   0.040     .1584622    6.853975
                      |
                _cons |   60.35922   5.927152    10.18   0.000     48.74222    71.97623
---------------------------------------------------------------------------------------

------------------------------------------------------------------------------
                             |               Robust           
  Random-effects parameters  |   Estimate   std. err.     [95% conf. interval]
-----------------------------+------------------------------------------------
country: Identity            |
                  var(_cons) |   146.1695   23.78864      106.2493    201.0885
-----------------------------+------------------------------------------------
               var(Residual) |   135.2595   12.23187      113.2899    161.4894
------------------------------------------------------------------------------

. 
end of do-file

. estat ic

Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |          N   ll(null)  ll(model)      df        AIC        BIC
-------------+---------------------------------------------------------------
           . |        909          .  -3684.829      22   7413.659    7519.53
-----------------------------------------------------------------------------
Note: BIC uses N = number of observations. See [R] IC note.

. estat icc

Residual intraclass correlation

------------------------------------------------------------------------------
                       Level |        ICC   Std. err.     [95% conf. interval]
-----------------------------+------------------------------------------------
                     country |   .5193833   .0510441      .4198939    .6173588
------------------------------------------------------------------------------

I wanted to test the sensitivity of my results to accounting for higher levels of clustering in the data. To this end, I have played around with the flexibility afforded by the mixed model and add another random intercept to account for additional levels of nesting. Specifically, I ran two three-level (random intercepts) models. In the first, I added a random intercepts at both the region level and the country-within-region level (I excluded the region variable from the “fixed effects” component of this model for this specification). Results are shown below.

Code:

. mixed price_dispersion_use i.TS_ce unem POWE i.income i.year || region_id: || country:  

Performing EM optimization ...

Performing gradient-based optimization: 
Iteration 0:  Log likelihood = -3692.3964  
Iteration 1:  Log likelihood = -3692.3964  

Computing standard errors ...

Mixed-effects ML regression                             Number of obs =    909

        Grouping information
        -------------------------------------------------------------
                        |     No. of       Observations per group
         Group variable |     groups    Minimum    Average    Maximum
        ----------------+--------------------------------------------
              region_id |          6         34      151.5        302
                country |        180          1        5.0          6
        -------------------------------------------------------------

                                                        Wald chi2(14) = 175.87
Log likelihood = -3692.3964                             Prob > chi2   = 0.0000

---------------------------------------------------------------------------------------
 price_dispersion_use | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
----------------------+----------------------------------------------------------------
                TS_ce |
2. advalorem uniform  |  -13.26848    2.33598    -5.68   0.000    -17.84692    -8.69004
    3. mixed uniform  |   .0820774   2.364169     0.03   0.972    -4.551609    4.715763
  4. specific_tiered  |  -12.63683   2.414591    -5.23   0.000    -17.36934   -7.904314
 5. advalorem tiered  |  -15.75917   4.300878    -3.66   0.000    -24.18874   -7.329606
     6. mixed tiered  |  -12.74558   3.013626    -4.23   0.000    -18.65218    -6.83898
                      |
                 unem |  -.2302968   .1583598    -1.45   0.146    -.5406762    .0800826
                 POWE |   .3434487   .2773258     1.24   0.216    -.2000998    .8869972
                      |
               income |
              2. Low  |  -23.69504   4.213635    -5.62   0.000    -31.95362   -15.43647
           3. Middle  |  -15.98282   2.564233    -6.23   0.000    -21.00862   -10.95701
                      |
                 year |
                2016  |   1.107993   1.366686     0.81   0.418    -1.570663     3.78665
                2018  |   1.779171   1.420145     1.25   0.210    -1.004262    4.562603
                2020  |   2.426264   1.424773     1.70   0.089    -.3662399    5.218767
                2022  |   2.326682   1.457251     1.60   0.110    -.5294783    5.182841
                2024  |   3.404323    1.47028     2.32   0.021      .522627    6.286019
                      |
                _cons |   64.83817   5.434994    11.93   0.000     54.18578    75.49057
---------------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects parameters  |   Estimate   Std. err.     [95% conf. interval]
-----------------------------+------------------------------------------------
region_id: Identity          |
                  var(_cons) |   19.51945   18.35721       3.08993    123.3066
-----------------------------+------------------------------------------------
country: Identity            |
                  var(_cons) |   153.1753   19.94595       118.672    197.7103
-----------------------------+------------------------------------------------
               var(Residual) |   135.2647   7.100675      122.0396    149.9229
------------------------------------------------------------------------------
LR test vs. linear model: chi2(2) = 361.60                Prob > chi2 = 0.0000

Note: LR test is conservative and provided only for reference.

. estat ic

Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |          N   ll(null)  ll(model)      df        AIC        BIC
-------------+---------------------------------------------------------------
           . |        909          .  -3692.396      18   7420.793   7507.415
-----------------------------------------------------------------------------
Note: BIC uses N = number of observations. See [R] IC note.

. estat icc

Residual intraclass correlation

------------------------------------------------------------------------------
                       Level |        ICC   Std. err.     [95% conf. interval]
-----------------------------+------------------------------------------------
                   region_id |   .0633832   .0563688      .0104144    .3032105
           country|region_id |   .5607712   .0399014      .4817034    .6368698
------------------------------------------------------------------------------

. 
end of do-file

Second, I ran model with a random intercept at both the income-group level and the country-within-income-group level. In this specification, I removed the income group variable from the "fixed" effects part of the model.

Code:

. mixed price_dispersion_use i.TS_ce unem POWE  i.region_id i.year || income: || id: 

Performing EM optimization ...

Performing gradient-based optimization: 
Iteration 0:  Log likelihood = -3690.9924  
Iteration 1:  Log likelihood = -3690.9924  

Computing standard errors ...

Mixed-effects ML regression                             Number of obs =    909

        Grouping information
        -------------------------------------------------------------
                        |     No. of       Observations per group
         Group variable |     groups    Minimum    Average    Maximum
        ----------------+--------------------------------------------
                 income |          3         98      303.0        498
                     id |        180          1        5.0          6
        -------------------------------------------------------------

                                                        Wald chi2(17) = 141.36
Log likelihood = -3690.9924                             Prob > chi2   = 0.0000

---------------------------------------------------------------------------------------
 price_dispersion_use | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
----------------------+----------------------------------------------------------------
                TS_ce |
2. advalorem uniform  |  -12.84855   2.354166    -5.46   0.000    -17.46263   -8.234469
    3. mixed uniform  |  -.1261898   2.394097    -0.05   0.958    -4.818533    4.566154
  4. specific_tiered  |  -12.12209   2.432416    -4.98   0.000    -16.88953    -7.35464
 5. advalorem tiered  |  -14.97557   4.313191    -3.47   0.001    -23.42927   -6.521872
     6. mixed tiered  |  -12.39842   3.032808    -4.09   0.000    -18.34262   -6.454229
                      |
                 unem |  -.2092789   .1600007    -1.31   0.191    -.5228745    .1043167
                 POWE |   .3769979     .27681     1.36   0.173    -.1655398    .9195356
                      |
            region_id |
              2. AMR  |   8.073306   3.468474     2.33   0.020     1.275223    14.87139
              3. EMR  |  -3.270375   4.061883    -0.81   0.421    -11.23152    4.690769
              4. EUR  |    11.1213    3.61721     3.07   0.002       4.0317     18.2109
              5. SEA  |  -6.254756   5.694607    -1.10   0.272    -17.41598    4.906468
              6. WPR  |   7.118405   3.968743     1.79   0.073    -.6601881      14.897
                      |
                 year |
                2016  |   1.127195   1.366599     0.82   0.409    -1.551289    3.805679
                2018  |   1.795766    1.42039     1.26   0.206    -.9881475     4.57968
                2020  |   2.475048   1.424955     1.74   0.082    -.3178118    5.267907
                2022  |    2.38783    1.45853     1.64   0.102     -.470836    5.246497
                2024  |   3.484633    1.47202     2.37   0.018     .5995274    6.369738
                      |
                _cons |   47.50363   6.870172     6.91   0.000     34.03834    60.96892
---------------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects parameters  |   Estimate   Std. err.     [95% conf. interval]
-----------------------------+------------------------------------------------
income: Identity             |
                  var(_cons) |    70.4047   64.12905      11.81064    419.6911
-----------------------------+------------------------------------------------
id: Identity                 |
                  var(_cons) |   149.3076   19.26522      115.9447    192.2706
-----------------------------+------------------------------------------------
               var(Residual) |   135.2595   7.097993      122.0393     149.912
------------------------------------------------------------------------------
LR test vs. linear model: chi2(2) = 435.46                Prob > chi2 = 0.0000

Note: LR test is conservative and provided only for reference.

. estat ic

Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |          N   ll(null)  ll(model)      df        AIC        BIC
-------------+---------------------------------------------------------------
           . |        909          .  -3690.992      21   7423.985   7525.044
-----------------------------------------------------------------------------
Note: BIC uses N = number of observations. See [R] IC note.

. estat icc

Residual intraclass correlation

------------------------------------------------------------------------------
                       Level |        ICC   Std. err.     [95% conf. interval]
-----------------------------+------------------------------------------------
                      income |   .1983388   .1454561      .0395546    .5977974
                   id|income |    .618957   .0728953       .469873    .7485514
------------------------------------------------------------------------------

. 
end of do-file

I am aware that I can’t specify vce(cluster country) in these three-level random intercept models (Stata shouts!), but I could have used “, robust”. However, I learned from the mixed helpfile that robust variances are clustered at the highest level in the multilevel model. This is no good for my data as my number of groups at the highest level is very small (3 income groups and 6 regions). So, at present, I have just used the default standard errors (as above).

Looking at the AIC and BIC statistics for my two-level and both of my three-level models, it is clear that the model with the random intercept for both income and country performs the worst; while the comparison between the two-level random intercept and three-level random intercept (|| region_id: || country: ) is ambiguous on which is better: AIC is lower for the two-level model; but BIC is lower for the three-level. I can't do LR test because Stata tells me the two-level and three-level random intercepts models are not nested.

Fundamentally, I am concerned about whether it is okay to run a three-level model at all when the number of groups at the highest level is so low. The mixed helpfile talks about specifying , reml when the number of groups is small. However, I find this discussion confusing because I don’t believe that I have a small number of groups at level 2; but I certainly do at level 3. Does a small number of groups at the highest level only still render the , reml option more appropriate than the default ML results?

Ultimately, my question has two parts. (1) Is it it even sensible to add the third level of clustering given that the number of groups is so small? (2) If it is sensible to have this kind of robustness check, is it better to report the standard ML estimates with default SEs, or is it more correct to use the, reml specification?

Thank you!

Sam

Package confreg - The confusion matrix estimated using regression models

Niels Henrik Bruun — Sun, 07 Dec 2025 03:46:50 GMT

Thanks to Kit Baum, a new package -confreg- is now available on SSC.

Description
The command -confreg- estimates sensitivity and specificity for a single modality by OLS regressing the binary values from the modality on the pathology using robust variance estimation.
The area under the ROC curve (AUC) is estimated here as the mean of sensitivity and specificity for the modality.
There are non-linear formulas for estimating the PPV, NPV, and accuracy using prevalence, sensitivity, and specificity (Bland, 2015, subsection 20.6).

To model more modalities, confreg stacks the values of each modality and the pathology and adds a categorical modality variable.
Using the stacked dataset, sensitivity and specificity are estimated by regressing the modality values on the pathology values and the categorical modality variable, with robust variance estimation.
If modalities are measured on the same patients, estimation uses random intercepts by ID.
The AUC, PPV, NPV, and accuracy are estimated from the prevalence, sensitivity, and specificity as described.

Examples (Stata example dataset)
A reviewer classified 109 tomographic images using a 5-point scale, from 1 = definitely normal to 5 = definitely abnormal.
Patients: 58 normal, 51 abnormal.

Data must be in long format.

Code:

. webuse hanley, clear
(Tomographic images)
. generate id = _n
. generate rating2 = rating >= 2
. generate rating3 = rating >= 3
. drop rating
. reshape long rating, i(id) j(point)
(j = 2 3)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations              109   ->   218         
Number of variables                   5   ->   5           
j variable (2 values)                     ->   point
xij variables:
                        rating2 rating3   ->   rating
-----------------------------------------------------------------------------

. label define point 2 "rating 2" 3 "rating 3" 
. label values point point

Using -confreg- to report sensitivities, specificities, and AUCs at values 2 and 3:

Code:

. confreg disease rating point, id(id) vce(robust)

                                 |         N          p       [95%        CI] 
---------------------------------+-------------------------------------------
rating 2                         |                                           
           Sensitivity, P(TP|C+) |        51   94.11765   87.63018   100.6051 
           Specificity, P(TN|C-) |        58   56.89655   44.09288   69.70022 
              AUC, (sens+spec)/2 |       109    75.5071   68.33038   82.68382 
---------------------------------+-------------------------------------------
rating 3                         |                                           
           Sensitivity, P(TP|C+) |        51   90.19608   81.99713   98.39503 
           Specificity, P(TN|C-) |        58   67.24138   55.10703   79.37573 
              AUC, (sens+spec)/2 |       109   78.71873   71.39641   86.04104 
(results _se_sp_auc are active now)

Report all the accuracy measures:

Code:

. matlist r(confreg), tw(32)

                                 |         N          p       [95%        CI] 
---------------------------------+-------------------------------------------
                                 |                                           
                Prevalence, C+/N |         .   .4678899          .          . 
---------------------------------+-------------------------------------------
rating 2                         |                                           
           Sensitivity, P(TP|C+) |        51   94.11765   87.63018   100.6051 
           Specificity, P(TN|C-) |        58   56.89655   44.09288   69.70022 
              AUC, (sens+spec)/2 |       109    75.5071   68.33038   82.68382 
            Accuracy, P(TP + TN) |        73   74.31193   66.85336   81.77049 
                   PPV, P(TP|P+) |        36   65.75342   58.88674    72.6201 
                   NPV, P(TN|P-) |       109   91.66667   83.06838    100.265 
---------------------------------+-------------------------------------------
rating 3                         |                                           
           Sensitivity, P(TP|C+) |        51   90.19608   81.99713   98.39503 
           Specificity, P(TN|C-) |        58   67.24138   55.10703   79.37573 
              AUC, (sens+spec)/2 |       109   78.71873   71.39641   86.04104 
            Accuracy, P(TP + TN) |        65   77.98165    70.4712    85.4921 
                   PPV, P(TP|P+) |        44   70.76923   62.87928   78.65918 
                   NPV, P(TN|P-) |       109   88.63636   80.01908   97.25365

The sensitivities are highly correlated, likewise for the specificities.
Note that sensitivities and specificities are uncorrelated.

Code:

. matlist r(se_sp_auc_corr), tw(32)

                                 | rating 2                        | rating 3                       
                                 | Sensiti~)  Specifi~)  AUC, (s~2 | Sensiti~)  Specifi~)  AUC, (s~2 
---------------------------------+---------------------------------+--------------------------------
rating 2                         |                                 |                                
           Sensitivity, P(TP|C+) |         1          .          . |         .          .          . 
           Specificity, P(TN|C-) |         .          1          . |         .          .          . 
              AUC, (sens+spec)/2 |  .4519802   .8920279          1 |         .          .          . 
---------------------------------+---------------------------------+--------------------------------
rating 3                         |                                 |                                
           Sensitivity, P(TP|C+) |  .7582875          .    .342731 |         1          .          . 
           Specificity, P(TN|C-) |         .   .8019208   .7153357 |         .          1          . 
              AUC, (sens+spec)/2 |  .4245351   .6644611   .7845994 |  .5598603    .828587          1

Compare estimates:

Code:

. matlist r(confreg), tw(32)

                                 |         N          p       [95%        CI] 
---------------------------------+-------------------------------------------
                                 |                                           
                Prevalence, C+/N |         .   .4678899          .          . 
---------------------------------+-------------------------------------------
rating 2                         |                                           
           Sensitivity, P(TP|C+) |        51   94.11765   87.63018   100.6051 
           Specificity, P(TN|C-) |        58   56.89655   44.09288   69.70022 
              AUC, (sens+spec)/2 |       109    75.5071   68.33038   82.68382 
            Accuracy, P(TP + TN) |        73   74.31193   66.85336   81.77049 
                   PPV, P(TP|P+) |        36   65.75342   58.88674    72.6201 
                   NPV, P(TN|P-) |       109   91.66667   83.06838    100.265 
---------------------------------+-------------------------------------------
rating 3                         |                                           
           Sensitivity, P(TP|C+) |        51   90.19608   81.99713   98.39503 
           Specificity, P(TN|C-) |        58   67.24138   55.10703   79.37573 
              AUC, (sens+spec)/2 |       109   78.71873   71.39641   86.04104 
            Accuracy, P(TP + TN) |        65   77.98165    70.4712    85.4921 
                   PPV, P(TP|P+) |        44   70.76923   62.87928   78.65918 
                   NPV, P(TN|P-) |       109   88.63636   80.01908   97.25365

What if the population prevalance is 0.25 instead of the sample prevalence of .4678899?
(Changes in Accuracy, PPV, and NPV)

Code:

. qui confreg disease rating point, id(id) vce(robust) prevalence(0.25)

. matlist r(confreg), tw(32)

                                 |         N          p       [95%        CI] 
---------------------------------+-------------------------------------------
                                 |                                           
                Prevalence, C+/N |         .        .25          .          . 
---------------------------------+-------------------------------------------
rating 2                         |                                           
           Sensitivity, P(TP|C+) |        51   94.11765   87.63018   100.6051 
           Specificity, P(TN|C-) |        58   56.89655   44.09288   69.70022 
              AUC, (sens+spec)/2 |       109    75.5071   68.33038   82.68382 
            Accuracy, P(TP + TN) |        73   66.20183   56.46307   75.94058 
                   PPV, P(TP|P+) |        36   42.12438   34.69007   49.55868 
                   NPV, P(TN|P-) |       109   96.66858   93.04368   100.2935 
---------------------------------+-------------------------------------------
rating 3                         |                                           
           Sensitivity, P(TP|C+) |        51   90.19608   81.99713   98.39503 
           Specificity, P(TN|C-) |        58   67.24138   55.10703   79.37573 
              AUC, (sens+spec)/2 |       109   78.71873   71.39641   86.04104 
            Accuracy, P(TP + TN) |        65   72.98005   63.65132   82.30879 
                   PPV, P(TP|P+) |        44    47.8565   38.33883   57.37417 
                   NPV, P(TN|P-) |       109   95.36519    91.5837   99.14668

Enjoy

Simpson's Paradox in PPML? Negative estimates for individual Sections vs. Positive pooled estimates (Sign reversal across multiple groups)

Jianli Ding — Sat, 06 Dec 2025 05:44:06 GMT

Hello Statalist,

I am estimating the effect of industrial agglomeration (Location Quotient, spec_lq) on new firm entry (new_entrants) using ppmlhdfe (Stata 17).
I am observing a consistent sign reversal (Simpson's Paradox) between my separate Section-level regressions and my pooled group regressions. I am looking for advice on the statistical mechanisms behind this reversal and how to properly handle/report it.

1. Data Structure & Industry Hierarchy

Unit of Analysis: Fine-grained spatial grids $\times$ 2-digit Industry $\times$ Year.
Hierarchy: The data follows a standard classification system where 2-digit Divisions are nested within broad 1-digit Sections. (e.g., Section I contains divisions 63, 64, and 65).
Grouping Strategy: I constructed aggregated groups (e.g., "High-Tech Services") by selecting specific 2-digit divisions. Note: For Section I, all its constituent 2-digit divisions are classified as High-Tech.

2. The Empirical Puzzle & Model Specification

To diagnose this, I use the following PPML specification controlling for Grid, Sub-industry, and Year fixed effects.

Stata Code:

* Controlling for Grid, Sub-industry (hydm2), and Year FEs; Clustering at Grid level
ppmlhdfe new_entrants $final_ivs if reg_group == "Your_Group_Filter", absorb(grid_id hydm2 year) vce(cluster grid_id)

The Conflict:

Result A (Section-Level Regressions): When I run the code above separately for any single Producer Service Section (e.g., restricting sample to Section I alone, or Section M alone), the coefficient for is consistently negative or insignificant.spec_lq
Result B (Pooled Group Regressions): When I pool these codes into the "High-Tech Service" group (using the exact same FE structure: ), the coefficient flips to positive and highly significant.absorb(grid_id hydm2 year)

Robustness: This sign reversal also occurs for the "General Producer Service" group (flipping from negative/null to positive when pooled).

3. Linearity Check

I tested for non-linearity by adding a quadratic term (c.spec_lq##c.spec_lq) to the pooled High-Tech model. The squared term is negative but statistically insignificant, suggesting the positive linear effect dominates in the pooled specification and the result is not driven by a simple inverted-U shape.

4. Robustness Check (Interaction Model)

To investigate slope heterogeneity, I ran a pooled interaction model on the full dataset (ppmlhdfe ... c.spec_lq##i.Group_ID). Using lincom, the calculated slope for the "High-Tech Service" group remains positive and significant (+0.156).

5. My Questions

Mechanisms: What are the likely econometric reasons for this sign reversal in a PPML HDFE setting? Is the positive pooled estimate driven by the change in the fixed-effect structure (absorbing average grid quality vs. industry-specific grid quality)?
Validity: Given that all constituent sections show negative congestion effects individually, is the positive pooled coefficient a valid measure of "Cluster Sorting Benefits," or is it simply an aggregation bias?

Outputs: Regression Comparison

Model (1) restricts sample to Section I (Info/Software) only.

Model (2) restricts sample to Section M (Science/R&D) only.

Model (3) is the pooled "High-Tech Service" group.

--------------------------------------------------- (1) (2) (3) Section_I Section_M Pooled_HT (Subset) (Subset) (Aggregated) --------------------------------------------------- spec_lq -0.114** -0.073 0.156*** --------------------------------------------------- Grid FE Yes Yes Yes Ind FE (2-digit) Yes Yes Yes Year FE Yes Yes Yes --------------------------------------------------- * Standard errors clustered by Grid
Thank you for your advice.

Fine and Gray competing risk long or wide format

Kim Vaarts — Fri, 05 Dec 2025 22:56:04 GMT

Do I need to set my data in long or wide format to conduct the Fine & Gray model with competing risks?
I have longitudinal data with follow-up of 10 years. The outcome variable is a response variable and measured several times a year and the independent variable is medicine type, where the patient can get different medicine types a per year.

I only see wide format dataset examples online of STATA. Can anyone please provide a link with an example of the Fine and Gray analysis with data in long format? STATA example please.

New package crosswalk-countries

Ulrich Kohler — Fri, 05 Dec 2025 17:33:55 GMT

Do you have a dataset where countries are encoded using, say, ISO 3166 three-digit numeric codes, and you want to produce a table of EU member states as of, say, 1995—in, say, Chinese? This turns out to be quite easy with Ben Jann’s excellent crosswalk package and my own suite of crosswalk tables.

Assuming the required packages are installed, here is an example of how this works. First, let me load some example data where countries are encoded using ISO 3166 three-digit numeric codes. The file kountry.dta from Raciborski’s kountry package provides exactly this.

Code:

. use _ISO3N_ using http://www.stata-journal.com/software/sj10-4/dm0038_1/kountry.dta, clear

Now, using crosswalk with crosswalk-countries, we can generate two variables that record the year each country entered the EU and, where relevant, the year it exited:

Code:

. crosswalk join = cntry.iso3n_to_eu(_ISO3N_ 1)
. crosswalk quit = cntry.iso3n_to_eu(_ISO3N_ 2)

These variables allow us to create an indicator for EU membership in a given year. I’ll use 1995, as in the earlier example, but any other year could be used as well:

Code:

.  gen eu = join <= 1995 & quit > 1995

Finally, we need to convert the numeric country codes into a string variable in Chinese. For this, we can again use crosswalk together with crosswalk-countries:

Code:

. crosswalk name = cntry.iso3n_to_name_zh(_ISO3N_)

And here's the table:

Code:

. tab name if eu

                                   name |      Freq.     Percent        Cum.
----------------------------------------+-----------------------------------
                                   丹麦 |          1        6.67        6.67
                                 卢森堡 |          1        6.67       13.33
             大不列颠及北爱尔兰联合王国 |          1        6.67       20.00
                                 奥地利 |          1        6.67       26.67
                                   希腊 |          1        6.67       33.33
                                   德国 |          1        6.67       40.00
                                 意大利 |          1        6.67       46.67
                                 比利时 |          1        6.67       53.33
                                   法国 |          1        6.67       60.00
                                 爱尔兰 |          1        6.67       66.67
                                   瑞典 |          1        6.67       73.33
                                   芬兰 |          1        6.67       80.00
                               荷兰王国 |          1        6.67       86.67
                                 葡萄牙 |          1        6.67       93.33
                                 西班牙 |          1        6.67      100.00
----------------------------------------+-----------------------------------
                                  Total |         15      100.00

You can use other languages (Arabic, English, German, French, Spanish, Russian) other organizations (African Union, International Court of Justice, OECD, etc.), geographical regions, and of course many other country coding systems.

Installation of (most recent) crosswalk by Ben Jann:

Code:

. ssc install crosswalk

Installation of crosswalk tables for countries is a bit more cumbersome since the number of files exceed the number of files Stata's package management can handle. I recommend:

Code:

. cd "`c(sysdir_personal)'"
. net from https://gitup.uni-potsdam.de/ukohler/crosswalk-countries/-/raw/main/
. net get crosswalk-countries
. unzipfile crosswalk_countries_cwfcn.zip, replace

This installs the package into the personal Ado directory. You may want to choose another directory if you don't want the package for your entire account.