Double Bootstrap DEA (Simar & Wilson, 2007)

Simona Ferraro

Join Date: Jan 2017

Posts: 34
#1

Double Bootstrap DEA (Simar & Wilson, 2007)

12 Feb 2024, 02:12

Dear members,

I would like to have some help regarding a model I decided to use. I have still some doubts. I hope someone might guide me. I have not found so much in the blog. I hope someone might read and give me a help.

I attach to the post the main data and the do-file. I want to perform the Simar & Wilson (2007) double bootstrap to analyse the efficiency of schools (simarwilson). I installed it as well as "ftruncreg".

I have 4 outputs, 5 inputs, 9 environmental variables (some dummy variables). Only those 9 enters as independent variables. The model is input orientation, variable returns to scale, algorithm 2 meaning the efficiency scores are computed internally.

I have only 1 year as condition. I added the followings: twosided reps(1000) bcreps(100) invert tebc(eff_vrs_o) level(95) dots

My first question is related a bit on the code and then, the intepretation of the output. I read the help file but I have still some doubts. I added twosided as notwosided is not reccomended with Algorithm 2. Given that the command sets internallity nounit, I did not add such piece of code and let the model run.

I added invert and the estimated efficiency scores are inverted. Larger efficiency scores indicate inefficiency for the input-oriented model (what I have).

- Does the model consider only the environmental variables as independent or in the truncation regression, all variables (hence also outputs and inputs) are considered as independent?

- the etimated after the second loop of bootstrap (the estimates bias-corrected from the truncation regression) should be inverted or those values are already the final ones?

- Once I add my last input (school_size), the bootstrap takes a while (more than 2h) and Stata, somehow, does not respond anylonger. What happens? Does someone have an explanation for it?

- Stata outcome shows me "inefficient if eff_vrs_o > 1". I use summarize to see the values of the bias-corrected and all values, also minimum are above 1. This confuses me a bit because, unless I interpreted incorrectly the help file, "...for
(regular) scores within (0,1], the default (twosided) is to use a two-sided truncated regression model and to sample from the two-sided truncated normal distribution. With twosided, the procedure hence considers that input-oriented (Farrell) efficiency scores are not only less than or equal to 1 but also strictly positive..."

I would really appreciate if someone who has used or has worked with this model, might help me.

Thank you

Attached Files

Main_data.dta (125.7 KB, 1 view)

Main_data_DEA_school.do (810 Bytes, 1 view)
Tags: dea, efficiency

Harald Tauchmann

Join Date: Aug 2017
Posts: 30

21 Feb 2024, 04:54

Dear Simona,
just a quick answer to your questions regarding simarwilson. (See the log file below that illustrates some of my arguments).

1. Unlike regression analysis, DEA does not involve the concepts of dependent and independent variables. Inputs and outputs do not enter the truncated regression. Therefore, there is no direct reason to consider them as independent. Since inefficiency affects the quantities of inputs and outputs, they cannot (all) be independent.

2. If you specify the invert option, simarwilson does the inversion (switching from Farrell to Shephard efficiency) internally, and there is no need to manually transform the efficiency scores - provided the Shephard measure is really what you want.

3. I added school_size as a third input to the model. Leaving all other code unchanged took 108 seconds on my laptop (see below). Hence, I have no idea what causes the excessive run time on your machine.

4. Getting scores that are consistently greater than one is just the consequence of specifying the invert option (and calculating bias-corrected scores that do not take the value of exactly one). If you simply remove "invert" from your code, you will get estimated scores that are all within the unit interval. The "notwosided" option is irrelevant to the DEA, and thus to the scores that the DEA yields, but only affects the parametric bootstrap of the truncated regression.

Hope my answer is of at least of some value to you.
Best wishes,
Harald

Code:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
name: <unnamed>
log: C:\...\Response_SimonaFerraro_statalist.log
log type: text
opened on: 21 Feb 2024, 12:37:10

. * Double Boostrap DEA (Simar & Wilson, 2007)
.
. * simarwilson conducts the internal efficiency analysis using Alg 2.
. * If the DEA is carried out internally, simarwilson internally sets nounit (inefficiency if eff score < 1) and I do not need to add it.
.
. * By invert (Shepard and not Farrell eff measure). All estimated efficiency scores are inverted, scores larger than one indicate inefficiency for the input-oriented.
. * notwosided is not recommended with algorithm 2
.
. ** Set seed to ensure replicability
. set seed 19023892

.
. ** Load Data
. use Main_data.dta, clear

.
. ** Orginal code of Simona Ferraro
. simarwilson (matemaatika eesti_keel continuing_studies reverse_dropout = teacher_training teacher_qualification ) median_income keel_oige typee municipality state linnakool maa
> kool tallinn tartu if year_==2020, algorithm(2) twosided unit rts(vrs) base(in) reps(1000) bcreps(100) invert tebc(eff_vrs_o) level(95) dots

Bootstrap (bias correction) replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................. 50
.................................................. 100

Bootstrap (conf. intervals) replications (1000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................. 50
.................................................. 100
.................................................. 150
.................................................. 200
.................................................. 250
.................................................. 300
.................................................. 350
.................................................. 400
.................................................. 450
.................................................. 500
.................................................. 550
.................................................. 600
.................................................. 650
.................................................. 700
.................................................. 750
.................................................. 800
.................................................. 850
.................................................. 900
.................................................. 950
.................................................. 1000

Simar & Wilson (2007) eff. analysis Number of obs = 348
(algorithm #2) Number of efficient DMUs = 0
Number of bootstr. reps = 1000
Wald chi2(9) = 50.96
inefficient if eff_vrs_o > 1 Prob > chi2(9) = 0.0000

------------------------------------------------------------------------------
Data Envelopment Analysis: Number of DMUs = 348
Number of ref. DMUs = 348
input oriented (Shephard) Number of outputs = 4
variable returns to scale Number of inputs = 2
bias corrected efficiency measure Number of reps (bc) = 100

------------------------------------------------------------------------------
| Observed Bootstrap Percentile
inefficiency | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
eff_vrs_o |
median_inc~e | -.0076518 .0023666 -3.23 0.001 -.0123481 -.0031095
keel_oige | .0015061 .0009857 1.53 0.127 -.0002973 .003461
typee | -.0049645 .0336034 -0.15 0.883 -.0656221 .0655787
municipality | .1804942 .0592178 3.05 0.002 .0724297 .3010025
state | .1370724 .2032406 0.67 0.500 -.2836536 .531626
linnakool | .2078946 .0966011 2.15 0.031 .0324536 .4192952
maakool | .0385492 .0962633 0.40 0.689 -.138435 .248432
tallinn | .2810482 .0944552 2.98 0.003 .1110525 .484972
tartu | .1959864 .1033036 1.90 0.058 .004681 .4265907
_cons | 1.459573 .1722798 8.47 0.000 1.083181 1.767136
-------------+----------------------------------------------------------------
/sigma | .2476732 .0098829 25.06 0.000 .2254373 .2642594
------------------------------------------------------------------------------

.
. ** Descriptive statistics for the bias-corrected efficiency scores
. sum eff_vrs_o

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
eff_vrs_o | 348 1.705392 .263333 1.063821 2.270815

.
. ** Add school_size as additional input variable
. cap drop eff_vrs_o

. timer clear 1

. timer on 1

. simarwilson (matemaatika eesti_keel continuing_studies reverse_dropout = teacher_training teacher_qualification school_size) /*
> */ median_income keel_oige typee municipality state linnakool maakool tallinn tartu if year_==2020, algorithm(2) twosided unit rts(vrs) base(in) reps(1000) bcreps(100) invert t
> ebc(eff_vrs_o) level(95) dots

Bootstrap (bias correction) replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................. 50
.................................................. 100

Bootstrap (conf. intervals) replications (1000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................. 50
.................................................. 100
.................................................. 150
.................................................. 200
.................................................. 250
.................................................. 300
.................................................. 350
.................................................. 400
.................................................. 450
.................................................. 500
.................................................. 550
.................................................. 600
.................................................. 650
.................................................. 700
.................................................. 750
.................................................. 800
.................................................. 850
.................................................. 900
.................................................. 950
.................................................. 1000

Simar & Wilson (2007) eff. analysis Number of obs = 348
(algorithm #2) Number of efficient DMUs = 0
Number of bootstr. reps = 1000
Wald chi2(9) = 81.44
inefficient if eff_vrs_o > 1 Prob > chi2(9) = 0.0000

------------------------------------------------------------------------------
Data Envelopment Analysis: Number of DMUs = 348
Number of ref. DMUs = 348
input oriented (Shephard) Number of outputs = 4
variable returns to scale Number of inputs = 3
bias corrected efficiency measure Number of reps (bc) = 100

------------------------------------------------------------------------------
| Observed Bootstrap Percentile
inefficiency | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
eff_vrs_o |
median_inc~e | -.0059043 .0023091 -2.56 0.011 -.0104083 -.0015038
keel_oige | .0011902 .0009883 1.20 0.228 -.0007657 .0032026
typee | -.0457486 .0332903 -1.37 0.169 -.1085907 .0182037
municipality | .1840001 .0585082 3.14 0.002 .0686036 .3069073
state | .1015493 .1937717 0.52 0.600 -.3137163 .4680878
linnakool | .2137572 .0911967 2.34 0.019 .043333 .3927926
maakool | -.0189099 .0912014 -0.21 0.836 -.2001557 .1545485
tallinn | .2527479 .0904463 2.79 0.005 .07103 .424427
tartu | .1791965 .0978454 1.83 0.067 -.0097767 .3851756
_cons | 1.501424 .168919 8.89 0.000 1.170017 1.834643
-------------+----------------------------------------------------------------
/sigma | .2453481 .0094185 26.05 0.000 .2230933 .2604626
------------------------------------------------------------------------------

. timer off 1

. timer list 1
1: 107.81 / 1 = 107.8120

.
. ** Same as the original code just WITHOUT option INVERT
. cap drop eff_vrs_o

. simarwilson (matemaatika eesti_keel continuing_studies reverse_dropout = teacher_training teacher_qualification ) median_income keel_oige typee municipality state linnakool maa
> kool tallinn tartu if year_==2020, algorithm(2) twosided unit rts(vrs) base(in) reps(1000) bcreps(100) tebc(eff_vrs_o) level(95) dots

Bootstrap (bias correction) replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................. 50
.................................................. 100

Bootstrap (conf. intervals) replications (1000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................. 50
.................................................. 100
.................................................. 150
.................................................. 200
.................................................. 250
.................................................. 300
.................................................. 350
.................................................. 400
.................................................. 450
.................................................. 500
.................................................. 550
.................................................. 600
.................................................. 650
.................................................. 700
.................................................. 750
.................................................. 800
.................................................. 850
.................................................. 900
.................................................. 950
.................................................. 1000

Simar & Wilson (2007) eff. analysis Number of obs = 348
(algorithm #2) Number of efficient DMUs = 0
Number of bootstr. reps = 1000
inefficient if eff_vrs_o < 1 Wald chi2(9) = 61.58
twosided truncation Prob > chi2(9) = 0.0000

------------------------------------------------------------------------------
Data Envelopment Analysis: Number of DMUs = 348
Number of ref. DMUs = 348
input oriented (Farrell) Number of outputs = 4
variable returns to scale Number of inputs = 2
bias corrected efficiency measure Number of reps (bc) = 100

------------------------------------------------------------------------------
| Observed Bootstrap Percentile
efficiency | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
eff_vrs_o |
median_inc~e | .0022289 .0007628 2.92 0.003 .0007148 .0038124
keel_oige | -.0008111 .0002959 -2.74 0.006 -.001378 -.0002127
typee | -.0014844 .0104352 -0.14 0.887 -.0212229 .0199318
municipality | -.060399 .0178857 -3.38 0.001 -.0979019 -.0261534
state | -.2299003 .0614881 -3.74 0.000 -.3518595 -.104871
linnakool | -.0594728 .0273929 -2.17 0.030 -.1110802 -.0069498
maakool | -.0139055 .0277085 -0.50 0.616 -.0674768 .0412561
tallinn | -.1040808 .0282378 -3.69 0.000 -.1574813 -.0491706
tartu | -.0839063 .0321504 -2.61 0.009 -.144252 -.0207181
_cons | .6742883 .0544838 12.38 0.000 .57141 .7807546
-------------+----------------------------------------------------------------
/sigma | .0805208 .0029744 27.07 0.000 .0734268 .0849844
------------------------------------------------------------------------------

.
. ** Descriptive statistics for the bias-corrected efficiency scores
. sum eff_vrs_o

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
eff_vrs_o | 348 .5496639 .0878639 .2806361 .896414

. log close
name: <unnamed>
log: C:\...\Response_SimonaFerraro_statal ist.log
log type: text
closed on: 21 Feb 2024, 12:39:46
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Last edited by Harald Tauchmann; 21 Feb 2024, 04:57.

Comment

Gabriel Temesgen

Join Date: Jul 2017

Posts: 76
#3

11 Aug 2025, 12:40

Dear Harald Tauchmann Harald Tauchmann,
I want to use the Simar and Wilson two-stage efficiency approach to estimate government spending efficiency for 40 countries, using one input and three output variables, over the period 2000–2023.
I followed the code from this paper, but their example only shows results for 2014.
My question:
Should I first collapse the dataset (e.g., take averages over the years) and then run something like this:
teradial exp_gdp = education health infrastructure if inst_quality < . & Inflation < . & GDP_PPC < ., te(te_vrs_1) rts(v) base(o) noprint
sum te_vrs_1 if e(sample) /* te_vrs_1 = conventional DEA efficiency scores */

tobit te_vrs_1 inst_quality Inflation GDP_PPC, ll(1) nolstretch vsquish
qui margins, dydx($g_list) predict(ystar(1,.)) post
estimates store tobit

truncreg te_vrs_1 inst_quality Inflation GDP_PPC, ll(1) nolstretch vsquish
qui margins, dydx($g_list) predict(e(1,.)) post
estimates store truncreg

simarwilson te_vrs_1 inst_quality Inflation GDP_PPC, reps(2000)
qui margins, dydx($g_list) post
estimates store alg_1
simarwilson (exp_gdp = education health infrastructure) inst_quality Inflation GDP_PPC if Year == 2000, alg(2) rts(v) base(o) reps(2000) bcreps(1000) tebc(tebc_vrs_1)
estimates store alg_2_raw
qui margins, dydx($g_list) post
estimates store alg_2
sum tebc_vrs_1 /* tebc_vrs_1 = truncated bootstrap DEA efficiency scores */

Or should I instead run the estimation year by year for 2000, 2001, …, 2023?
If I go year by year, how can I summarize the regression results?

Also, do you have any suggestions for panel data (time-series cross-sectional) efficiency analysis?
Comment
Harald Tauchmann

Join Date: Aug 2017

Posts: 30
#4

12 Aug 2025, 09:31

Dear Gabriel:
Well, what you should do is first of all consider what you want and which assumptions you are willing to make. Nevertheless: (i) I would generally not recommend collapsing/averaging the data before running a DEA. DEA estimates a frontier, not a conditional mean. Consequently, averaging will introduce a bias because any single observed input-output combination is potentially informative about what is technically possible. (ii) If you pool the data over the years and run a DEA using the pooled sample, this means that you assume that the production technology is constant over time. This is probably a rather strong assumption, provided that you use data for 24 years. So I would regard separate DEAs for each year as more plausible, even though this approach does not exploit that the technology will only gradually change over time. (iii) If you want to run the DEA year-by-year but still want to have a single regression in the second stage analysis - that is a single uniform coefficient attached to say “Inflation” - this is technically straightforward. See the code below (I do not know if it runs with your data, which I do not know.) If this makes sense as an economic model is, however, a different question. I would probably add year fixed effects to the rhs of the truncated regression (see below). Adding country fixed effects may also make sense from an economic perspective. However, I would not be surprised if this causes technical problems. (iv) See also the discussion on statalist regarding the Simar & Wilson (2017) estimator using panel data.
Best wishes,
Harald

Code:

gen te_vrs_1 = . forvalues yy == 2000(1)2023 { teradial exp_gdp = education health infrastructure if inst_quality < . & Inflation < . & GDP_PPC < . & year == `yy’, te(te_vrs_1_tmp) rts(v) base(o) noprint replace te_vrs_1 = te_vrs_1_tmp if year == `yy’ cap drop te_vrs_1_tmp } simarwilson te_vrs_1 inst_quality Inflation GDP_PPC i.year, reps(2000)
Comment
Gabriel Temesgen

Join Date: Jul 2017

Posts: 76
#5

17 Aug 2025, 14:26

Dear Harald Tauchmann, Thank you very much it works.
I ran two DEA regressions as per your suggestion:
Year-by-year DEA: I perform a separate DEA for each year and then use a single regression in the second-stage analysis.
Pooled DEA: I pool the data over all years and run a DEA using the pooled sample:
simarwilson (exp_gdp = education_index_rescaled health_index_rescaled infrastructure_index_rescaled) ///
lnpercapita c.inst_qlty_ndex##c.hdindex debtgdp tradegdp cpi_inf taxgdp i.year i.id, ///
alg(2) rts(v) base(o) reps(2000) bcreps(1000) tebc(tebc_vrs_1)
I prefer to stick with the first approach because the regression results from the pooled DEA appear inflated compared to the year-by-year DEA.
I have a couple of additional questions:
Variable scaling in DEA:
Most of my variables are on different scales. To address this, I initially used z-standardization, but that produced negative values. Since the DEA model requires positive inputs and outputs, I then rescaled the variables using min-max normalization.
What would you recommend in this case? Should I use the original variables with different scales, or continue with the rescaling procedure?
Health index from PCA as an output (I use for the other outputs as well):
I am using one input (government spending / GDP ratio) and three outputs (health, education, and infrastructure). The health index is derived using PCA from different health indicators. Here is the code I used:
* mortality rate is negative i convert it into survival rate
gen infant_survival = 1000 / infant_mr
label variable infant_survival "Survival rate, infant (per 1,000 live births)"
*** PCA for health_index
pca z_infant_survival z_lifeexp z_immun, components(1)
factortest z_infant_survival z_lifeexp z_immun
estat kmo
predict health_index, score
// Rescale health_index
summarize health_index
gen health_min = r(min)
gen health_max = r(max)
gen health_index_rescaled = (health_index - health_min) / (health_max - health_min) + 0.01

Could you please advise if using the PCA-derived index as an output is appropriate in this DEA setup?
Thank you very much for your guidance.
Best
Attached Files
Comment
Harald Tauchmann

Join Date: Aug 2017

Posts: 30
#6

19 Aug 2025, 06:03

Dear Gabriel,
In principle, running a DEA does not require the rescaling of input and output measures, since DEA is scale-invariant (in a proportional sense). Bear in mind that a DEA models a production technology. Therefore, your input and output variables should be measured in a way that allows them to be considered in terms of the quantities of inputs and outputs. This is problematic if one of your outputs is 'health' (derived from three health measures using PCA), since it is unclear which scale to use for measuring health. One could argue that a scalar variable 'health' only exists in theory, but not as a one-dimensional variable that can be observed in the real world. So why not use the three original health measures separately as outputs? I have serious doubts regarding your suggested data transformation: health_index_rescaled = (health_index - health_min) / (health_max - health_min) + 0.01. By applying it, you normalize the smallest observed health value to zero, which is problematic in a DEA context, and then add a positive constant whose value seems to be arbitrarily chosen. (Its choice may have a substantial impact on the results your analysis yields.)
Best,
Harald
Comment
Gabriel Temesgen

Join Date: Jul 2017

Posts: 76
#7

19 Aug 2025, 08:08

Thanks Harald Tauchmann, for the brief explanation.
My challenge is that I do not have disaggregated data on health expenditure. Instead, I only have data on total government expenditure that gathers all the three sectors: health, education, and infrastructure. Because of this, I plan to use one input (total government expenditure) and three outputs (one for each sector).
For each sector, I want to construct indicators as follows:
Health:
* Immunization, measles (% of children ages 12–23 months)
* Mortality rate, infant (per 1,000 live births) [I would use the inverse]
* Life expectancy at birth, total (years)

Education:
* Pupil–teacher ratio, primary
* Literacy rate, adult total (% of people ages 15 and above)
* School enrollment, primary (% gross)
* School enrollment, secondary (% gross)

Infrastructure:
* Mobile cellular subscriptions (per 100 people)
* People using at least basic drinking water services (% of population)
* Access to electricity (% of population)

My main questions are:
1. Can I use 10 outputs in DEA? As I mentioned earlier, I have 40 countries (DMUs).
2. Some indicators are measured per 1,000, while others are percentages. Because of these scale differences, I am considering using Principal Component Analysis (PCA) to reduce each group of indicators into a single index (one for health, one for education, and one for infrastructure).
I have considered your suggestion " DEA is scale-invariant (in a proportional sense)".
Thanks again
Gabriel
Comment
Harald Tauchmann

Join Date: Aug 2017

Posts: 30
#8

19 Aug 2025, 10:19

Dear Gabriel:
Considering ten outputs and one input when running a DEA with only 40 DMUs is technically possible. The number of DMUs only needs to be greater than the number of outputs + inputs. However, in such a small sample setting, the finite sample bias inherent in DEA will be quite substantial. (If you increase the number of inputs and outputs, at some point [almost] all DMUs will produce fully efficiently according to the efficiency scores DEA yields [particularly if you specify variable returns to scale]).
(ii) Get rid of all measurements in percentages and transform them to variables measured at a level. It does not matter whether you measure per 1000, per million, etc., but your variables must not be measured relative to something that varies across DMUs (e.g. the number of inhabitants). If you 'feel' that the different sizes of the countries should be taken into account, and that they are only 'comparable' in terms of per capita values, you may consider specifying constant returns to scale in the DEA. (However, according to the DEA model, DMUs are comparable if all inputs and outputs are considered, and if these are measured using absolute scales that are consistent across DMUs.)
Best, Harald
Comment

Announcement