A study of the tax burden by revenu categories

Aziz Essouaied

Join Date: Apr 2020

Posts: 203
#16

25 Jan 2025, 05:08

Clyde Schechter Thanks for the help. Yes, I guess I've changed the code because I've used it on other databases, I guess I did make some changes in the process, and now when I wanted to use it again, it gave me that error message. Thanks for the help anyway.
Still, here, there is a problem that I do want to get your advice on if it is possible: The current database I'm using (the one on #14) is about taxes people pay according to their revenus, yet, there are some people (those with revenu that doesn't exceed a certain amount, the poor classes if you want to call them that way) who don't pay taxes (the variable "Taxe_Payé" for those people is 0), but still they do have a revenu. The problem that these people are creating when I apply the code is that when dividing the "Taxe_Payé" by "Revenu_Annéeé", it is a divison of 0 by a positive value, and that gives me a result that's a bit weird to analyse, here's the result:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float bracket long id_num double Revenu_Année float(Impot_Payé effective_tax_rate) double total_income 1 11293 327.03265967772097 74.21466 3.3798416 3693179.825740503 10 105704 2061.5124999654527 31.88242 .02203993 217910117.2963482 20 108870 3529.9655197542434 57.90738 .016248617 384307346.1356445 40 225867 8787.90451409199 728.749 .06950825 1984897628.8844154 60 225867 16971.8567085875 2620.085 .15248114 3833382359.198533 80 225868 23487.14526649053 4180.503 .1777457 5304994527.051683 90 112933 27809.098754913826 5194.637 .186663 3140564949.688683 95 56469 35072.8734448177 6954.365 .1980317 1980530090.5554106 99 45171 46170.90434383673 10492.78 .22471543 2085585920.115449 99.9 10164 107625.7937056832 27840.965 .25883132 1093908567.224564 100 1129 465566.7259929141 120939.2 .26597884 525624833.646 end

The proble is that, for the first class, it gives me a "wrong" "effective_tax_rate", since I guess the "gen effective_tax_rate = Impot_Payé/Revenu_Année" command of the code created some distortion for that first class, since it had to use 0 paid taxes and divide it by a positive revenu for that first poor class.

Is there a way to correct this please? I do want to show the first class, and I guess for those people, they should have a really small "effective_tax_rate" that could be close to zero, since they pay about nothing in taxes compared to the small revenus they earn. Yet, the code gets me an effective_tax_rate of about 337% for that first class, and that doesn't make sense.

Any help please? With many thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#17

25 Jan 2025, 08:10

I see the anomalous result in what you show. But to find out where the problem is requires seeing the data from which those results were generated, along with the exact code that created it.

I can tell you that when you divide 0 by a positive number you get 0; there is no problem associated with that. You may be confusing that with dividing by 0. But even that would just lead to a missing value, not an anomalously large result. The code shown in #12 of this thread should not produce this kind of problem. If you used it unmodified, I am quite confident that the problem arises from some erroneous data or data that just isn't what you expected it to be.

Anyway, nothing more can be said without seeing the data and exact code that created these results.
Comment

Aziz Essouaied

Join Date: Apr 2020
Posts: 203

#18

25 Jan 2025, 11:27

Clyde Schechter Thanks for the help!

Here's the data on which I've applied the code:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double Revenu_Année float Impot_Payé
24015.379  3933.683
42206.545  8434.434
 36889.06  6932.468
26109.679  4336.566
39681.416  7729.908
14385.029  2004.793
12904.748  1737.222
14070.842  2079.853
11686.995  1559.949
13509.571  1858.438
13506.436  1966.681
29359.533   6183.24
15564.943   2423.95
24029.455  4661.007
25140.046  4758.284
14385.029  2059.021
24805.308  3282.636
21922.531  4111.383
25267.761  4808.718
30787.104  5680.851
35680.101   7038.82
42282.843  8149.774
41347.412  8236.804
 43959.38  6236.476
46051.024   9474.11
41367.373  8273.574
23090.514  3763.933
 43895.24  8518.322
33677.781  4495.589
30403.981   5257.65
26020.927  4970.615
23343.019  4331.012
39376.535  6940.421
  23614.8    4039.6
43191.675  8642.883
22990.661  3832.333
20275.437  3504.201
56839.433 13338.104
22727.017  4038.028
21925.397  3812.582
17463.539  2840.013
17638.807  2813.337
23433.331  4227.581
24772.533  4616.985
42997.801  8674.304
42884.521  8994.131
11704.479  1532.016
23390.023  4305.743
12352.238  1686.569
26754.953  4635.622
28581.334  4941.036
29817.422  5400.736
46083.094  9511.203
33551.781  6353.342
14695.325  2248.476
23837.271  4595.424
 16364.11  1768.941
14462.969  2118.728
13629.119  1859.865
13480.118  1961.048
15225.171  2270.185
25082.852    4702.7
14436.699   1362.31
14147.463  2003.094
24736.987  4581.159
21994.715  3833.441
26076.943  4980.064
29916.472  5291.995
37155.938  7305.427
  26833.7  4531.231
34362.458  6636.015
41451.511  8198.047
 37614.57  7069.975
21163.171  3528.521
28564.741   5408.19
32265.449  5778.196
 34556.51  6682.951
30326.373  5451.877
22364.445  3907.091
22327.509  3955.969
 18349.26  3082.684
24882.902  4554.667
25433.153  4229.585
41274.971  8100.187
52445.719 11519.258
28943.441  5034.317
40466.357  8113.134
32317.757  5882.521
 25738.39  4526.666
28380.104  5033.999
31892.922  5742.724
 44198.93  8860.961
34875.191  6648.993
41518.507  8156.467
42724.921  7274.216
40438.143   7952.28
18711.021  3163.753
24772.533  4733.406
 31029.76  5578.427
42017.476  8412.897
end

It is about the Revenu (first variable) and the Tax Paid (second variable) for about 1129000 people, it is a somewhat extensive dataset.

Here's the code I've used based on your suggestions before:

assert !missing(Revenu_Année)
centile Revenu_Année, centile(1 10 20 40 60 80 90 95 99 99.9)
label define bracket 1 "Bottom Centile" 2 "Bottom Decile" 3 "Bottom Quintile" 4 "Second Quintile" 5 "Middle Quintile" 6 "Fourth Quintile" 7 ">80th to 90th Percentile" 8 "> 90th to 95th Percentile" 9 "> 95th to 99th Percentile" 10 "> 99th to 99.9th Percentile" 11 "> 99.9th Percentile"
gen bracket:bracket = .
local centiles `r(centiles)'
forvalues i = 1/`r(n_cent)' {
replace bracket = `i' ///
if Revenu_Année <= `r(c_`i')' & missing(bracket)
}
replace bracket = `r(n_cent)' + 1 if missing(bracket)

gen effective_tax_rate = (Impot_Payé/Revenu_Année)*100
egen id_num = group(id)

collapse (count) id_num (mean) Revenu_Année Impot_Payé effective_tax_rate (sum) total_income = Revenu_Année, by(bracket)

And here's the result if that code:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float bracket long id_num double Revenu_Année float(Impot_Payé effective_tax_rate) double total_income
 1  11293 327.03265967772154  74.21466  337.9842  3693179.825740509
 2 105704 2061.5124999654536  31.88242  2.203993  217910117.2963483
 3 108870 3529.9655197543693  57.90738 1.6248617  384307346.1356582
 4 225867  8787.904514089849   728.749  6.950825 1984897628.8839319
 5 225867 16971.856708587457  2620.085 15.248115  3833382359.198523
 6 225868 23487.145266491487  4180.503 17.774572  5304994527.051899
 7 112933 27809.098754914205  5194.637   18.6663  3140564949.688726
 8  56469 35072.873444818266  6954.365  19.80317 1980530090.5554426
 9  45171 46170.904343836875  10492.78  22.47154 2085585920.1154554
10  10164 107625.79370568323 27840.965 25.883133 1093908567.2245643
11   1129  465566.7259929141  120939.2 26.597883      525624833.646
end
label values bracket bracket
label def bracket 1 "Bottom Centile", modify
label def bracket 2 "Bottom Decile", modify
label def bracket 3 "Bottom Quintile", modify
label def bracket 4 "Second Quintile", modify
label def bracket 5 "Middle Quintile", modify
label def bracket 6 "Fourth Quintile", modify
label def bracket 7 ">80th to 90th Percentile", modify
label def bracket 8 "> 90th to 95th Percentile", modify
label def bracket 9 "> 95th to 99th Percentile", modify
label def bracket 10 "> 99th to 99.9th Percentile", modify
label def bracket 11 "> 99.9th Percentile", modify

The problem is that the effective tax rate for the first social class doesn't seem logical, the value is 337%, and that doesn't seem logical, since the people in that class are the poorest, so they should be paying the smallest effective tax rate.

I hope this explanation is clear. Please feel free to ask if you need any more clarification about my problem.

With many thanks!

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#19

25 Jan 2025, 12:12

OK, thanks. The data example and code do not reproduce the results you show. There are two reasons it doesn't. One is that the example does not contain the variable id, which is used in the second to last command. The other is that it is only a small subset of the full data. Given the size of the data set, where bracket 1 alone contains 11,293 observations according to your output, it isn't feasible to post it here.

Nevertheless, eliminating the creation of variable id_num and its use in -collapse- from the code (which serves no other purpose than to provide a count of the number of people in each bracket), we already get a hint of the problem. The actual output is then:

Code:

. list, noobs clean abbrev(20) bracket Revenu_Année Impot_Payé effective_tax_rate total_income Bottom Ce 11686.995 1559.949 13.34773 11686.995 Bottom De 13256.113 1853.865 13.96183 119305.01 Bottom Qu 15462.162 2190.975 14.11427 154621.62 Second Qu 22412.097 3960.457 17.64229 448241.94 Middle Qu 26566.913 4783.463 17.9877 531338.26 Fourth Qu 34339.965 6322.521 18.36276 686799.3 >80th to 41665.792 8134.953 19.53242 416657.92 > 90th to 43385.723 8213.223 18.95004 216928.62 > 95th to 47194.692 9841.383 20.8061 188778.77 > 99th to 56839.433 13338.1 23.46629 56839.433

we get a hint of the problem, because the effective tax rate, which you expect to be close to zero, is of the order of magnitude of 13%. In the output you show it is much higher still, at about 74%. It isn't really possible to say much about these results. First of all, the bottom bracket value is still lower than that of the second bracket, and by an amount that is fairly typical of the difference between consecutive brackets. But in the example data, the output for the bottom bracket comes from just a single original observation!

The code looks correct to me, so I think the problem is in the data. It is most likely that the values of Revenu_Année or Impot_Payé (or both) are incorrect in one or more observations. If you have even just a few observations where the value of Revenu_Année is too low or that of Impot_Payé is too high, the effective tax_rate calculated for that person will be falsely elevated, and a few extreme observations like that can pull up the average so the final results look like what you see. Here's what I suggest you do to try to find the source of the error(s).

Code:

assert !missing(Revenu_Année) centile Revenu_Année, centile(1 10 20 40 60 80 90 95 99 99.9) label define bracket 1 "Bottom Centile" 2 "Bottom Decile" 3 "Bottom Quintile" 4 "Second Quintile" 5 "Middle Quintile" 6 "Fourth Quintile" 7 ">80th to 90th Percentile" 8 "> 90th to 95th Percentile" 9 "> 95th to 99th Percentile" 10 "> 99th to 99.9th Percentile" 11 "> 99.9th Percentile" gen bracket:bracket = . local centiles `r(centiles)' forvalues i = 1/`r(n_cent)' { replace bracket = `i' /// if Revenu_Année <= `r(c_`i')' & missing(bracket) } replace bracket = `r(n_cent)' + 1 if missing(bracket) gen effective_tax_rate = (Impot_Payé/Revenu_Année)*100 keep if bracket == 1 gsort -effective_tax_rate browse

This will show you the observations from bracket 1 in decreasing order of effective tax rate. So starting from the top you will see the individual observations with the highest calculated individual effective tax rate. At the very least, the first of these will be order of magnitude or more higher than expected. Perhaps some others are as well. Then you can look at their recorded values for Revenu_Année and Impot_Payé, and it will likely be obvious which of these values is (are) incorrect. You will then have to correct that data and restart your analysis from the beginning.

It would not surprise me to find that the problem arises either from a misplaced decimal point or from a digit transposition involving something like 19 vs 91 in leading digits.
Comment
Aziz Essouaied

Join Date: Apr 2020

Posts: 203
#20

25 Jan 2025, 12:54

Clyde Schechter I do understand that. I'used that last code, and I've got the table showing me observations about individuals that have a low Revenu, but a Tax that's really close to that Revenu, which means that they are in the poorest social class, yet, if we calculate their effective tax rate, that rate will be high. I understand that.

Is there a way to just delete those observations from the main dataset?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#21

25 Jan 2025, 14:36

Well, as a Stata question, yes, of course there is a way to delete them. That's what the -drop- command does. If you're not sure exactly how to use it, read -help drop-, and also read the PDF manual entry for the drop command (which you can open using the link at the top of the -help drop- screen.

But, don't do that yet. You should first try to investigate those values. There is a slight chance that they are actually correct--and, if so, they should be left alone.. And if you can determine for sure that they are errors, you might also be able to get corrected information to overwrite these in the data set. If you cannot, after looking into the matter, actually determine whether these are errors, or what the correct values are, then you can -drop- those observations. But also be sure that you can explain and defend your decisions about which ones to drop. And, to be sure you don't forget what those decisions were and the reasons for them, document it all as comments in your do-file.

How you go about investigating this problem depends on how you came to have this data set. Typically there is an original source data set that was given to you, followed by some importation and data management that you did. You should start by tracing back these anomalous observations to the original source data set. If you see that the same errors are found in the source data set, then you should contact whoever provided it to you, call that person's attention to the implausibility of those particular observations, and ask that person to look into it and provide corrections if they are errors, or confirmation that they are correct. If you don't find the same anomalies in the original source data set, then you need to go through all of the data management steps you used to find the mistakes that introduced these errors. (And, it is entirely possible that when you do that you will also discover other errors that need to be fixed.) Then correct those mistakes, and re-run the data management from scratch to create a new, corrected data set.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment