Distribution of taxation by "xtiles"

Aziz Essouaied

Join Date: Apr 2020
Posts: 182

Distribution of taxation by "xtiles"

18 Aug 2024, 08:54

Hello Stata community;

I'am working with some fiscal data about individual tax filings, here's the data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(revenu_net_global revenu_net_imposable impt_d)
        0         0        0
        0         0        0
        0         0        0
        0         0        0
        0         0        0
 15285751  14786000  2544360
  5751606   5452000   117520
  4320000   4320000   300000
 12620394  12121000  1851460
  2000000   2000000   300000
        0         0   300000
  1500000   1200000        0
        0         0        0
   800000    420000   300000
        0         0   300000
        0         0   300000
  2000000   1700000        0
        0         0   300000
  3600000   3200000   300000
  6200000   5900000   300000
        0         0   300000
  1000000   1000000   300000
        0         0   300000
  2720000   2420000   300000
  2000000   2000000   300000
  1000000   1000000   300000
        0         0        0
  3000000   2400000   300000
        0         0   300000
  6056669   5557000   300000
        0         0   300000
  2500000   2500000        0
        0         0        0
        0         0   300000
  1000000   1000000   300000
        0         0   300000
        0         0        0
        0         0        0
        0         0        0
        0         0        0
        0         0        0
        0         0   300000
        0         0   300000
        0         0        0
  5000000   4700000        0
        0         0        0
        0         0   300000
  1300000   1000000   300000
  4000000   3700000        0
  7500000   6900000   494000
  2625427   2326000   300000
  6510678   6211000   314860
  4320000   3820000        0
 21620000  21620000  4353600
  4000000   3700000        0
 84372350  84373000 25130550
        0         0   300000
 93500000         0        0
  5558000   5058000    15080
        0         0        0
        0         0        0
        0         0   300000
        0         0   300000
 65951600         0        0
  9000000   8500000   910000
        0         0        0
        0         0        0
        0         0   300000
        0         0        0
  1445767    946000        0
        0         0   300000
  6000000   6000000   300000
        0         0   300000
        0         0   300000
        0         0   300000
        0         0        0
        0         0        0
        0         0        0
        0         0   300000
  6000000   5400000   300000
  6945156   6646000   427960
 10896027  10597000  1455220
140111625 133947000 42481450
        0         0   300000
 74449275  74150000 21552500
198813514 191597000 62658950
 15450000  14950000  2587000
 38758617         0   300000
        0         0   300000
 24300000         0        0
 12000000  11990000  1817400
        0         0        0
  9712438   9413000  1147380
        0         0   300000
        0         0   300000
1.890e+08         0        0
        0         0   300000
105674393  34825000  8244000
        0         0   300000
 13600000  13300000  2158000
end

The first variable is called "revenu_net_global", which is basically the total net revenu of that individual before taxation.
The second variable is "revenu_net_imposable", which is the taxable revenu after considering forms of deductions, in other words, the individual will pay taxes on that taxable revenu, not on his total net revenu
The third variable is "impt_d", which is basically the monetary amount that that individual paid (or should pay), so basically, it is the taxes that the individual has really paid on his taxable revenu.

What I want to have as a result is a table in which:
The first variabble shows the social categories of this population by the"revenu_net_global" variable, i other words, I want to have the Bottom Quitile, Second Quintile, Middle Quintile, Forth Quintile, 81% to 90%, 91% to 95%, 96% to 99%, and the top richest 1% according to their "revenu_net_global". I know that there is a STATA command called "xtile", yet I don't know how I can define the social classes by myself.
The second variable I want to have on the table is the number of individuals (or observations n) for each social class
The third variable is the median (or the average) income before txation for each social class, that is the median (or the average) revenu_net_imposable for each class
The fourth variable I would like to have is the median (or the average) taxation that is paid by the median (or the average) individual of each class, that is the variable "impt_d"
As for the fifth and last variable, I want it to be about the average tax rate of each social class, that is the amount of taxation (the variable "impt_d") each average individual of each social class paid on his taxable revenu.

I really would appreciate it if I could get some help with this. Thanks in advance to whoever would like to help!

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30066

18 Aug 2024, 11:35

Code:

_pctile revenu_net_global, percentiles(20 40 60 80 90 95 99)

label define social_class    1    "Bottom Quintile"    ///
                            2    "2nd Quintile"    ///
                            3    "Middle Quintile"    ///
                            4    "4th Quintile"    ///
                            5    ">80 to 90th Percentile"    ///
                            6    "> 90 to 95th Percentile"    ///
                            7    "> 95 to 99th Percentile"    ///
                            8    "> 99th Percentile"
gen social_class:social_class = .
forvalues i = 1/7 {
    replace social_class = `i' if missing(social_class) & revenu_net_global <= r(r`i')
}
replace social_class = 8 if missing(social_class) & !missing(revenu_net_global)
gen percent_tax_paid = 100*impt_d/revenu_net_imposable

collapse (count) class_size = revenu_net_global ///
        (median) class_median_pretax_income = revenu_net_imposable ///
        (mean) percent_tax_paid, by(social_class)

In your example data, there are only 100 observations, so fine-graining the percentiles the way you do produces strange results. For example, the top percentile is based on only one person. And all the social classes above the quintiles are based on small numbers. Presumably your real data set is large enough to overcome those things.

There is another problem, however, that your real data may or may not overcome. In the example data, over half of the observations have 0 income. Consequently, it is impossible to identify a first and second quintile, because all of the 0-income observations must be classified the same way. So in the output here you will see that the "first quintile" is in fact a bit more than the lower half of the data, and there is no second quintile at all. If the distribution of income in your real data resembles the example data, you will encounter this same issue.

Comment

Aziz Essouaied

Join Date: Apr 2020
Posts: 182

18 Aug 2024, 12:36

Clyde Schechter Thanks for the help, it works well.

The data I've presented in my post is just a data of the "dataex" command in Stata, my full data has about 143 227 observations. Yet even so, you're right, the poorest 1% and the poorest 10% of this population (according to their global net revenu) showed some weird results in their class_median_pretax_income and percent_tax_paid, those variables for those two social classes showed zeros. So, even with a large set of data, I don't understand what happened,

Also, I don't know if it is possible to classify all the 0 income observations in just one classe, just before the bottom quitile for instance, just to make it clear that that class is for people who don't have any revenu whatsoever.

Here is the result of the code you've just helped me with so you could get a clearer view:

Social_Class	class_size	class_median_pretax_income	percent_tax_paid
Bottom Ce	1433.00	0.00
Bottom De	53897.00	0.00
2nd Quint	1961.00	236000.00	2007.546
Middle Qu	28646.00	3523500.00	20.07466
4th Quint	28645.00	8309000.00	17.03967
>80 to 90	14323.00	22656000.00	22.2686
> 90 to 9	7161.00	47447000.00	25.94216
> 95 to 9	5729.00	95301000.00	30.51006
> 99th Pe	1432.00	241345000.00	32.95389

Already, there is a little problem with the calculation of the "percent_tax_paid" value of the 3 class (2nd Quint), I don't know what happened.

Again, thanks for the help, I really appreciate it!

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#4

18 Aug 2024, 13:27

I think putting the people with zero income in a separate group is a good idea, and have done that in this revision.

I don't know what you mean by "...a little problem with the calculation of the "percent_tax_paid" value of the 3 class..." What, specifically are you seeing that is contrary to your expectations. What I do see is that when I wrote the code, I neglected to include your desired variable representing median or average tax paid. That is now included in the results.

Let me also point out that the average (mean, in this case) percent tax paid is not going to equal the median tax paid divided by the median pre-tax income. Nor would the median percent tax paid equal that ratio. Medians simply do not work that way. And that inequality persists if we use mean tax paid and mean pre-tax income. The calculation has to be made by first calculating the percentage of pre-tax income paid in tax for each individual and then taking the average (we did it as mean, it could be median--same problem either way). This is simply mathematically different from taking the ratio of mean (or median) tax paid to mean (or median) pre-tax income.

Here is modified code, with changes in bold face.

Code:

_pctile revenu_net_global if revenu_net_global > 0, percentiles(20 40 60 80 90 95 99) label define social_class 0 "Zero Income" /// 1 "Bottom Quintile" /// 2 "2nd Quintile" /// 3 "Middle Quintile" /// 4 "4th Quintile" /// 5 ">80 to 90th Percentile" /// 6 "> 90 to 95th Percentile" /// 7 "> 95 to 99th Percentile" /// 8 "> 99th Percentile" gen social_class:social_class = 0 if revenu_net_global == 0 forvalues i = 1/7 { replace social_class = `i' if missing(social_class) & revenu_net_global <= r(r`i') } replace social_class = 8 if missing(social_class) & !missing(revenu_net_global) gen percent_tax_paid = 100*impt_d/revenu_net_imposable collapse (count) class_size = revenu_net_global /// (median) class_median_pretax_income = revenu_net_imposable /// class_median_tax_paid = impt_d /// (mean) percent_tax_paid, by(social_class)
Comment
Aziz Essouaied

Join Date: Apr 2020

Posts: 182
#5

18 Aug 2024, 13:45

Clyde Schechter Thanks for the respond.

As for the individuals who don't have any revenu whatsoever, I'am still slightly ambivalent about treating them as a separate class, I understand that if treated as part of the class of the poorest 1% or the poorest 10%, they could hurt the stats of that class, but also, I don't believe that treating them as a separate class (the one you've called "Zero Income") is somewhat a correct way of doing things in terms of taxation and revenu, but, I'll think about it anyway.

When I said "...a little problem with the calculation of the "percent_tax_paid" value of the 3 class...", what I meant is that that value shows 2007.546, although the variable "percent_tax_pad" is a rate, so that doesn't go along.

Again, thanks for the help!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#6

18 Aug 2024, 15:22

When I said "...a little problem with the calculation of the "percent_tax_paid" value of the 3 class...", what I meant is that that value shows 2007.546, although the variable "percent_tax_pad" is a rate, so that doesn't go along.

OK. I think that's a problem with your data set. Nothing like that happens in your example. Here are the results of running the code in #4 on the example data in #1:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float social_class long class_size double class_median_pretax_income float percent_tax_paid 1 52 0 . 3 10 1000000 22.142857 4 18 4510000 4.950842 5 10 12055500 14.509416 6 5 0 29.42557 7 4 17412500 27.693884 8 1 191597000 32.703514 end label values social_class social_class label def social_class 1 "Bottom Quintile", modify label def social_class 3 "Middle Quintile", modify label def social_class 4 "4th Quintile", modify label def social_class 5 ">80 to 90th Percentile", modify label def social_class 6 "> 90 to 95th Percentile", modify label def social_class 7 "> 95 to 99th Percentile", modify label def social_class 8 "> 99th Percentile", modify

You can see that the percent_tax_paid is in the expected range for every group.

Looking at the code itself, you can see that the percent tax paid at the individual level is calculated by dividing impt_d by revenu_net_imposable, and multiplying by 100. As long as impt_d lies between 0 and revenu_net_imposable, the result will necessarily between 0 and 100. So I think that you have some bad data in your full data set. You can find the offending observations with:

Code:

browse if !inrange(impt_d, 0, revenu_net_imposable) & !missing(impt_d)

applied to the original data.
That will show you observations of the individuals who have an impt_d that is outside the range of possibility. Then you have to figure out why you have those and how to fix them.
Comment
Aziz Essouaied

Join Date: Apr 2020

Posts: 182
#7

20 Aug 2024, 00:10

Clyde Schechter Thanks for the help.

What I also want to say is that I'am trying to do this work to study the distribution of the fiscal charge on a population of taxpayers devided into social classes by their revenu (the "revenu_net_global" variable), in other words, I guess the "percent_tax_paid" should go from the lowest to the highest, it should be classified from the lowest percent to the highest one, since it is based on the taxable revenu as well as taxes paid.
I don't know if the first command of the code you've suggested could help sort the revenu and the taxes paid in that way.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30066

20 Aug 2024, 09:05

I'm not sure I understand what you're saying in #7, but perhaps changing the -collapse- command as follows will do it:

Code:

collapse (count) class_size = revenu_net_global ///
      (median) class_median_pretax_income = revenu_net_imposable ///
      class_median_tax_paid = impt_d ///
     (min) lowest_pct_tax_paid = percent_tax_paid ///
     (mean) mean_percent_tax_paid = percent_tax_paid ///
     (max) highest_pct_tax_paid = percent_tax_paid, by(social_class)

Comment

Aziz Essouaied

Join Date: Apr 2020
Posts: 182

29 Aug 2024, 13:56

Clyde Schechter Thanks again for considering this little study.

I've worked on my data a bit more, and using the code you've suggested, this is the result i've got:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float social_class long class_size double(class_median_revenu_net_global class_median_pretax_income class_median_tax_paid) float tax_paid_in_percent
 0 41212         0         0    300000         .
 1 15011  -2561000         0    300000 -455.1441
 2  7954   1311000   1080000    300000  31.32941
 3  8749   3191000   2925000    300000 10.610744
 4 17573   5882000   5515000    300000  6.074727
 5 17575   8765000   8121000    819000  9.655637
 6 17578  18702500  17339500   3217370 17.078936
 7  8787  42305000  38776000   9521120 22.292946
 8  4394  79853500  72341000  20922675  25.37377
 9  3516 146992500 127248000  40136800  26.52573
10   878 356696000 298782000 100173700  26.64242
end
label values social_class social_class
label def social_class 0 "Zero Income", modify
label def social_class 1 "Bottom Centile", modify
label def social_class 2 "Bottom Decile", modify
label def social_class 3 "Bottom Quintile", modify
label def social_class 4 "Second Quintile", modify
label def social_class 5 "Middle Quintile", modify
label def social_class 6 "Fourth Quintile", modify
label def social_class 7 ">80th to 90th Percentile", modify
label def social_class 8 "> 90th to 95th Percentile", modify
label def social_class 9 "> 95th to 99th Percentile", modify
label def social_class 10 "> 99th Percentile", modify

As you can see, there are some point to consider in this:

- Some individuals have a negative "revenu_net_global", yet they paid about 300 000 000.00 in taxes, so when I've created the variable "percent_tax_paid = 100*impt_d/revenu_net_imposable", I've ended up with the resut of -455.14407, which is an odd value since that variable is a percentage, I guess it is because the "revenu_net_global" was negative for that class, so, is there a way to treat that? Should I use the absolute value in order to get an acceptable result? Or should I have the individuals (the observations) who have a negative "revenu_net_global" in an independant social class?

- What I meant in #7 is that since the observations are sorted by their levels of "revenu_net_global", shouldn't all the variables in the result table be sorted in the same way too? Shouldn't the social clases be sorted by the variable "tax_paid_in_percent"?

Again, thanks for the help Clyde Schechter

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#10

29 Aug 2024, 14:21

Some individuals have a negative "revenu_net_global", yet they paid about 300 000 000.00 in taxes, so when I've created the variable "percent_tax_paid = 100*impt_d/revenu_net_imposable", I've ended up with the resut of -455.14407, which is an odd value since that variable is a percentage, I guess it is because the "revenu_net_global" was negative for that class, so, is there a way to treat that?

Well, I have to say I am stunned that people with negative net taxable income are assessed a positive amount of tax to pay. What country has such a regressive, really beyond regressive, tax system? (Or am I misunderstanding what the tax represents? Are you counting here not just income tax but all taxes combined, including perhaps sales taxes or VAT and property tax?)

Be that as it may, from a mathematical perspective, if the tax paid is positive and the income is negative, then the tax paid as a percent of income will, in fact, be a negative number. That is the mathematically correct way to do it. If you find it confusing to include these people in classes that also have people with positive net taxable income, then it might be better to have them as a separate class. I don't really know what to say about that as I don't know how any of this will impact what you will subsequently do with these data.

What I meant in #7 is that since the observations are sorted by their levels of "revenu_net_global", shouldn't all the variables in the result table be sorted in the same way too? Shouldn't the social clases be sorted by the variable "tax_paid_in_percent"?

Well, the behavior of -collapse- is that the data come out sorted in order of its -by()- variable(s). So in this case, sorted in order of class. If you want them sorted by percent tax paid, then just sort them that way after -collapse-. -sort tax_paid_in_percent- will do that.
Comment

Announcement

Distribution of taxation by "xtiles"

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment