Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Distribution of taxation by "xtiles"

    Hello Stata community;

    I'am working with some fiscal data about individual tax filings, here's the data:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double(revenu_net_global revenu_net_imposable impt_d)
            0         0        0
            0         0        0
            0         0        0
            0         0        0
            0         0        0
     15285751  14786000  2544360
      5751606   5452000   117520
      4320000   4320000   300000
     12620394  12121000  1851460
      2000000   2000000   300000
            0         0   300000
      1500000   1200000        0
            0         0        0
       800000    420000   300000
            0         0   300000
            0         0   300000
      2000000   1700000        0
            0         0   300000
      3600000   3200000   300000
      6200000   5900000   300000
            0         0   300000
      1000000   1000000   300000
            0         0   300000
      2720000   2420000   300000
      2000000   2000000   300000
      1000000   1000000   300000
            0         0        0
      3000000   2400000   300000
            0         0   300000
      6056669   5557000   300000
            0         0   300000
      2500000   2500000        0
            0         0        0
            0         0   300000
      1000000   1000000   300000
            0         0   300000
            0         0        0
            0         0        0
            0         0        0
            0         0        0
            0         0        0
            0         0   300000
            0         0   300000
            0         0        0
      5000000   4700000        0
            0         0        0
            0         0   300000
      1300000   1000000   300000
      4000000   3700000        0
      7500000   6900000   494000
      2625427   2326000   300000
      6510678   6211000   314860
      4320000   3820000        0
     21620000  21620000  4353600
      4000000   3700000        0
     84372350  84373000 25130550
            0         0   300000
     93500000         0        0
      5558000   5058000    15080
            0         0        0
            0         0        0
            0         0   300000
            0         0   300000
     65951600         0        0
      9000000   8500000   910000
            0         0        0
            0         0        0
            0         0   300000
            0         0        0
      1445767    946000        0
            0         0   300000
      6000000   6000000   300000
            0         0   300000
            0         0   300000
            0         0   300000
            0         0        0
            0         0        0
            0         0        0
            0         0   300000
      6000000   5400000   300000
      6945156   6646000   427960
     10896027  10597000  1455220
    140111625 133947000 42481450
            0         0   300000
     74449275  74150000 21552500
    198813514 191597000 62658950
     15450000  14950000  2587000
     38758617         0   300000
            0         0   300000
     24300000         0        0
     12000000  11990000  1817400
            0         0        0
      9712438   9413000  1147380
            0         0   300000
            0         0   300000
    1.890e+08         0        0
            0         0   300000
    105674393  34825000  8244000
            0         0   300000
     13600000  13300000  2158000
    end
    The first variable is called "revenu_net_global", which is basically the total net revenu of that individual before taxation.
    The second variable is "revenu_net_imposable", which is the taxable revenu after considering forms of deductions, in other words, the individual will pay taxes on that taxable revenu, not on his total net revenu
    The third variable is "impt_d", which is basically the monetary amount that that individual paid (or should pay), so basically, it is the taxes that the individual has really paid on his taxable revenu.

    What I want to have as a result is a table in which:
    The first variabble shows the social categories of this population by the"revenu_net_global" variable, i other words, I want to have the Bottom Quitile, Second Quintile, Middle Quintile, Forth Quintile, 81% to 90%, 91% to 95%, 96% to 99%, and the top richest 1% according to their "revenu_net_global". I know that there is a STATA command called "xtile", yet I don't know how I can define the social classes by myself.
    The second variable I want to have on the table is the number of individuals (or observations n) for each social class
    The third variable is the median (or the average) income before txation for each social class, that is the median (or the average) revenu_net_imposable for each class
    The fourth variable I would like to have is the median (or the average) taxation that is paid by the median (or the average) individual of each class, that is the variable "impt_d"
    As for the fifth and last variable, I want it to be about the average tax rate of each social class, that is the amount of taxation (the variable "impt_d") each average individual of each social class paid on his taxable revenu.

    I really would appreciate it if I could get some help with this. Thanks in advance to whoever would like to help!

  • #2
    Code:
    _pctile revenu_net_global, percentiles(20 40 60 80 90 95 99)
    
    label define social_class    1    "Bottom Quintile"    ///
                                2    "2nd Quintile"    ///
                                3    "Middle Quintile"    ///
                                4    "4th Quintile"    ///
                                5    ">80 to 90th Percentile"    ///
                                6    "> 90 to 95th Percentile"    ///
                                7    "> 95 to 99th Percentile"    ///
                                8    "> 99th Percentile"
    gen social_class:social_class = .
    forvalues i = 1/7 {
        replace social_class = `i' if missing(social_class) & revenu_net_global <= r(r`i')
    }
    replace social_class = 8 if missing(social_class) & !missing(revenu_net_global)
    gen percent_tax_paid = 100*impt_d/revenu_net_imposable
    
    collapse (count) class_size = revenu_net_global ///
            (median) class_median_pretax_income = revenu_net_imposable ///
            (mean) percent_tax_paid, by(social_class)
    In your example data, there are only 100 observations, so fine-graining the percentiles the way you do produces strange results. For example, the top percentile is based on only one person. And all the social classes above the quintiles are based on small numbers. Presumably your real data set is large enough to overcome those things.

    There is another problem, however, that your real data may or may not overcome. In the example data, over half of the observations have 0 income. Consequently, it is impossible to identify a first and second quintile, because all of the 0-income observations must be classified the same way. So in the output here you will see that the "first quintile" is in fact a bit more than the lower half of the data, and there is no second quintile at all. If the distribution of income in your real data resembles the example data, you will encounter this same issue.

    Comment


    • #3
      Clyde Schechter Thanks for the help, it works well.

      The data I've presented in my post is just a data of the "dataex" command in Stata, my full data has about 143 227 observations. Yet even so, you're right, the poorest 1% and the poorest 10% of this population (according to their global net revenu) showed some weird results in their class_median_pretax_income and percent_tax_paid, those variables for those two social classes showed zeros. So, even with a large set of data, I don't understand what happened,

      Also, I don't know if it is possible to classify all the 0 income observations in just one classe, just before the bottom quitile for instance, just to make it clear that that class is for people who don't have any revenu whatsoever.

      Here is the result of the code you've just helped me with so you could get a clearer view:
      Social_Class class_size class_median_pretax_income percent_tax_paid
      Bottom Ce 1433.00 0.00
      Bottom De 53897.00 0.00
      2nd Quint 1961.00 236000.00 2007.546
      Middle Qu 28646.00 3523500.00 20.07466
      4th Quint 28645.00 8309000.00 17.03967
      >80 to 90 14323.00 22656000.00 22.2686
      > 90 to 9 7161.00 47447000.00 25.94216
      > 95 to 9 5729.00 95301000.00 30.51006
      > 99th Pe 1432.00 241345000.00 32.95389
      Already, there is a little problem with the calculation of the "percent_tax_paid" value of the 3 class (2nd Quint), I don't know what happened.

      Again, thanks for the help, I really appreciate it!

      Comment


      • #4
        I think putting the people with zero income in a separate group is a good idea, and have done that in this revision.

        I don't know what you mean by "...a little problem with the calculation of the "percent_tax_paid" value of the 3 class..." What, specifically are you seeing that is contrary to your expectations. What I do see is that when I wrote the code, I neglected to include your desired variable representing median or average tax paid. That is now included in the results.

        Let me also point out that the average (mean, in this case) percent tax paid is not going to equal the median tax paid divided by the median pre-tax income. Nor would the median percent tax paid equal that ratio. Medians simply do not work that way. And that inequality persists if we use mean tax paid and mean pre-tax income. The calculation has to be made by first calculating the percentage of pre-tax income paid in tax for each individual and then taking the average (we did it as mean, it could be median--same problem either way). This is simply mathematically different from taking the ratio of mean (or median) tax paid to mean (or median) pre-tax income.

        Here is modified code, with changes in bold face.
        Code:
        _pctile revenu_net_global if revenu_net_global > 0, percentiles(20 40 60 80 90 95 99)
        
        label define social_class   0     "Zero Income"     ///
                                    1    "Bottom Quintile"    ///
                                    2    "2nd Quintile"    ///
                                    3    "Middle Quintile"    ///
                                    4    "4th Quintile"    ///
                                    5    ">80 to 90th Percentile"    ///
                                    6    "> 90 to 95th Percentile"    ///
                                    7    "> 95 to 99th Percentile"    ///
                                    8    "> 99th Percentile"
        gen social_class:social_class = 0 if revenu_net_global == 0
        forvalues i = 1/7 {
            replace social_class = `i' if missing(social_class) & revenu_net_global <= r(r`i')
        }
        replace social_class = 8 if missing(social_class) & !missing(revenu_net_global)
        gen percent_tax_paid = 100*impt_d/revenu_net_imposable
        
        collapse (count) class_size = revenu_net_global ///
                (median) class_median_pretax_income = revenu_net_imposable ///
                        class_median_tax_paid = impt_d ///
                (mean) percent_tax_paid, by(social_class)

        Comment


        • #5
          Clyde Schechter Thanks for the respond.

          As for the individuals who don't have any revenu whatsoever, I'am still slightly ambivalent about treating them as a separate class, I understand that if treated as part of the class of the poorest 1% or the poorest 10%, they could hurt the stats of that class, but also, I don't believe that treating them as a separate class (the one you've called "Zero Income") is somewhat a correct way of doing things in terms of taxation and revenu, but, I'll think about it anyway.

          When I said "...a little problem with the calculation of the "percent_tax_paid" value of the 3 class...", what I meant is that that value shows 2007.546, although the variable "percent_tax_pad" is a rate, so that doesn't go along.

          Again, thanks for the help!

          Comment


          • #6
            When I said "...a little problem with the calculation of the "percent_tax_paid" value of the 3 class...", what I meant is that that value shows 2007.546, although the variable "percent_tax_pad" is a rate, so that doesn't go along.
            OK. I think that's a problem with your data set. Nothing like that happens in your example. Here are the results of running the code in #4 on the example data in #1:
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input float social_class long class_size double class_median_pretax_income float percent_tax_paid
            1 52         0         .
            3 10   1000000 22.142857
            4 18   4510000  4.950842
            5 10  12055500 14.509416
            6  5         0  29.42557
            7  4  17412500 27.693884
            8  1 191597000 32.703514
            end
            label values social_class social_class
            label def social_class 1 "Bottom Quintile", modify
            label def social_class 3 "Middle Quintile", modify
            label def social_class 4 "4th Quintile", modify
            label def social_class 5 ">80 to 90th Percentile", modify
            label def social_class 6 "> 90 to 95th Percentile", modify
            label def social_class 7 "> 95 to 99th Percentile", modify
            label def social_class 8 "> 99th Percentile", modify
            You can see that the percent_tax_paid is in the expected range for every group.

            Looking at the code itself, you can see that the percent tax paid at the individual level is calculated by dividing impt_d by revenu_net_imposable, and multiplying by 100. As long as impt_d lies between 0 and revenu_net_imposable, the result will necessarily between 0 and 100. So I think that you have some bad data in your full data set. You can find the offending observations with:
            Code:
            browse if !inrange(impt_d, 0, revenu_net_imposable) & !missing(impt_d)
            applied to the original data.
            That will show you observations of the individuals who have an impt_d that is outside the range of possibility. Then you have to figure out why you have those and how to fix them.

            Comment


            • #7
              Clyde Schechter Thanks for the help.

              What I also want to say is that I'am trying to do this work to study the distribution of the fiscal charge on a population of taxpayers devided into social classes by their revenu (the "revenu_net_global" variable), in other words, I guess the "percent_tax_paid" should go from the lowest to the highest, it should be classified from the lowest percent to the highest one, since it is based on the taxable revenu as well as taxes paid.
              I don't know if the first command of the code you've suggested could help sort the revenu and the taxes paid in that way.

              Comment


              • #8
                I'm not sure I understand what you're saying in #7, but perhaps changing the -collapse- command as follows will do it:
                Code:
                collapse (count) class_size = revenu_net_global ///
                      (median) class_median_pretax_income = revenu_net_imposable ///
                      class_median_tax_paid = impt_d ///
                     (min) lowest_pct_tax_paid = percent_tax_paid ///
                     (mean) mean_percent_tax_paid = percent_tax_paid ///
                     (max) highest_pct_tax_paid = percent_tax_paid, by(social_class)

                Comment


                • #9
                  Clyde Schechter Thanks again for considering this little study.

                  I've worked on my data a bit more, and using the code you've suggested, this is the result i've got:

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input float social_class long class_size double(class_median_revenu_net_global class_median_pretax_income class_median_tax_paid) float tax_paid_in_percent
                   0 41212         0         0    300000         .
                   1 15011  -2561000         0    300000 -455.1441
                   2  7954   1311000   1080000    300000  31.32941
                   3  8749   3191000   2925000    300000 10.610744
                   4 17573   5882000   5515000    300000  6.074727
                   5 17575   8765000   8121000    819000  9.655637
                   6 17578  18702500  17339500   3217370 17.078936
                   7  8787  42305000  38776000   9521120 22.292946
                   8  4394  79853500  72341000  20922675  25.37377
                   9  3516 146992500 127248000  40136800  26.52573
                  10   878 356696000 298782000 100173700  26.64242
                  end
                  label values social_class social_class
                  label def social_class 0 "Zero Income", modify
                  label def social_class 1 "Bottom Centile", modify
                  label def social_class 2 "Bottom Decile", modify
                  label def social_class 3 "Bottom Quintile", modify
                  label def social_class 4 "Second Quintile", modify
                  label def social_class 5 "Middle Quintile", modify
                  label def social_class 6 "Fourth Quintile", modify
                  label def social_class 7 ">80th to 90th Percentile", modify
                  label def social_class 8 "> 90th to 95th Percentile", modify
                  label def social_class 9 "> 95th to 99th Percentile", modify
                  label def social_class 10 "> 99th Percentile", modify
                  As you can see, there are some point to consider in this:

                  - Some individuals have a negative "revenu_net_global", yet they paid about 300 000 000.00 in taxes, so when I've created the variable "percent_tax_paid = 100*impt_d/revenu_net_imposable", I've ended up with the resut of -455.14407, which is an odd value since that variable is a percentage, I guess it is because the "revenu_net_global" was negative for that class, so, is there a way to treat that? Should I use the absolute value in order to get an acceptable result? Or should I have the individuals (the observations) who have a negative "revenu_net_global" in an independant social class?

                  - What I meant in #7 is that since the observations are sorted by their levels of "revenu_net_global", shouldn't all the variables in the result table be sorted in the same way too? Shouldn't the social clases be sorted by the variable "tax_paid_in_percent"?

                  Again, thanks for the help Clyde Schechter

                  Comment


                  • #10
                    Some individuals have a negative "revenu_net_global", yet they paid about 300 000 000.00 in taxes, so when I've created the variable "percent_tax_paid = 100*impt_d/revenu_net_imposable", I've ended up with the resut of -455.14407, which is an odd value since that variable is a percentage, I guess it is because the "revenu_net_global" was negative for that class, so, is there a way to treat that?
                    Well, I have to say I am stunned that people with negative net taxable income are assessed a positive amount of tax to pay. What country has such a regressive, really beyond regressive, tax system? (Or am I misunderstanding what the tax represents? Are you counting here not just income tax but all taxes combined, including perhaps sales taxes or VAT and property tax?)

                    Be that as it may, from a mathematical perspective, if the tax paid is positive and the income is negative, then the tax paid as a percent of income will, in fact, be a negative number. That is the mathematically correct way to do it. If you find it confusing to include these people in classes that also have people with positive net taxable income, then it might be better to have them as a separate class. I don't really know what to say about that as I don't know how any of this will impact what you will subsequently do with these data.

                    What I meant in #7 is that since the observations are sorted by their levels of "revenu_net_global", shouldn't all the variables in the result table be sorted in the same way too? Shouldn't the social clases be sorted by the variable "tax_paid_in_percent"?
                    Well, the behavior of -collapse- is that the data come out sorted in order of its -by()- variable(s). So in this case, sorted in order of class. If you want them sorted by percent tax paid, then just sort them that way after -collapse-. -sort tax_paid_in_percent- will do that.

                    Comment

                    Working...
                    X