Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Clyde Schechter Thanks for the help. Yes, I guess I've changed the code because I've used it on other databases, I guess I did make some changes in the process, and now when I wanted to use it again, it gave me that error message. Thanks for the help anyway.
    Still, here, there is a problem that I do want to get your advice on if it is possible: The current database I'm using (the one on #14) is about taxes people pay according to their revenus, yet, there are some people (those with revenu that doesn't exceed a certain amount, the poor classes if you want to call them that way) who don't pay taxes (the variable "Taxe_Payé" for those people is 0), but still they do have a revenu. The problem that these people are creating when I apply the code is that when dividing the "Taxe_Payé" by "Revenu_Annéeé", it is a divison of 0 by a positive value, and that gives me a result that's a bit weird to analyse, here's the result:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float bracket long id_num double Revenu_Année float(Impot_Payé effective_tax_rate) double total_income
       1  11293 327.03265967772097  74.21466  3.3798416  3693179.825740503
      10 105704 2061.5124999654527  31.88242  .02203993  217910117.2963482
      20 108870 3529.9655197542434  57.90738 .016248617  384307346.1356445
      40 225867   8787.90451409199   728.749  .06950825 1984897628.8844154
      60 225867   16971.8567085875  2620.085  .15248114  3833382359.198533
      80 225868  23487.14526649053  4180.503   .1777457  5304994527.051683
      90 112933 27809.098754913826  5194.637    .186663  3140564949.688683
      95  56469   35072.8734448177  6954.365   .1980317 1980530090.5554106
      99  45171  46170.90434383673  10492.78  .22471543  2085585920.115449
    99.9  10164  107625.7937056832 27840.965  .25883132  1093908567.224564
     100   1129  465566.7259929141  120939.2  .26597884      525624833.646
    end
    The proble is that, for the first class, it gives me a "wrong" "effective_tax_rate", since I guess the "gen effective_tax_rate = Impot_Payé/Revenu_Année" command of the code created some distortion for that first class, since it had to use 0 paid taxes and divide it by a positive revenu for that first poor class.

    Is there a way to correct this please? I do want to show the first class, and I guess for those people, they should have a really small "effective_tax_rate" that could be close to zero, since they pay about nothing in taxes compared to the small revenus they earn. Yet, the code gets me an effective_tax_rate of about 337% for that first class, and that doesn't make sense.

    Any help please? With many thanks!

    Comment


    • #17
      I see the anomalous result in what you show. But to find out where the problem is requires seeing the data from which those results were generated, along with the exact code that created it.

      I can tell you that when you divide 0 by a positive number you get 0; there is no problem associated with that. You may be confusing that with dividing by 0. But even that would just lead to a missing value, not an anomalously large result. The code shown in #12 of this thread should not produce this kind of problem. If you used it unmodified, I am quite confident that the problem arises from some erroneous data or data that just isn't what you expected it to be.

      Anyway, nothing more can be said without seeing the data and exact code that created these results.

      Comment


      • #18
        Clyde Schechter Thanks for the help!

        Here's the data on which I've applied the code:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input double Revenu_Année float Impot_Payé
        24015.379  3933.683
        42206.545  8434.434
         36889.06  6932.468
        26109.679  4336.566
        39681.416  7729.908
        14385.029  2004.793
        12904.748  1737.222
        14070.842  2079.853
        11686.995  1559.949
        13509.571  1858.438
        13506.436  1966.681
        29359.533   6183.24
        15564.943   2423.95
        24029.455  4661.007
        25140.046  4758.284
        14385.029  2059.021
        24805.308  3282.636
        21922.531  4111.383
        25267.761  4808.718
        30787.104  5680.851
        35680.101   7038.82
        42282.843  8149.774
        41347.412  8236.804
         43959.38  6236.476
        46051.024   9474.11
        41367.373  8273.574
        23090.514  3763.933
         43895.24  8518.322
        33677.781  4495.589
        30403.981   5257.65
        26020.927  4970.615
        23343.019  4331.012
        39376.535  6940.421
          23614.8    4039.6
        43191.675  8642.883
        22990.661  3832.333
        20275.437  3504.201
        56839.433 13338.104
        22727.017  4038.028
        21925.397  3812.582
        17463.539  2840.013
        17638.807  2813.337
        23433.331  4227.581
        24772.533  4616.985
        42997.801  8674.304
        42884.521  8994.131
        11704.479  1532.016
        23390.023  4305.743
        12352.238  1686.569
        26754.953  4635.622
        28581.334  4941.036
        29817.422  5400.736
        46083.094  9511.203
        33551.781  6353.342
        14695.325  2248.476
        23837.271  4595.424
         16364.11  1768.941
        14462.969  2118.728
        13629.119  1859.865
        13480.118  1961.048
        15225.171  2270.185
        25082.852    4702.7
        14436.699   1362.31
        14147.463  2003.094
        24736.987  4581.159
        21994.715  3833.441
        26076.943  4980.064
        29916.472  5291.995
        37155.938  7305.427
          26833.7  4531.231
        34362.458  6636.015
        41451.511  8198.047
         37614.57  7069.975
        21163.171  3528.521
        28564.741   5408.19
        32265.449  5778.196
         34556.51  6682.951
        30326.373  5451.877
        22364.445  3907.091
        22327.509  3955.969
         18349.26  3082.684
        24882.902  4554.667
        25433.153  4229.585
        41274.971  8100.187
        52445.719 11519.258
        28943.441  5034.317
        40466.357  8113.134
        32317.757  5882.521
         25738.39  4526.666
        28380.104  5033.999
        31892.922  5742.724
         44198.93  8860.961
        34875.191  6648.993
        41518.507  8156.467
        42724.921  7274.216
        40438.143   7952.28
        18711.021  3163.753
        24772.533  4733.406
         31029.76  5578.427
        42017.476  8412.897
        end
        It is about the Revenu (first variable) and the Tax Paid (second variable) for about 1129000 people, it is a somewhat extensive dataset.

        Here's the code I've used based on your suggestions before:

        assert !missing(Revenu_Année)
        centile Revenu_Année, centile(1 10 20 40 60 80 90 95 99 99.9)
        label define bracket 1 "Bottom Centile" 2 "Bottom Decile" 3 "Bottom Quintile" 4 "Second Quintile" 5 "Middle Quintile" 6 "Fourth Quintile" 7 ">80th to 90th Percentile" 8 "> 90th to 95th Percentile" 9 "> 95th to 99th Percentile" 10 "> 99th to 99.9th Percentile" 11 "> 99.9th Percentile"
        gen bracket:bracket = .
        local centiles `r(centiles)'
        forvalues i = 1/`r(n_cent)' {
        replace bracket = `i' ///
        if Revenu_Année <= `r(c_`i')' & missing(bracket)
        }
        replace bracket = `r(n_cent)' + 1 if missing(bracket)

        gen effective_tax_rate = (Impot_Payé/Revenu_Année)*100
        egen id_num = group(id)

        collapse (count) id_num (mean) Revenu_Année Impot_Payé effective_tax_rate (sum) total_income = Revenu_Année, by(bracket)

        And here's the result if that code:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float bracket long id_num double Revenu_Année float(Impot_Payé effective_tax_rate) double total_income
         1  11293 327.03265967772154  74.21466  337.9842  3693179.825740509
         2 105704 2061.5124999654536  31.88242  2.203993  217910117.2963483
         3 108870 3529.9655197543693  57.90738 1.6248617  384307346.1356582
         4 225867  8787.904514089849   728.749  6.950825 1984897628.8839319
         5 225867 16971.856708587457  2620.085 15.248115  3833382359.198523
         6 225868 23487.145266491487  4180.503 17.774572  5304994527.051899
         7 112933 27809.098754914205  5194.637   18.6663  3140564949.688726
         8  56469 35072.873444818266  6954.365  19.80317 1980530090.5554426
         9  45171 46170.904343836875  10492.78  22.47154 2085585920.1154554
        10  10164 107625.79370568323 27840.965 25.883133 1093908567.2245643
        11   1129  465566.7259929141  120939.2 26.597883      525624833.646
        end
        label values bracket bracket
        label def bracket 1 "Bottom Centile", modify
        label def bracket 2 "Bottom Decile", modify
        label def bracket 3 "Bottom Quintile", modify
        label def bracket 4 "Second Quintile", modify
        label def bracket 5 "Middle Quintile", modify
        label def bracket 6 "Fourth Quintile", modify
        label def bracket 7 ">80th to 90th Percentile", modify
        label def bracket 8 "> 90th to 95th Percentile", modify
        label def bracket 9 "> 95th to 99th Percentile", modify
        label def bracket 10 "> 99th to 99.9th Percentile", modify
        label def bracket 11 "> 99.9th Percentile", modify
        The problem is that the effective tax rate for the first social class doesn't seem logical, the value is 337%, and that doesn't seem logical, since the people in that class are the poorest, so they should be paying the smallest effective tax rate.

        I hope this explanation is clear. Please feel free to ask if you need any more clarification about my problem.

        With many thanks!

        Comment


        • #19
          OK, thanks. The data example and code do not reproduce the results you show. There are two reasons it doesn't. One is that the example does not contain the variable id, which is used in the second to last command. The other is that it is only a small subset of the full data. Given the size of the data set, where bracket 1 alone contains 11,293 observations according to your output, it isn't feasible to post it here.

          Nevertheless, eliminating the creation of variable id_num and its use in -collapse- from the code (which serves no other purpose than to provide a count of the number of people in each bracket), we already get a hint of the problem. The actual output is then:
          Code:
          . list, noobs clean abbrev(20)
          
                bracket   Revenu_Année   Impot_Payé   effective_tax_rate   total_income  
              Bottom Ce      11686.995     1559.949             13.34773      11686.995  
              Bottom De      13256.113     1853.865             13.96183      119305.01  
              Bottom Qu      15462.162     2190.975             14.11427      154621.62  
              Second Qu      22412.097     3960.457             17.64229      448241.94  
              Middle Qu      26566.913     4783.463              17.9877      531338.26  
              Fourth Qu      34339.965     6322.521             18.36276       686799.3  
               >80th to      41665.792     8134.953             19.53242      416657.92  
              > 90th to      43385.723     8213.223             18.95004      216928.62  
              > 95th to      47194.692     9841.383              20.8061      188778.77  
              > 99th to      56839.433      13338.1             23.46629      56839.433
          we get a hint of the problem, because the effective tax rate, which you expect to be close to zero, is of the order of magnitude of 13%. In the output you show it is much higher still, at about 74%. It isn't really possible to say much about these results. First of all, the bottom bracket value is still lower than that of the second bracket, and by an amount that is fairly typical of the difference between consecutive brackets. But in the example data, the output for the bottom bracket comes from just a single original observation!

          The code looks correct to me, so I think the problem is in the data. It is most likely that the values of Revenu_Année or Impot_Payé (or both) are incorrect in one or more observations. If you have even just a few observations where the value of Revenu_Année is too low or that of Impot_Payé is too high, the effective tax_rate calculated for that person will be falsely elevated, and a few extreme observations like that can pull up the average so the final results look like what you see. Here's what I suggest you do to try to find the source of the error(s).
          Code:
          assert !missing(Revenu_Année)
          centile Revenu_Année, centile(1 10 20 40 60 80 90 95 99 99.9)
          label define bracket 1 "Bottom Centile" 2 "Bottom Decile" 3 "Bottom Quintile" 4 "Second Quintile" 5 "Middle Quintile" 6 "Fourth Quintile" 7 ">80th to 90th Percentile" 8 "> 90th to 95th Percentile" 9 "> 95th to 99th Percentile" 10 "> 99th to 99.9th Percentile" 11 "> 99.9th Percentile"
          gen bracket:bracket = .
          local centiles `r(centiles)'
          forvalues i = 1/`r(n_cent)' {
              replace bracket = `i' ///
              if Revenu_Année <= `r(c_`i')' & missing(bracket)
          }
          replace bracket = `r(n_cent)' + 1 if missing(bracket)
          
          gen effective_tax_rate = (Impot_Payé/Revenu_Année)*100
          keep if bracket == 1
          gsort -effective_tax_rate
          browse
          This will show you the observations from bracket 1 in decreasing order of effective tax rate. So starting from the top you will see the individual observations with the highest calculated individual effective tax rate. At the very least, the first of these will be order of magnitude or more higher than expected. Perhaps some others are as well. Then you can look at their recorded values for Revenu_Année and Impot_Payé, and it will likely be obvious which of these values is (are) incorrect. You will then have to correct that data and restart your analysis from the beginning.

          It would not surprise me to find that the problem arises either from a misplaced decimal point or from a digit transposition involving something like 19 vs 91 in leading digits.

          Comment


          • #20
            Clyde Schechter I do understand that. I'used that last code, and I've got the table showing me observations about individuals that have a low Revenu, but a Tax that's really close to that Revenu, which means that they are in the poorest social class, yet, if we calculate their effective tax rate, that rate will be high. I understand that.

            Is there a way to just delete those observations from the main dataset?

            Comment


            • #21
              Well, as a Stata question, yes, of course there is a way to delete them. That's what the -drop- command does. If you're not sure exactly how to use it, read -help drop-, and also read the PDF manual entry for the drop command (which you can open using the link at the top of the -help drop- screen.

              But, don't do that yet. You should first try to investigate those values. There is a slight chance that they are actually correct--and, if so, they should be left alone.. And if you can determine for sure that they are errors, you might also be able to get corrected information to overwrite these in the data set. If you cannot, after looking into the matter, actually determine whether these are errors, or what the correct values are, then you can -drop- those observations. But also be sure that you can explain and defend your decisions about which ones to drop. And, to be sure you don't forget what those decisions were and the reasons for them, document it all as comments in your do-file.

              How you go about investigating this problem depends on how you came to have this data set. Typically there is an original source data set that was given to you, followed by some importation and data management that you did. You should start by tracing back these anomalous observations to the original source data set. If you see that the same errors are found in the source data set, then you should contact whoever provided it to you, call that person's attention to the implausibility of those particular observations, and ask that person to look into it and provide corrections if they are errors, or confirmation that they are correct. If you don't find the same anomalies in the original source data set, then you need to go through all of the data management steps you used to find the mistakes that introduced these errors. (And, it is entirely possible that when you do that you will also discover other errors that need to be fixed.) Then correct those mistakes, and re-run the data management from scratch to create a new, corrected data set.

              Comment

              Working...
              X