Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • binning

    Hi all!

    So this is a general question...is there a command in STATA that collapse by bins?
    I have two variables, namely the age of the product and the average sales of it. I would like to do a collapse by bins and keep the mean of the sales inside the bins. Since this is a request from a Professor, I would like first of all to understand what he meant; this is the reason why the question is general...

  • #2
    There is no single command to do this. First you have to create the binning variable itself. Then you can -collapse- -by()- that variable.

    Creating the binning variable can be done in a number of ways. The most direct is just with a series of -generate- and -replace- commands conditioned on the binning variable falling in between two cutpoints. There is also a command -egen, cut()- that is sometimes used for this purpose. In some circumstances, -recode- can be useful here as well.

    All of that said, remember that when you impose categories on a discrete variable you discard information. Unless the cutpoints correspond to real discontinuities in the relationships of the binned variable to other variables, the result is to make your analyses noisier, less reliable, and sometimes biased. While binned means and the like can be suggestive to look at, you should be extremely reluctant to use them in real analyses. Treating continuous variables as continuous is almost always the better way to go.

    Comment


    • #3
      I agree with everything Clyde said. For more discussion -- although some of what could be said -- available papers include

      SJ-18-3 dm0095 . . . . . . . . . . . Speaking Stata: From rounding to binning
      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
      Q3/18 SJ 18(3):741--754 (no commands)
      basic review of how to bin variables in Stata, meaning how to
      divide their range or support into disjoint intervals

      SJ-18-1 gr0072 . . . . . . . Speaking Stata: Logarithmic binning and labeling
      (help niceloglabels) . . . . . . . . . . . . . . . . . . . N. J. Cox
      Q1/18 SJ 18(1):262--286
      introduces the niceloglabels command for helping (even automating)
      label choice

      Comment


      • #4
        People often collapse variables down into quartiles or quintiles. As Clyde and Nick both mention, in general this is not a good practice because it means you are "throwing away" information (particularly if you use the bins (instead of the continuous version of the variable) in a regression.

        Where it can be useful is in interpreting your results:
        * (i.e. "Replacing a teacher in the bottom quartile with a teacher in the top quartile is associated with a 10 point gain in a child's reading scores...").

        * Or in conveying how skewed the underlying data are (i.e. "the average startup in our sample has 33.7 employees, however, this average masks the skewness of the data: 50% of the firms in our sample reach a max of 5 employees over the sample period, and 25% never have more than 2 employees...."). I suspect your sample of product sales will have a similar "long tail" distribution.

        Obviously, you don't need bins for either case. For example, in the latter case you could just type summarize product_sales, detail to get a sense of what the bins will look like.

        Code:
        . summ max_emp if target_real==1 & sample==1, detail
        
        
        -------------------------------------------------------------
              Percentiles      Smallest
         1%            1              1
         5%            1              1
        10%            1              1       Obs               1,932
        25%            2              1       Sum of Wgt.       1,932
        
        50%            5                      Mean           33.73344
                                Largest       Std. Dev.      160.4271
        75%           15           1800
        90%           50           2292       Variance       25736.86
        95%          100           2990       Skewness        11.6985
        99%          625           3000       Kurtosis       172.2819

        Comment


        • #5
          Many thanks for the useful suggestions. I will definitely keep them in mind.
          My distribution is not that fat tailed...so maybe I could try binning by categorizing age and collapsing by age category I guess...
          I'll try and let you know.

          Many thanks!

          Comment

          Working...
          X