binning

Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#1

binning

14 Nov 2018, 07:14

Hi all!

So this is a general question...is there a command in STATA that collapse by bins?
I have two variables, namely the age of the product and the average sales of it. I would like to do a collapse by bins and keep the mean of the sales inside the bins. Since this is a request from a Professor, I would like first of all to understand what he meant; this is the reason why the question is general...
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#2

14 Nov 2018, 12:05

There is no single command to do this. First you have to create the binning variable itself. Then you can -collapse- -by()- that variable.

Creating the binning variable can be done in a number of ways. The most direct is just with a series of -generate- and -replace- commands conditioned on the binning variable falling in between two cutpoints. There is also a command -egen, cut()- that is sometimes used for this purpose. In some circumstances, -recode- can be useful here as well.

All of that said, remember that when you impose categories on a discrete variable you discard information. Unless the cutpoints correspond to real discontinuities in the relationships of the binned variable to other variables, the result is to make your analyses noisier, less reliable, and sometimes biased. While binned means and the like can be suggestive to look at, you should be extremely reluctant to use them in real analyses. Treating continuous variables as continuous is almost always the better way to go.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35726
#3

14 Nov 2018, 12:13

I agree with everything Clyde said. For more discussion -- although some of what could be said -- available papers include

SJ-18-3 dm0095 . . . . . . . . . . . Speaking Stata: From rounding to binning
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q3/18 SJ 18(3):741--754 (no commands)
basic review of how to bin variables in Stata, meaning how to
divide their range or support into disjoint intervals

SJ-18-1 gr0072 . . . . . . . Speaking Stata: Logarithmic binning and labeling
(help niceloglabels) . . . . . . . . . . . . . . . . . . . N. J. Cox
Q1/18 SJ 18(1):262--286
introduces the niceloglabels command for helping (even automating)
label choice
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#4

14 Nov 2018, 13:12

People often collapse variables down into quartiles or quintiles. As Clyde and Nick both mention, in general this is not a good practice because it means you are "throwing away" information (particularly if you use the bins (instead of the continuous version of the variable) in a regression.

Where it can be useful is in interpreting your results:
* (i.e. "Replacing a teacher in the bottom quartile with a teacher in the top quartile is associated with a 10 point gain in a child's reading scores...").

* Or in conveying how skewed the underlying data are (i.e. "the average startup in our sample has 33.7 employees, however, this average masks the skewness of the data: 50% of the firms in our sample reach a max of 5 employees over the sample period, and 25% never have more than 2 employees...."). I suspect your sample of product sales will have a similar "long tail" distribution.

Obviously, you don't need bins for either case. For example, in the latter case you could just type summarize product_sales, detail to get a sense of what the bins will look like.

Code:

. summ max_emp if target_real==1 & sample==1, detail ------------------------------------------------------------- Percentiles Smallest 1% 1 1 5% 1 1 10% 1 1 Obs 1,932 25% 2 1 Sum of Wgt. 1,932 50% 5 Mean 33.73344 Largest Std. Dev. 160.4271 75% 15 1800 90% 50 2292 Variance 25736.86 95% 100 2990 Skewness 11.6985 99% 625 3000 Kurtosis 172.2819
Comment
Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#5

15 Nov 2018, 07:03

Many thanks for the useful suggestions. I will definitely keep them in mind.
My distribution is not that fat tailed...so maybe I could try binning by categorizing age and collapsing by age category I guess...
I'll try and let you know.

Many thanks!
1 like
Comment

Announcement

Comment

Comment

Comment

Comment