Dificulties understanding some smoothing code using scalar command

Patrick Zollner

Join Date: Aug 2018

Posts: 10
#1

Dificulties understanding some smoothing code using scalar command

06 Aug 2018, 08:04

Hi,

I'm still new to Stata and am having some trouble following some code that looks as follows. I've included my interpretation in a step by step manner below and hope that you can let me know if I've understood it properly.

Code:

gen smooth = 0 quietly summarize mass scalar p1 = r(p1) scalar p99 = r(p99) scalar interval = 10 forvalues i = `=scalar(p1)'(`=scalar(interval)')`=scalar(p99)' { scalar begin = `i' scalar end = `i'+`=scalar(interval)' if `i' == `=scalar(p1)' {replace smooth = `i' if mass <= `=scalar(end)'} else {replace smooth = `i' if mass> `=scalar(begin)' & mass <= `=scalar(end)'} }

From what I can gather, this loop generates begin and end for every i at an absolute interval of 10, between the 1st and 99th percentiles of mass.
At each iteration of the loop, Stata replaces smooth with i if mass falls within the current interval (i - i+10).
So for example if i=50 and mass=53, smooth would be replaced by i=50. This also holds for mass=60, right?
Effectively this means that at i=50, observations for mass=51 to mass=60 will receive smooth=50, correct?
This is close to, but not exactly the same as, rounding down to the nearest 10, yes?
Is there a statistical term for this type of smoothing, (e.g. smoothing by rounding down)?

Thanks, I look forward to your feedback.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35641
#2

06 Aug 2018, 08:32

This code (source???) wouldn't run even with data supplied. summarize won't leave r(p1) and r(p99) in its wake unless the detail option is specified.

Apart from that crucial correction, I had to rewrite it to understand it. Using scalars here rather than locals is unlikely to be worth the extra precision.

This is a quick translation

Code:

gen smooth = 0 quietly summarize mass, detail local p1 = r(p1) local p99 = r(p99) local interval = 10 forvalues i = `p1'(`interval')`p99' { if `i' == `p1' replace smooth = `i' if mass <= `i' + `interval' else replace smooth = `i' if mass > `i' & mass <= `i' + `interval' }

Seems very ad hoc.

I wouldn't call this smoothing at all. It's binning with intervals of width 10 based on the 1st and 99th percentiles, except that values below the first percentile get put in the bin starting there, regardless; and values more than the 99th percentile + 10 will be mapped to 0. Perhaps that rule never bit the authors or previous users.

Note that in samples of size below 100, the 1st percentile and 99th percentiles returned by summarize are those of the minimum and maximum in any case.

Why follow it at all? What you want to do? What's the rationale for 10? If 10 is a good idea for bin width, then something like 10 * floor(x/10) or 10 * ceil(x/10) is immensely simpler as a rule. More on that in the next column of Speaking Stata sometime later this year in the Stata Journal.
Comment
Patrick Zollner

Join Date: Aug 2018

Posts: 10
#3

06 Aug 2018, 08:56

Thanks for the lightning-fast response!

The code is not mine and also very much ad hoc. I do not have access to the source data, but only the output data and the do file which was used to generate it (this code is a part of it I did not understand). I assume the code was run with the detail option specified as the output does bin the observations as you describe.

Thanks also for your suggestions on other code which could be used to the same effect. i will keep it in mind.
Comment

Announcement

Dificulties understanding some smoothing code using scalar command

Comment

Comment