Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimates of (un-)conditional quantiles through -qreg- versus other commands

    Dear all,

    I (Stata 13.1; current update level: 19 Dec 2014) would like to know why---under certain conditions---Stata's -qreg- yields different estimates for (unconditional and conditional) quantiles than obtained through commands such as -sum, d-, -tabstat-, or -centile-? Below, I focus on the median. However, the problem can be replicated using other quantiles.

    There are no differences between -qreg- and the other commands when the median is estimated for (sub-)groups with an uneven number of cases. Please run the following lines of code:

    Code:
    clear
    set obs 1001
    set seed 54321
    
    gen y = rnormal() 
    
    qreg y, quant(.5)
    centile y, centile(50)               // two alternatives are -sum y, d- and -tabstat y, stat(median)-
    
    sort y
    list in 500/502
    In the scenario above we get the same result for the median of y, namely -.0041289, using either -qreg y- or -centile y- (or alternatives). In this case, all commands agree on one value. This is not surprising, since there is one single middle observation. After -sort-ing on y, -list in 500/502- shows this middle value and its neighbors.

    Differences arise when the median is estimated for an even number of observations: Simply change the second line of the code above to -set obs 1000- and rerun the example. Now, -qreg y- yields -.0026933 as estimate for the median (constant of the model). However, -centile y- yields -.0034111. The output of -list- suggests that -qreg- reports the value from the observation with the larger value, -.0026933. (Why?) However, -centile- does what proably most people here would do: It takes the values of the two middle observations and calculates the mean of the two: -dis (-.0041289-.0026933)/2- = -.0034111. It turns out that---in univariate scenarios--- -qreg- always reports empirical values (values of observations in the analyzed sample) for the constant of the model.

    Again: My question is why -qreg- behaves in this way? The differences I described also affect estimates of conditional quantiles (also see the quote from the Stata Manual below). I have not found any explanation for this behavior in the literature on quantile regression, the Stata Manual or on Statalist, nor can I think of a good reason from a statistical or computational point of view.

    I also (wanted to let you) know this:

    (1) In older releases of Stata commands such as -sum, d- (and probably -centile-) would not calculate the median as the mean of two middle values in the case of an even number of observations, but pick one of the two middle observations. Could it be that StataCorp at one point changed the behavior of -sum, d-, -centile- and others but simply forgot -qreg-?

    (2) In the Stata Manual entry for -qreg- I found this as description of a value reported in -qreg-'s output ("Raw sum of deviations 71102.5 (about 4934)", [R] p1769): "This value [4934, that is] is a median (one of the two center observations), not the median, which would typically be defined as the midpoint of the two center observations." ([R] p. 1769)

    (3) Some of you might think that this question could have---and maybe should have---been directed towards StataCorp's Tech Support. I agree. However, I reckoned that some of you might also wanted to know about this issue---and some might be able to help me out with this, anyway.

    Any ideas?

    Best,
    Sebastian
Working...
X