A numerical representation of distance from a regression line?

Mark Richards

Join Date: Aug 2022

Posts: 3
#1

A numerical representation of distance from a regression line?

29 Aug 2022, 01:26

Hello

Am a novice Stata user/ statistician. I have created a two way scatter of aggregated travel distances (y axis) and passenger counts (x axis) for suburbs of Cape Town. This I created from aggregated data bins from the original dataset using a collapse function (I think this was the correct method).I have a regression line. The data has non-normal distribution. I am trying to identify the 'worst outliers' for travel distance per capita. Is there a way to represent this in a numerical way? I can 'see' them on the graph, but for accuracy I would like to isolate the top 5 locations and describe them with more detail.

Many thanks. Mark
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17851
#2

29 Aug 2022, 02:14

Mark:
see Leverage statistics paragraph, -regress postestimation- entry, Stata .pdf manual.

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36054
#3

29 Aug 2022, 02:50

The appropriate measure of distance of any point from a regression line is the residual, obtainable through the predict command.

Working on log scale is surely indicated. Some bigger deviations from the regression line are likely to be just a side-effect of heteros{c|k}edasticity and would be moderated by working on log scale.
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1548
#4

29 Aug 2022, 03:10

Carlo Lazzaro aren't leverage statistics more a measure of how much of an outlier the predictor values are, rather than how far the predicted values are from the observed regressand? For what Mark has asked for, I am rather more inclined to suggest looking directly for large (absolute values of) residuals. You might want to

Code:

predict e, resid

and then look into outliers, perhaps using a boxplot, or less visually, using Nick's command:

Code:

extremes e

NB. The command is available from SSC, via

Code:

ssc install extremes

That said, the graph seems to show an almost canonical case of heteroskedasticity, with the variance of travel distance increasing in passenger count, and the suggestion in #3 of working with logs would mitigate this issue.

Last edited by Hemanshu Kumar; 29 Aug 2022, 03:14.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17851
#5

29 Aug 2022, 03:26

Hemanshu:
I thought that Mark had already taken a look at his regression residuals.

Kind regards,
Carlo
(Stata 19.0)
Comment
Mark Richards

Join Date: Aug 2022

Posts: 3
#6

30 Aug 2022, 06:05

Many thanks everyone for your thoughts. Much appreciated.

Time to hit the books. Will also amend to logs and see that unfolds.

Mark
Comment

Announcement

A numerical representation of distance from a regression line?

Comment

Comment

Comment

Comment

Comment