Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • A numerical representation of distance from a regression line?

    Hello

    Am a novice Stata user/ statistician. I have created a two way scatter of aggregated travel distances (y axis) and passenger counts (x axis) for suburbs of Cape Town. This I created from aggregated data bins from the original dataset using a collapse function (I think this was the correct method).I have a regression line. The data has non-normal distribution. I am trying to identify the 'worst outliers' for travel distance per capita. Is there a way to represent this in a numerical way? I can 'see' them on the graph, but for accuracy I would like to isolate the top 5 locations and describe them with more detail.

    Many thanks. Mark
    Click image for larger version

Name:	Main place travel (2).jpg
Views:	1
Size:	55.1 KB
ID:	1679699

  • #2
    Mark:
    see Leverage statistics paragraph, -regress postestimation- entry, Stata .pdf manual.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      The appropriate measure of distance of any point from a regression line is the residual, obtainable through the predict command.

      Working on log scale is surely indicated. Some bigger deviations from the regression line are likely to be just a side-effect of heteros{c|k}edasticity and would be moderated by working on log scale.

      Comment


      • #4
        Carlo Lazzaro aren't leverage statistics more a measure of how much of an outlier the predictor values are, rather than how far the predicted values are from the observed regressand? For what Mark has asked for, I am rather more inclined to suggest looking directly for large (absolute values of) residuals. You might want to

        Code:
        predict e, resid
        and then look into outliers, perhaps using a boxplot, or less visually, using Nick's command:

        Code:
        extremes e
        NB. The command is available from SSC, via
        Code:
        ssc install extremes
        That said, the graph seems to show an almost canonical case of heteroskedasticity, with the variance of travel distance increasing in passenger count, and the suggestion in #3 of working with logs would mitigate this issue.
        Last edited by Hemanshu Kumar; 29 Aug 2022, 03:14.

        Comment


        • #5
          Hemanshu:
          I thought that Mark had already taken a look at his regression residuals.
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Many thanks everyone for your thoughts. Much appreciated.

            Time to hit the books. Will also amend to logs and see that unfolds.

            Mark

            Comment

            Working...
            X