Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to run a non-parametric regression when you want the dependent variable to be a density?

    I want to run a non-parametric regression of count of something as a dependent variable (say, number of crimes) on a continuous variable as a key variable (say, distance from subway stations). This would be the equivalent in a regression framework to a kernel density plot (twoway kdensity) with the same dependent and key variables. If I run the regression above using npregress, the analysis would average the dependent variable which takes value 1 in each row (in my case, each reported crime is a row in my dataset) within a specific kernel bandwidth. As in a twoway (kdensity), I would like the non-parametric regression to count the number of events in the dependent variable (crimes) instead of averaging them.

    An alternative way to run the above regression would be to aggregate the dependent variable (crimes) into (spatial) units and then run an OLS regression. However, I am trying to use a non-parametric approach in my analysis. Any ideas on how to run my analysis in a non-parametric fashion? Thanks!

  • #2
    You didn't get a quick answer. You'll increase your chances of a helpful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex.

    You might look at a recent post by Wooldridge and the 2nd edition of his text.

    Comment


    • #3
      Thanks Phil. Another way of asking my previous question is asking how to subtract two kernel density plots. This was treated in this thread eleven years ago.

      The code for the two kernel density plots is:
      twoway (kdensity distancekm4 if year==2005, color(sand) lcolor(sand))
      (kdensity distancekm4 if year==2007 , ///
      fcolor(none) color (black) ), ///
      title (2007-2005) ///
      xlabel(0 .5 1 2 3 4 5 6, grid) ///
      xtitle("Km from the new subway network") ///
      ytitle("Kernel density of the crime category") ///
      legend(off) name(Year2008)
      The previous command draws two kernel density plots with this output.

      You can get a sense of the data with the following ten observations:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(distancekm4 year crime)
      .04580309 2005 -1
        .302739 2006  1
       3.433682 2006  1
       1.483485 2008  1
       .6612761 2008  1
       2.501483 2008  1
      .25662634 2009  1
      .25662634 2007  1
       2.411719 2010  1
      .04629777 2010  1
      end
      label values crime crimelabel
      label def crimelabel -1 "Year 2005", modify
      label def crimelabel 1 "Year>=2006", modify

      A wrong approach to subtract the two previous kernel density plots is to run the following code:

      npregress kernel crime distancekm4 if (year==2007 | year==2005), vce(bootstrap, reps(100) seed(123))

      In the previous code, the variable crime is a dummy that takes value 1 for the endline (the year 2007) and value -1 for the baseline (the year 2005). Although the previous code runs, it doesn't do what I want because, while it runs a conditional expectation of 1s and -1s, what I want is the subtraction between a density of crimes (rows) where crime==1 (density of crimes in the baseline) and a second density where crime==-1 (density of crimes in the endline).

      I wonder whether the current command npregress command or other user written commands may help me to get estimates and standard errors for the subtraction of two kernel density plots.

      Thanks!
      Kenzo

      PS: The output for the npregress kernel command described above (that does not do what I want) running it with my full sample of 18,409 obs. is as follows:

      . npregress kernel crime distancekm4 if (year==2007 | year==2005), vce(bootstrap, reps(100) seed(123))
      (running npregress on estimation sample)

      Bootstrap replications (100)
      ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
      .................................................. 50
      .................................................. 100

      Bandwidth
      ------------------------------------
      | Mean Effect
      -------------+----------------------
      distancekm4 | .1435896 .2565698
      ------------------------------------

      Local-linear regression Number of obs = 5,651
      Kernel : epanechnikov E(Kernel obs) = 811
      Bandwidth: cross validation R-squared = 0.0534
      ------------------------------------------------------------------------------
      | Observed Bootstrap Percentile
      crime | Estimate Std. Err. z P>|z| [95% Conf. Interval]
      -------------+----------------------------------------------------------------
      Mean |
      crime | .1878787 .013184 14.25 0.000 .1582148 .2111885
      -------------+----------------------------------------------------------------
      Effect |
      distancekm4 | .3854463 .0818289 4.71 0.000 .2340934 .5651261
      ------------------------------------------------------------------------------
      Note: Effect estimates are averages of derivatives.



      Comment


      • #4
        Hi, I would greatly appreciate feedback on how to further improve my question to increase the likelihood that somebody could get interested in it. Should I re-post it under the subject "How to subtract two density plots?". Thanks!

        Comment


        • #5
          Hi Kenzo
          I think your question is not completely clear because its difficult to see the big picture of what is that you want to do.
          Your example about densities, for example, does not use "crime" variable at all, and your npregress example, regresses crime against distance.

          Perhaps what you are trying to compare are marginal densities of crime or densities of distance, but rather joint densities of Crime and distance across time. Unfortunately, I dont think there is a method or command for that, as far as i know. But it would be "relatively" easy to program something like that, at least in terms of joint cumulative densities, that you can use to see which one stochastically dominates the other.

          So, perhaps if you provide more details of what is your research question, and what you are thinking of doing, it may be easier to provide you with a better answer.
          HTH
          Fernando


          Comment


          • #6
            Hi Fernando,

            Thanks for your message. My research question is whether closer proximity to a subway station has an effect on crime rates. For this, I have a georeferenced dataset with every reported crime in a city (Santiago, Chile). The density is of total crimes and of different types of crime. So the dependent variable in one analysis is total crimes and in other analyses, the dependent variables are different types of crime (robbery, larceny, burglary. The key variable is proximity to the subway network. I am analysing the effect of closer proximity to the subway network because of a subway expansion. I want to compare the density of crimes in the baseline (pre-subway expansion) where the key variable is distance to the post-subway expansion subway network with the density of crimes in the endline (post-expansion) where the key variable is the same distance to the post-expansion subway network. As I explain in post #3, when I plot both densities together, I obtain this output. In this output, the black line is the density of crimes in the endline (after the subway expansion) and the sand line is the density of crimes in the baseline (before the subway expansion). I want to to obtain a difference between both lines with a confidence interval about this difference at a certain level (say, a 95% confidence interval).

            I know how to run the previous analysis defining a spatial unit of aggregation, counting the crimes per unit (say, census block) and then running a regression. However, what I don't know how to run is how to run a regression avoiding aggregating the data in spatial units. In other words, I would like to run the previous analysis in a continuous fashion using a kernel (say, an Epanechnikov kernel of a specific bandwidth).

            Are there any other relevant details that I have missed in this and previous posts?

            Many thanks,
            Kenzo

            Comment


            • #7
              Hi Kenzo
              Ok the picture is a bit more clear. but your description of the data, and some of the "results" you show do not make sense with your description.
              do you mind producing the following figures?
              Code:
              two scatter crime distance if year==2005  || scatter crime distance if year==2007 
              two lpoly crime distance if year==2005  || lpoly crime distance if year==2007
              I think that will provide a better idea of the type of data you have and a better direction of what you may need to do
              Fernando

              Comment


              • #8
                Hi Kenzo
                So Crime doesnt really measure "crime density" nor the number of "crimes" its just an indicator
                And based on what you show, it doesnt matter how far or close you are in terms of distance, CRIME is always either 1 or -1.
                So perhaps you are trying to do something different from what you explain or what your data allows you to do.
                I think this is one of the questions that would better be answer in a discussion with a college/advisor who can sit down with yo and see what can of things is the data telling you.

                Comment


                • #9
                  Hi,

                  What I would like to do is theoretically simple but I have not been able to implement it. I would like to modify the ado file of "npregress" by considering as the dependent variable not the y-variable itself (in my case, the categorical variable "crime"--the variable "crime" takes value 1 in the endline (2007) and -1 in the baseline (2005)--), but the sum of the elements of "crime" within a certain bandwidth. The key variable would continue being the same as in "npregress" (just the x-var, which in my case is the distance between the reported crime and the closest subway station in the endline).

                  Let us call this new function "npregdepsumker" (an acronym for a "non-parametric regression whose dependent variable is the sum of the y-var within a kernel"). Then, the non-parametric regression I would like to run if the previously mentioned new function would exist would be:

                  npregdepsumker crime distancekm4 if (year==2007 | year==2005), vce(bootstrap, reps(100) seed(123))

                  We would interpret the coefficient on distancekm4 as the elasticity between distance to the closest subway station in the endline and "the difference between the number of crimes in the endline and the number of crimes in the baseline".

                  I would greatly appreciate feedback if somebody could point out where (and if possible how) in the ado-file of npregress I should intervene (I got lost trying to navigate npregress's ado) to get what I would like to get. Maybe my question is out of the scope of this forum. If this is so, I would understand perfectly well.

                  Thanks!

                  ------

                  PS: (Answer to post #7) Hi Fernando

                  1. This is the figure for
                  two scatter crime distance if year==2005 || scatter crime distance if year==2007
                  2. This is the figure for
                  two lpoly crime distance if year==2005 || lpoly crime distance if year==2007
                  I defined that the variable "crime" takes value 1 after the subway expansion (2007) and value -1 before the subway expansion (2005). The previous two plots show that the instructions "scatter" and "lpoly" don't do what I want because:
                  1. "scatter" plots +1 and -1 in the y-axis and distance to the subway network in the x-axis.
                  2. "lpoly" calculates a conditional expectation of +1 in the endline and -1 in the baseline, not a density. In other words, I don't want a function that sums all y-values within a certain bandwidth and divides by that same sample size (expectation). Instead, I want a sum of y-values without dividing by the local sample size (something analogous to a density).
                  Clarification: Post #8 by Fernando replies to the second part of this post (starting from "PS" in this post).

                  Comment

                  Working...
                  X