Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • lpoly - problems using large dataset

    The lpoly command works fine as long as my dataset is not too large. Using a dataset with about 45,000,000 observations, I get gaps in the graph. I stored the smoothing grid and the smoothed points and saw that there seems to be a problem saving these two variables.

    stata version: 15
    command: lpoly varA varB if varB >= -.4 & varB <= .4 [aweight=varC], noscatter ci bwidth(0.05) generate(gvarB gvarA)

    For 45 smoothing grids the corresponding smoothed point is saved. But for 5 smoothing grids the smoothed point is missing. At the same time stata saved 5 smoothed points without a corresponding smoothing grid. Increasing the number of observations in the dataset results in more grids. (My dataset has 90,000,000 observations, using a 5 % or 10 % sample works fine. First gaps in the graph appear using a 20 % sample.)

    Is there an alternative command or a solution to the problem? Thank you in advance, Heiko

  • #2
    Hi Heiko
    I think your best solution is to do the Smoothing yourself, Specially because you have already selected your Bandwidth. As many other non-parametric analysis, bandwidth selection tends to be the most time consuming task, but if you already have it selected, it can be replicated "easily"
    Below an example using the DUI dataset
    Code:
    webuse dui, clear
    * the example below will use gaussian kernel  because it is the easiest to implement without additional programming. It can be changed to your needs.
    * Lpoly example:
    lpoly citations fines, bw(0.4) gen(vx vy) kernel(gaussian)
     * Vx can be created as follows (assuming you want to follow Lpoly recipie)
    sum fines
    gen double vvx=r(min)+(r(max)-r(min))*(_n-1)/(50-1) if _n<=50
    * then you can LOOP through this variables, using a simple OLS weighted by the kernel gaussian
    * Below is the example for the stndard Lpoly , which assumes a degree zero (only intercept)
    * and the Smooth outcome will be saved in a new variable
    gen double vvy=.
    forvalues i=1/50 {
    reg citations [aw=normalden(fines,vvx[`i'],0.4)]
    replace vvy=_b[_cons] in `i'
    }
    
    * So you can modify this code to do the smoothing only for the values of interest, If you need to add your own sample weights, you can do it as follows
    gen double vvy2=.
    forvalues i=1/50 {
    capture drop nweight
    gen double nweight=normalden(fines,vvx[`i'],0.4)*csize
    reg citations [aw=nweight]
    replace vvy2=_b[_cons] in `i'
    }
    *which should  give you the same as 
    lpoly citations fines [aw=csize], kernel(gaussian) bw(0.4) gen(vxx2 vyy2)
    Hope this helps
    Fernando

    Comment


    • #3
      Thank you very much Fernando. I will try that. Cheers Heiko

      Comment


      • #4
        Your welcome. I just wanted to let you know. If you copy the code directly into Stata, you may obtain some "errors". THe reason being that spaces are not being recognized as such between this editor (the forum) and Stata editor. At least not for me.
        So, it will be better if you transcribe the code rather than just copy and paste.
        Fernando

        Comment


        • #5
          I trust the OP has resolved his issue, but I thought I'd add a comment to this 6+ year thread.

          There may be a bug in the lpoly command, or possibly the documentation. To get the smoothed variable as output, use the following syntax:

          lpoly yvar time, at(time) gen(yhat)

          I'll copy in some code demonstrating what does and does not work with one of Stata's sample datasets:

          Code:
          sysuse dir
          sysuse gnp96.dta
          
          * gnp96.dta is a time series dataset:
          describe
          tsset
          
          * Quarterly percent changes and annualized quarterly percent changes:
          gen double pcgnp96 = (gnp96/L.gnp96) -1
          gen double pcagnp96 = (gnp96/L.gnp96)^4 -1
          gen long t=_n
          
          list in 1/4, clean
          list in 141/142, clean
          sum
          
          * This code creates less helpful variables: 50 observations:
          lpoly pcagnp96 date , name(gnp01a) gen(gengrid1a genhat1a)
          
          /* Although t is perfectly correlated with date, this will create
            a variable with only a subsample of observations:   */
          
          lpoly pcagnp96 date , name(gnp01b) at(t) gen(genhat1b)
          
          /* This code creates a workable smoothed variable   */  
          
          lpoly pcagnp96 date , name(gnp01c) at(date) gen(genhat1c)
          
          * genhat1c is the output of the correct coding:
          sum
          
          /* Note: when I've worked with a dataset where yvar has fewer
            observations than the entire dataset, the genhat1c run has
            presented a less helpful graph which covers the full dataset range.
            To address this, use this sort of overlay code:
            */
          
          twoway scatter pcagnp96 date  || ///
            line  genhat1c date, name(combo1c) legend(col(2) position(6))
          
          des, short
          exit
          I've noticed that the appearance of the fitted line changes when the generate option is included. That may or may not be a graphic artifact. I am running Stata 18.0 on Windows 11.

          Comment

          Working...
          X