lpoly - problems using large dataset

Heiko Stueber

Join Date: Feb 2019

Posts: 2
#1

lpoly - problems using large dataset

21 Feb 2019, 03:51

The lpoly command works fine as long as my dataset is not too large. Using a dataset with about 45,000,000 observations, I get gaps in the graph. I stored the smoothing grid and the smoothed points and saw that there seems to be a problem saving these two variables.

stata version: 15
command: lpoly varA varB if varB >= -.4 & varB <= .4 [aweight=varC], noscatter ci bwidth(0.05) generate(gvarB gvarA)

For 45 smoothing grids the corresponding smoothed point is saved. But for 5 smoothing grids the smoothed point is missing. At the same time stata saved 5 smoothed points without a corresponding smoothing grid. Increasing the number of observations in the dataset results in more grids. (My dataset has 90,000,000 observations, using a 5 % or 10 % sample works fine. First gaps in the graph appear using a 20 % sample.)

Is there an alternative command or a solution to the problem? Thank you in advance, Heiko
Tags: None

FernandoRios

Join Date: Apr 2014
Posts: 2459

21 Feb 2019, 06:28

Hi Heiko
I think your best solution is to do the Smoothing yourself, Specially because you have already selected your Bandwidth. As many other non-parametric analysis, bandwidth selection tends to be the most time consuming task, but if you already have it selected, it can be replicated "easily"
Below an example using the DUI dataset

Code:

webuse dui, clear
* the example below will use gaussian kernel  because it is the easiest to implement without additional programming. It can be changed to your needs.
* Lpoly example:
lpoly citations fines, bw(0.4) gen(vx vy) kernel(gaussian)
 * Vx can be created as follows (assuming you want to follow Lpoly recipie)
sum fines
gen double vvx=r(min)+(r(max)-r(min))*(_n-1)/(50-1) if _n<=50
* then you can LOOP through this variables, using a simple OLS weighted by the kernel gaussian
* Below is the example for the stndard Lpoly , which assumes a degree zero (only intercept)
* and the Smooth outcome will be saved in a new variable
gen double vvy=.
forvalues i=1/50 {
reg citations [aw=normalden(fines,vvx[`i'],0.4)]
replace vvy=_b[_cons] in `i'
}

* So you can modify this code to do the smoothing only for the values of interest, If you need to add your own sample weights, you can do it as follows
gen double vvy2=.
forvalues i=1/50 {
capture drop nweight
gen double nweight=normalden(fines,vvx[`i'],0.4)*csize
reg citations [aw=nweight]
replace vvy2=_b[_cons] in `i'
}
*which should  give you the same as 
lpoly citations fines [aw=csize], kernel(gaussian) bw(0.4) gen(vxx2 vyy2)

Hope this helps
Fernando

Comment

Heiko Stueber

Join Date: Feb 2019

Posts: 2
#3

21 Feb 2019, 06:30

Thank you very much Fernando. I will try that. Cheers Heiko
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2459
#4

21 Feb 2019, 06:42

Your welcome. I just wanted to let you know. If you copy the code directly into Stata, you may obtain some "errors". THe reason being that spaces are not being recognized as such between this editor (the forum) and Stata editor. At least not for me.
So, it will be better if you transcribe the code rather than just copy and paste.
Fernando
Comment

David Howe

Join Date: Jan 2015
Posts: 6

09 May 2025, 21:18

I trust the OP has resolved his issue, but I thought I'd add a comment to this 6+ year thread.

There may be a bug in the lpoly command, or possibly the documentation. To get the smoothed variable as output, use the following syntax:

lpoly yvar time, at(time) gen(yhat)

I'll copy in some code demonstrating what does and does not work with one of Stata's sample datasets:

Code:

sysuse dir
sysuse gnp96.dta

* gnp96.dta is a time series dataset:
describe
tsset

* Quarterly percent changes and annualized quarterly percent changes:
gen double pcgnp96 = (gnp96/L.gnp96) -1
gen double pcagnp96 = (gnp96/L.gnp96)^4 -1
gen long t=_n

list in 1/4, clean
list in 141/142, clean
sum

* This code creates less helpful variables: 50 observations:
lpoly pcagnp96 date , name(gnp01a) gen(gengrid1a genhat1a)

/* Although t is perfectly correlated with date, this will create
  a variable with only a subsample of observations:   */

lpoly pcagnp96 date , name(gnp01b) at(t) gen(genhat1b)

/* This code creates a workable smoothed variable   */  

lpoly pcagnp96 date , name(gnp01c) at(date) gen(genhat1c)

* genhat1c is the output of the correct coding:
sum

/* Note: when I've worked with a dataset where yvar has fewer
  observations than the entire dataset, the genhat1c run has
  presented a less helpful graph which covers the full dataset range.
  To address this, use this sort of overlay code:
  */

twoway scatter pcagnp96 date  || ///
  line  genhat1c date, name(combo1c) legend(col(2) position(6))

des, short
exit

I've noticed that the appearance of the fitted line changes when the generate option is included. That may or may not be a graphic artifact. I am running Stata 18.0 on Windows 11.

Announcement