Regression discontinuity using local linear regression

Anton Ivanov

Join Date: Sep 2014

Posts: 267
#1

Regression discontinuity using local linear regression

20 Aug 2020, 14:44

Hello!

I am studying the relationship between average online ratings and sales. To address endogeneity caused by the omitted variable bias, I would like to adapt regression discontinuity identification approach employed by Li, X. (2018). Impact of average rating on social media endorsement: The moderating role of rating dispersion and discount threshold. Information Systems Research, 29(3), 739-754.

Similar to the platform design described by Li (2018), my platform of interest has an institutional feature of displaying the aggregated product rating that provides an opportunity for the identification. For a given product with at least two review ratings (ranging from 1 to 5 stars), the platform calculates the average of these ratings and rounds it up to the nearest star. For example, one product with and average user rating of 4.47 is rounded down and displayed as 4-star rating, while another one with a mean rating of 4.59 is rounded up and displayed as 5-star. Consequently, there is a one-star difference between the displayed platform average ratings of the two products, although their true average ratings are relatively close. As is the case with study by Li (2018), the average rating is centrally displayed on the product’s page, while the true mean rating is not displayed.

I further adapt the identification approach used by Li (2018). Let r_i be the true mean user rating of product deal i that may fall in a small (e.g., 0.2-star) bandwidth of a cutoff c. The value of c can be 1.5, 2.5, 3.5, or 4.5. Each of these cutoffs with a 0.2-star bandwidth corresponds to one rating range, such as (4.5 ±0.25). In total, there are four (technically, there are 3 ranges, because r_i min = 2.6) rating ranges between 1 and 5 stars. I pool data from these rating ranges and use the 0.2-star bandwidth in the main part of the analysis. Following the approach employed by Li (2018), I would like to use standard local linear regression as specified in Equation (1) to estimate the causal effect of displayed mean rating:

y_i = a0 + b*I(r_i >= c) + a1*(r_i - c) + a2*(r_i - c)*I(r_i >= c) + e_i, (1)

where the dependent variable y_i is the natural logarithm of the number of products i sales. If r_i ≥ c, then I(r(i) ≥ c) = 1 and the product’s mean rating is rounded up to the nearest star; otherwise, I(r(i) ≥ c) = 0, and it is rounded down. Because the discontinuity in outcome y_i is likely to be merely induced by the indication function I(r(i) ≥ c), the coefficient b estimates the causal effect of an extra star displayed average rating (Li 2018).

Given the exemplar data provided below, what would be the correct way of estimating Equation (1) using Stata?

Thank you for your feedback!

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte id float(y_i r_i c) 1 2.890372 4.285714 4.5 3 2.890372 3.76 3.5 4 2.397895 4.404762 4.5 5 2.3025851 4.1666665 4.5 6 2.397895 3.9333334 3.5 7 1.7917595 4.4102564 4.5 9 2.0794415 4.4102564 4.5 10 2.3025851 3.75 3.5 11 2.0794415 4.446429 4.5 13 2.3025851 3.789474 3.5 15 2.0794415 3.666667 3.5 16 2.1972246 4.285714 4.5 17 2.0794415 2.666667 2.5 19 2.6390574 4.122807 4.5 20 2.772589 4.7272725 4.5 22 2.0794415 4.652174 4.5 23 3.367296 4.263158 4.5 24 2.484907 4.4444447 4.5 25 1.3862944 4.6 4.5 27 1.7917595 4.75 4.5 28 1.609438 3.8 3.5 30 1.7917595 4.446429 4.5 31 2.3025851 4.209677 4.5 32 1.7917595 3.3243244 3.5 33 1.7917595 4.4745765 4.5 34 1.3862944 3.92 3.5 35 1.609438 4.9375 4.5 37 2.0794415 3.333333 3.5 39 2.1972246 5 4.5 40 1.7917595 4.75 4.5 41 2.6390574 4.6666665 4.5 42 2.3025851 3.5 3.5 44 2.0794415 4.402985 4.5 45 2.1972246 4.4329896 4.5 47 1.3862944 5 4.5 49 1.609438 4.560606 4.5 51 1.3862944 4.4329896 4.5 53 1.0986123 3.51 3.5 55 1.94591 3.6 3.5 56 2.3025851 4.3333335 4.5 58 1.94591 4.710836 4.5 59 1.609438 3.448276 3.5 62 1.0986123 4.6312847 4.5 63 1.7917595 4.04 4.5 64 1.94591 4.2222223 4.5 65 1.609438 4.714286 4.5 68 1.0986123 3 2.5 69 1.0986123 4.2222223 4.5 70 1.3862944 5 4.5 71 .6931472 3.666667 3.5 72 1.609438 4.428571 4.5 73 1.0986123 4.3488374 4.5 74 1.7917595 4.7412586 4.5 75 1.609438 5 4.5 76 1.7917595 4.4 4.5 77 1.0986123 3.88 3.5 79 2.397895 3.568282 3.5 80 1.0986123 5 4.5 82 1.0986123 4.769231 4.5 83 1.3862944 4.6153846 4.5 84 1.94591 4.708122 4.5 85 1.0986123 5 4.5 86 1.0986123 3.92 3.5 87 1.609438 5 4.5 88 1.0986123 4.428571 4.5 end
Tags: endogeneity, local linear regression, omitted variable bias, Regression Discontinuity
Salvatore Lattanzio

Join Date: Feb 2018

Posts: 20
#2

21 Aug 2020, 02:36

Hi Anton,

there are a number of different ways in which you can estimate equation (1), both parametrically and non-parametrically, using optimal bandwidths or not.

First, if you want to estimate a parametric local linear regression, you can simply run a linear regression within the bandwidth that you choose to use or the optimal bandwidth (e.g. the one computed using the algorithm of Calonico et al. (2017))

Code:

* Normalize running variable and generete treatment variable gen n_i=r_i-c gen d=n_i>0 * Parametric local linear regression with arbitrary bandwidth reg y_i i.d##c.n_i if abs(n_i)<.2 * Parametric local linear regression with optimal bandwidth (Calonico et al. 2017) rdbwselect y_i n_i global h=e(h_mserd) reg y_i i.d##c.n_i if abs(n_i)<$h

Second, you can run a non-parametric local linear regression, using rdrobust (which has a number of different options, here I provide just a basic example)

Code:

* Non-parametric local linear regression with optimal bandwidth (Calonico et al. 2017) rdrobust y_i n_i
1 like
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#3

21 Aug 2020, 08:28

Salvatore Lattanzio Salvatore, thank you very much for your response. It is very helpful!
1 like
Comment

Announcement

Regression discontinuity using local linear regression

Comment

Comment