Hello!
I am studying the relationship between average online ratings and sales. To address endogeneity caused by the omitted variable bias, I would like to adapt regression discontinuity identification approach employed by Li, X. (2018). Impact of average rating on social media endorsement: The moderating role of rating dispersion and discount threshold. Information Systems Research, 29(3), 739-754.
Similar to the platform design described by Li (2018), my platform of interest has an institutional feature of displaying the aggregated product rating that provides an opportunity for the identification. For a given product with at least two review ratings (ranging from 1 to 5 stars), the platform calculates the average of these ratings and rounds it up to the nearest star. For example, one product with and average user rating of 4.47 is rounded down and displayed as 4-star rating, while another one with a mean rating of 4.59 is rounded up and displayed as 5-star. Consequently, there is a one-star difference between the displayed platform average ratings of the two products, although their true average ratings are relatively close. As is the case with study by Li (2018), the average rating is centrally displayed on the product’s page, while the true mean rating is not displayed.
I further adapt the identification approach used by Li (2018). Let r_i be the true mean user rating of product deal i that may fall in a small (e.g., 0.2-star) bandwidth of a cutoff c. The value of c can be 1.5, 2.5, 3.5, or 4.5. Each of these cutoffs with a 0.2-star bandwidth corresponds to one rating range, such as (4.5 ±0.25). In total, there are four (technically, there are 3 ranges, because r_i min = 2.6) rating ranges between 1 and 5 stars. I pool data from these rating ranges and use the 0.2-star bandwidth in the main part of the analysis. Following the approach employed by Li (2018), I would like to use standard local linear regression as specified in Equation (1) to estimate the causal effect of displayed mean rating:
y_i = a0 + b*I(r_i >= c) + a1*(r_i - c) + a2*(r_i - c)*I(r_i >= c) + e_i, (1)
where the dependent variable y_i is the natural logarithm of the number of products i sales. If r_i ≥ c, then I(r(i) ≥ c) = 1 and the product’s mean rating is rounded up to the nearest star; otherwise, I(r(i) ≥ c) = 0, and it is rounded down. Because the discontinuity in outcome y_i is likely to be merely induced by the indication function I(r(i) ≥ c), the coefficient b estimates the causal effect of an extra star displayed average rating (Li 2018).
Given the exemplar data provided below, what would be the correct way of estimating Equation (1) using Stata?
Thank you for your feedback!
I am studying the relationship between average online ratings and sales. To address endogeneity caused by the omitted variable bias, I would like to adapt regression discontinuity identification approach employed by Li, X. (2018). Impact of average rating on social media endorsement: The moderating role of rating dispersion and discount threshold. Information Systems Research, 29(3), 739-754.
Similar to the platform design described by Li (2018), my platform of interest has an institutional feature of displaying the aggregated product rating that provides an opportunity for the identification. For a given product with at least two review ratings (ranging from 1 to 5 stars), the platform calculates the average of these ratings and rounds it up to the nearest star. For example, one product with and average user rating of 4.47 is rounded down and displayed as 4-star rating, while another one with a mean rating of 4.59 is rounded up and displayed as 5-star. Consequently, there is a one-star difference between the displayed platform average ratings of the two products, although their true average ratings are relatively close. As is the case with study by Li (2018), the average rating is centrally displayed on the product’s page, while the true mean rating is not displayed.
I further adapt the identification approach used by Li (2018). Let r_i be the true mean user rating of product deal i that may fall in a small (e.g., 0.2-star) bandwidth of a cutoff c. The value of c can be 1.5, 2.5, 3.5, or 4.5. Each of these cutoffs with a 0.2-star bandwidth corresponds to one rating range, such as (4.5 ±0.25). In total, there are four (technically, there are 3 ranges, because r_i min = 2.6) rating ranges between 1 and 5 stars. I pool data from these rating ranges and use the 0.2-star bandwidth in the main part of the analysis. Following the approach employed by Li (2018), I would like to use standard local linear regression as specified in Equation (1) to estimate the causal effect of displayed mean rating:
y_i = a0 + b*I(r_i >= c) + a1*(r_i - c) + a2*(r_i - c)*I(r_i >= c) + e_i, (1)
where the dependent variable y_i is the natural logarithm of the number of products i sales. If r_i ≥ c, then I(r(i) ≥ c) = 1 and the product’s mean rating is rounded up to the nearest star; otherwise, I(r(i) ≥ c) = 0, and it is rounded down. Because the discontinuity in outcome y_i is likely to be merely induced by the indication function I(r(i) ≥ c), the coefficient b estimates the causal effect of an extra star displayed average rating (Li 2018).
Given the exemplar data provided below, what would be the correct way of estimating Equation (1) using Stata?
Thank you for your feedback!
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte id float(y_i r_i c) 1 2.890372 4.285714 4.5 3 2.890372 3.76 3.5 4 2.397895 4.404762 4.5 5 2.3025851 4.1666665 4.5 6 2.397895 3.9333334 3.5 7 1.7917595 4.4102564 4.5 9 2.0794415 4.4102564 4.5 10 2.3025851 3.75 3.5 11 2.0794415 4.446429 4.5 13 2.3025851 3.789474 3.5 15 2.0794415 3.666667 3.5 16 2.1972246 4.285714 4.5 17 2.0794415 2.666667 2.5 19 2.6390574 4.122807 4.5 20 2.772589 4.7272725 4.5 22 2.0794415 4.652174 4.5 23 3.367296 4.263158 4.5 24 2.484907 4.4444447 4.5 25 1.3862944 4.6 4.5 27 1.7917595 4.75 4.5 28 1.609438 3.8 3.5 30 1.7917595 4.446429 4.5 31 2.3025851 4.209677 4.5 32 1.7917595 3.3243244 3.5 33 1.7917595 4.4745765 4.5 34 1.3862944 3.92 3.5 35 1.609438 4.9375 4.5 37 2.0794415 3.333333 3.5 39 2.1972246 5 4.5 40 1.7917595 4.75 4.5 41 2.6390574 4.6666665 4.5 42 2.3025851 3.5 3.5 44 2.0794415 4.402985 4.5 45 2.1972246 4.4329896 4.5 47 1.3862944 5 4.5 49 1.609438 4.560606 4.5 51 1.3862944 4.4329896 4.5 53 1.0986123 3.51 3.5 55 1.94591 3.6 3.5 56 2.3025851 4.3333335 4.5 58 1.94591 4.710836 4.5 59 1.609438 3.448276 3.5 62 1.0986123 4.6312847 4.5 63 1.7917595 4.04 4.5 64 1.94591 4.2222223 4.5 65 1.609438 4.714286 4.5 68 1.0986123 3 2.5 69 1.0986123 4.2222223 4.5 70 1.3862944 5 4.5 71 .6931472 3.666667 3.5 72 1.609438 4.428571 4.5 73 1.0986123 4.3488374 4.5 74 1.7917595 4.7412586 4.5 75 1.609438 5 4.5 76 1.7917595 4.4 4.5 77 1.0986123 3.88 3.5 79 2.397895 3.568282 3.5 80 1.0986123 5 4.5 82 1.0986123 4.769231 4.5 83 1.3862944 4.6153846 4.5 84 1.94591 4.708122 4.5 85 1.0986123 5 4.5 86 1.0986123 3.92 3.5 87 1.609438 5 4.5 88 1.0986123 4.428571 4.5 end
Comment