Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Improving Code Efficiency for Calculating Cumulative Sums on Panel Data

    Hello Statalists,

    I am trying to generate a variable, 'total_votes_cum,' which calculates the cumulative 'total_votes' received in previous reviews for the same product ('prouct_parent') before the given review was posted. Please note that some products may have multiple reviews posted on the same date. I have provided sample data below:

    product_parent review_date total_votes total_votes_cum
    1 3/2/2015 10 0
    1 3/2/2015 5 0
    1 3/3/2015 5 15
    1 3/4/2015 5 20
    1 4/1/2015 7 25
    2 5/1/2015 8 0
    2 5/2/2015 6 8
    2 5/2/2015 6 8
    2 5/5/2015 6 20
    2 6/1/2015 7 26
    2 7/1/2015 12 33


    I have tested the following code to create 'total_votes_cum,' and it works fine. But the issue is that it is taking too long to process for a very large dataset with millions of reviews:


    by product_parent: gen total_votes_cum = total_votes

    local n = _N
    forval i = 1/`n' {
    forval j = 1/`n' {
    if product_parent[`i'] == product_parent[`j'] & review_date[`i'] > review_date[`j'] {
    replace total_votes_cum = total_votes_cum + total_votes[`j'] if _n == `i'
    }
    }
    }

    replace total_votes_cum = total_votes_cum - total_votes


    Is there any way that I can modify the code to expedite the process?


    Thank you,
    Sun


  • #2
    Well, I have to agree that if I wanted to find a slower way to do this than the code in #1, I'm not sure I could, or that it's even possible.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte product_parent float review_date byte total_votes
    1 20149 10
    1 20149  5
    1 20150  5
    1 20151  5
    1 20179  7
    2 20209  8
    2 20210  6
    2 20210  6
    2 20213  6
    2 20240  7
    2 20270 12
    end
    format %tdDDMonCCYY review_date
    
    rangestat (sum) wanted = total_votes, by(product_parent) interval(review_date . -1)
    replace wanted = 0 if missing(wanted)
    Notes:
    1. It is absolutely crucial that your date variable be a true Stata internal format date variable. The code will not run with a string date variable. So if your date variable is a string, you have to convert it first. Use the -date()- function; see -help date()- if that is not familiar. Also, with the particular dates you chose, it is not possible to know if they are MDY or DMY.
    2. -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.
    3. The helpful way to show example data is with the -dataex- command. Had you done that, there would be no doubt whether your date is a string or a proper Stata date variable. Nor, if it is a proper Stata date variable, would there be any doubt whether it is MDY or DMY. So in the future, whenever you seek help with code, please use the -dataex- command to show example data, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment

    Working...
    X