Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Excel´s =SUMIF for STATA

    I have the followin set of data (excerpt):

    likes comments date date1
    1 0 27jul2009 14apr2009
    0 0 19sep2009 15apr2009
    1 0 21sep2009 16apr2009
    1 3 21sep2009 17apr2009
    2 8 21nov2009 18apr2009
    1 0 24nov2009 19apr2009
    2 0 25nov2009 20sep2009
    1 0 21dec2009 21sep2009
    3 5 22dec2009 22apr2009
    1 0 30dec2009 23apr2009
    1 0 30dec2009 24apr2009
    1 0 14jan2010 25apr2009
    1 0 14jan2010 26apr2009


    I would like to sum the variables likes and comments, if the date in date1 appears in date.

    for Excel that woud be =SUMIF (range, criteria, [sum_range])

    For example: 21sep2009 is in date1 and appears in date. So now all likes and commentst are summed up so in sum i have 2 likes and 3 comments for that date. I wish to have a new variable showing "2" and another variable showing "3" next to the date 21sep2009 in date1.

    I hope someone can halp me on that!!
    Last edited by Christian Beer; 14 Nov 2019, 10:49.

  • #2
    It's not clear to me why your variables are aligned at all.

    Where do date1 and date2 come from? Perhaps date2 is really a different but related dataset and you should think about a merge with the other variables.

    I can't myself imagine a good code suggestion without understanding how the data arise.

    This is an excerpt but I have to guess that you have left out detail important to understanding this. No identifier variables at all?

    Comment


    • #3
      I have a dataset containing millions of social media posts with the related date, number of likes and number of comments for a long period from 2006 to 2019.

      I only want to analyse all posts from 2009/01/01 until 2016/12/31 (but without touching the dataset), this is why I creaded the variable date1 which shows all days for that period. I´m sorry to have confused with my excerpt. One problem is, that there has not been a post for each day of this time window.

      I need to count the total number of likes and comments for each day of the window. If there has been no post on that day, hence the date in date 1 does not appear in date, I would have 0 likes and 0 comments. At the end I wish to have three variables: The date in that period and the number of likes and comments for each date.

      I hope I could point out clearer what I need. Thank you for your help!!
      Last edited by Christian Beer; 14 Nov 2019, 11:55.

      Comment


      • #4
        I think you have to loop over -date1-, because you are searching the whole file for a match. This is not really the kind of think Stata is good at, and may cause you trouble if you millions of observations, but:

        Code:
        gen total_likes=.
        gen total_comments=.
        levelsof date1, local(datelist)
        foreach D in `datelist' {
            egen tmp1 = total(like *(`D'==date))
            egen tmp2 = total(comm*(`D'==date))
            replace total_likes = tmp1 if date1==`D'
            replace total_comments=tmp2 if date1==`D'
            drop tmp1 tmp2
        }
        should get you there.

        Comment


        • #5
          Or, more directly

          Code:
          save tmpfile
          collapse (sum) total_likes=likes total_comments=comments, by(date)
          ren date date1
          merge 1:m date1 using tmpfile
          drop if _merge==1   // ignore totals if there is no date1 in the original file
          drop _merge

          will be much faster and robust with respect to memory.

          hth,
          Jeph

          Comment


          • #6
            Nr. 2 worked perfectly!!! Thank you so much for your help!!! I might catch my papers submission deadline now!! :D

            Comment

            Working...
            X