Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Countifs-like Function?

    I performed a sequential merge of two datasets, Dataset A and Dataset B, as I merely wished to horizontally append Dataset A to Dataset B.

    Dataset A contains the following variables:
    range_var_1
    range_var_2
    daterange_var

    Dataset B contains the following variables:
    obs_var_1
    date_begin_var_obs
    date_end_var_obs

    I have also created variablename1, variablename2, variablename3….variablename304.

    After the merging and the variablename creation, I wish replace the value for each variablename(1 to 304) with the result of something similar the following Excel countifs function:

    =COUNTIFS(DatasetA!$A:$A,$C2,DatasetA!$M:$M,F$1,Da tasetA!$N:$N,">="&$A2,DatasetA!$N:$N,"<="&$B2)

    Including the variables in the formula:
    =COUNTIFS(DatasetA!range_var_1,obs_var_1, DatasetA!range_var_2, variablename, DatasetA!daterange_var, ">="date_begin_var_obs, DatasetA!daterange_var, "<=”date_end_var_obs)

    So, basically, within the sequentially merged dataset, I have four criteria and four ranges on which to perform a count function using the variable values of each observation. In replacing the value for each variablename(1 to 304) with the result for the formula, variablename in the formula above would take on the name of the respective variable (literally variablename1…variablename304); ie as part of the count function, the formula would look up each variablename and count the number of occurrences of the name within the range_var_2.

    I have tried egen count but this does not produce what I am looking for. Any help would be appreciated.

  • #2

    Stata programmers here, I guess, often will know little or nothing about the details of MS Excel. I can guess that COUNTIFS is some kind of conditional counting function, but I don't know what it does precisely.

    This has gone unanswered for over 6 hours, which may just be a time of week/time of day effect, but an easier question might be to show part of your data and explain in words what calculation you want done.

    I have tried egen count but this does not produce what I am looking for.
    Sorry, but that means nothing without details.

    Please do read http://www.statalist.org/forums/help#stata

    Comment


    • #3
      Hey Nick,

      Thanks for the response. I have found out how to get what I want but it is taking hours to compute and I had to stop it without completing. The code I tried is below:

      Code:
      local N = _N
      gen count=.
      foreach var of varlist x_* {
      qui forval i = 1/`N'{
            count if inrange(dateofpost,datadate_begin[`i'],datadate_end[`i']) & engagedauditorkey1==engagedauditorkey[`i'] & ratingcity==`var'[`i']
            replace `var'_2 = r(N) in `i'   
      }
      }
      I have about 9,000 observations and I wish to run a loop through the variables x_1 to x_304, counting for each row of data the number of observations that meet three specific criteria (date of post within date range, engagedauditorkey within engagedauditorkey1, and variable x_1 to x_304 within ratingcity) and placing the respective result in x_1_2 to x_304_2. I am not that familiar with loop construction, but is there a faster way to complete the desired action?

      Comment


      • #4
        Note that

        Code:
        gen count=.
        does nothing useful, although that is not what makes your code slow. Looping over observations is usually very slow.

        I think this might be a problem rangestat (SSC). Do search the forum for mentions.

        I don't understand your data structure. I suspect it's not very fit for purpose and that you should think about a reshape long. The thought that non-trivial analyses might typically entail a loop over 304 variables suggests that they might logically be better as blocks of observations.
        Last edited by Nick Cox; 29 Jan 2017, 01:52.

        Comment


        • #5
          Thanks, Nick. Here is an example of my dataset:
          datadate_begin datadate_end engagedauditorkey x_1 x_2 x_1_2 x_2_2 dateofpost engagedauditorkey1 ratingcity overallrating
          31-Dec-14 31-Dec-15 3 Aberdeen, SD Akron, OH 0 2 14-Jul-11 3 Aberdeen, SD 4
          31-Jan-11 30-Jan-12 4 Aberdeen, SD Akron, OH 0 0 30-Nov-12 3 San Jose, CA 5
          31-Oct-10 30-Oct-11 3 Aberdeen, SD Akron, OH 1 0 15-Jul-15 3 Akron, OH 4
          31-Mar-11 30-Mar-12 1 Aberdeen, SD Akron, OH 0 1 19-Feb-16 3 San Francisco, CA 4
          7-May-15 1 Aberdeen, SD 1
          3-Dec-15 3 Akron, OH 3
          13-Jan-10 1 Boston, MA 3
          9-Nov-11 1 Akron, OH 3
          9-Oct-12 1 New York, NY 5

          The variables datadate_end, datadate_begin, and engagedauditorkey came from one database and dateofpost, engagedauditorkey1, ratingcity, and overallrating came from another database. Since I performed a sequential merge on the data, a row containing data from the first dataset has no relationship to a row of data from the second dataset. I created x_1, x_2...x_304 after the merge and they are each filled with one city. I would like to check for a match of daterange(datadate_begin & datadate_end), engagedauditorkey, and x_1, x_2, etc against dataofpost, engagedauditorkey1, and ratingcity and return the match count in x_1_2, x_2_2...x_304_2. I would also like to obtain the average of the overallrating if the match count just calculated is above 2. The ratings would be entered into variables x_1_3, x_2_3...x_304_3 (not shown). Unfortunately rangestat would not provide what I need in this case as the merge of the two initial datasets is intentionally sequential and do not have a direct link with eachother.
          Last edited by michael joe; 29 Jan 2017, 22:35.

          Comment


          • #6
            Sorry, but I can't read your data example. But the fact that different variables are for different cities seems to match my earlier guess that you need a different data structure.

            Comment

            Working...
            X