Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting a dataset into subsets based on active dates

    Hi all,

    I am using Stata 17/SE on Mac and am trying to generate subsets of my panel data based on dates.

    For context, I have 50 years of tennis data... around 6 million observations where each observation is one match. Variables include things like tournament number, tournament start/end dates, player name, score, etc..
    I am attempting to construct my own rank within the dataset. Still, I need a little help and can't find similar examples.

    My objective is to:
    * Generate points won (based on tournament type, what round, whether the player won or lost, etc.).. this is the points awarded for that particular match/observation
    * Calculate the player's start and end date (e.g., career period)
    * Generate subsets based on mm-yy to only include players who were active within this given month (this will give approx 540 smaller datasets...daily would be better but would be way too many files)
    * Take the monthly datasets back into one dataset
    * Then generate a 12-month cumulative sum of points
    * Collapse into a player-level dataset, taking the max cumulative sum
    * Then merge onto the larger panel dataset using m:1 name_merge and month variable

    Noting that the first tournament is 06jan1976 and the final tournament date is 27dec2022.

    So far, I have the player points and career dates.

    However, what I am struggling with is how to generate a loop to capture each month in its own dataset and then generate the cumulative sum.

    E.g., Should I be starting with something like the following:

    Setting up:
    Code:
    gen month = month(tourn_date_start)
    gen year = year(tourn_date_start)
    gen date_monthly = ym(year,month)
    format %tm date_monthly
    Loop:
    Code:
    forval m=1/12 {
      forval y=1976/2022 {
      keep if month == `m' & year == `y'
      gen active = inrange(tourn_date_start, career_start, career_end)
      ....???
      save `m'-`y', replace
      }
    }
    Appending:
    Code:
    cd "...."
    clear
    append using `: dir . files "*.dta"'
    Cumulative sum:
    ...

    Noting that I likely will need to save these files to my drive (rather than temporary files) as I am using my little Macbook and it struggles at the best of times, haha.

    I am a little lost in some areas... Any help is appreciated!

    Data is as per the below (noting this is only for the first month/year given the size of the data hard to find a subset to show the context..):
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int tourn_date_start float month int year float date_monthly str64 name_merge float(career_start career_end est_points)
    5868 1 1976 192 "jiri hrebec"         5857  7969  2
    5868 1 1976 192 "jan kodes"           5857 13639  2
    5868 1 1976 192 "marty riessen"       5857  9181  2
    5868 1 1976 192 "dick stockton"       5849 10734 32
    5868 1 1976 192 "bob giltinan"        5857  7299  .
    5868 1 1976 192 "allan stone"         5857  7299  .
    5868 1 1976 192 "bob lutz"            5857 10363 32
    5868 1 1976 192 "erik van dillen"     5863 10363  1
    5868 1 1976 192 "dick stockton"       5849 10734  .
    5868 1 1976 192 "ilie nastase"        5857 14668  2
    5868 1 1976 192 "sherwood stewart"    5849 11032  1
    5868 1 1976 192 "jan kodes"           5857 13639  2
    5868 1 1976 192 "billy martin"        5849  8354  .
    5868 1 1976 192 "arthur ashe"         5849 10363  .
    5868 1 1976 192 "sandy mayer"         5849 10734  2
    5868 1 1976 192 "bjorn borg"          5857 12365 64
    5868 1 1976 192 "bob lutz"            5857 10363  .
    5868 1 1976 192 "tom okker"           5849 10363  4
    5868 1 1976 192 "jaime fillol"        5868 15627  .
    5868 1 1976 192 "mark cox"            5849 10363  .
    5868 1 1976 192 "dick crealy"         5857 10363  .
    5868 1 1976 192 "frew mcmillan"       5849 10363  .
    5868 1 1976 192 "tom gorman"          5849 14668  .
    5868 1 1976 192 "rod laver"           5868 10734  .
    5868 1 1976 192 "dennis ralston"      5849  8278  2
    5868 1 1976 192 "aleksandr metreveli" 5849  6905  .
    5868 1 1976 192 "brian gottfried"     5849 10804  4
    5868 1 1976 192 "karl meiler"         5857  9138  1
    5868 1 1976 192 "tom okker"           5849 10363  1
    5868 1 1976 192 "rod laver"           5868 10734 32
    5868 1 1976 192 "vitas gerulaitis"    5849 10741  2
    5868 1 1976 192 "ismail el shafei"    5868 10363  .
    5868 1 1976 192 "vitas gerulaitis"    5849 10741  2
    5868 1 1976 192 "jan kodes"           5857 13639  1
    5868 1 1976 192 "jan kodes"           5857 13639  1
    5868 1 1976 192 "allan stone"         5857  7299  .
    5868 1 1976 192 "ray moore"           5868  9166  .
    5868 1 1976 192 "fred stolle"         5868  8984  .
    5868 1 1976 192 "wojtek fibak"        5857 11525  .
    5868 1 1976 192 "ilie nastase"        5857 14668  1
    5868 1 1976 192 "frew mcmillan"       5849 10363  1
    5868 1 1976 192 "dennis ralston"      5849  8278 64
    5868 1 1976 192 "jiri hrebec"         5857  7969  .
    5868 1 1976 192 "charlie pasarell"    5857  8452  2
    5868 1 1976 192 "dennis ralston"      5849  8278  2
    5868 1 1976 192 "jiri hrebec"         5857  7969  1
    5868 1 1976 192 "jan kodes"           5857 13639  .
    5868 1 1976 192 "geoff masters"       5849  8030  1
    5868 1 1976 192 "tom okker"           5849 10363  .
    5868 1 1976 192 "harold solomon"      5849 11403  .
    5868 1 1976 192 "andrew pattison"     5849  8655  .
    5868 1 1976 192 "billy martin"        5849  8354  .
    5868 1 1976 192 "bjorn borg"          5857 12365  2
    5868 1 1976 192 "kim warwick"         5849 13658  .
    5868 1 1976 192 "stan smith"          5857 14668  .
    5868 1 1976 192 "jimmy connors"       5863 14668  .
    5868 1 1976 192 "jeff borowiak"       5857 10734  .
    5868 1 1976 192 "dennis ralston"      5849  8278  1
    5868 1 1976 192 "sandy mayer"         5849 10734  1
    5868 1 1976 192 "andrew pattison"     5849  8655  .
    5868 1 1976 192 "allan stone"         5857  7299  .
    5868 1 1976 192 "vijay amritraj"      5849 12057  .
    5868 1 1976 192 "allan stone"         5857  7299  2
    5868 1 1976 192 "tom okker"           5849 10363  .
    5868 1 1976 192 "jimmy connors"       5863 14668  .
    5868 1 1976 192 "jiri hrebec"         5857  7969  .
    5868 1 1976 192 "frew mcmillan"       5849 10363  2
    5868 1 1976 192 "jan kodes"           5857 13639  .
    5868 1 1976 192 "harold solomon"      5849 11403  .
    5868 1 1976 192 "bob hewitt"          5849  7634  .
    5868 1 1976 192 "dick stockton"       5849 10734  1
    5868 1 1976 192 "tom okker"           5849 10363  2
    5868 1 1976 192 "erik van dillen"     5863 10363  1
    5868 1 1976 192 "brian gottfried"     5849 10804  .
    5868 1 1976 192 "bob giltinan"        5857  7299  1
    5868 1 1976 192 "dick crealy"         5857 10363  .
    5868 1 1976 192 "tom gorman"          5849 14668  4
    5868 1 1976 192 "arthur ashe"         5849 10363  .
    5868 1 1976 192 "rod laver"           5868 10734 32
    5868 1 1976 192 "karl meiler"         5857  9138  .
    5868 1 1976 192 "stan smith"          5857 14668  .
    5868 1 1976 192 "dick stockton"       5849 10734  .
    5868 1 1976 192 "wojtek fibak"        5857 11525  1
    5868 1 1976 192 "charlie pasarell"    5857  8452  .
    5868 1 1976 192 "roscoe tanner"       5857 15187  .
    5868 1 1976 192 "tom okker"           5849 10363  .
    5868 1 1976 192 "bob lutz"            5857 10363  4
    5868 1 1976 192 "bob hewitt"          5849  7634  2
    5868 1 1976 192 "tom gorman"          5849 14668  2
    5868 1 1976 192 "jimmy connors"       5863 14668  1
    5868 1 1976 192 "ismail el shafei"    5868 10363  .
    5868 1 1976 192 "bjorn borg"          5857 12365  .
    5868 1 1976 192 "roscoe tanner"       5857 15187  2
    5868 1 1976 192 "dick stockton"       5849 10734 64
    5868 1 1976 192 "brian gottfried"     5849 10804  2
    5868 1 1976 192 "erik van dillen"     5863 10363  .
    5868 1 1976 192 "marty riessen"       5857  9181  .
    5868 1 1976 192 "tony roche"          5868 10734  .
    5868 1 1976 192 "cliff richey"        5849 14668  .
    5868 1 1976 192 "alex metreveli"      5849  6905  1
    end
    format %td tourn_date_start
    format %tm date_monthly
    format %td career_start
    format %td career_end
    Thanks Statalist team.







  • #2
    Code:
    preserve
    forvalues md = `=tm(1976m1)'/`=tm(2022m12)' {
        local filename: display %tm `md'
        by name_merge, sort: egen byte active_this_month ///
            = max(mofd(tourn_date_start) == `md')
        keep if active_this_month
        if _N > 0 {
            drop active_this_month
            save `filename', replace        
        }
        restore, preserve
    }
    Note: This is slightly different from what you asked for. Instead of saving the filename as 1-1976.dta, it uses the filename 1976m1.dta, etc.

    It isn't clear to me just what you want saved in each subset file. The code above selects all players who are active in the focused month (i.e. have any tourn_date_start in that month) and saves all of the observations in the entire data set that pertain to those players. If that isn't what you intended, the code will need modification. If you can't make that change yourself, please post back with an explanation of what observations you wish to keep in each file.

    The block that saves the files is guarded by -if _N > 0- on the chance that there are some months in the range from January 1976 to December 2022 in which there were no active players. That keeps you from saving empty files. (If there are no such months of total inactivity, nothing is lost by doing it this way.)

    Comment

    Working...
    X