Generating new variables based on information in other observations in dataset

Noah Spencer

Join Date: Jan 2019

Posts: 125
#1

Generating new variables based on information in other observations in dataset

06 Oct 2019, 08:46

I'm working with some NCAA football player data. I've reached a bit of a difficult (for me) data cleaning/organization problem, that is a bit tricky to explain.

Here's what the data looks like:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str25 player_ncaa_pfr int year str22 school_pfr str5 pos_guess_ncaa float pos_guess_rank_ncaa int pass_yards_ncaa double(rush_yards_ncaa rec_yards_ncaa) "Ahmaad Galloway" 2001 "Alabama" "RB" 2 0 881 20 "Antonio Carter" 2001 "Alabama" "WR/TE" 1 0 0 428 "Derrick Hamilton" 2001 "Clemson" "WR/TE" 3 0 21 590 "Freddie Milons" 2001 "Alabama" "WR/TE" 2 0 10 626 "J.J. McKelvey" 2001 "Clemson" "WR/TE" 2 0 0 392 "Roscoe Crosby" 2001 "Clemson" "WR/TE" 1 0 0 396 "Santonio Beard" 2001 "Alabama" "RB" 1 0 633 8 "Travis Zachery" 2001 "Clemson" "RB" 1 0 576 414 "Tyler Watts" 2001 "Alabama" "QB" 1 1325 564 0 "Woodrow Dantzler" 2001 "Clemson" "QB" 1 2360 1004 0 end

What I want to do is create new variables for each player's teammates' statistics. For example, in 2001, Alabama had two running backs (in this example dataset), Galloway and Beard. For Galloway, I want to create a new variable with Beard's statistics, and vice versa. I was thinking to use the position rank (pos_guess_rank_ncaa) variable to help with this. Galloway is the first-ranked RB on Alabama in 2001, while Beard is the second-ranked (the higher numerical ranking indicates lesser actual ranking). My idea is to create two new variables here: rush_yards_ncaa for the first ranked RB on a team and rush_yards_ncaa for the second ranked RB on a team.

I'm just not sure how to actually implement this in Stata. Is there a way to do this? I don't really know where to start on coding it.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30065

06 Oct 2019, 10:18

I'm not sure I understand what you want, but I think it's the following. If not, please post back showing what the results for the example data should look like and a detailed explanation of how you arrived at them.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str25 player_ncaa_pfr int year str22 school_pfr str5 pos_guess_ncaa float pos_guess_rank_ncaa int pass_yards_ncaa double(rush_yards_ncaa rec_yards_ncaa)
"Ahmaad Galloway"  2001 "Alabama" "RB"    2    0  881  20
"Antonio Carter"   2001 "Alabama" "WR/TE" 1    0    0 428
"Derrick Hamilton" 2001 "Clemson" "WR/TE" 3    0   21 590
"Freddie Milons"   2001 "Alabama" "WR/TE" 2    0   10 626
"J.J. McKelvey"    2001 "Clemson" "WR/TE" 2    0    0 392
"Roscoe Crosby"    2001 "Clemson" "WR/TE" 1    0    0 396
"Santonio Beard"   2001 "Alabama" "RB"    1    0  633   8
"Travis Zachery"   2001 "Clemson" "RB"    1    0  576 414
"Tyler Watts"      2001 "Alabama" "QB"    1 1325  564   0
"Woodrow Dantzler" 2001 "Clemson" "QB"    1 2360 1004   0
end

forvalues i = 1/2 {
    by school_pfr year pos_guess_ncaa, sort:  ///
        egen rush_yards_rank_`i' = ///
        max(cond(pos_guess_rank_ncaa == `i'), rush_yards_ncaa, .)
}

Comment

Noah Spencer

Join Date: Jan 2019

Posts: 125
#3

13 Oct 2019, 15:25

Thanks for your response, Clyde! This worked for me.
Comment

Announcement

Generating new variables based on information in other observations in dataset

Comment

Comment