Generating many new variables at once

Taylor Walter

Join Date: Mar 2018
Posts: 80

Generating many new variables at once

01 Jun 2018, 14:00

I'm hoping to generate a large number (likely 20-30) of new variables at once. Essentially, I have a long list of courses, and have generated flags for the top 10 courses based on a few variables (such # of times taken, retaken, etc).

I now need to create variables for those courses. So essentially if *any* of the three variables listed below are flagged with a 1, then that course would become a new variable. There's some overlaps but plenty that don't, so doing this manually is certainly possible but the potential to overlook one is high. Thanks much.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str98 course float(course_starts1 five_min1 retakes1)
"New Survey "                                                                    . . .
"New Handbook"                                                             1 . .
"Miller Reentry Resources "                                                  . . .
"E-Car"                                             . . .
"Interpretation Quiz"                                 . . .
"Tennessee Test"                              . . .
"Introduction to Math "                                                         1 . .
"MA Reading"                                                           1 . .
"Baseline Test"                                                                      . . 1
"New Resources"                                                    . 1 .
"Intro English"                                                          . . 1
"Targeting Success "                                                               . . .
"Infographics"                                                         . 1 .
"Christian Prayer - Part V" . . .
"Christian Prayer - Part I"                                                        . . .
"Georgia Test"                                1 . .
"History 101"                   . . .
"Astronomy 101"                                               . . .
end

Last edited by Taylor Walter; 01 Jun 2018, 14:02.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#2

01 Jun 2018, 14:12

So essentially if *any* of the three variables listed below are flagged with a 1, then that course would become a new variable.

What does that mean? How would you calculate the values of that variable from the other information in your data?
Comment
Taylor Walter

Join Date: Mar 2018

Posts: 80
#3

01 Jun 2018, 14:20

The new variable would then just be flagged with a "1". So in the above example, "New Handbook" would become its own variable, flagged with a 1. So would "Introduction to Math", "MA Reading", and so on. I'll later just drop everything from the data set other than these new course-variables that are flagged.
Comment
Taylor Walter

Join Date: Mar 2018

Posts: 80
#4

01 Jun 2018, 14:28

I'm re-reading what I wrote and that may not have been clear...basically, what I have right now are a bunch of course names as string variables under the variable "course". What I really want is for courses to become their own variables, *if* they have been flagged (meaning they were one of the 10 most taken, most retaken, etc.).
Comment

eric_a_booth

Join Date: Apr 2014
Posts: 290

01 Jun 2018, 15:02

Thanks for providing the -dataex-. (Without really understanding what you are asking for or why) What you probably want is just:

Code:

keep if inlist(1, course_starts1, five_min1, retakes1)

but here are some more ideas :

Code:

clear
input str98 course float(course_starts1 five_min1 retakes1)
"New Survey "                                                                    . . .
"New Handbook"                                                             1 . .
"Miller Reentry Resources "                                                  . . .
"E-Car"                                             . . .
"Interpretation Quiz"                                 . . .
"Tennessee Test"                              . . .
"Introduction to Math "                                                         1 . .
"MA Reading"                                                           1 . .
"Baseline Test"                                                                      . . 1
"New Resources"                                                    . 1 .
"Intro English"                                                          . . 1
"Targeting Success "                                                               . . .
"Infographics"                                                         . 1 .
"Christian Prayer - Part V" . . .
"Christian Prayer - Part I"                                                        . . .
"Georgia Test"                                1 . .
"History 101"                   . . .
"Astronomy 101"                                               . . .
"Georgia Test"                                . . .
end

*--note an extra georgia test was added for difficulty
compress //we dont need this to be str98


**quick way to keep any where flag is 1:
******keep if inlist(1, course_starts1, five_min1, retakes1)
    *only downside is if you want to keep all the observations for courses where it is ever a 1 , this can be achieved with the code below. 

foreach x in course_starts1 five_min1 retakes1 {
bys course: egen `x'_max = max(`x')
}

*Identify which courses were flagged overall , all in one column
cap drop flag
egen flag = rowmax(course_starts1 five_min1 retakes1 )
egen betterflag = rowmax(*_max) //take into account all courses flagged at any point in case you are keeping other attributes from those observervations ...now the george test course has a flag for that second obs. will be a constant when you drop all courses with no flags!
drop *_max //dont need this

*quick list of courses with flags
levelsof betterflag, loc(keepcourses)

di `"`keepcourses'"'


*now separate variables flagging obs of interest.
foreach x in course_starts1 five_min1 retakes1 {
bys course: egen `x'_flag = max(`x')  
}

keep if betterflag==1

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

Comment

Taylor Walter

Join Date: Mar 2018

Posts: 80
#6

03 Jun 2018, 15:18

Thanks, Eric. This was very helpful and got me right to the brink of where I need to be (in a much more efficient way than I had previously set it up).

Now my end goal is to get each of these courses to be their own variable. I can do this by just writing out 45ish separate codes, that would look something like this:

Code:

gen Baseline_Test_retakes = . replace Baseline_Test_retakes = 1 if course == "Baseline Test" & retakes1 != .

The part at the end:

Code:

&retakes1 !=.

is just for a double check on potential human error--it won't turn back any changes if I'm trying to replace for the wrong variable.

You can see the example I used in my -dataex- above, where "Baseline Test" is flagged with a 1 for retakes1 only. If it was also flagged for course_starts1, then I'd just replicate that code above for another variable called Baseline_Test_starts. My original (poorly worded) question was for advice on streamlining this process.
Comment

eric_a_booth

Join Date: Apr 2014
Posts: 290

04 Jun 2018, 14:52

I still think your question is a bit vague. My advice is to provide a more concrete explanation about why you are doing what you are doing or, at minimum, provide an example of the target/result dataset that would result (for your above example) after manipulation. Regardless, I dont think there is any reason to write 45ish separate codes, even if it requires some manual renaming of varnames to make them work (which really is the major obstacle...Stata cannot automatically create a new var using a course name if it is (1) too long (2) contains spaces and symbols that are not legal for varnames, so you have to manually name these new vars or pick a new naming convention (you can also use -strtoname()- ) ).

Below, the code picks the first 15 letters of the course name (you can change this), substitutes some non-legal varname characters (and labels the variable with the original course name), but that assumes that when you extend this code to your full dataset that there are not longer varnames with non-unique starting 15 characters (if so you need to truncate your 3 flagged vars [course_starts1 five_min1 retakes1] to make more room and make adjustments to characters you substitute out ... or use the strtoname() function ).

Here's an example that I think gets you where you want with fewer lines of code and in a way that is more adaptable to the particularities of your dataset.

Code:

clear
input str98 course float(course_starts1 five_min1 retakes1)
"New Survey "                                                                    . . .
"New Handbook"                                                             1 . .
"Miller Reentry Resources "                                                  . . .
"E-Car"                                             . . .
"Interpretation Quiz"                                 . . .
"Tennessee Test"                              . . .
"Introduction to Math "                                                         1 . .
"MA Reading"                                                           1 . .
"Baseline Test"                                                                      . . 1
"New Resources"                                                    . 1 .
"Intro English"                                                          . . 1
"Targeting Success "                                                               . . .
"Infographics"                                                         . 1 .
"Christian Prayer - Part V" . . .
"Christian Prayer - Part I"                                                        . . .
"Georgia Test"                                1 . .
"History 101"                   . . .
"Astronomy 101"                                               . . .
"Georgia Test"                                . . .
end

*--note an extra georgia test was added for difficulty
compress //we dont need this to be str98

 
*gen Baseline_Test_retakes = .
*replace Baseline_Test_retakes = 1 if course == "Baseline Test" & retakes1 != .



foreach j in course_starts1 five_min1 retakes1 {
levelsof course if `j' ==1 , loc(clist)
foreach c in `clist' {
*create an abbrev version that can be a varname (that is less less than 15 chars b/c of var name length limits)
loc abbrev ""
loc abbrev `"`=substr(`"`c'"', 1, 15)'"'
**fix symbols in the coursename to make it a varname:
  foreach sym in " " "-" {
    loc abbrev `"`=subinstr(`"`abbrev'"', `"`sym'"' , "_", .)'"'
    } //end sym loop
di as err `" `c' (`abbrev') & variable: `j' "'
    g `abbrev'_`j' = 1 if course==`"`c'"' & `j'==1
    cap lab var `abbrev'_`j' `"`c' [`j']"'
    } //end clist loop
    } //end j loop
    
    
    
    desc, fulln

Last edited by eric_a_booth; 04 Jun 2018, 14:57.

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

Comment

Taylor Walter

Join Date: Mar 2018

Posts: 80
#8

04 Jun 2018, 15:14

Well, it may have been vague, but it did exactly the right thing, so thank you very much.

If I was going to manually fix the course names (which was likely happening anyway), so that they were all compliant with Stata var name length limits, then would this be the only code needed?

Code:

foreach j in course_starts1 five_min1 retakes1 { levelsof course if `j' ==1 , loc(clist) }

Or do I still need the -di as err- piece?
Comment
eric_a_booth

Join Date: Apr 2014

Posts: 290
#9

04 Jun 2018, 15:24

The part of the code (with -levelsof-) you quote only lists the course names if each j ==1. The di as err is a -display- command to help you keep track of which loop is running. The line that generates the new variable with the truncated var name is the line

Code:

g `abbrev'_`j' = 1 if course==`"`c'"' & `j'==1

this line is equivalent to your code:

Code:

gen Baseline_Test_retakes = . replace Baseline_Test_retakes = 1 if course == "Baseline Test" & retakes1 != .

but with macros substituting into the command as it iterates across the loops.

Last edited by eric_a_booth; 04 Jun 2018, 15:27.

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX
2 likes
Comment
Taylor Walter

Join Date: Mar 2018

Posts: 80
#10

04 Jun 2018, 15:48

Makes sense. Thanks, Eric.
1 like
Comment

Announcement

Generating many new variables at once

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment