Collapsing data

Todd Motiwalla

Join Date: Mar 2018
Posts: 27

07 Mar 2018, 15:17

Hello, it is first time using Statalist. I am hoping to get advice on how to collapse data, which I have made some progress on but need to figure out a few last bits. I fear this may be easier than I am making it out to be, but I am in quite a mental pickle.

Currently my data looks something like this:

Teacher ID	School ID	Year	Semester	Subject Taught
333	123	2014	Fall	Geometry 1
333	123	2014	Fall	English Literature
333	123	2014	Fall	Modern History
333	777	2014	Spring	Modern History
333	123	2015	Fall	Geometry
333	123	2015	Fall	Homeroom
333	123	2015	Spring	AP English
333	123	2015	Spring	Homeroom

This is, as you can see, all for one teacher (same teacher ID) and each subject they have taught, by semester by school year. I have 4 years of data (2014, 2015, 2016, 2017). Essentially, the goal is to collapse this so that there is one row per teacher per year (four rows per teacher if they taught all 4 years we have data for).

The first step taken by the person who handed this off to me was to create dummy variables to capture all the varieties of classes within each subject (algebra, geometry, and stats all classify as "math", for example).

There is also the issue of semester: we want to collapse the two semesters into one, and also create some type of flag if they are in a different school in one semester than the other (you can see this happens in 2014 Spring in my example, where the School ID is 777).

So what I would ultimately want the data to look like is this, and I have just made up some results for 2016 and 2017 which I didn't include in the above table:

Teacher ID	School Year	Math	English	Science	History	Homeroom	School ID	Change schools?
333	2014	1	1	0	1	0	123	1
333	2015	1	1	0	0	1	123	0
333	2016	0	0	0	1	1	123	0
333	2017	0	1	0	0	1	123	0

Here is what I have done so far, in the pursuit of this:

Code:

// local list of vars to use
local varlist math english science history art homeroom

// collapse to get teacher x schoolyear x school records
collapse (sum) `varlist', by(teacherid schoolyear schoolid)

// these are just flags, so anything greater than 0 should be 1
foreach var in `varlist' {
    replace `var' = 1 if `var' > 0
}

I am getting an error message for the middle code, however, where I am trying to collapse. And I don't think this will collapse quite enough anyway--I'm relatively new to the collapse command. I essentially wish to get it to be year by school (where they started the year) by teacher. The other significant problem I have yet to be able to solve is how to collapse the semesters into one year, while creating some dummy variable flag for if the School ID is different for the two semesters within each unique year.

Sorry for this massively long entry. Thank you in advance for your help.

Last edited by Todd Motiwalla; 07 Mar 2018, 15:29.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#2

07 Mar 2018, 15:53

The following code will work with your example:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(teacherid schoolid year) str6 semester str18 subjecttaught 333 123 2014 "Fall" "Geometry 1" 333 123 2014 "Fall" "English Literature" 333 123 2014 "Fall" "Modern History" 333 777 2014 "Spring" "Modern History" 333 123 2015 "Fall" "Geometry" 333 123 2015 "Fall" "Homeroom" 333 123 2015 "Spring" "AP English" 333 123 2015 "Spring" "Homeroom" end by teacherid year (schoolid), sort: egen math = /// max(inlist(subjecttaught, "Geometry", "Geometry 1", "Algebra")) by teacherid year: egen english = max(inlist(subjecttaught, /// "English Literature", "AP English", "English 1")) by teacherid year: egen science = max(inlist(subjecttaught, /// "Chemistry", "Physics", "Biology")) by teacherid year: egen history = max(inlist(subjecttaught, /// "Modern History", "American History", "World History")) by teacherid year: egen homeroom = max(inlist(subjecttaught, /// "Homeroom", "Cluster")) by teacherid year (schoolid): gen byte change_schools = /// schoolid[1] != schoolid[_N] by teacherid year: keep if _n == 1 keep teacherid year math english science history homeroom schoolid change_schools

Note that each of the general subject variables is created by using -egen, max()- with an -inlist()- expression embedded. The -inlist()- expressions I have put there include the various subjects shown in your example, and I have supplemented those, for demonstration purposes, with others that are not. You will need to replace the subject names shown with actual subject names that appear in your data. You may also hit a snag here. -inlist()- will only accept a maximum of 9 strings to look for. If any of the general subject areas have more than 9 courses, this limitation will bite you. In that case, I recommend using -encode- to create a numerically encoded version of the subject variable, and then use that variable and the relevant numbers in -inlist()-. The -inlist()- function will allow up to 255 numbers, which I guess will be more than enough for any subject area.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment

Todd Motiwalla

Join Date: Mar 2018
Posts: 27

07 Mar 2018, 16:19

Thank you, Clyde.

I think this is very close to what I need. The issue is that I already have dummy variables for each subject, so in reality, I already have those columns in the first example data table I gave, it's just that they're still for each unique course. Does that mean perhaps -inlist- will work on its own, if I tell it to just read "math", which is the name I gave to that dummy variable? And then drop the existing course column, once they're all completed?

Here's some example code that reads each possible variation of course title and sticks it in an overarching subject; because it's an extremely large dataset so there is a chance that even -encode- may not capture every specific course and I think I may have to stick with this:

Code:

// MATH
gen math = 1 if regexm(title, ///
    "(ALGEBRA)|(GEOMETRY)|(ALG THINKING)|(ADV MATHTOPIC)|(CALCULUS)")
replace math = 1 if regexm(title, ///
    "(STATISTICS)|(CAHSEE MATH)|(PRECALC)|(ALG 2)|(ALG2)|(MATH)|(ECONOMICS)")
replace math = 1 if regexm(title, ///
    "(GEOM THINKING)|(PRE-CALC)|(MACRO ECON)|(MICRO ECON)|(PROB & STAT)")
replace math = 0 if math == .
replace math = 0 if regexm(title, "HOME ECONOMICS")
    
// ELA
gen ela = 1 if regexm(title, ///
    "(ACAD LIT)|(ACCELERATED LIT)|(AMER LIT)|(AMERICAN LIT)|(ENG LIT)")
replace ela = 1 if regexm(title, ///
    "(ENGLANG)|(ENGLIT)|(ENGLNG)|(CAHSEE LA)|(LANG ART)|(CREATIVE WR)")
replace ela = 1 if regexm(title, ///
    "(ENG EURO LIT)|(ENG FOR ACAD)|(ENG/EURO)|(ENGLISH)|(EXPO WRITING)")
replace ela = 1 if regexm(title, ///
    "(LANGUAGE ARTS)|(NOVEL)|(READ 180)|(READING)|(SHAKESPEAR)|(WRITING)")
replace ela = 0 if ela == .
replace ela = 0 if regexm(title, "(SPANISH LANGUAGE ARTS)")
replace ela = 0 if regexm(title, "(MANDARIN LANGUAGE ARTS)")
replace ela = 0 if regexm(title, "(CANTONESE LANGUAGE ARTS)")
replace ela = 0 if regexm(title, "(ELD READING)")

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#4

07 Mar 2018, 17:57

I have always found regular expressions difficult to work with, so I have avoided them, and I can't vouch for the correctness of the code you show in #3. I will just assume that it is correct and that it properly creates 0/1 dichotomous variables for math and ela (and I assume you have similar code for history, homeroom, science, etc.) On re-reading #1 I see that you actually said there that you already had these variables, created by somebody who handed the project off to you. This actually considerably simplifies the code I offered in #2.

Code:

foreach v of varlist math ela science history homeroom { by teacherid year, sort: egen taught_`v' = max(`v') } by teacherid year (schoolid), sort: gen byte changed_school = /// schoolid[1] != schoolid[_N] by teacherid year: keep if _n == 1 keep teacherid year taught_* schoolid change_schools rename taught_* *
Comment
Todd Motiwalla

Join Date: Mar 2018

Posts: 27
#5

07 Mar 2018, 18:17

This looks somewhat similar to the code I wrote in #1. Clyde, do you prefer this type of method over trying to use -collapse (sum)- in an effort to make this long data more wide?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#6

07 Mar 2018, 22:50

The problem with using -collapse- here is that it can't do what needs to be done to calculate the changed_schools variable. It can easily do everything involving the subject matter variables. Of course one could just generate changed_schools as I suggested and then finish off with -collapse-, using (first) for changed_schools. The end result would be the same.

-collapse- because it has to figure out what you want it to do before doing it, and some additional overhead, is slower in execution than the -egen- approach. On the other hand, coding one -collapse- command takes less time and effort than writing all the other code. The choice between them is largely based on that tradeoff, which, in turn, depends on the size of the data set, and whether the code will just be run once or will be used repeatedly.

In this case, I started with the -egen- approach because, a) it was needed to generate changed_schools, and b) -collapse- would not have been able to properly handle the string variables involved. When you reminded me that you already had numeric indicators for the subjects, that made it possible to use -collapse- (after dealing with changed_schools). But I didn't really consider the option, because at that point I was just in the mode of modifying the code I had started with. I think if I had understood the problem correctly the first time, I might have gone for -collapse-. But I don't think it really makes much difference in any case.
1 like
Comment
Todd Motiwalla

Join Date: Mar 2018

Posts: 27
#7

08 Mar 2018, 08:46

Clyde, if possible, could you show me what the -collapse- code you would have done looks like? I haven't used it before (obviously) and would like to try to learn and maybe apply it in the future, as I know I'll have somewhat similar but slightly less messy datasets where we'll need to collapse the data and make it wide.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#8

08 Mar 2018, 09:43

Code:

collapse (max) math ela science history homeroom (first) change_schools, by(teacherid year)
1 like
Comment

Announcement

Collapsing data

Comment

Comment

Comment

Comment

Comment

Comment

Comment