Issues Collapsing Data

Johny Daniel

Join Date: May 2017
Posts: 11

Issues Collapsing Data

02 Jul 2017, 10:26

Hi,
I'm using STATA 14.2 (apologies in advance as I am using stata on a university server and cannot use dataex on the university version)

Below is an example of the data I have:

Unique Code (CDS)	District	School	Ethnicity	Gender	Total Enrollment
100001	ABC	XYZ	1	1	4
100001	ABC	XYZ	2	1	5
100001	ABC	XYZ	4	1	3
100001	ABC	XYZ	4	2	4
100002	ABC	PQR	1	1	5
100002	ABC	PQR	1	2	3
100002	ABC	PQR	3	1	3
100002	ABC	PQR	3	2	8
100002	ABC	PQR	5	1	1
100002	ABC	PQR	5	2	5

I want to collapse the data, so each row provides demographics for each school. Example of what I am trying to do:

CDS Code	Dist	Sch	Eth1	Eth2	Eth3	Eth4	Eth5	Male	Fem	EnrTotal
100001	ABC	XYZ	4	5	0	7	0	12	4	16
10002	ABC	PQR	8	0	11	0	6	9	16	25

What I have tried:
egen NOREPORT = sum(enr_total) if ethnic==0, by(cds_code)
egen WHITE = sum(enr_total) if ethnic==1, by(cds_code)
egen ASIAN = sum(enr_total) if ethnic==2, by(cds_code)
egen PAC = sum(enr_total) if ethnic==3, by(cds_code)
egen AFR = sum(enr_total) if ethnic==4, by(cds_code)
egen HIS = sum(enr_total) if ethnic==5, by(cds_code)
egen TWOMORE = sum(enr_total) if ethnic==9, by(cds_code)

egen MALE = sum(enr_total) if gender=="M", by(cds_code)
egen FEMALE = sum(enr_total) if gender=="F", by(cds_code)
egen ENRTOT = sum(enr_tota), by(cds_code)

collapse (first) county district school (mean) NOREPORT ASIAN PAC HIS AFR WHITE TWOMORE MALE FEMALE ENRTOT, by(cds_code)

Problem:
After running the above code, the enrollment total does not equal to the actual total for each school. The numbers are much larger (no sure why).

Can anyone tell me if I am doing something wrong in collapsing the data or if there is another way I should collapse the data?

Thanks

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

02 Jul 2017, 10:39

I can't replicate your problem. When I run

Code:

clear
input long cds_code str3(district school) byte(ethnic gender enr_total)
100001 "ABC" "XYZ" 1 1 4
100001 "ABC" "XYZ" 2 1 5
100001 "ABC" "XYZ" 4 1 3
100001 "ABC" "XYZ" 4 2 4
100002 "ABC" "PQR" 1 1 5
100002 "ABC" "PQR" 1 2 3
100002 "ABC" "PQR" 3 1 3
100002 "ABC" "PQR" 3 2 8
100002 "ABC" "PQR" 5 1 1
100002 "ABC" "PQR" 5 2 5
end

egen NOREPORT = sum(enr_total) if ethnic==0, by(cds_code)
egen WHITE = sum(enr_total) if ethnic==1, by(cds_code)
egen ASIAN = sum(enr_total) if ethnic==2, by(cds_code)
egen PAC = sum(enr_total) if ethnic==3, by(cds_code)
egen AFR = sum(enr_total) if ethnic==4, by(cds_code)
egen HIS = sum(enr_total) if ethnic==5, by(cds_code)
egen TWOMORE = sum(enr_total) if ethnic==9, by(cds_code)

egen MALE = sum(enr_total) if gender==1, by(cds_code)
egen FEMALE = sum(enr_total) if gender==2, by(cds_code)
egen ENRTOT = sum(enr_tota), by(cds_code)

collapse (first) district school (mean) NOREPORT ASIAN PAC HIS AFR WHITE TWOMORE MALE FEMALE ENRTOT, by(cds_code)

Stata gives me:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long cds_code str3(district school) float(NOREPORT ASIAN PAC HIS AFR WHITE TWOMORE MALE FEMALE ENRTOT)
100001 "ABC" "XYZ" . 5  . . 7 4 . 12  4 16
100002 "ABC" "PQR" . . 11 6 . 8 .  9 16 25
end

as the end result.

Now, I am quite sure that what you show us in #1 is not what you're actually doing. It can't be, because, for example, you refer to -gender == "F"-, but the example data you show has gender as a 1/2 numeric variable. So you have made some "translations" between what you actually did and what you reported here. My guess is that your error has gotten "lost in translation."

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

02 Jul 2017, 11:27

I was mystified by the original code and thought the following code to be more obvious about what it's doing. Building on the dataex sample data presented by Clyde,

Code:

generate NOREPORT = cond(ethnic==0,enr_total,0)
generate WHITE    = cond(ethnic==1,enr_total,0)
generate ASIAN    = cond(ethnic==2,enr_total,0)
generate PAC      = cond(ethnic==3,enr_total,0)
generate AFR      = cond(ethnic==4,enr_total,0)
generate HIS      = cond(ethnic==5,enr_total,0)
generate TWOMORE  = cond(ethnic==9,enr_total,0)

generate MALE   = cond(gender==1,enr_total,0)
generate FEMALE = cond(gender==2,enr_total,0)

collapse (first) district school ///
         (sum) NOREPORT ASIAN PAC HIS AFR WHITE TWOMORE MALE FEMALE enr_total, ///
         by(cds_code)

It yields the following result, again presented with dataex to make it unlikely to wrap the data onto multiple lines.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long cds_code str3(district school) double(NOREPORT ASIAN PAC HIS AFR WHITE TWOMORE MALE FEMALE enr_total)
100001 "ABC" "XYZ" 0 5  0 0 7 4 0 12  4 16
100002 "ABC" "PQR" 0 0 11 6 0 8 0  9 16 25
end

Comment

Johny Daniel

Join Date: May 2017

Posts: 11
#4

02 Jul 2017, 11:40

Sorry about the gender error you clearly pointed out. My data file actually has male as M and female as F. I misrepresented it in the first table above.

But I ran the codes again. To illustrate the problem I am running into,

The Enrollment total BEFORE running the collapse code:

Code: sum enr_total
variable Obs Mean SD Min Max

enr_total 127715 48.70 107.43 0 2218

Enrollment total AFTER collapsing data:
Code: sum ENRTOT
Variable Obs Mean SD Min Max

ENRTOT 127715 386580.7 446157.9 2 1317804

The mean, min and max change drastically after running all the above commands followed by the collapse command.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#5

02 Jul 2017, 12:04

Well, it appears there is something wrong with your -collapse- command. I say that because you have the same N (Obs) before and after--so you are not getting any actual collapse of the data. Or perhaps at the point you give the -collapse- command you have somehow reduced your data set to a single observation per value of cds_code so there is no collapsing to do.

As I said earlier, based on what you have actually shown of data and code, I cannot reproduce your problem, and I also know that what you have shown is not what you did. If you want help, I think you have to show a representative example data set (that reproduces the problem) and the actual exact and complete code that you ran, along with an actual exact copy of the output of -list- at the end. I understand that you can't use -dataex-. Perhaps you can attach a .dta file instead. To get actual exact copies of Stata code and output you should do that by copy/paste from Stata's Results window or your log file into the Forum editor between code delimiters (see FAQ #12 if you don't know how to get those.) DO NOT EDIT IT IN ANY WAY. The problem clearly lies in some detail(s) that you have not yet disclosed.
2 likes
Comment

Johny Daniel

Join Date: May 2017
Posts: 11

02 Jul 2017, 13:16

RAW DATA

Code:

cds_code    county    district              school                                ethnic  gender  enr_total
1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    2    1    2
1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    6    2    33
1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    5    1    18
1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    7    1    3
1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    0    1    3
1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    9    1    7
1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    6    1    48
1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    0    2    5
1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    1    2    1
1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    5    2    12
1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    9    2    4
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    4    1    5
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    5    1    79
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    9    2    22
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    2    1    21
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    6    1    16
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    7    1    50
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    5    2    88
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    1    2    3
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    1    1    1
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    4    2    7
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    3    2    2
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    3    1    2
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    2    2    41
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    6    2    15
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    0    1    2
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    9    1    24
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    7    2    52
1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    0    2    3
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    6    2    4
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    4    2    10
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    9    2    29
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    9    1    36
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    2    1    19
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    4    1    12
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    7    2    298
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    2    2    28
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    3    2    2
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    5    2    68
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    1    2    2
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    7    1    305
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    6    1    5
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    5    1    92
1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    3    1    1
1.610e+12    Alameda    New Haven Unified    Core Learning Academy at Conley-Caraballo High    2    1    5
1.610e+12    Alameda    New Haven Unified    Core Learning Academy at Conley-Caraballo High    6    2    29
1.610e+12    Alameda    New Haven Unified    Core Learning Academy at Conley-Caraballo High    6    1    16
1.610e+12    Alameda    New Haven Unified    Core Learning Academy at Conley-Caraballo High    5    2    39
1.610e+12    Alameda    New Haven Unified    Core Learning Academy at Conley-Caraballo High    5    1    66
1.610e+12    Alameda    New Haven Unified    Core Learning Academy at Conley-Caraballo High    9    2    1

CODE USED:

Code:

egen NOREPORT = sum(enr_total) if ethnic==0, by(cds_code)
egen AMIND = sum(enr_total) if ethnic==1, by(cds_code)
egen ASIAN = sum(enr_total) if ethnic==2, by(cds_code)
egen PAC = sum(enr_total) if ethnic==3, by(cds_code)
egen FIL = sum(enr_total) if ethnic==4, by(cds_code)
egen HIS = sum(enr_total) if ethnic==5, by(cds_code)
egen AFR = sum(enr_total) if ethnic==6, by(cds_code)
egen WHITE = sum(enr_total) if ethnic==7, by(cds_code)
egen TWOMORE = sum(enr_total) if ethnic==9, by(cds_code)

egen MALE = sum(enr_total) if gender==1, by(cds_code)
egen FEMALE = sum(enr_total) if gender==2, by(cds_code)
egen ENRTOT = sum(enr_tota), by(cds_code)
collapse (first) county district school (mean) NOREPORT AMIND ASIAN PAC FIL HIS AFR WHITE TWOMORE MALE FEMALE ENRTOT, by(cds_code)

Link to .dta file: https://utexas.box.com/s/7ezjnexhak3...gojaneesn5u34i

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

02 Jul 2017, 13:51

Well, thanks for trying. The example data you show is really difficult to wrestle into Stata. I've already spent 15 minutes trying to do that in various ways, including "laundering" it through Excel, and it's really not working out well. The layout is completely irregular.

Here's what I get when I just do the first school:

Code:

. * Example generated by -dataex-. To install: ssc install dataex
. clear

. input double cds_code str7 county str16 district str27 school byte(ethnic gender enr_total)

       cds_code     county          district                       school    ethnic    gender  enr_to~l
  1. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 2 1  2
  2. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 6 2 33
  3. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 5 1 18
  4. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 7 1  3
  5. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 0 1  3
  6. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 9 1  7
  7. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 6 1 48
  8. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 0 2  5
  9. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 1 2  1
 10. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 5 2 12
 11. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 9 2  4
 12. end

. 
. egen NOREPORT = sum(enr_total) if ethnic==0, by(cds_code)
(9 missing values generated)

. egen AMIND = sum(enr_total) if ethnic==1, by(cds_code)
(10 missing values generated)

. egen ASIAN = sum(enr_total) if ethnic==2, by(cds_code)
(10 missing values generated)

. egen PAC = sum(enr_total) if ethnic==3, by(cds_code)
(11 missing values generated)

. egen FIL = sum(enr_total) if ethnic==4, by(cds_code)
(11 missing values generated)

. egen HIS = sum(enr_total) if ethnic==5, by(cds_code)
(9 missing values generated)

. egen AFR = sum(enr_total) if ethnic==6, by(cds_code)
(9 missing values generated)

. egen WHITE = sum(enr_total) if ethnic==7, by(cds_code)
(10 missing values generated)

. egen TWOMORE = sum(enr_total) if ethnic==9, by(cds_code)
(9 missing values generated)

. 
. egen MALE = sum(enr_total) if gender==1, by(cds_code)
(5 missing values generated)

. egen FEMALE = sum(enr_total) if gender==2, by(cds_code)
(6 missing values generated)

. egen ENRTOT = sum(enr_tota), by(cds_code)

. collapse (first) county district school (mean) NOREPORT AMIND ASIAN PAC FIL HIS AFR WHITE TWOMORE MALE FEMALE ENRTOT, by(cds_code)

. 
. list, noobs clean abbrev(16)

     cds_code    county           district                        school   NOREPORT   AMIND   ASIAN   PAC   FIL   HIS   AFR   WHITE   TWOMORE   M
> ALE   FEMALE   ENRTOT  
    1.610e+12   Alameda   Berkeley Unified   Berkeley Technology Academy          8       1       2     .     .    30    81       3        11    
>  81       55      136

and those results are correct. So I am still unable to replicate your problem thus far.

But I think I see the problem now. You are collapsing -by(cds_code)-. But notice that in your posted data, cds_code is the same for all observations. So Stata will add up the data for all of the observations, not just those in a given school.

Now, I know what you're thinking. You're thinking, well I put up some -list- output there and Stata just took my detailed cds_code number (I'm guessing 12 digits) and wrote it out approximately in scientific notation. The actual numbers are not the same for all schools. But, you would be wrong to think that unless you had the foresight to make sure that you created the cds_code numbers as doubles. If you just accepted the usual defaults for creating this variable, the numbers are probably stored as floats--and floats only can carry 9 decimal digits of information. So many of your distinct 12-digit numbers will in fact be stored as identical by Stata, as they all get truncated to 9 digits. For numbers this large, you need to use either -double-s or strings. Since you probably will not need to do any arithmetic on the cds_code variable, strings would probably be better. They may not all turn out to have the same first 9 digits, but any group of schools that do will be treated as if they were a single school by this code applied to this data set.

The solution is to go back and recreate the data set, being sure to import or -generate- (or whatever you did to create this data) cds_code as a double or a string variable.

Then everything should go fine.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

02 Jul 2017, 14:09

I downloaded your complete data from the link in #6 and I ran the code in #6 on it, preceding and following the collapse with appropriate diagnostic commands.

Code:

. use "ENRtotupdate.dta", clear

. do "/var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//SD17170.000000"

. egen NOREPORT = sum(enr_total) if ethnic==0, by(cds_code)
(119810 missing values generated)

. egen AMIND = sum(enr_total) if ethnic==1, by(cds_code)
(115955 missing values generated)

. egen ASIAN = sum(enr_total) if ethnic==2, by(cds_code)
(112029 missing values generated)

. egen PAC = sum(enr_total) if ethnic==3, by(cds_code)
(117760 missing values generated)

. egen FIL = sum(enr_total) if ethnic==4, by(cds_code)
(114549 missing values generated)

. egen HIS = sum(enr_total) if ethnic==5, by(cds_code)
(107911 missing values generated)

. egen AFR = sum(enr_total) if ethnic==6, by(cds_code)
(110878 missing values generated)

. egen WHITE = sum(enr_total) if ethnic==7, by(cds_code)
(108327 missing values generated)

. egen TWOMORE = sum(enr_total) if ethnic==9, by(cds_code)
(114501 missing values generated)

.
. egen MALE = sum(enr_total) if gender==1, by(cds_code)
(62810 missing values generated)

. egen FEMALE = sum(enr_total) if gender==2, by(cds_code)
(64905 missing values generated)

. egen ENRTOT = sum(enr_total), by(cds_code)

. summarize enr_total

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   enr_total |    127,715    48.70996    107.4337          0       2218

. display r(sum)
6220993

. collapse (first) county district school (mean) NOREPORT AMIND ASIAN PAC FIL HIS AFR WHITE TWOM
> ORE MALE FEMALE ENRTOT, by(cds_code)

. summarize ENRTOT

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      ENRTOT |        166    37475.86      125226          2    1317804

. display r(sum)
6220993

.

This does not conform to the results you posted in #4 above.

Added in edit: When I restrict myself to the first 11 observations, I reproduce Clyde's results at #7.

Last edited by William Lisowski; 02 Jul 2017, 14:24.

Comment

Johny Daniel

Join Date: May 2017

Posts: 11
#9

02 Jul 2017, 14:16

Clyde, thanks a lot for taking the time to work on my problem. I just checked the cds_code in stata and it is a "double" type variable. And you are right that several schools have the exact same 13-digit code. I am not sure where to go from here as it would take me hours to manually change the cds_code for schools and I may make errors in doing that.

But thanks for your help! I really appreciate you helping me find what the exact issue was.
Comment
Johny Daniel

Join Date: May 2017

Posts: 11
#10

02 Jul 2017, 14:21

William:

It's weird that your observation count was lower. I ran the same codes again and still get the same number of observations:

Variable Obs Mean Std. Dev. Min Max

enr_total 127,715 48.70996 107.4337 0 2218
ENRTOT 127,715 386580.7 446157.9 2 1317804
Comment
Johny Daniel

Join Date: May 2017

Posts: 11
#11

02 Jul 2017, 14:36

Thank you thank you thank you!!! I went back to to my raw data excel file, and formatted the cds_code numbers from "general" to "numbers" and then imported the data to stata. Ran the same codes. And viola!!!

sum ENRTOT

Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
ENRTOT | 10,221 608.2577 569.734 0 4922
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#12

02 Jul 2017, 14:38

Well, it's not weird that my observation count is lower, because that's what collapse is all about. What's weird is that yours does not change.

Run the code, starting with the use command that reads in the dataset and ending with the summarize command that follows the collapse command. While you're at it, add the following command to your code.

Code:

which collapse

and add

Code:

display r(sum)

after each collapse command, as I did in my post.

Copy all the commands and output from your Results window and paste them into a code block. Then perhaps we can see where the results diverge.

Crossed with post #11: This does not explain your results in #4, unless they were run on a different dataset than the one you posted in post #6. In which case, it would have been nice to have been told so. Also, note that the application of

Code:

format %20.0f cds_code

would have allowed you to look at your data in the Data Browser and determine that the cds_codes in your dataset (or at least in the dataset given in post #6) were not at all what you expected them to be, since almost all of them grouped all the schools in a county into a single cds_code.

Last edited by William Lisowski; 02 Jul 2017, 14:48.
Comment

Announcement

Issues Collapsing Data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment