Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issues Collapsing Data

    Hi,
    I'm using STATA 14.2 (apologies in advance as I am using stata on a university server and cannot use dataex on the university version)

    Below is an example of the data I have:

    Unique Code (CDS) District School Ethnicity Gender Total Enrollment
    100001 ABC XYZ 1 1 4
    100001 ABC XYZ 2 1 5
    100001 ABC XYZ 4 1 3
    100001 ABC XYZ 4 2 4
    100002 ABC PQR 1 1 5
    100002 ABC PQR 1 2 3
    100002 ABC PQR 3 1 3
    100002 ABC PQR 3 2 8
    100002 ABC PQR 5 1 1
    100002 ABC PQR 5 2 5
    I want to collapse the data, so each row provides demographics for each school. Example of what I am trying to do:

    CDS Code Dist Sch Eth1 Eth2 Eth3 Eth4 Eth5 Male Fem EnrTotal
    100001 ABC XYZ 4 5 0 7 0 12 4 16
    10002 ABC PQR 8 0 11 0 6 9 16 25

    What I have tried:
    egen NOREPORT = sum(enr_total) if ethnic==0, by(cds_code)
    egen WHITE = sum(enr_total) if ethnic==1, by(cds_code)
    egen ASIAN = sum(enr_total) if ethnic==2, by(cds_code)
    egen PAC = sum(enr_total) if ethnic==3, by(cds_code)
    egen AFR = sum(enr_total) if ethnic==4, by(cds_code)
    egen HIS = sum(enr_total) if ethnic==5, by(cds_code)
    egen TWOMORE = sum(enr_total) if ethnic==9, by(cds_code)

    egen MALE = sum(enr_total) if gender=="M", by(cds_code)
    egen FEMALE = sum(enr_total) if gender=="F", by(cds_code)
    egen ENRTOT = sum(enr_tota), by(cds_code)

    collapse (first) county district school (mean) NOREPORT ASIAN PAC HIS AFR WHITE TWOMORE MALE FEMALE ENRTOT, by(cds_code)

    Problem:
    After running the above code, the enrollment total does not equal to the actual total for each school. The numbers are much larger (no sure why).

    Can anyone tell me if I am doing something wrong in collapsing the data or if there is another way I should collapse the data?

    Thanks


  • #2
    I can't replicate your problem. When I run

    Code:
    clear
    input long cds_code str3(district school) byte(ethnic gender enr_total)
    100001 "ABC" "XYZ" 1 1 4
    100001 "ABC" "XYZ" 2 1 5
    100001 "ABC" "XYZ" 4 1 3
    100001 "ABC" "XYZ" 4 2 4
    100002 "ABC" "PQR" 1 1 5
    100002 "ABC" "PQR" 1 2 3
    100002 "ABC" "PQR" 3 1 3
    100002 "ABC" "PQR" 3 2 8
    100002 "ABC" "PQR" 5 1 1
    100002 "ABC" "PQR" 5 2 5
    end
    
    egen NOREPORT = sum(enr_total) if ethnic==0, by(cds_code)
    egen WHITE = sum(enr_total) if ethnic==1, by(cds_code)
    egen ASIAN = sum(enr_total) if ethnic==2, by(cds_code)
    egen PAC = sum(enr_total) if ethnic==3, by(cds_code)
    egen AFR = sum(enr_total) if ethnic==4, by(cds_code)
    egen HIS = sum(enr_total) if ethnic==5, by(cds_code)
    egen TWOMORE = sum(enr_total) if ethnic==9, by(cds_code)
    
    egen MALE = sum(enr_total) if gender==1, by(cds_code)
    egen FEMALE = sum(enr_total) if gender==2, by(cds_code)
    egen ENRTOT = sum(enr_tota), by(cds_code)
    
    collapse (first) district school (mean) NOREPORT ASIAN PAC HIS AFR WHITE TWOMORE MALE FEMALE ENRTOT, by(cds_code)
    Stata gives me:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long cds_code str3(district school) float(NOREPORT ASIAN PAC HIS AFR WHITE TWOMORE MALE FEMALE ENRTOT)
    100001 "ABC" "XYZ" . 5  . . 7 4 . 12  4 16
    100002 "ABC" "PQR" . . 11 6 . 8 .  9 16 25
    end
    as the end result.

    Now, I am quite sure that what you show us in #1 is not what you're actually doing. It can't be, because, for example, you refer to -gender == "F"-, but the example data you show has gender as a 1/2 numeric variable. So you have made some "translations" between what you actually did and what you reported here. My guess is that your error has gotten "lost in translation."

    Comment


    • #3
      I was mystified by the original code and thought the following code to be more obvious about what it's doing. Building on the dataex sample data presented by Clyde,
      Code:
      generate NOREPORT = cond(ethnic==0,enr_total,0)
      generate WHITE    = cond(ethnic==1,enr_total,0)
      generate ASIAN    = cond(ethnic==2,enr_total,0)
      generate PAC      = cond(ethnic==3,enr_total,0)
      generate AFR      = cond(ethnic==4,enr_total,0)
      generate HIS      = cond(ethnic==5,enr_total,0)
      generate TWOMORE  = cond(ethnic==9,enr_total,0)
      
      generate MALE   = cond(gender==1,enr_total,0)
      generate FEMALE = cond(gender==2,enr_total,0)
      
      collapse (first) district school ///
               (sum) NOREPORT ASIAN PAC HIS AFR WHITE TWOMORE MALE FEMALE enr_total, ///
               by(cds_code)
      It yields the following result, again presented with dataex to make it unlikely to wrap the data onto multiple lines.
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input long cds_code str3(district school) double(NOREPORT ASIAN PAC HIS AFR WHITE TWOMORE MALE FEMALE enr_total)
      100001 "ABC" "XYZ" 0 5  0 0 7 4 0 12  4 16
      100002 "ABC" "PQR" 0 0 11 6 0 8 0  9 16 25
      end

      Comment


      • #4
        Sorry about the gender error you clearly pointed out. My data file actually has male as M and female as F. I misrepresented it in the first table above.

        But I ran the codes again. To illustrate the problem I am running into,

        The Enrollment total BEFORE running the collapse code:

        Code: sum enr_total
        variable Obs Mean SD Min Max
        enr_total 127715 48.70 107.43 0 2218

        Enrollment total AFTER collapsing data:
        Code: sum ENRTOT
        Variable Obs Mean SD Min Max
        ENRTOT 127715 386580.7 446157.9 2 1317804
        The mean, min and max change drastically after running all the above commands followed by the collapse command.



        Comment


        • #5
          Well, it appears there is something wrong with your -collapse- command. I say that because you have the same N (Obs) before and after--so you are not getting any actual collapse of the data. Or perhaps at the point you give the -collapse- command you have somehow reduced your data set to a single observation per value of cds_code so there is no collapsing to do.

          As I said earlier, based on what you have actually shown of data and code, I cannot reproduce your problem, and I also know that what you have shown is not what you did. If you want help, I think you have to show a representative example data set (that reproduces the problem) and the actual exact and complete code that you ran, along with an actual exact copy of the output of -list- at the end. I understand that you can't use -dataex-. Perhaps you can attach a .dta file instead. To get actual exact copies of Stata code and output you should do that by copy/paste from Stata's Results window or your log file into the Forum editor between code delimiters (see FAQ #12 if you don't know how to get those.) DO NOT EDIT IT IN ANY WAY. The problem clearly lies in some detail(s) that you have not yet disclosed.

          Comment


          • #6
            RAW DATA

            Code:
            cds_code    county    district              school                                ethnic  gender  enr_total
            1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    2    1    2
            1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    6    2    33
            1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    5    1    18
            1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    7    1    3
            1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    0    1    3
            1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    9    1    7
            1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    6    1    48
            1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    0    2    5
            1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    1    2    1
            1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    5    2    12
            1.610e+12    Alameda    Berkeley Unified    Berkeley Technology Academy    9    2    4
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    4    1    5
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    5    1    79
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    9    2    22
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    2    1    21
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    6    1    16
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    7    1    50
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    5    2    88
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    1    2    3
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    1    1    1
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    4    2    7
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    3    2    2
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    3    1    2
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    2    2    41
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    6    2    15
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    0    1    2
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    9    1    24
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    7    2    52
            1.610e+12    Alameda    Castro Valley Unified    Castro Valley Elementary    0    2    3
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    6    2    4
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    4    2    10
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    9    2    29
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    9    1    36
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    2    1    19
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    4    1    12
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    7    2    298
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    2    2    28
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    3    2    2
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    5    2    68
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    1    2    2
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    7    1    305
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    6    1    5
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    5    1    92
            1.610e+12    Alameda    Livermore Valley Joint Unified    William Mendenhall Middle    3    1    1
            1.610e+12    Alameda    New Haven Unified    Core Learning Academy at Conley-Caraballo High    2    1    5
            1.610e+12    Alameda    New Haven Unified    Core Learning Academy at Conley-Caraballo High    6    2    29
            1.610e+12    Alameda    New Haven Unified    Core Learning Academy at Conley-Caraballo High    6    1    16
            1.610e+12    Alameda    New Haven Unified    Core Learning Academy at Conley-Caraballo High    5    2    39
            1.610e+12    Alameda    New Haven Unified    Core Learning Academy at Conley-Caraballo High    5    1    66
            1.610e+12    Alameda    New Haven Unified    Core Learning Academy at Conley-Caraballo High    9    2    1
            CODE USED:
            Code:
            egen NOREPORT = sum(enr_total) if ethnic==0, by(cds_code)
            egen AMIND = sum(enr_total) if ethnic==1, by(cds_code)
            egen ASIAN = sum(enr_total) if ethnic==2, by(cds_code)
            egen PAC = sum(enr_total) if ethnic==3, by(cds_code)
            egen FIL = sum(enr_total) if ethnic==4, by(cds_code)
            egen HIS = sum(enr_total) if ethnic==5, by(cds_code)
            egen AFR = sum(enr_total) if ethnic==6, by(cds_code)
            egen WHITE = sum(enr_total) if ethnic==7, by(cds_code)
            egen TWOMORE = sum(enr_total) if ethnic==9, by(cds_code)
            
            egen MALE = sum(enr_total) if gender==1, by(cds_code)
            egen FEMALE = sum(enr_total) if gender==2, by(cds_code)
            egen ENRTOT = sum(enr_tota), by(cds_code)
            collapse (first) county district school (mean) NOREPORT AMIND ASIAN PAC FIL HIS AFR WHITE TWOMORE MALE FEMALE ENRTOT, by(cds_code)
            Link to .dta file: https://utexas.box.com/s/7ezjnexhak3...gojaneesn5u34i

            Comment


            • #7
              Well, thanks for trying. The example data you show is really difficult to wrestle into Stata. I've already spent 15 minutes trying to do that in various ways, including "laundering" it through Excel, and it's really not working out well. The layout is completely irregular.

              Here's what I get when I just do the first school:

              Code:
              . * Example generated by -dataex-. To install: ssc install dataex
              . clear
              
              . input double cds_code str7 county str16 district str27 school byte(ethnic gender enr_total)
              
                     cds_code     county          district                       school    ethnic    gender  enr_to~l
                1. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 2 1  2
                2. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 6 2 33
                3. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 5 1 18
                4. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 7 1  3
                5. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 0 1  3
                6. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 9 1  7
                7. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 6 1 48
                8. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 0 2  5
                9. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 1 2  1
               10. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 5 2 12
               11. 1.610e+12 "Alameda" "Berkeley Unified" "Berkeley Technology Academy" 9 2  4
               12. end
              
              . 
              . egen NOREPORT = sum(enr_total) if ethnic==0, by(cds_code)
              (9 missing values generated)
              
              . egen AMIND = sum(enr_total) if ethnic==1, by(cds_code)
              (10 missing values generated)
              
              . egen ASIAN = sum(enr_total) if ethnic==2, by(cds_code)
              (10 missing values generated)
              
              . egen PAC = sum(enr_total) if ethnic==3, by(cds_code)
              (11 missing values generated)
              
              . egen FIL = sum(enr_total) if ethnic==4, by(cds_code)
              (11 missing values generated)
              
              . egen HIS = sum(enr_total) if ethnic==5, by(cds_code)
              (9 missing values generated)
              
              . egen AFR = sum(enr_total) if ethnic==6, by(cds_code)
              (9 missing values generated)
              
              . egen WHITE = sum(enr_total) if ethnic==7, by(cds_code)
              (10 missing values generated)
              
              . egen TWOMORE = sum(enr_total) if ethnic==9, by(cds_code)
              (9 missing values generated)
              
              . 
              . egen MALE = sum(enr_total) if gender==1, by(cds_code)
              (5 missing values generated)
              
              . egen FEMALE = sum(enr_total) if gender==2, by(cds_code)
              (6 missing values generated)
              
              . egen ENRTOT = sum(enr_tota), by(cds_code)
              
              . collapse (first) county district school (mean) NOREPORT AMIND ASIAN PAC FIL HIS AFR WHITE TWOMORE MALE FEMALE ENRTOT, by(cds_code)
              
              . 
              . list, noobs clean abbrev(16)
              
                   cds_code    county           district                        school   NOREPORT   AMIND   ASIAN   PAC   FIL   HIS   AFR   WHITE   TWOMORE   M
              > ALE   FEMALE   ENRTOT  
                  1.610e+12   Alameda   Berkeley Unified   Berkeley Technology Academy          8       1       2     .     .    30    81       3        11    
              >  81       55      136
              and those results are correct. So I am still unable to replicate your problem thus far.

              But I think I see the problem now. You are collapsing -by(cds_code)-. But notice that in your posted data, cds_code is the same for all observations. So Stata will add up the data for all of the observations, not just those in a given school.

              Now, I know what you're thinking. You're thinking, well I put up some -list- output there and Stata just took my detailed cds_code number (I'm guessing 12 digits) and wrote it out approximately in scientific notation. The actual numbers are not the same for all schools. But, you would be wrong to think that unless you had the foresight to make sure that you created the cds_code numbers as doubles. If you just accepted the usual defaults for creating this variable, the numbers are probably stored as floats--and floats only can carry 9 decimal digits of information. So many of your distinct 12-digit numbers will in fact be stored as identical by Stata, as they all get truncated to 9 digits. For numbers this large, you need to use either -double-s or strings. Since you probably will not need to do any arithmetic on the cds_code variable, strings would probably be better. They may not all turn out to have the same first 9 digits, but any group of schools that do will be treated as if they were a single school by this code applied to this data set.

              The solution is to go back and recreate the data set, being sure to import or -generate- (or whatever you did to create this data) cds_code as a double or a string variable.

              Then everything should go fine.

              Comment


              • #8
                I downloaded your complete data from the link in #6 and I ran the code in #6 on it, preceding and following the collapse with appropriate diagnostic commands.
                Code:
                . use "ENRtotupdate.dta", clear
                
                . do "/var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//SD17170.000000"
                
                . egen NOREPORT = sum(enr_total) if ethnic==0, by(cds_code)
                (119810 missing values generated)
                
                . egen AMIND = sum(enr_total) if ethnic==1, by(cds_code)
                (115955 missing values generated)
                
                . egen ASIAN = sum(enr_total) if ethnic==2, by(cds_code)
                (112029 missing values generated)
                
                . egen PAC = sum(enr_total) if ethnic==3, by(cds_code)
                (117760 missing values generated)
                
                . egen FIL = sum(enr_total) if ethnic==4, by(cds_code)
                (114549 missing values generated)
                
                . egen HIS = sum(enr_total) if ethnic==5, by(cds_code)
                (107911 missing values generated)
                
                . egen AFR = sum(enr_total) if ethnic==6, by(cds_code)
                (110878 missing values generated)
                
                . egen WHITE = sum(enr_total) if ethnic==7, by(cds_code)
                (108327 missing values generated)
                
                . egen TWOMORE = sum(enr_total) if ethnic==9, by(cds_code)
                (114501 missing values generated)
                
                .
                . egen MALE = sum(enr_total) if gender==1, by(cds_code)
                (62810 missing values generated)
                
                . egen FEMALE = sum(enr_total) if gender==2, by(cds_code)
                (64905 missing values generated)
                
                . egen ENRTOT = sum(enr_total), by(cds_code)
                
                . summarize enr_total
                
                    Variable |        Obs        Mean    Std. Dev.       Min        Max
                -------------+---------------------------------------------------------
                   enr_total |    127,715    48.70996    107.4337          0       2218
                
                . display r(sum)
                6220993
                
                . collapse (first) county district school (mean) NOREPORT AMIND ASIAN PAC FIL HIS AFR WHITE TWOM
                > ORE MALE FEMALE ENRTOT, by(cds_code)
                
                . summarize ENRTOT
                
                    Variable |        Obs        Mean    Std. Dev.       Min        Max
                -------------+---------------------------------------------------------
                      ENRTOT |        166    37475.86      125226          2    1317804
                
                . display r(sum)
                6220993
                
                .
                This does not conform to the results you posted in #4 above.

                Added in edit: When I restrict myself to the first 11 observations, I reproduce Clyde's results at #7.
                Last edited by William Lisowski; 02 Jul 2017, 14:24.

                Comment


                • #9
                  Clyde, thanks a lot for taking the time to work on my problem. I just checked the cds_code in stata and it is a "double" type variable. And you are right that several schools have the exact same 13-digit code. I am not sure where to go from here as it would take me hours to manually change the cds_code for schools and I may make errors in doing that.

                  But thanks for your help! I really appreciate you helping me find what the exact issue was.

                  Comment


                  • #10
                    William:

                    It's weird that your observation count was lower. I ran the same codes again and still get the same number of observations:

                    ​​​​​​​Variable Obs Mean Std. Dev. Min Max

                    enr_total 127,715 48.70996 107.4337 0 2218
                    ENRTOT 127,715 386580.7 446157.9 2 1317804


                    Comment


                    • #11
                      Thank you thank you thank you!!! I went back to to my raw data excel file, and formatted the cds_code numbers from "general" to "numbers" and then imported the data to stata. Ran the same codes. And viola!!!

                      sum ENRTOT

                      Variable | Obs Mean Std. Dev. Min Max
                      -------------+---------------------------------------------------------
                      ENRTOT | 10,221 608.2577 569.734 0 4922

                      Comment


                      • #12
                        Well, it's not weird that my observation count is lower, because that's what collapse is all about. What's weird is that yours does not change.

                        Run the code, starting with the use command that reads in the dataset and ending with the summarize command that follows the collapse command. While you're at it, add the following command to your code.
                        Code:
                        which collapse
                        and add
                        Code:
                        display r(sum)
                        after each collapse command, as I did in my post.

                        Copy all the commands and output from your Results window and paste them into a code block. Then perhaps we can see where the results diverge.

                        Crossed with post #11: This does not explain your results in #4, unless they were run on a different dataset than the one you posted in post #6. In which case, it would have been nice to have been told so. Also, note that the application of
                        Code:
                        format %20.0f cds_code
                        would have allowed you to look at your data in the Data Browser and determine that the cds_codes in your dataset (or at least in the dataset given in post #6) were not at all what you expected them to be, since almost all of them grouped all the schools in a county into a single cds_code.
                        Last edited by William Lisowski; 02 Jul 2017, 14:48.

                        Comment

                        Working...
                        X