Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshaping "triply" long nested data into wide format

    Hello,

    I would like to reshape data that I would characterize as "triply" long. The primary variable, 'measure', is a count of how many people in each organization fall into some specific category. I would like to separate the categories nested within the variable data_category into variables in and of themselves and list them in wide form. Employee type breaks each respective data category down into two additional subsets. What I want to do is this: I would like to combine the three variables: 'data_category', 'data_subcategory', and 'employee_type' to create a single variable. For example, I want to create a variable that combines 'certificate level' with the counts for the subcategories '4 yr bachelors', '5 yr masters', etc. and then also includes the employee type category. I have aggregate data; the data follows this general pattern seen below, but it is important to note that there are a longer list of sub-categories.

    school_year dstrct instn instn_name data_category data_sub_category employee_type measure
    1 2010-11 601 0103 A High School Certificate Level 4 Yr Bachelor's Administrators 1
    2 2010-11 601 0103 A High School Certificate Level 5 Yr Master's Administrators 1
    3 2010-11 601 0103 A High School Certificate Level 6 Yr Specialist's Administrators 3
    4 2010-11 601 0103 A High School Certificate Level 7 Yr Doctoral Administrators 1
    5 2010-11 601 0103 A High School Certificate Level Other * Administrators 0
    6 2010-11 601 0103 A High School Certificate Level 4 Yr Bachelor's PK-12 Teachers 21
    7 2010-11 601 0103 A High School Certificate Level 5 Yr Master's PK-12 Teachers 22
    8 2010-11 601 0103 A High School Certificate Level 6 Yr Specialist's PK-12 Teachers 15
    9 2010-11 601 0103 A High School Certificate Level 7 Yr Doctoral PK-12 Teachers 2
    10 2010-11 601 0103 A High School Certificate Level Other * PK-12 Teachers 1
    11 2010-11 601 0103 A High School Certified Personnel Professional Administrators 6
    12 2010-11 601 0103 A High School Certified Personnel Provisional Administrators 0
    13 2010-11 601 0103 A High School Certified Personnel Professional PK-12 Teachers 61
    14 2010-11 601 0103 A High School Certified Personnel Provisional PK-12 Teachers 0

    So, in its most basic form, I would expect the data would go from what is above to:

    school_year dstrct instn instn_name cl_4ba_ad cl_5ma_ad cl_6sp_ad cl_7doc_ad cl_o_ad
    1 2010-11 601 103 A High School 1 1 3 1 0
    2 2010-11 601 104 B Middle School
    3 2010-11 601 105 C Lower School
    4 2010-11 602 106 X High School
    5 2010-11 602 107 Y Middle School
    6 2010-11 602 108 Z Lower School


    Where 'cl_4ba_ad' represents certificate level 4 year bachelors for administrators, etc.


    In all, these subcategories account for 81 rows of data for each organization. So rather than having 81 rows, in its most basic (though not most compact) form, I would like to have 1 row and 81 columns for each organization. Also, because there are so many rows, I don't see it feasible for me to write syntax listing all the variable names in the reshape command. Is there a general way for me to write the reshape syntax?

    Thank you. I am not clear on where to start with such a dense array of data.

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte seq str12 school_year int(dstrct instn) str14 instn_name str20 data_category str18 data_sub_category str15 employee_type byte measure
     1 "2010-11" 601 103 "A High School" "Certificate Level"   "4 Yr Bachelor's"   "Administrators"  1
     2 "2010-11" 601 103 "A High School" "Certificate Level"   "5 Yr Master's"     "Administrators"  1
     3 "2010-11" 601 103 "A High School" "Certificate Level"   "6 Yr Specialist's" "Administrators"  3
     4 "2010-11" 601 103 "A High School" "Certificate Level"   "7 Yr Doctoral"     "Administrators"  1
     5 "2010-11" 601 103 "A High School" "Certificate Level"   "Other *"           "Administrators"  0
     6 "2010-11" 601 103 "A High School" "Certificate Level"   "4 Yr Bachelor's"   "PK-12 Teachers" 21
     7 "2010-11" 601 103 "A High School" "Certificate Level"   "5 Yr Master's"     "PK-12 Teachers" 22
     8 "2010-11" 601 103 "A High School" "Certificate Level"   "6 Yr Specialist's" "PK-12 Teachers" 15
     9 "2010-11" 601 103 "A High School" "Certificate Level"   "7 Yr Doctoral"     "PK-12 Teachers"  2
    10 "2010-11" 601 103 "A High School" "Certificate Level"   "Other *"           "PK-12 Teachers"  1
    11 "2010-11" 601 103 "A High School" "Certified Personnel" "Professional"      "Administrators"  6
    12 "2010-11" 601 103 "A High School" "Certified Personnel" "Provisional"       "Administrators"  0
    13 "2010-11" 601 103 "A High School" "Certified Personnel" "Professional"      "PK-12 Teachers" 61
    14 "2010-11" 601 103 "A High School" "Certified Personnel" "Provisional"       "PK-12 Teachers"  0
    end
    
    replace data_category = "cl" if data_category == "Certificate Level"
    replace data_category = "cp" if data_category == "Certified Personnel"
    
    replace data_sub_category = "4ba" if data_sub_category == "4 Yr Bachelor's"
    replace data_sub_category = "5ma" if data_sub_category == "5 Yr Master's"
    replace data_sub_category = "6sp" if data_sub_category == "6 Yr Specialist's"
    replace data_sub_category = "7doc" if data_sub_category == "7 Yr Doctoral"
    replace data_sub_category = "o" if data_sub_category == "Other *"
    
    replace employee_type = "ad" if employee_type == "Administrators"
    replace employee_type = "pk12" if employee_type == "PK-12 Teachers"
    
    egen suffix = concat(data_category data_sub_category employee_type), punct("_")
    
    drop data*category employee_type seq
    reshape wide measure, i(school_year dstrct instn) j(suffix) string
    The above will do what you ask. There are a few special things in Stata that work best with wide data--are you sure you are going to be doing some of those? If not, the resulting data set will be unwieldy and difficult to use for most things in Stata, and you will probably wish you hadn't done it, but there it is.

    Also, because there are so many rows, I don't see it feasible for me to write syntax listing all the variable names in the reshape command.
    You are probably confusing this with the syntax of -reshape long- which would require many more stubs (not variables) to be listed. The real work here is changing the data_category, data_sub_category, and employee_type variables to their abbreviations. As the original values of these are unsystematic, and the abbreviations are not systematically derived from the originals, this cannot be automated. If the number of different values of these variables is large, you will, I'm afraid, be stuck with a large amount of typing. There isn't a way around that under the circumstances. The good news is that these variables appear to be crossed with each other, so 81 combinations could arise as simply as a 9x3x3 classification, which would only require 9+3+3 = 15 lines of code to abbreviate--a bit tedious, but not terrible.

    In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.



    Comment


    • #3
      Thank you so much! This worked wonderfully. The additional lines of code were no problem. I did get an error message that prompted me to encode 'suffix' as a numeric variable (not sure why). But other than that the code worked fine.

      The only other issue was that 'reshape' won't keep the variable names - those suffixes. Instead I get measure1, measure2... measure81. I am fine with writing the code to rename the variables if necessary. But I thought there might be a simpler method.

      Thank you again.

      Comment


      • #4
        I did get an error message that prompted me to encode 'suffix' as a numeric variable (not sure why).
        What was the error message? What command did it come from? It seems likely that you neglected to include the "string" option on your reshape wide command. Is this what you saw?
        Code:
        . reshape wide measure, i(school_year dstrct instn) j(suffix) 
        variable suffix is string; specify option string
        r(109);
        The command should have read
        Code:
        reshape wide measure, i(school_year dstrct instn) j(suffix) string
        That is necessary for the suffixes - which are not numeric - to be kept as strings. Using the encode command to create a numeric variable was an incorrect choice. Encoding suffix turned it into numbers, which is why your variables are now named measure1 through measure81.

        See
        Code:
        help reshape
        help encode
        to learn more.
        Last edited by William Lisowski; 06 Oct 2022, 11:31.

        Comment


        • #5
          Here is the error message:

          . reshape wide measure, i( long_school_year school_dstrct_nm instn_number) j(suffix) string
          string j variable suffix not allowed to contain spaces;
          encode string variable suffix into a numeric variable and use the new numeric variable as the j variable
          r(111);

          Does this mean my suffixes cannot contain the _character?

          Comment


          • #6
            Does this mean my suffixes cannot contain the _character?
            If that were true the code in post #2 would not have worked when Clyde tested it.

            It means exactly what is says - at least one of the values of the suffix variable has a space within it.
            Code:
            tab suffix if strpos(suffix," ")>0
            should show the values have spaces within them.

            Note that in post #2 none of the components of the suffix (data_category data_sub_category employee_type) had a space within them. In adding the remaining values for those components, did you include any with spaces? Did you accidentally include any trailing spaces ("xyz " instead of "xyz" for example)?

            Comment


            • #7
              Thank you! The 'tab suffix" syntax helped me find the issue. Finding and fixing the spaces solved the problem!


              Thanks again!

              Comment


              • #8
                Hello,

                I am trying to complete a similar task with a different data set. Here is one individual from my data with scores on different assessments over a period of 4 years:

                input int year long(ad_date st_ud in_num dist_num) str5 subj str52 test_nm int score byte st_lev
                2007 200706 155000 100 744 "CT" "Social Studies" 521 6
                2007 200706 155000 100 744 "CT" "English Language Arts" 648 6
                2007 200706 155000 100 744 "CT" "Mathematics" 852 6
                2007 200706 155000 100 744 "CT" "Reading" 752 6
                2007 200706 155000 100 744 "CT" "Science" 744 6
                2008 200806 155000 100 744 "CT" "Reading" 740 7
                2008 200806 155000 100 744 "CT" "Mathematics" 993 7
                2008 200806 155000 100 744 "CT" "Social Studies" 920 7
                2008 200806 155000 100 744 "CT" "Science" 962 7
                2008 200806 155000 100 744 "CT" "English Language Arts" 731 7
                2009 200906155000 100 744 "CT" "English Language Arts" 741 8
                2009 200906 155000 100 744 "CT" "Social Studies" 981 8
                2009 200906 155000 100 744 "MA" "Writing" 253 8
                2009 200906 155000 100 744 "CT" "Reading" 845 8
                2009 200906 155000 100 744 "CT" "Mathematics" 900 8
                2009 200906 155000 100 744 "CT" "Science" 855 8
                2010 201006 155000 105 744 "ET" "Biology" 455 9
                2010 201006 155000 105 744 "ET" "9th Grade Literature and Composition" 473 9
                2010 201006 155000 105 744 "ET" "Mathematics-1" 333 9
                2011 201011 155000 105 744 "STA" "Sta Reading" 531 0
                2011 201011 155000 105 744 "STA" "Sta Writing" 640 0
                2011 201106 155000 105 744 "STA" "Sta Reading" 740 0
                2011 201106 155000 105 744 "STA" "Sta Writing" 520 0
                2011 201106 155000 105 744 "STA" "Sta Mathematics" 670 0
                2011 201011 155000 105 744 "STA" "Sta Mathematics" 420 0
                end
                [/CODE]

                I am trying to do two things: first, to create a single variable using test_nm (test name), assmt_cd (essentially the test type), and st_lev (grade level) as variables. Based on previous replies to this thread, I have been able to do that successfully with the following command:

                egen suffix=concat(test_nm, assmt_cd st_lev), punt("_")

                Combining the three of these variables will create a unique variable with one exception: The STA score. The variable STA scores have no grade level associated with them, so have tried to account for this by replacing st_lev with ad_ date before concatentating the three variables with:

                replace st_lev=ad_date if st_lev==0


                It works, but it creates an unwieldy number of variables. Is there an easier way to keep each administration of STA, possibly keeping the highest score before concatentating the variables? I think this also comes to play in the second issue that I am having. Because it looks like there may be other places where students have taken other assessments more than once in a grade level.


                The second issue is to reshape this data from wide to long after creating these variables. When I try to reshape I get a message saying there is a reshape error. The error reads:
                "Your data are currently long. You are performing a reshape wide. You specified i(year dist_num in_num stu_id) and j(suffix). There are observations within i(year dist_num in_num stu_id) with the same value of j(suffix)."

                When I type reshape error, I get a super long list of cases with examples of what I mention previously - individuals with two administrations of the same test. See:
                3012630 2013 110 488000624 ecobfre_eoct_12
                3012631 2013 110 488000624 ecobfre_eoct_12
                3012632 2013 110 488000624 ushisteoc_eoct_12
                3012633 2013 110 488000624 ushisteoc_eoct_12
                3020874 2013 275 286778107 amlico_eoct_12
                3020875 2013 275 286778107 amlico_eoct_12
                3021155 2013 192 284048658 ecobfre_eoct_12
                3021156 2013 192 284048658 ecobfre_eoct_12
                3029053 2013 5052 284001288 ecobfre_eoct_12
                3029054 2013 5052 284001288 ecobfre_eoct_12
                3031666 2013 206 682001484 ecobfre_eoct_12
                3031667 2013 206 682001484 ecobfre_eoct_12
                where the last column is the concatenated suffix and the column to the left of it is stu_id (I am sorry that the columns are not labeled).

                However, Stata is telling me there are no duplicates. So I think that somehow trying to identify the highest score on multiple administrations on an assessment might work but I am not sure how to do that or if that is even the problem.

                I would appreciate any advice you could offer. Thank you.

                Comment


                • #9
                  There are observations within i(year dist_num in_num stu_id) with the same value of j(suffix)
                  It seems to me that you should use ad_date rather than year in the reshape command. Presumably the same test won't have been administered to the same student twice in the same month. So then when the same test is administered to the same student more than once in a single year, you can choose the first adminstration, the final one, the highest score, or whatever.

                  Comment

                  Working...
                  X