Reshaping "triply" long nested data into wide format

May Smith

Join Date: Aug 2022
Posts: 15

Reshaping "triply" long nested data into wide format

05 Oct 2022, 11:23

Hello,

I would like to reshape data that I would characterize as "triply" long. The primary variable, 'measure', is a count of how many people in each organization fall into some specific category. I would like to separate the categories nested within the variable data_category into variables in and of themselves and list them in wide form. Employee type breaks each respective data category down into two additional subsets. What I want to do is this: I would like to combine the three variables: 'data_category', 'data_subcategory', and 'employee_type' to create a single variable. For example, I want to create a variable that combines 'certificate level' with the counts for the subcategories '4 yr bachelors', '5 yr masters', etc. and then also includes the employee type category. I have aggregate data; the data follows this general pattern seen below, but it is important to note that there are a longer list of sub-categories.

	school_year	dstrct	instn	instn_name	data_category	data_sub_category	employee_type	measure
1	2010-11	601	0103	A High School	Certificate Level	4 Yr Bachelor's	Administrators	1
2	2010-11	601	0103	A High School	Certificate Level	5 Yr Master's	Administrators	1
3	2010-11	601	0103	A High School	Certificate Level	6 Yr Specialist's	Administrators	3
4	2010-11	601	0103	A High School	Certificate Level	7 Yr Doctoral	Administrators	1
5	2010-11	601	0103	A High School	Certificate Level	Other *	Administrators	0
6	2010-11	601	0103	A High School	Certificate Level	4 Yr Bachelor's	PK-12 Teachers	21
7	2010-11	601	0103	A High School	Certificate Level	5 Yr Master's	PK-12 Teachers	22
8	2010-11	601	0103	A High School	Certificate Level	6 Yr Specialist's	PK-12 Teachers	15
9	2010-11	601	0103	A High School	Certificate Level	7 Yr Doctoral	PK-12 Teachers	2
10	2010-11	601	0103	A High School	Certificate Level	Other *	PK-12 Teachers	1
11	2010-11	601	0103	A High School	Certified Personnel	Professional	Administrators	6
12	2010-11	601	0103	A High School	Certified Personnel	Provisional	Administrators	0
13	2010-11	601	0103	A High School	Certified Personnel	Professional	PK-12 Teachers	61
14	2010-11	601	0103	A High School	Certified Personnel	Provisional	PK-12 Teachers	0

So, in its most basic form, I would expect the data would go from what is above to:

	school_year	dstrct	instn	instn_name	cl_4ba_ad	cl_5ma_ad	cl_6sp_ad	cl_7doc_ad	cl_o_ad
1	2010-11	601	103	A High School	1	1	3	1	0
2	2010-11	601	104	B Middle School	…	…	…	…	…
3	2010-11	601	105	C Lower School	…	…	…	…	…
4	2010-11	602	106	X High School	…	…	…	…	…
5	2010-11	602	107	Y Middle School	…	…	…	…	…
6	2010-11	602	108	Z Lower School	…	…	…	…	…

Where 'cl_4ba_ad' represents certificate level 4 year bachelors for administrators, etc.

In all, these subcategories account for 81 rows of data for each organization. So rather than having 81 rows, in its most basic (though not most compact) form, I would like to have 1 row and 81 columns for each organization. Also, because there are so many rows, I don't see it feasible for me to write syntax listing all the variable names in the reshape command. Is there a general way for me to write the reshape syntax?

Thank you. I am not clear on where to start with such a dense array of data.

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

05 Oct 2022, 12:41

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte seq str12 school_year int(dstrct instn) str14 instn_name str20 data_category str18 data_sub_category str15 employee_type byte measure
 1 "2010-11" 601 103 "A High School" "Certificate Level"   "4 Yr Bachelor's"   "Administrators"  1
 2 "2010-11" 601 103 "A High School" "Certificate Level"   "5 Yr Master's"     "Administrators"  1
 3 "2010-11" 601 103 "A High School" "Certificate Level"   "6 Yr Specialist's" "Administrators"  3
 4 "2010-11" 601 103 "A High School" "Certificate Level"   "7 Yr Doctoral"     "Administrators"  1
 5 "2010-11" 601 103 "A High School" "Certificate Level"   "Other *"           "Administrators"  0
 6 "2010-11" 601 103 "A High School" "Certificate Level"   "4 Yr Bachelor's"   "PK-12 Teachers" 21
 7 "2010-11" 601 103 "A High School" "Certificate Level"   "5 Yr Master's"     "PK-12 Teachers" 22
 8 "2010-11" 601 103 "A High School" "Certificate Level"   "6 Yr Specialist's" "PK-12 Teachers" 15
 9 "2010-11" 601 103 "A High School" "Certificate Level"   "7 Yr Doctoral"     "PK-12 Teachers"  2
10 "2010-11" 601 103 "A High School" "Certificate Level"   "Other *"           "PK-12 Teachers"  1
11 "2010-11" 601 103 "A High School" "Certified Personnel" "Professional"      "Administrators"  6
12 "2010-11" 601 103 "A High School" "Certified Personnel" "Provisional"       "Administrators"  0
13 "2010-11" 601 103 "A High School" "Certified Personnel" "Professional"      "PK-12 Teachers" 61
14 "2010-11" 601 103 "A High School" "Certified Personnel" "Provisional"       "PK-12 Teachers"  0
end

replace data_category = "cl" if data_category == "Certificate Level"
replace data_category = "cp" if data_category == "Certified Personnel"

replace data_sub_category = "4ba" if data_sub_category == "4 Yr Bachelor's"
replace data_sub_category = "5ma" if data_sub_category == "5 Yr Master's"
replace data_sub_category = "6sp" if data_sub_category == "6 Yr Specialist's"
replace data_sub_category = "7doc" if data_sub_category == "7 Yr Doctoral"
replace data_sub_category = "o" if data_sub_category == "Other *"

replace employee_type = "ad" if employee_type == "Administrators"
replace employee_type = "pk12" if employee_type == "PK-12 Teachers"

egen suffix = concat(data_category data_sub_category employee_type), punct("_")

drop data*category employee_type seq
reshape wide measure, i(school_year dstrct instn) j(suffix) string

The above will do what you ask. There are a few special things in Stata that work best with wide data--are you sure you are going to be doing some of those? If not, the resulting data set will be unwieldy and difficult to use for most things in Stata, and you will probably wish you hadn't done it, but there it is.

Also, because there are so many rows, I don't see it feasible for me to write syntax listing all the variable names in the reshape command.

You are probably confusing this with the syntax of -reshape long- which would require many more stubs (not variables) to be listed. The real work here is changing the data_category, data_sub_category, and employee_type variables to their abbreviations. As the original values of these are unsystematic, and the abbreviations are not systematically derived from the originals, this cannot be automated. If the number of different values of these variables is large, you will, I'm afraid, be stuck with a large amount of typing. There isn't a way around that under the circumstances. The good news is that these variables appear to be crossed with each other, so 81 combinations could arise as simply as a 9x3x3 classification, which would only require 9+3+3 = 15 lines of code to abbreviate--a bit tedious, but not terrible.

In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Comment

May Smith

Join Date: Aug 2022

Posts: 15
#3

06 Oct 2022, 09:58

Thank you so much! This worked wonderfully. The additional lines of code were no problem. I did get an error message that prompted me to encode 'suffix' as a numeric variable (not sure why). But other than that the code worked fine.

The only other issue was that 'reshape' won't keep the variable names - those suffixes. Instead I get measure1, measure2... measure81. I am fine with writing the code to rename the variables if necessary. But I thought there might be a simpler method.

Thank you again.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

06 Oct 2022, 11:28

I did get an error message that prompted me to encode 'suffix' as a numeric variable (not sure why).

What was the error message? What command did it come from? It seems likely that you neglected to include the "string" option on your reshape wide command. Is this what you saw?

Code:

. reshape wide measure, i(school_year dstrct instn) j(suffix) variable suffix is string; specify option string r(109);

The command should have read

Code:

reshape wide measure, i(school_year dstrct instn) j(suffix) string

That is necessary for the suffixes - which are not numeric - to be kept as strings. Using the encode command to create a numeric variable was an incorrect choice. Encoding suffix turned it into numbers, which is why your variables are now named measure1 through measure81.

See

Code:

help reshape help encode

to learn more.

Last edited by William Lisowski; 06 Oct 2022, 11:31.
Comment
May Smith

Join Date: Aug 2022

Posts: 15
#5

06 Oct 2022, 13:03

Here is the error message:

. reshape wide measure, i( long_school_year school_dstrct_nm instn_number) j(suffix) string
string j variable suffix not allowed to contain spaces;
encode string variable suffix into a numeric variable and use the new numeric variable as the j variable
r(111);

Does this mean my suffixes cannot contain the _character?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

06 Oct 2022, 13:27

Does this mean my suffixes cannot contain the _character?

If that were true the code in post #2 would not have worked when Clyde tested it.

It means exactly what is says - at least one of the values of the suffix variable has a space within it.

Code:

tab suffix if strpos(suffix," ")>0

should show the values have spaces within them.

Note that in post #2 none of the components of the suffix (data_category data_sub_category employee_type) had a space within them. In adding the remaining values for those components, did you include any with spaces? Did you accidentally include any trailing spaces ("xyz " instead of "xyz" for example)?
Comment
May Smith

Join Date: Aug 2022

Posts: 15
#7

07 Oct 2022, 11:18

Thank you! The 'tab suffix" syntax helped me find the issue. Finding and fixing the spaces solved the problem!

Thanks again!
Comment

May Smith

Join Date: Aug 2022
Posts: 15

24 Oct 2022, 13:08

Hello,

I am trying to complete a similar task with a different data set. Here is one individual from my data with scores on different assessments over a period of 4 years:

input int year long(ad_date st_ud in_num dist_num) str5 subj str52 test_nm int score byte st_lev
2007 200706 155000 100 744 "CT" "Social Studies" 521 6
2007 200706 155000 100 744 "CT" "English Language Arts" 648 6
2007 200706 155000 100 744 "CT" "Mathematics" 852 6
2007 200706 155000 100 744 "CT" "Reading" 752 6
2007 200706 155000 100 744 "CT" "Science" 744 6
2008 200806 155000 100 744 "CT" "Reading" 740 7
2008 200806 155000 100 744 "CT" "Mathematics" 993 7
2008 200806 155000 100 744 "CT" "Social Studies" 920 7
2008 200806 155000 100 744 "CT" "Science" 962 7
2008 200806 155000 100 744 "CT" "English Language Arts" 731 7
2009 200906155000 100 744 "CT" "English Language Arts" 741 8
2009 200906 155000 100 744 "CT" "Social Studies" 981 8
2009 200906 155000 100 744 "MA" "Writing" 253 8
2009 200906 155000 100 744 "CT" "Reading" 845 8
2009 200906 155000 100 744 "CT" "Mathematics" 900 8
2009 200906 155000 100 744 "CT" "Science" 855 8
2010 201006 155000 105 744 "ET" "Biology" 455 9
2010 201006 155000 105 744 "ET" "9th Grade Literature and Composition" 473 9
2010 201006 155000 105 744 "ET" "Mathematics-1" 333 9
2011 201011 155000 105 744 "STA" "Sta Reading" 531 0
2011 201011 155000 105 744 "STA" "Sta Writing" 640 0
2011 201106 155000 105 744 "STA" "Sta Reading" 740 0
2011 201106 155000 105 744 "STA" "Sta Writing" 520 0
2011 201106 155000 105 744 "STA" "Sta Mathematics" 670 0
2011 201011 155000 105 744 "STA" "Sta Mathematics" 420 0
end
[/CODE]

I am trying to do two things: first, to create a single variable using test_nm (test name), assmt_cd (essentially the test type), and st_lev (grade level) as variables. Based on previous replies to this thread, I have been able to do that successfully with the following command:

egen suffix=concat(test_nm, assmt_cd st_lev), punt("_")

Combining the three of these variables will create a unique variable with one exception: The STA score. The variable STA scores have no grade level associated with them, so have tried to account for this by replacing st_lev with ad_ date before concatentating the three variables with:

replace st_lev=ad_date if st_lev==0

It works, but it creates an unwieldy number of variables. Is there an easier way to keep each administration of STA, possibly keeping the highest score before concatentating the variables? I think this also comes to play in the second issue that I am having. Because it looks like there may be other places where students have taken other assessments more than once in a grade level.

The second issue is to reshape this data from wide to long after creating these variables. When I try to reshape I get a message saying there is a reshape error. The error reads:
"Your data are currently long. You are performing a reshape wide. You specified i(year dist_num in_num stu_id) and j(suffix). There are observations within i(year dist_num in_num stu_id) with the same value of j(suffix)."

When I type reshape error, I get a super long list of cases with examples of what I mention previously - individuals with two administrations of the same test. See:

3012630	2013	110	488000624	ecobfre_eoct_12
3012631	2013	110	488000624	ecobfre_eoct_12
3012632	2013	110	488000624	ushisteoc_eoct_12
3012633	2013	110	488000624	ushisteoc_eoct_12
3020874	2013	275	286778107	amlico_eoct_12
3020875	2013	275	286778107	amlico_eoct_12
3021155	2013	192	284048658	ecobfre_eoct_12
3021156	2013	192	284048658	ecobfre_eoct_12
3029053	2013	5052	284001288	ecobfre_eoct_12
3029054	2013	5052	284001288	ecobfre_eoct_12
3031666	2013	206	682001484	ecobfre_eoct_12
3031667	2013	206	682001484	ecobfre_eoct_12

where the last column is the concatenated suffix and the column to the left of it is stu_id (I am sorry that the columns are not labeled).

However, Stata is telling me there are no duplicates. So I think that somehow trying to identify the highest score on multiple administrations on an assessment might work but I am not sure how to do that or if that is even the problem.

I would appreciate any advice you could offer. Thank you.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

24 Oct 2022, 13:37

There are observations within i(year dist_num in_num stu_id) with the same value of j(suffix)

It seems to me that you should use ad_date rather than year in the reshape command. Presumably the same test won't have been administered to the same student twice in the same month. So then when the same test is administered to the same student more than once in a single year, you can choose the first adminstration, the final one, the highest score, or whatever.
Comment

Announcement

Reshaping "triply" long nested data into wide format

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment