Reshaping - long to wide - "variable id does not uniquely identify the observations" error - how to fix?

Matt Theron

Join Date: Apr 2017
Posts: 4

Reshaping - long to wide - "variable id does not uniquely identify the observations" error - how to fix?

02 Apr 2017, 14:42

Hi there I am using Stata13 on Windows 10 and trying to convert my data from long to wide format.

I am busy working with a panel dataset which has racial (black, indian, white, etc.) demographics for schools over a period 1996-2011 (year). I am needing to look at racial confirgurations of schools over the period to track possible racial integration. The main variables I am working with, and trying to reshape, are school_id (unique school identifier); dummy variables for each race group (b c i w) - their numeric value representing the total learners of that racial group for the specific year; and lastly year (categorical variable).

I have run the following command: reshape wide b c i w, i(school_id) j(year)... I get:

values of variable year not unique within school_id
Your data are currently long. You are performing a reshape wide. You specified i(school_id) and j(year). There are observations
within i(school_id) with the same value of j(year). In the long data, variables i() and j() together must uniquely identify the
observations.

long wide
+---------------+ +------------------+
| i j a b | | i a1 a2 b1 b2 |
|---------------| <--- reshape ---> |------------------|
| 1 1 1 2 | | 1 1 3 2 4 |
| 1 2 3 4 | | 2 5 7 6 8 |
| 2 1 5 6 | +------------------+
| 2 2 7 8 |
+---------------+
Type reshape error for a list of the problem variables.

However the duplicates of year within school_id are valid because for each year there are various grades within a specific school, and data on the quantity of each race group for that grade. I have run the following code to aggregate my data for a race group across grades for a specific year for each school_id: by school_id race year, sort: egen totalsum = total (quantity). Hence the data is structured as follows: e.g.

school_id	b	c	i	w	year
1	33	.	.	.	2000
1	.	56	.	.	2000
1	.	.	4	.	2000
1	.	.	.	70	2000
1	.	23	.	.	2001
1	.	.	6	.	2001
1	.	.	.	63	2001
2	81	.	.	.	1998
2	.	47	.	.	1998
2	.	.	12	.	1998
2	.	.	.	44	1998

etc.

How could i structure my data (what code should i use) so that i can have the racial information for a given year for a school_id on a single line? As below:

school_id	b	c	i	w	year
1	33	56	4	70	2000
1	.	23	6	63	2001
2	81	47	12	44	1998

Any help and suggestions would be much appreciated. Thank you in advance

Tags: panel data, reshape

Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#2

02 Apr 2017, 15:13

Given the layout of the data set you are starting with (which is not in long layout), this is not a -reshape- problem at all:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(school_id b c i w year) 1 33 . . . 2000 1 . 56 . . 2000 1 . . 4 . 2000 1 . . . 70 2000 1 . 23 . . 2001 1 . . 6 . 2001 1 . . . 63 2001 2 81 . . . 1998 2 . 47 . . 1998 2 . . 12 . 1998 2 . . . 44 1998 end // FIRST VERIFY THAT THE COUNT FOR EACH RACE // IS REPORTED CONSISTENTLY (OR ONLY ONCE) FOR // EACH SCHOOL-YEAR COMBINATION foreach v of varlist b c i w { by school_id year (`v'), sort: assert `v' == `v'[1] if !missing(`v') } // NOW COLLAPSE collapse (firstnm) b c i w, by(school_id year)

This is what a long layout of the same data would look like:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(school_id year) str1 race float count_ 1 2000 "b" 33 1 2000 "c" 56 1 2000 "i" 4 1 2000 "w" 70 1 2001 "b" . 1 2001 "c" 23 1 2001 "i" 6 1 2001 "w" 63 2 1998 "b" 81 2 1998 "c" 47 2 1998 "i" 12 2 1998 "w" 44 end

In fact, I don't know what you plan to do with this data, but if you intend further analysis in Stata, you would probably be better off having your data in this long layout than in the wide layout you requested. Most analyses in Stata are awkward or impossible in wide layout, but simple in long. There are exceptions, but they are not common. So, to get to the long layout, after you do the -collapse-, I recommend you also run:

Code:

rename (b c i w) count_= reshape long count_, i(school_id year) j(race) string

Your life will probably be easier going forward if you do.

By the way, in the future, please do not post data using HTML tables. Yours happened to be easy enough to bring into Stata, but, often they are not. And even then, they even leave unanswered important questions like whether a certain variable is a string or a numeric with value labels. So to assure that you provide those who want to help you with a completely faithful replica of the Stata example, use the -dataex- command (as I have done twice above.) To install -dataex-, run -ssc install dataex-. Then run -help dataex- to read the simple instructions for using it.
Comment
Matt Theron

Join Date: Apr 2017

Posts: 4
#3

03 Apr 2017, 05:16

Hi Clyde, thank you for the response. Apologies for wrong formatting - first time poster.

Your collapse command helped alot, my data now looks:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long school_id byte year float(b c i w)
100000030 15 1 23 4 .
100000036 15 32 . . .
100000037 15 179 28 . .
100000038 15 3 4 1 76
100000054 15 . . . 11
100000055 15 48 61 11 3
100000056 15 25 12 . .
100000065 15 29 59 32 437
100000066 15 7 14 . 28
100000070 15 11 11 23 83
100000103 7 11 . . 118
100000103 8 14 19 1 116
100000103 9 13 12 1 113
100000103 10 10 17 . 107
100000103 11 16 24 1 108
100000103 13 16 37 4 143
100000103 14 11 40 1 163
100000103 15 11 35 . 174
101000045 8 3 3 . 79

For analysis I am trying to get each school_id to be a single row, so something like:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(school_id b_1 c_1 i_1 w_1 b_2 c_2 i_2 w_2)
55 48 61 5 60 50 57 6 67
56 20 15 2 50 22 17 4 53
65 33 40 10 55 30 42 9 56
66 60 10 3 . 58 11 5 .
70 . . . . 30 20 . .
103 44 32 3 14 . . . .

What could I do to shape the data as such?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#4

03 Apr 2017, 05:37

Please use the CODE identifiers that dataex automatically produces for you.

At a guess you want to collapse (total) b c i w, by(school_id)

where it seems that you want to combine different years.
Comment
Matt Theron

Join Date: Apr 2017

Posts: 4
#5

03 Apr 2017, 07:23

As per my second post:
input long school_id byte year float(b c i w)
100000030 15 1 23 4 .
100000036 15 32 . . .
100000037 15 179 28 . .
100000038 15 3 4 1 76
100000054 15 . . . 11
100000055 15 48 61 11 3
100000056 15 25 12 . .
100000065 15 29 59 32 437
100000066 15 7 14 . 28
100000070 15 11 11 23 83
100000103 7 11 . . 118
100000103 8 14 19 1 116
100000103 9 13 12 1 113
100000103 10 10 17 . 107
100000103 11 16 24 1 108
100000103 13 16 37 4 143
100000103 14 11 40 1 163
100000103 15 11 35 . 174
101000045 8 3 3 . 79

the data is setup as school_id year race(b c i w) - where a school_id will have many observations (rows) as there are years of data captured. I would like the data to be setup so that a school_id has instead the year and race variables combined as new variables (coloumns). E.g.:
school_id b_1 c_1 i_1 w_1 b_2 c_2 i_2 w_2 etc.

65 33 40 10 55 30 42 9 56

103 44 32 3 14 . . . .

.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#6

03 Apr 2017, 07:39

Sorry, but this is no clearer to me. As Clyde pointed out already in #2 HTML tables are not as helpful as you might hope.

What's the correspondence between year (currently values like 8 ...15; presumably years 2008 ... 2015) and your variables? What's the significance of the subscripting 1 2 etc.? The numbers just seem made up as (e.g.) the values for school ...65 in the first dataset don't map on to the values in the second datset.

This may sound blunt or even brutal, but many of us keep half an eye on Statalist while doing our day jobs. A question has to be at least half clear to get much attention. If a thread makes little or no progress, one shrugs shoulders sadly and moves back to the task in hand.
Comment

school_id	b_1	c_1	i_1	w_1	b_2	c_2	i_2	w_2	etc.
65	33	40	10	55	30	42	9	56
103	44	32	3	14	.	.	.	.

Announcement

Reshaping - long to wide - "variable id does not uniquely identify the observations" error - how to fix?

Comment

Comment

Comment

Comment

Comment