Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshaping - long to wide - "variable id does not uniquely identify the observations" error - how to fix?

    Hi there I am using Stata13 on Windows 10 and trying to convert my data from long to wide format.

    I am busy working with a panel dataset which has racial (black, indian, white, etc.) demographics for schools over a period 1996-2011 (year). I am needing to look at racial confirgurations of schools over the period to track possible racial integration. The main variables I am working with, and trying to reshape, are school_id (unique school identifier); dummy variables for each race group (b c i w) - their numeric value representing the total learners of that racial group for the specific year; and lastly year (categorical variable).

    I have run the following command: reshape wide b c i w, i(school_id) j(year)... I get:

    values of variable year not unique within school_id
    Your data are currently long. You are performing a reshape wide. You specified i(school_id) and j(year). There are observations
    within i(school_id) with the same value of j(year). In the long data, variables i() and j() together must uniquely identify the
    observations.

    long wide
    +---------------+ +------------------+
    | i j a b | | i a1 a2 b1 b2 |
    |---------------| <--- reshape ---> |------------------|
    | 1 1 1 2 | | 1 1 3 2 4 |
    | 1 2 3 4 | | 2 5 7 6 8 |
    | 2 1 5 6 | +------------------+
    | 2 2 7 8 |
    +---------------+
    Type reshape error for a list of the problem variables.


    However the duplicates of year within school_id are valid because for each year there are various grades within a specific school, and data on the quantity of each race group for that grade. I have run the following code to aggregate my data for a race group across grades for a specific year for each school_id: by school_id race year, sort: egen totalsum = total (quantity). Hence the data is structured as follows: e.g.
    school_id b c i w year
    1 33 . . . 2000
    1 . 56 . . 2000
    1 . . 4 . 2000
    1 . . . 70 2000
    1 . 23 . . 2001
    1 . . 6 . 2001
    1 . . . 63 2001
    2 81 . . . 1998
    2 . 47 . . 1998
    2 . . 12 . 1998
    2 . . . 44 1998
    etc.

    How could i structure my data (what code should i use) so that i can have the racial information for a given year for a school_id on a single line? As below:
    school_id b c i w year
    1 33 56 4 70 2000
    1 . 23 6 63 2001
    2 81 47 12 44 1998
    Any help and suggestions would be much appreciated. Thank you in advance

  • #2
    Given the layout of the data set you are starting with (which is not in long layout), this is not a -reshape- problem at all:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(school_id b c i w year)
    1 33  .  .  . 2000
    1  . 56  .  . 2000
    1  .  .  4  . 2000
    1  .  .  . 70 2000
    1  . 23  .  . 2001
    1  .  .  6  . 2001
    1  .  .  . 63 2001
    2 81  .  .  . 1998
    2  . 47  .  . 1998
    2  .  . 12  . 1998
    2  .  .  . 44 1998
    end
    
    //    FIRST VERIFY THAT THE COUNT FOR EACH RACE
    //    IS REPORTED CONSISTENTLY (OR ONLY ONCE) FOR
    //    EACH SCHOOL-YEAR COMBINATION
    foreach v of varlist b c i w {
        by school_id year (`v'), sort: assert `v' == `v'[1] if !missing(`v')
    }
    
    //    NOW COLLAPSE
    collapse (firstnm) b c i w, by(school_id year)
    This is what a long layout of the same data would look like:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(school_id year) str1 race float count_
    1 2000 "b" 33
    1 2000 "c" 56
    1 2000 "i"  4
    1 2000 "w" 70
    1 2001 "b"  .
    1 2001 "c" 23
    1 2001 "i"  6
    1 2001 "w" 63
    2 1998 "b" 81
    2 1998 "c" 47
    2 1998 "i" 12
    2 1998 "w" 44
    end
    In fact, I don't know what you plan to do with this data, but if you intend further analysis in Stata, you would probably be better off having your data in this long layout than in the wide layout you requested. Most analyses in Stata are awkward or impossible in wide layout, but simple in long. There are exceptions, but they are not common. So, to get to the long layout, after you do the -collapse-, I recommend you also run:

    Code:
    rename (b c i w) count_=
    reshape long count_, i(school_id year) j(race) string
    Your life will probably be easier going forward if you do.

    By the way, in the future, please do not post data using HTML tables. Yours happened to be easy enough to bring into Stata, but, often they are not. And even then, they even leave unanswered important questions like whether a certain variable is a string or a numeric with value labels. So to assure that you provide those who want to help you with a completely faithful replica of the Stata example, use the -dataex- command (as I have done twice above.) To install -dataex-, run -ssc install dataex-. Then run -help dataex- to read the simple instructions for using it.

    Comment


    • #3
      Hi Clyde, thank you for the response. Apologies for wrong formatting - first time poster.

      Your collapse command helped alot, my data now looks:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input long school_id byte year float(b c i w)
      100000030 15 1 23 4 .
      100000036 15 32 . . .
      100000037 15 179 28 . .
      100000038 15 3 4 1 76
      100000054 15 . . . 11
      100000055 15 48 61 11 3
      100000056 15 25 12 . .
      100000065 15 29 59 32 437
      100000066 15 7 14 . 28
      100000070 15 11 11 23 83
      100000103 7 11 . . 118
      100000103 8 14 19 1 116
      100000103 9 13 12 1 113
      100000103 10 10 17 . 107
      100000103 11 16 24 1 108
      100000103 13 16 37 4 143
      100000103 14 11 40 1 163
      100000103 15 11 35 . 174
      101000045 8 3 3 . 79

      For analysis I am trying to get each school_id to be a single row, so something like:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(school_id b_1 c_1 i_1 w_1 b_2 c_2 i_2 w_2)
      55 48 61 5 60 50 57 6 67
      56 20 15 2 50 22 17 4 53
      65 33 40 10 55 30 42 9 56
      66 60 10 3 . 58 11 5 .
      70 . . . . 30 20 . .
      103 44 32 3 14 . . . .

      What could I do to shape the data as such?



      Comment


      • #4
        Please use the CODE identifiers that dataex automatically produces for you.

        At a guess you want to collapse (total) b c i w, by(school_id)

        where it seems that you want to combine different years.

        Comment


        • #5
          As per my second post:
          input long school_id byte year float(b c i w)
          100000030 15 1 23 4 .
          100000036 15 32 . . .
          100000037 15 179 28 . .
          100000038 15 3 4 1 76
          100000054 15 . . . 11
          100000055 15 48 61 11 3
          100000056 15 25 12 . .
          100000065 15 29 59 32 437
          100000066 15 7 14 . 28
          100000070 15 11 11 23 83
          100000103 7 11 . . 118
          100000103 8 14 19 1 116
          100000103 9 13 12 1 113
          100000103 10 10 17 . 107
          100000103 11 16 24 1 108
          100000103 13 16 37 4 143
          100000103 14 11 40 1 163
          100000103 15 11 35 . 174
          101000045 8 3 3 . 79

          the data is setup as school_id year race(b c i w) - where a school_id will have many observations (rows) as there are years of data captured. I would like the data to be setup so that a school_id has instead the year and race variables combined as new variables (coloumns). E.g.:
          school_id b_1 c_1 i_1 w_1 b_2 c_2 i_2 w_2 etc.
          65 33 40 10 55 30 42 9 56
          103 44 32 3 14 . . . .
          .

          Comment


          • #6
            Sorry, but this is no clearer to me. As Clyde pointed out already in #2 HTML tables are not as helpful as you might hope.

            What's the correspondence between year (currently values like 8 ...15; presumably years 2008 ... 2015) and your variables? What's the significance of the subscripting 1 2 etc.? The numbers just seem made up as (e.g.) the values for school ...65 in the first dataset don't map on to the values in the second datset.

            This may sound blunt or even brutal, but many of us keep half an eye on Statalist while doing our day jobs. A question has to be at least half clear to get much attention. If a thread makes little or no progress, one shrugs shoulders sadly and moves back to the task in hand.


            Comment

            Working...
            X