Reshaping data

Ronny Kalinka

Join Date: Jan 2023

Posts: 6
#1

Reshaping data

18 Jan 2023, 00:33

Hey everyone,

I am new to Stata and despite intensive research in this forum I am facing an issue where I need some help.

I have a dataset with variables like state_1999 state_2001 state_2003 ... state_2019, family_id_1999 ... family_id_2019, College_1999 ... College_2019, you get the idea (total of 170+ variables).

state_1999 state_2001 ... state_2019 family_id_1999 ... family_id_2019 College...

Now I want to reshape the dataset so that I have a "year" variable and the variable "state", "family_id" and "College" (+ the others I didn't mention).

year state family_id College ...

I guess the way is to reshape long. But think one problem is that I do not have a variable that uniquely specifies the observations, but tbh I am not sure.

I would be so happy If someone could give me a hint how to tackle this issue.

Best regards,
Ronny
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10219

18 Jan 2023, 01:26

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10(state_1999 state_2001 state_2019) str14(family_id_1999 family_id_2019) str12(College_1999 College_2019) str8(xxx_1999 xxx_2019)
"1" "2" "3" "4" "5" "6" "7" "8" "9"
end

*EXTRACT STUBNAMES
qui ds
local list= ustrregexra("`r(varlist)'", "_(\d{4})", "_")
local list: list uniq list
*RESHAPE
gen obs_no=_n
reshape long `list', i(obs_no) j(year)
*CLEAN UP
rename *_ *
drop obs_no

Res.:

Code:

. l

     +-----------------------------------------+
     | year   state   family~d   College   xxx |
     |-----------------------------------------|
  1. | 1999       1          4         6     8 |
  2. | 2001       2                            |
  3. | 2019       3          5         7     9 |
     +-----------------------------------------+

Last edited by Andrew Musau; 18 Jan 2023, 01:28.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

18 Jan 2023, 01:27

You can add a variable that uniquely identifies observations. While it is a requirement for reshape (so you can tell where each observation in the long dataset originated) it does not have to be meaningful. Your code will be something like the following.

Code:

generate seq = _n reshape long state family_id College, i(seq) j(year)
1 like
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1411
#4

18 Jan 2023, 01:33

Welcome to Statalist, Ronny!

You might want to try something like this:

Code:

gen `c(obs_t)' n = _n // creates a unique identifier for every row reshape long state_ family_id_ College_, i(n) j(year) rename *_ * drop n

In the future, it would be helpful for you to provide a data extract using the dataex command. For this and other tips on how to post questions effectively, see the Statalist FAQ, esp. #12.

Added: crossed with #2 and #3. All three solutions are similar, take your pick!

Last edited by Hemanshu Kumar; 18 Jan 2023, 01:37.
1 like
Comment
Ronny Kalinka

Join Date: Jan 2023

Posts: 6
#5

18 Jan 2023, 01:46

Thanks to you both for your quick answers! @Andrew: Your code worked perfectly for my situation. You saved me so much time and nerves, big thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#6

18 Jan 2023, 10:27

Sorry for being late to the party. While the code in #2 worked (as would that in #3), the code in #4 is better. The reason is that -gen unique_identifier = _n- will, by default, create the unique_identifier as a float. If your data set is very large (more than about 16,000,000 observations), the required values at the end of the data set will be too large to fit in float storage and they will be truncated. As a result the identifiers will no longer be unique.

For a unique identifier, -gen long unique_identifier = _n- is safer, because -long- has enough available* bits to accommodate any possible value up to about 2 billion. It is rare to have a data set with more observations than that (and impossible in BE or SE flavors of Stata). But routinely using -long- can be wasteful of memory.

So by using -gen `c(obs_t)' unique_identifier = _n-, as recommended in #4, you get the best of both worlds. Stata will identify the smallest storage type that will accommodate the number of observations in the data set and use that for you.

(*long and float are both 4 bytes, but in float, one byte is lost to hold the exponent, so the largest integer that can be represented precisely is limited to 3 bytes).
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

18 Jan 2023, 11:52

Here are the limits on storage of decimal integers with full accuracy in the various numeric storage types. The fixed-point variables lose the 27 largest positive values to missing value codes; the similar loss for floating point variables occurs only for the largest exponent, so it doesn't affect the much smaller integer values.

byte - 7 bits	-127	100
int - 15 bits	-32,767	32,740
long - 31 bits	-2,147,483,647	2,147,483,620
float - 24 bits	-16,777,216	16,777,216
double - 53 bits	-9,007,199,254,740,992	9,007,199,254,740,992

state_1999	state_2001	...	state_2019	family_id_1999	...	family_id_2019	College...

Announcement

Comment

Comment

Comment

Comment

Comment

Comment