Reshape large data set

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#16

28 Apr 2018, 12:06

This one is a tad tricky, because the 1s and 2's also occur in the _2011 suffix, as well as in the infixed position where you want to isolate them. But it can be done as follows:

Code:

ds w1_* local stubs `r(varlist)' local stubs: subinstr local stubs "1_" "@_", all display `"`stubs'"' reshape long `stubs', i(pid) j(_j)

Presumably the 1, 2, 3, 4 represent something like year, or wave, or something like that, so you might want to pick a more informative name for that variable than _j. But as I didn't know what it is, I just went generic.
Comment

Aidan Horn

Join Date: Sep 2017
Posts: 30

#17

29 Apr 2018, 04:24

Dear Clyde

I didn't see the point of only working with the w1_ variables. Yes, the numbers correspond to waves. I have managed to achieve what I wanted to do, with your help. This is the code I used:

Code:

// First get all the variable names into a local macro
ds pid csm cluster wave_died, not
local stubs `r(varlist)'

//  Take the wave numbers out of the list of stubs
forvalues i = 1/4 { 
    local stubs: subinstr local stubs "w`i'_" "w@_", all
}

// Now eliminate duplicate mentions of the stubs
local stubs: list uniq stubs

// Pass the list of stubs to -reshape- (this takes half an hour on a fast machine)
reshape long `stubs', i(pid) j(Wave)

/* In order to rename w_* *, I need to drop the w_cluster variable, which 
doesn't give any more information than "cluster" */
count if cluster!=w_cluster & w_cluster!=. // 0
drop w_cluster
rename w_* *

The only command syntax I don't understand is the one for subinstr.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#18

29 Apr 2018, 10:25

Thanks for posting the solution you used.

As for why I chose to work with only the w1* series, I assumed that all of the w2*, w3*, and w4* variables have a w1* counterpart. (And a weakness of the solution I proposed is that it will fail if this is not true.) By doing that, I avoid having to have

Code:

// Now eliminate duplicate mentions of the stubs local stubs: list uniq stubs

After all, those duplicates in the stubs macro come from the fact that for every w1* variable there is a corresponding w2* variable, etc.

Skipping that step of eliminating duplicate stubs is not a big advantage, however, and given that it relies on an assumption about the data, perhaps it is not as good a way to do it. If we were dealing with a data set with thousands of variables, then my approach (assuming the assumption is true) would perhaps be more efficient or avoid exceeding maximum length of a macro. In most situations, the two approaches are equivalent.

As for the -subinstr- step, one of the complications in your data is that the wave number appears as an infix, rather than a suffix in the variable names. -reshape- by default looks for the -j()- values as suffixes. Thus if your variables were abc1, abc2, abc3, etc. you would use -reshape long abc, i(whatever_identifiers) j(_j)-. But you have something more like a1bc a2bc a3bc. The @ symbol in the stublist of the comment tells Stata where to find the values of j(). So, in a short simple case, this latter case would be handled by -reshape long a@bc, i(whatever_identifiers) j(_j)-. What the -subinstr- step in the code does is place the @ symbol in the appropriate place in the stubs, to wit, where the 1_'s (or, for your code, 2_'s, 3_'s, and 4_'s are. ) After that step local macro stubs will expand to a suitable set of stubs that contain @ symbols at the location of the infixed wave numbers.
Comment
Abiodun Salako

Join Date: Feb 2017

Posts: 14
#19

13 Jan 2019, 23:38

I have a dataset (example below) with 51 observations (the 50 states and DC). The dataset has four variables: f00011, state, abbreviation and fullsopyear. The first three variables are respectively the state fips code (a unique state identifier), the state's name and an abbreviation of the state name. The last variable "fullsopyear" specifies the year when a state passed a law within the period 1970-2015 inclusive. For states that did not pass the law within that period, fullsopyear is coded as 0.

I would like to 'reshape' this dataset to the long format with a state-year observation for each of the 50 states and DC for the years 1970-2015 inclusive (i.e. 46 years) and create a variable "fullsop" that is a binary indicator of whether the law had been passed in that state-year i.e. for Alaska, fullsop=1 for years 1987-2015 and 0 for years 1970-1986 and for Alabama, fullsop=0 for all the years. I would like to avoid having to do this manually for each state so any help with a shorter code will be appreciated.

Thanks.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str2 f00011 str20 state str2 abbreviation str4 fullsopyear "01" "Alabama" "AL" "0" "02" "Alaska" "AK" "1987" "04" "Arizo." "AZ" "2000" "05" "Arkansas" "AR" "0" "06" "California" "CA" "0" "08" "Colorado" "CO" "2010" "09" "Connecticut" "CT" "2014" "10" "Delaware" "DE" "0" "11" "District of Columbia" "DC" "1995" "12" "Florida" "FL" "0" end label var f00011 "f00011" label var state "State" label var abbreviation "Abbreviation" label var fullsopyear " fullsop year "
Comment

David Benson

Join Date: Oct 2018
Posts: 489

#20

14 Jan 2019, 00:59

So you're going to use tsfill

Code:

destring fullsopyear, gen(year)
destring f00011, gen(id)
order id, first
replace year=2015 if year==.  // tsfill won't work if year is missing

* I added 2 obs at the end (so I have going from 1970 to 2015)
. list, noobs abbrev(14)

  +------------------------------------------------------------------------+
  | id   f00011                  state   abbreviation   fullsopyear   year |
  |------------------------------------------------------------------------|
  |  1       01                Alabama             AL             0   2015 |
  |  2       02                 Alaska             AK          1987   1987 |
  |  4       04                 Arizo.             AZ          2000   2000 |
  |  5       05               Arkansas             AR             0   2015 |
  |  6       06             California             CA             0   2015 |
  |------------------------------------------------------------------------|
  |  8       08               Colorado             CO          2010   2010 |
  |  9       09            Connecticut             CT          2014   2014 |
  | 10       10               Delaware             DE             0   2015 |
  | 11       11   District of Columbia             DC          1995   1995 |
  | 12       12                Florida             FL             0   2015 |
  |------------------------------------------------------------------------|
  | 98                           Bogus                                1970 |
  | 99                              B2                                2015 |
  +------------------------------------------------------------------------+

xtset id year  // have to xtset or tsset the data
tsfill, full  // this fills it in from 1970-2015

sort id state  // since blanks are first, this puts obs with state name filled in last [_N]
drop if inlist(id, 98, 99)
destring fullsopyear, gen(year_passed)

* Carrying down the data for all of the new observations
foreach v of varlist f00011 state abbreviation fullsopyear {
by id: replace `v' = `v'[_N] if `v'==""
}
by id: replace year_passed = year_passed[_N] if year_passed==.

gen had_passed=0
replace had_passed=1 if year>=year_passed & year_passed!=0
sort id year

. list id state fullsopyear year_passed year had_passed if year>=2010 & inrange(id,5,9), noobs sepby(id) abbrev(12)

  +------------------------------------------------------------------+
  | id         state   fullsopyear   year_passed   year   had_passed |
  |------------------------------------------------------------------|
  |  5      Arkansas             0             0   2010            0 |
  |  5      Arkansas             0             0   2011            0 |
  |  5      Arkansas             0             0   2012            0 |
  |  5      Arkansas             0             0   2013            0 |
  |  5      Arkansas             0             0   2014            0 |
  |  5      Arkansas             0             0   2015            0 |
  |------------------------------------------------------------------|
  |  6    California             0             0   2010            0 |
  |  6    California             0             0   2011            0 |
  |  6    California             0             0   2012            0 |
  |  6    California             0             0   2013            0 |
  |  6    California             0             0   2014            0 |
  |  6    California             0             0   2015            0 |
  |------------------------------------------------------------------|
  |  8      Colorado          2010          2010   2010            1 |
  |  8      Colorado          2010          2010   2011            1 |
  |  8      Colorado          2010          2010   2012            1 |
  |  8      Colorado          2010          2010   2013            1 |
  |  8      Colorado          2010          2010   2014            1 |
  |  8      Colorado          2010          2010   2015            1 |
  |------------------------------------------------------------------|
  |  9   Connecticut          2014          2014   2010            0 |
  |  9   Connecticut          2014          2014   2011            0 |
  |  9   Connecticut          2014          2014   2012            0 |
  |  9   Connecticut          2014          2014   2013            0 |
  |  9   Connecticut          2014          2014   2014            1 |
  |  9   Connecticut          2014          2014   2015            1 |
  +------------------------------------------------------------------+

Comment

Abiodun Salako

Join Date: Feb 2017
Posts: 14

#21

14 Jan 2019, 13:45

Originally posted by David Benson View Post

So you're going to use tsfill

Code:

destring fullsopyear, gen(year)
destring f00011, gen(id)
order id, first
replace year=2015 if year==. // tsfill won't work if year is missing

* I added 2 obs at the end (so I have going from 1970 to 2015)
. list, noobs abbrev(14)

+------------------------------------------------------------------------+
| id f00011 state abbreviation fullsopyear year |
|------------------------------------------------------------------------|
| 1 01 Alabama AL 0 2015 |
| 2 02 Alaska AK 1987 1987 |
| 4 04 Arizo. AZ 2000 2000 |
| 5 05 Arkansas AR 0 2015 |
| 6 06 California CA 0 2015 |
|------------------------------------------------------------------------|
| 8 08 Colorado CO 2010 2010 |
| 9 09 Connecticut CT 2014 2014 |
| 10 10 Delaware DE 0 2015 |
| 11 11 District of Columbia DC 1995 1995 |
| 12 12 Florida FL 0 2015 |
|------------------------------------------------------------------------|
| 98 Bogus 1970 |
| 99 B2 2015 |
+------------------------------------------------------------------------+

xtset id year // have to xtset or tsset the data
tsfill, full // this fills it in from 1970-2015

sort id state // since blanks are first, this puts obs with state name filled in last [_N]
drop if inlist(id, 98, 99)
destring fullsopyear, gen(year_passed)

* Carrying down the data for all of the new observations
foreach v of varlist f00011 state abbreviation fullsopyear {
by id: replace `v' = `v'[_N] if `v'==""
}
by id: replace year_passed = year_passed[_N] if year_passed==.

gen had_passed=0
replace had_passed=1 if year>=year_passed & year_passed!=0
sort id year

. list id state fullsopyear year_passed year had_passed if year>=2010 & inrange(id,5,9), noobs sepby(id) abbrev(12)

+------------------------------------------------------------------+
| id state fullsopyear year_passed year had_passed |
|------------------------------------------------------------------|
| 5 Arkansas 0 0 2010 0 |
| 5 Arkansas 0 0 2011 0 |
| 5 Arkansas 0 0 2012 0 |
| 5 Arkansas 0 0 2013 0 |
| 5 Arkansas 0 0 2014 0 |
| 5 Arkansas 0 0 2015 0 |
|------------------------------------------------------------------|
| 6 California 0 0 2010 0 |
| 6 California 0 0 2011 0 |
| 6 California 0 0 2012 0 |
| 6 California 0 0 2013 0 |
| 6 California 0 0 2014 0 |
| 6 California 0 0 2015 0 |
|------------------------------------------------------------------|
| 8 Colorado 2010 2010 2010 1 |
| 8 Colorado 2010 2010 2011 1 |
| 8 Colorado 2010 2010 2012 1 |
| 8 Colorado 2010 2010 2013 1 |
| 8 Colorado 2010 2010 2014 1 |
| 8 Colorado 2010 2010 2015 1 |
|------------------------------------------------------------------|
| 9 Connecticut 2014 2014 2010 0 |
| 9 Connecticut 2014 2014 2011 0 |
| 9 Connecticut 2014 2014 2012 0 |
| 9 Connecticut 2014 2014 2013 0 |
| 9 Connecticut 2014 2014 2014 1 |
| 9 Connecticut 2014 2014 2015 1 |
+------------------------------------------------------------------+

Thanks. This worked. The only change I needed to make was to change:

Code:

 replace year=2015 if year==.

Code:

 replace year=2015 if year==0

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#22

16 Jan 2019, 09:19

So this is made more complicated by the -j()- part of the variable name being at the beginning rather than the end. Also, you probably don't want to have all these variables end up with names beginning with w_ after the -reshape-. So you have to do a little juggling:

Code:

ds w1_* local to_lengthen `r(varlist)' local to_lengthen: subinstr local to_lengthen "w1_" "@", all reshape long `to_lengthen', i(pid) j(wave) string destring wave, ignore("w_") replace

Note: This code assumes that all of the variables involved are instantiated with the w1_ prefix. If there are some that occur, say, with w2_ but not w1_ then this code will miss them. Post back with examples if that is the case.
Comment
Angela Jiang

Join Date: Feb 2023

Posts: 22
#23

18 Sep 2023, 18:52

Hi all,

I am trying to use the code Clyde posted in his initial reply to this post to reshape a dataset, but I can't seem to get the "peeling" part to work. stubs before and after the peeling looked the same, my variable list looks like "resident1 ageykid1 ageokid1 num_son1 num_dau1 num_kids_10miles1 num_kid_workft1 num_kid_workpt1 num_kids_resident2 ageykid2 ageokid2...". Also it seems that local stubs r(varlist) works but local stubs `r(varlist)' did not work for me for getting a list of variables.

// FIRST GET ALL THE VARIABLE NAMES INTO A LOCAL MACRO
ds hhidpn, not
local stubs r(varlist)

// NOW PEEL OFF THE YEARS AT THE END
forvalues i = 1/14 { // OR WHATEVER YOUR RANGE OF YEARS IS
local stubs: subinstr local stubs "`i'" "", all
}

local stubs: list uniq stubs
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#24

19 Sep 2023, 09:50

-local stubs r(varlist)- is definitely wrong: it puts the string "r(varlist)" into local macro stubs, not any variable names. From there all is lost.

In any case, because some of your variables, e.g. num_kids_10miles1, contain embedded numbers, not just numbers at the end, the approach outlined earlier in this thread will fail anyway, because when you get to i = 10, the 10 in the middle will get deleted. So a different way is needed. Try this:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(resident1 ageykid1 ageokid1 num_son1 num_dau1 num_kids_10miles1 num_kid_workft1 num_kid_workpt1 num_kids_resident2 ageykid2 ageokid2 hhid) .3488717 .028556867 .07110509 .875991 .5844658 .3913819 .6950234 .4548882 .9748848 .7459667 .859742 1 .2668857 .8689333 .32336795 .20470947 .3697791 .11966132 .6866152 .0674011 .7264384 .4961259 .13407555 2 .1366463 .3508549 .5551032 .8927587 .8506309 .7542434 .9319346 .3379889 .04541512 .7167162 .48844185 3 end rename (*#) *_# ds hhid, not local stubs `r(varlist)' local stubs `"`stubs' "' // NEED EXTRA SPACE AT END forvalues i = 1/2 { // REPlACE 2 BY ACTUAL LARGEST TERMINATING NUMBER local stubs: subinstr local stubs "_`i' " "_ ", all display `"``i': `stubs''"' } local stubs: list uniq stubs reshape long `stubs', i(hhid) j(series) rename (*_) *

Note: If the *1, *2, *3, .... stubs are all the same, or if there is one value of n such that *n covers all of the stubs, then a simpler approach can be used (replacing 1 by the number that actually covers all of the stubs):

Code:

ds *1 local stubs `r(varlist)' local stubs `"`stubs' "' local stubs: subinstr local stubs "1 " " ", all reshape long `stubs', i(hhid) j(series)

By the way, it is not clear to me why you said that -local stubs `r(varlist)'- didn't work for you. I cannot see how this command could have gone wrong. But let me just issue one caution to you. Do not attempt to run the code shown in this post one line at a time, or in blocks. You must run the whole thing uninterrupted from beginning to end. Failure to do that will result in local macros disappearing at each interruption, so that subsequent references to them will trigger errors.
1 like
Comment
Angela Jiang

Join Date: Feb 2023

Posts: 22
#25

19 Sep 2023, 15:26

Hi Clyde,

Thanks again for the clarification. By not working I was forgetting to wrap my list in "" when I was trying to display the content of stubs, which resulted in a syntax error, as it is a list and not a single item. Thanks for the new code suggestion too. I ended up just changing the 10 in my variable name to ten because there were just a few variables with that problem. But will keep your suggestion in mind for future reference.

Best,
Angela
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment