Reshaping a large dataset with awkward variale names

Julian Rebien

Join Date: Feb 2023

Posts: 3
#1

Reshaping a large dataset with awkward variale names

13 Feb 2023, 11:47

Hello Everyone,

I am currently working with a rather lagre Panel Dataset (n = 35000, 15 waves, ~400 Variables per Wave). The dataset is in a wide format so I want to reshape into long.
The problem is the names of the variables. They follow this scheme: kpX_1234
kpX is the prefix to determine the Wave number. Whereas the four following numbers identify the specific variable.

I tried reshaping with the following code:

reshape long kp1 kp2 kp3 ... kpX, i(id) j(wave)

But this creates a subobservation identifier (j) for every variable due to the way the variables are named. Due to the volume of Variable I would obviously like to avoid having to enter them all individually.
I thought about changing the variable names to 1234_kpX but this would still leave me having to enter every variable at a time.

Is there any quick way of reshaping the dataset?

Thank you,
Julian
Tags: data, label, panel data, reshape

William Lisowski

Join Date: Dec 2014
Posts: 10150

13 Feb 2023, 12:55

I think this example may start you in a useful direction.

Code:

ds
// swap the location of the wave and the variable
rename (kp#_#) (kp(#)[2]_(#)[1])
ds
unab kps : kp*
local kps = ustrregexra("`kps'","_\d+","_")
local kps : list uniq kps
macro list _kps
reshape long `kps', i(id) j(wave)
rename (kp*_) (kp*)
ds
xtset id wave

Code:

. ds
id        kp1_1002  kp1_1004  kp2_1001  kp2_1003  kp2_1005  kp3_1002  kp3_1004
kp1_1001  kp1_1003  kp1_1005  kp2_1002  kp2_1004  kp3_1001  kp3_1003  kp3_1005

. // swap the location of the wave and the variable
. rename (kp#_#) (kp(#)[2]_(#)[1])

. ds
id        kp1002_1  kp1004_1  kp1001_2  kp1003_2  kp1005_2  kp1002_3  kp1004_3
kp1001_1  kp1003_1  kp1005_1  kp1002_2  kp1004_2  kp1001_3  kp1003_3  kp1005_3

. unab kps : kp*

. local kps = ustrregexra("`kps'","_\d+","_")

. local kps : list uniq kps

. macro list _kps
_kps:           kp1001_ kp1002_ kp1003_ kp1004_ kp1005_

. reshape long `kps', i(id) j(wave)
(j = 1 2 3)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations                2   ->   6           
Number of variables                  16   ->   7           
j variable (3 values)                     ->   wave
xij variables:
             kp1001_1 kp1001_2 kp1001_3   ->   kp1001_
             kp1002_1 kp1002_2 kp1002_3   ->   kp1002_
             kp1003_1 kp1003_2 kp1003_3   ->   kp1003_
             kp1004_1 kp1004_2 kp1004_3   ->   kp1004_
             kp1005_1 kp1005_2 kp1005_3   ->   kp1005_
-----------------------------------------------------------------------------

. rename (kp*_) (kp*)

. ds
id      wave    kp1001  kp1002  kp1003  kp1004  kp1005

. xtset id wave

Panel variable: id (strongly balanced)
 Time variable: wave, 1 to 3
         Delta: 1 unit

.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30357

13 Feb 2023, 12:58

Here is a toy data set that, to the extent I understand your description, is a miniature version of yours. The code is somewhat complicated because it allows for the possibility that some variables appear in only some waves. (There is a much simpler approach if the variables are exactly the same for all waves.)

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int id float(kp1_01 kp1_02 kp1_03 kp2_01 kp2_02 kp3_01 kp3_02 kp3_03 kp4_01 kp4_02 kp4_03 kp5_01 kp5_02 kp5_03 kp5_04)
 1   -.4868504   -.6211073   .4324252  1.4292794    1.418919 -1.0091912  -.6478488   .4354094  -.6658268   -1.29429 -1.6691283    .9876445 -.04783701  .13352603   -.962458
 2  -1.4760038   .13218145 -2.3337057    -.77772   .05279221   .5900462  -.8133085  1.6562247  -.9548236    .691761  -1.303175    1.806899 -1.1514199  1.5513942  -.6238794
 3 -.010911492   2.0378487  -.7645378   .3661323  -.14697625   .5971413  -.3801724  .41377705  -.1895006  1.1144592 -.03385835  -.54118717  -.3051975  -1.049907  -.4821892
 4 -.013760252   1.1648929  1.0160784  -.1939826    .8652378 -1.6475668  1.0691564  1.4551728 -1.4043055 -2.3067133   .4437251    .3597733  1.5742495   .9167848  1.6736826
 5    .5155788   1.0495144  .18720382  1.6394366   -.7876329  -1.148713   .9284888   .6767642   .6313429 -.29211426  1.0566341   -1.755562   .3703801   2.520513  .20477623
 6  -.13912173    .3293923  -.3145866  -.3533281  -1.6520803 -.05945954 -.26202136  -2.445083 -1.0745265   1.810242  .56974715    .4301121  -.2826256   .7168143 -.29396102
 7   1.2873173    .7287014   .4677534   -.149388 -.029363904 -.05453772   .8345145   .6460763    .680746  .28123158  -.5975326 -.005834702  -.8274989  -.2430358  2.2851815
 8  -.58777297    .8152171   .8098378  -.7269634  -2.2438629 -.16784404 .009923752    .840189  -.4044435  2.1420016   .7723381    .4447461  -.1531042  -.7514588  -1.851878
 9 -.030600455    .8828248  -.5765787  -.4719959   -.2996832 -.56888497    1.29752  1.0651693   1.618652  -.5647308   -.496987   -.0561219  -1.473499  1.2691363  1.0321714
10  -2.1968436 -.021391753 -1.3777484  .58884954   -.4705126   .2996863   1.389477 -1.3293905  .06572019   .4935814    .266472   -.6170233     .17205  -.4231777  -.8290966
11   -.9155009    .9599126  .04223201  .23385265 .0023159687 -.12103387   -.584632   .5391571  .24641147  -.9057139  -.5198731   .28306606  -.3384258  2.2876053   .8863388
12   -.2230017    .6054105  1.1632676  -.6359429    .6715717 -2.0140722  -.4651928  -.2333257 -1.3160325  -.8274933  -.9717938   -.1713765   .8923253   .4719555  1.7225384
13    .4945496   -.6521159 .014696945  .35979015   -1.609707  -.5642818  .08011564   .5968156  .09929268   1.306392 -1.4420376  -1.0013876  -.1564685 .070919015  -1.834222
14  -.26231077    .7883045 .000711149  1.0419201   -.6880803 -1.2260357   .3401581  -.2264112  1.4602727 -1.0972396  -.4132413  -1.5947353  -2.527048  1.7420542 -1.0552567
15    .3661239   -2.079254  -.6886311 -1.0310671  -.54471886   .6767082  -.6534075   .6107397 -1.0000384 -.08958297   .3053431  -.37762335     1.7379 -2.9995944   .7526055
16    -.962465    .4353854   .6953909  .22454466   -.7130303  .18569472  -.2482205  -.2132934 -1.1054157 -.19497195 -2.0184422   1.1813127   1.439231    .970898 -1.4356583
17   .25329232    .8039326  -.0475542  -3.217909    1.506732   2.091645   .3127741 -1.0356842   -.244828 -.04305877  .15496953   .02744569   .5860421  .50971663  -.9005138
18   2.2842488   1.9533952  .46003765 -.13631526    .9514928  -.8494245   .6407731  -.3206576   -.903191   .6428951  .53760386   1.3699054  -.5978242   .1706619   .5247311
19    .3223809  -1.2994204   .6942165   .3730735     .190155  -.8536293 -2.6800895  1.0551893 -.12321397   .3994437  1.7923543   -.8648146  2.1030679   .2856318   -1.22212
20  -.40590465  -.12833437  -.6036441  -.4330619   1.2029504   .5746796    .631211 -.03979383   .7894307  -.2371126   -.736102   1.1558506 -1.8836583  -.7919921 -1.7791284
end


rename (kp#_#) v#[2]_#[1]
ds v*
local stubs `r(varlist)'
foreach s of local stubs {
    local breakpoint = strpos("`s'", "_")
    local short_s = substr("`s'", 1, `breakpoint')
    local stubs: subinstr local stubs "`s'" "`short_s'"
}
local stubs: list uniq stubs

reshape long `stubs', i(id) j(wave)
rename *_ *

The code will work equally well regardless the number of distinct variables or waves. All that is required is that the variable names in question be of the form kp#_#, where the first # is the wave number and the second one is the variable number. The end result is a long data set, with variables named v#, where # is the second # from the original variable names.

Added: Crossed with #2. The solutions given are effectively the same. The code in #2 uses regular expressions, which simplifies some of the fiddling around with the variable names.

Last edited by Clyde Schechter; 13 Feb 2023, 13:06.

Comment

Daniel Shin

Join Date: Mar 2020

Posts: 146
#4

13 Feb 2023, 12:59

Maybe there is a more elegant solution, but try this:

Code:

reshape long kp1_ kp2_ kp3_, i(id) j(var) ren kp*_ kp* reshape long kp, i(id var) j(wave) reshape wide kp, i(id wave) j(var)
Comment
Julian Rebien

Join Date: Feb 2023

Posts: 3
#5

14 Feb 2023, 03:51

Thank you!,
your code helped me a lot in renaming my variables. they are no named like this: kp_1234_x
However when I try to run the reshape STATA says that a lot of the variables are "already defined". I have no idea why that is happening.
Can anybody help me with that?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

14 Feb 2023, 08:22

You apparently changed the renamed variables by adding an underscore after the kp.

The code in both post #2 and post #3 assume the variable names have a single underscore in them.

For what it's worth, I now prefer the approach in post #3 of renaming the variables from beginning with kp to beginning with v, so that if there are other variables beginning with kp (say, identifiers that are the same from wave to wave) the reshape will not risk confusing them.

Last edited by William Lisowski; 14 Feb 2023, 08:26.
Comment

Announcement