Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshaping a large dataset with awkward variale names

    Hello Everyone,

    I am currently working with a rather lagre Panel Dataset (n = 35000, 15 waves, ~400 Variables per Wave). The dataset is in a wide format so I want to reshape into long.
    The problem is the names of the variables. They follow this scheme: kpX_1234
    kpX is the prefix to determine the Wave number. Whereas the four following numbers identify the specific variable.


    I tried reshaping with the following code:

    reshape long kp1 kp2 kp3 ... kpX, i(id) j(wave)

    But this creates a subobservation identifier (j) for every variable due to the way the variables are named. Due to the volume of Variable I would obviously like to avoid having to enter them all individually.
    I thought about changing the variable names to 1234_kpX but this would still leave me having to enter every variable at a time.

    Is there any quick way of reshaping the dataset?

    Thank you,
    Julian

  • #2
    I think this example may start you in a useful direction.
    Code:
    ds
    // swap the location of the wave and the variable
    rename (kp#_#) (kp(#)[2]_(#)[1])
    ds
    unab kps : kp*
    local kps = ustrregexra("`kps'","_\d+","_")
    local kps : list uniq kps
    macro list _kps
    reshape long `kps', i(id) j(wave)
    rename (kp*_) (kp*)
    ds
    xtset id wave
    Code:
    . ds
    id        kp1_1002  kp1_1004  kp2_1001  kp2_1003  kp2_1005  kp3_1002  kp3_1004
    kp1_1001  kp1_1003  kp1_1005  kp2_1002  kp2_1004  kp3_1001  kp3_1003  kp3_1005
    
    . // swap the location of the wave and the variable
    . rename (kp#_#) (kp(#)[2]_(#)[1])
    
    . ds
    id        kp1002_1  kp1004_1  kp1001_2  kp1003_2  kp1005_2  kp1002_3  kp1004_3
    kp1001_1  kp1003_1  kp1005_1  kp1002_2  kp1004_2  kp1001_3  kp1003_3  kp1005_3
    
    . unab kps : kp*
    
    . local kps = ustrregexra("`kps'","_\d+","_")
    
    . local kps : list uniq kps
    
    . macro list _kps
    _kps:           kp1001_ kp1002_ kp1003_ kp1004_ kp1005_
    
    . reshape long `kps', i(id) j(wave)
    (j = 1 2 3)
    
    Data                               Wide   ->   Long
    -----------------------------------------------------------------------------
    Number of observations                2   ->   6           
    Number of variables                  16   ->   7           
    j variable (3 values)                     ->   wave
    xij variables:
                 kp1001_1 kp1001_2 kp1001_3   ->   kp1001_
                 kp1002_1 kp1002_2 kp1002_3   ->   kp1002_
                 kp1003_1 kp1003_2 kp1003_3   ->   kp1003_
                 kp1004_1 kp1004_2 kp1004_3   ->   kp1004_
                 kp1005_1 kp1005_2 kp1005_3   ->   kp1005_
    -----------------------------------------------------------------------------
    
    . rename (kp*_) (kp*)
    
    . ds
    id      wave    kp1001  kp1002  kp1003  kp1004  kp1005
    
    . xtset id wave
    
    Panel variable: id (strongly balanced)
     Time variable: wave, 1 to 3
             Delta: 1 unit
    
    .

    Comment


    • #3
      Here is a toy data set that, to the extent I understand your description, is a miniature version of yours. The code is somewhat complicated because it allows for the possibility that some variables appear in only some waves. (There is a much simpler approach if the variables are exactly the same for all waves.)

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input int id float(kp1_01 kp1_02 kp1_03 kp2_01 kp2_02 kp3_01 kp3_02 kp3_03 kp4_01 kp4_02 kp4_03 kp5_01 kp5_02 kp5_03 kp5_04)
       1   -.4868504   -.6211073   .4324252  1.4292794    1.418919 -1.0091912  -.6478488   .4354094  -.6658268   -1.29429 -1.6691283    .9876445 -.04783701  .13352603   -.962458
       2  -1.4760038   .13218145 -2.3337057    -.77772   .05279221   .5900462  -.8133085  1.6562247  -.9548236    .691761  -1.303175    1.806899 -1.1514199  1.5513942  -.6238794
       3 -.010911492   2.0378487  -.7645378   .3661323  -.14697625   .5971413  -.3801724  .41377705  -.1895006  1.1144592 -.03385835  -.54118717  -.3051975  -1.049907  -.4821892
       4 -.013760252   1.1648929  1.0160784  -.1939826    .8652378 -1.6475668  1.0691564  1.4551728 -1.4043055 -2.3067133   .4437251    .3597733  1.5742495   .9167848  1.6736826
       5    .5155788   1.0495144  .18720382  1.6394366   -.7876329  -1.148713   .9284888   .6767642   .6313429 -.29211426  1.0566341   -1.755562   .3703801   2.520513  .20477623
       6  -.13912173    .3293923  -.3145866  -.3533281  -1.6520803 -.05945954 -.26202136  -2.445083 -1.0745265   1.810242  .56974715    .4301121  -.2826256   .7168143 -.29396102
       7   1.2873173    .7287014   .4677534   -.149388 -.029363904 -.05453772   .8345145   .6460763    .680746  .28123158  -.5975326 -.005834702  -.8274989  -.2430358  2.2851815
       8  -.58777297    .8152171   .8098378  -.7269634  -2.2438629 -.16784404 .009923752    .840189  -.4044435  2.1420016   .7723381    .4447461  -.1531042  -.7514588  -1.851878
       9 -.030600455    .8828248  -.5765787  -.4719959   -.2996832 -.56888497    1.29752  1.0651693   1.618652  -.5647308   -.496987   -.0561219  -1.473499  1.2691363  1.0321714
      10  -2.1968436 -.021391753 -1.3777484  .58884954   -.4705126   .2996863   1.389477 -1.3293905  .06572019   .4935814    .266472   -.6170233     .17205  -.4231777  -.8290966
      11   -.9155009    .9599126  .04223201  .23385265 .0023159687 -.12103387   -.584632   .5391571  .24641147  -.9057139  -.5198731   .28306606  -.3384258  2.2876053   .8863388
      12   -.2230017    .6054105  1.1632676  -.6359429    .6715717 -2.0140722  -.4651928  -.2333257 -1.3160325  -.8274933  -.9717938   -.1713765   .8923253   .4719555  1.7225384
      13    .4945496   -.6521159 .014696945  .35979015   -1.609707  -.5642818  .08011564   .5968156  .09929268   1.306392 -1.4420376  -1.0013876  -.1564685 .070919015  -1.834222
      14  -.26231077    .7883045 .000711149  1.0419201   -.6880803 -1.2260357   .3401581  -.2264112  1.4602727 -1.0972396  -.4132413  -1.5947353  -2.527048  1.7420542 -1.0552567
      15    .3661239   -2.079254  -.6886311 -1.0310671  -.54471886   .6767082  -.6534075   .6107397 -1.0000384 -.08958297   .3053431  -.37762335     1.7379 -2.9995944   .7526055
      16    -.962465    .4353854   .6953909  .22454466   -.7130303  .18569472  -.2482205  -.2132934 -1.1054157 -.19497195 -2.0184422   1.1813127   1.439231    .970898 -1.4356583
      17   .25329232    .8039326  -.0475542  -3.217909    1.506732   2.091645   .3127741 -1.0356842   -.244828 -.04305877  .15496953   .02744569   .5860421  .50971663  -.9005138
      18   2.2842488   1.9533952  .46003765 -.13631526    .9514928  -.8494245   .6407731  -.3206576   -.903191   .6428951  .53760386   1.3699054  -.5978242   .1706619   .5247311
      19    .3223809  -1.2994204   .6942165   .3730735     .190155  -.8536293 -2.6800895  1.0551893 -.12321397   .3994437  1.7923543   -.8648146  2.1030679   .2856318   -1.22212
      20  -.40590465  -.12833437  -.6036441  -.4330619   1.2029504   .5746796    .631211 -.03979383   .7894307  -.2371126   -.736102   1.1558506 -1.8836583  -.7919921 -1.7791284
      end
      
      
      rename (kp#_#) v#[2]_#[1]
      ds v*
      local stubs `r(varlist)'
      foreach s of local stubs {
          local breakpoint = strpos("`s'", "_")
          local short_s = substr("`s'", 1, `breakpoint')
          local stubs: subinstr local stubs "`s'" "`short_s'"
      }
      local stubs: list uniq stubs
      
      reshape long `stubs', i(id) j(wave)
      rename *_ *
      The code will work equally well regardless the number of distinct variables or waves. All that is required is that the variable names in question be of the form kp#_#, where the first # is the wave number and the second one is the variable number. The end result is a long data set, with variables named v#, where # is the second # from the original variable names.

      Added: Crossed with #2. The solutions given are effectively the same. The code in #2 uses regular expressions, which simplifies some of the fiddling around with the variable names.
      Last edited by Clyde Schechter; 13 Feb 2023, 13:06.

      Comment


      • #4
        Maybe there is a more elegant solution, but try this:

        Code:
        reshape long kp1_ kp2_ kp3_, i(id) j(var)
        ren kp*_ kp*
        reshape long kp, i(id var) j(wave)
        reshape wide kp, i(id wave) j(var)

        Comment


        • #5
          Thank you!,
          your code helped me a lot in renaming my variables. they are no named like this: kp_1234_x
          However when I try to run the reshape STATA says that a lot of the variables are "already defined". I have no idea why that is happening.
          Can anybody help me with that?

          Comment


          • #6
            You apparently changed the renamed variables by adding an underscore after the kp.

            The code in both post #2 and post #3 assume the variable names have a single underscore in them.

            For what it's worth, I now prefer the approach in post #3 of renaming the variables from beginning with kp to beginning with v, so that if there are other variables beginning with kp (say, identifiers that are the same from wave to wave) the reshape will not risk confusing them.

            Last edited by William Lisowski; 14 Feb 2023, 08:26.

            Comment

            Working...
            X