Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshaping with Two Sets of Variables Ordered Differently

    Dear Statalist Community,

    After I tried a crude fix for which I received help on here yesterday already (it wasn't the help that was crude, but rather my approach), I now have to use a better approach relying on job identifiers to correctly match all variables to the corresponding job. I'm working with the NLSY data (which I just wanted to point out in case anybody has experienced the same difficulties with it before).

    My data are referring to jobs held by various people over the years (so the combination of caseid, year, and jobnumber should serve as a unique identifier after everything has been reshaped). Essentially my problem is that there are two sets of variables in the data: One for which the jobs are numbered 1-5, each year starting from 1 again, where job number 1 is the most recent job. The other for which jobs are simply numbered with an increasing number which does not reset over the years (i.e. job 25 is the 25th job held by the person over the years in the data). The data are currently in wide format with the ending representing the year and jobnumber (i.e. VAR_1980_02 meaning job number 2 [whether the second overall or that year depending on the variable] from 1980). So if I just were to reshape them as is, the information on the same job might not end up in the same line as the second job in 1980 could have been the fourth overall, and therefore some of the information will be under job number 4 and some under job number 2.

    In the example data below HRP* represents a variable for which the job numbers repeat every year and EmpAllTenure* one for which the job number keeps increasing.

    There are further two variables which provide job-identifying information. The JOB_UID_EMPROSTER variable provides information for the variables for which job numbers repeat from 1-5 every year (here: HRP, there are more variables like this in the whole data set). For example for CASEID=17, JOB_UID_EMPROSTER_1982_01 = 19790100 means that the information on the most recent job in 1982 (same scale as HRP) refers to information about the first job in 1979 (because the 19790100 number represents the year, then the job number, and two zeros). The JOB_UID_EMPROSTER variables link the round-specific job 1-5 employers to the roster of the other variables (those with numbers who keep increasing). So that means that that number (here: 19790100) can be used to find the same number among the list of EmpAllUID* variables. The ending of the name of the matching EmpAllUID variable then represents the job number the 1-5 repeating variable would take on the constantly increasing scale. I then need to figure out how to make use of that information to reshape my data.

    I'll try to rephrase this quickly, since I don't find the information super intuitive. Taking JOB_UID_EMPROSTER_1980_01 as an example, I need to find the value among the EmpAllUID_## variables (which don't vary by year, only job number), which equals JOB_UID_EMPROSTER_1980_01. So when JOB_UID_EMPROSTER_1980_01 == (e.g.) EmpAllUID_04, I know that information for all the variables from 1980 with the job number "1" which are on the repeating 1-5 scale (here: HRP_1980_01) refer to the overall job number "04" (the ending from EmpAllUID_04) for that specific respondent. Now I need help using that information to reshape my data so that it is in long format, both in terms of years as well as job numbers and has the correctly corresponding information in each line.

    I know that the data aren't super intuitive to read in wide format, so I've tried to be precise in my description and find a good balance between providing enough data that they work as an example and not too much so that one can still get a feel for them. I hope that my description is useful enough, otherwise please ask me to clarify and I will try my best.

    I've been going over this for quite a while now, so any help would be incredibly highly appreciated.


    Oh and: negative numbers are missing values (I didn't recode them yet, but I don't think that should matter at this stage) and due to the wide format I've had to manually adapt the -dataex output into two parts but it seems to work this way.








    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int CASEID_1979 long(HRP_1979_01 HRP_1979_02 JOB_UID_EMPROSTER_1979_01 JOB_UID_EMPROSTER_1979_02) int(EmpAllTENURE_1979_01 EmpAllTENURE_1979_02 HRP_1980_01 HRP_1980_02) long JOB_UID_EMPROSTER_1980_01
    14 237  -4 19790100       -4  9 -4  -4  -4 19800100
    15  -4  -4       -4       -4 -4 -4 100  -4 19800100
    16 455  -4 19790100       -4 59 -4 571 489 19800100
    17 420  -4 19790100       -4 62 -4 585  -4 19790100
    18 625 350 19790100 19790200 30  5 416  -4 19800100
    end
    
    
    input long JOB_UID_EMPROSTER_1980_02 int(EmpAllTENURE_1980_01 EmpAllTENURE_1980_02 EmpAllTENURE_1980_03) byte EmpAllTENURE_1980_04 long(EmpAllUID_01 EmpAllUID_02 EmpAllUID_03 EmpAllUID_04)
    19790100  -3  7 -4 -4 19790100 19800100 19810100 19820100
          -4  50 -4 -4 -4 19800100 19810200 19830100 19840100
    19790100  89 20 -4 -4 19790100 19800100 19810300 19810200
          -4 111 -4 -4 -4 19790100 19830100 19850100 19850200
    19790100  -4  7 35 -4 19790200 19790100 19800100 19870100
    end

  • #2
    I start to get a little confused from the fourth paragraph on, but in terms of the general problem as I understand it, you have some cases where the wide variables would be named:

    var_1980_01
    var_1980_02
    var_1981_01
    ...

    And some cases where variables representing the same reference period (year/job no) would be named:
    var_1980_01
    var_1980_02
    var_1981_03
    ...

    Is that correct? If so, I don't see why you couldn't reshape them & then update the j variable. So to convert everything to a running number by year:

    Code:
    reshape long var_1980_ var_1981_, i(caseid) j(jobno)
    ren *_ *
    reshape long var_1980 var1981, i(caseid jobno) j(year)
    sort caseid year jobno
    by caseid year: replace jobno = _n

    Comment


    • #3
      Thanks for getting back to me Mike. I did finally find a solution myself. I ended up splitting the data set and working things out separately. Your way sounds a lot simple than what I did, but I managed to incorporate the identifiers.

      Also: thanks for trying to wrap your head around the problem. I realize that it wasn't described perfectly (, but in my defense the data just are a little confusing).

      Comment

      Working...
      X