Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshaping a panel dataset from wide to long

    Hello everyone,

    I was wondering if anyone would be able to help me understand what I may be doing wrong with my ID variables, I would like to reshape my dataset from wide to long.

    I have a panel dataset with 4 waves (but I'm only wanting to use waves 3 and 4, each wave has a person variable and a case/household variable. I merged the two waves using these person and case variables.

    I have attached a screenshot of a section of what these variables look like:
    Household ID variables.tiff Person ID variables.tiff

    I then created one ID variable:
    egen personID = group(person household)

    And finally tried to reshape using this:
    reshape long PresyrW PresmonW TypeW sexw DVGIEMPw PermJbW EmpStYW EdLevelW DVHasDCW PFTyp1W POMeth1W POEmFr1W ORetIncW OriskaW OriskcW OSaferetW OunderW OPenSavW DVAge9W PDCVal1W SpendMW LvTdayW, i(personID) j(wave)

    When I try and reshape the dataset I get the error (r(9)) says that my variable ID does not uniquely identify the observations.

    I was wondering how I might deal with duplicates in my ID variable or how to generate one that doesn't contain duplicates so that I am able to reshape.

    Any help is greatly appreciated, thanks,

    ​​​​​​​Alice

  • #2
    The information you have provided is too incomplete, and, by the way, provided in a very difficult format to work with. I doubt anybody can answer your question based on what you've shown.

    You have created a variable personID based on person and household. But there is no variable person or household in your data set, at least not in what you show of it. We have no clue what the numerous other variables mentioned in the -reshape- command look like.

    Suffice it to say, that when Stata says that a variable fails to identify the observations uniquely, I have never known it to be wrong. It must be the case that you have more than one observation with the same value of personID. Actually, it probably wouldn't have made sense for you to generate that variable the way you did if it were just going to identify individual observations anyway. But since the data you show in your pictures contains neither person, nor household, nor personID, you really leave us without a clue about the relevant aspects of your data set.

    So we need to see an example of your data, and, so that we can try to experiment with it, it needs to be in a usable form. (Pictures are not usable: how would you import the examples you showed into Stata?) The way to show data examples is with the -dataex- command. You can install it by running -ssc install dataex-, and then read the instructions by running -help dataex-. Use -dataex- to post an example of your data. Be sure to include all of the relevant variables. You don't need to show a large number of observations. Probably a handful will suffice for this.

    Finally, I'm just wondering one thing. If you want the data in long layout (and, that is almost always the best layout for Stata analyses, so you should want it that way), did you consider -append-ing the wave data sets instead of -merge-ing them? Then it would have come out long in the first place.



    Comment


    • #3
      Hello Clyde,

      Thank you for your response. Sorry for not including the proper information in my first post. I have included a section of my data below from the 4th wave (before merging or appending them), I have only included a few variables. I will also try appending the two waves instead of merging them.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte(PersonW3 Personw4) int(CASEw3 CASEw4 PresyrW4) byte PresmonW4 long DVGIEMPW4
       1 1 14178  9156 2012  9     0
       1 1  5535 10350 2012  9     0
      -7 3 12638  5337   -7 -7 18600
       1 1 12638  5337 2012 11 14000
       2 2 12638  5337   -7 -7     0
       1 1 21368   175 2012  9     0
       1 1 20302  6792 2012  9 36000
       2 2 20302  6792 2012  9 10800
       4 3 20302  6792 2012  9 28000
       3 1 20302  1951 2012  9 27600
       1 1 14097 12885 2012  9 93000
       2 2 14097 12885 2012  9 36000
       5 3 14097 12885   -7 -7     0
       4 4  2211  9213   -7 -7     0
       1 1  2211  9213 2012  9 42000
       2 2  2211  9213 2012  9 75000
       3 3  2211  9213   -7 -7     0
       1 1 15167 10746 2012 10     0
       2 2 15167 10746 2012 10     0
       3 3 15167 10746   -7 -7     0
      end
      label values PersonW3 PersonW3
      label def PersonW3 -7 "Not applicable", modify
      label values Personw4 Personw4
      label values PresyrW4 PresyrW4
      label def PresyrW4 -7 "Does not apply", modify
      label values PresmonW4 PresmonW4
      label def PresmonW4 -7 "Does not apply", modify
      label def PresmonW4 9 "September", modify
      label def PresmonW4 10 "October", modify
      label def PresmonW4 11 "November", modify
      Thank you for your help,

      Alice

      Comment


      • #4
        Thanks. This is clearer, but I am puzzled by the data. It is unclear what is the unit that is represented by a single observation (row) of your data. As it stands it looks like when you merged the wave you somehow threw together data from one wave 3 person with some other arbitrary wave 4 person. Or are these Person numbers serial numbers within households, and each observation represents the same household in both waves? Also it seems odd that Presyr and Presmon (which appear to be year and month variables) are not available in wave 3, though I suppose that may be just the way the data came to you.

        Anyway, if I have it right about an observation being a household, the following code will do the -reshape- for you:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte(PersonW3 Personw4) int(CASEw3 CASEw4 PresyrW4) byte PresmonW4 long DVGIEMPW4
         1 1 14178  9156 2012  9     0
         1 1  5535 10350 2012  9     0
        -7 3 12638  5337   -7 -7 18600
         1 1 12638  5337 2012 11 14000
         2 2 12638  5337   -7 -7     0
         1 1 21368   175 2012  9     0
         1 1 20302  6792 2012  9 36000
         2 2 20302  6792 2012  9 10800
         4 3 20302  6792 2012  9 28000
         3 1 20302  1951 2012  9 27600
         1 1 14097 12885 2012  9 93000
         2 2 14097 12885 2012  9 36000
         5 3 14097 12885   -7 -7     0
         4 4  2211  9213   -7 -7     0
         1 1  2211  9213 2012  9 42000
         2 2  2211  9213 2012  9 75000
         3 3  2211  9213   -7 -7     0
         1 1 15167 10746 2012 10     0
         2 2 15167 10746 2012 10     0
         3 3 15167 10746   -7 -7     0
        end
        label values PersonW3 PersonW3
        label def PersonW3 -7 "Not applicable", modify
        label values Personw4 Personw4
        label values PresyrW4 PresyrW4
        label def PresyrW4 -7 "Does not apply", modify
        label values PresmonW4 PresmonW4
        label def PresmonW4 -7 "Does not apply", modify
        label def PresmonW4 9 "September", modify
        label def PresmonW4 10 "October", modify
        label def PresmonW4 11 "November", modify
        
        gen long household = _n
        reshape long Person CASE Presyr Presmon DVGIEMP, i(household) j(wave) string
        destring wave, ignore("wW") replace
        The code is made a bit more baroque by virtue of the designation of wave in the original variable names with sometimes a lower case w and sometimes a W, but that is not a big deal.

        One other thing: in several of the variables you are using -7 as a code for missing values. That will cause headaches when you analyze the data: Stata will treat them as actual numerical values of -7. And unlike certain other statistical languages, there is no way to tell a Stata analysis command to pretend that -7 really means missing when doing calculations. So you will need to replace those -7's by actual Stata missing values before doing any calculations. Fortunately, that's very easy to do:

        Code:
        mvdecode Person CASE Presyr Presmon DVGIEMP, mv(-7)
        Do the same for any other "magic numbers" that really mean missing value.

        Comment


        • #5
          Thank you Clyde, this was really helpful. My data now looks like this:

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float person byte wave int PresyrW byte(PresmonW EmpNW TypeW) long DVGIEMPw byte PermJbW
           1 3 2012  1  6 . 29400  1
           1 4 2014  4  3 2     .  1
           2 3 2011 12  4 . 13800  1
           2 4 2014  4  1 6     . -7
           3 3 2011  8 -6 .     0 -7
           3 4 2012 11  5 2     .  1
           4 3 2011  8  3 . 19900  1
           4 4 2014  7  3 2     .  4
           5 3 2010  8  4 .  5640  1
           5 4 2013  8  3 2     .  1
           6 3 2011  6 -6 .     0 -7
           6 4 2014  5 -7 2     . -7
           7 3 2010 10 -7 .     0 -7
           7 4 2012  9  6 2     .  1
           8 3 2011  6  4 .     0 -7
           8 4 2014  6  5 2     . -7
           9 3 2011  3  1 .     0 -7
           9 4 2013  9  1 2     . -7
          10 3 2011 10  1 .  6940  1
          10 4 2013 11  3 2     . -7
          11 3 2011  5 -7 .     0 -7
          11 4 2014  5  1 6     . -7
          12 3 2011  5  3 . 18900  1
          12 4 2013 10  3 2     . -7
          13 3 2011 10 -6 .     0 -7
          13 4 2012 10  6 6     . -7
          14 3 2011  8 -6 .     0 -7
          14 4 2013  1  4 2     .  1
          15 3 2011 12 -6 .     0 -7
          15 4 2013  4 -7 2     . -7
          end
          label values PresyrW PresyrW4
          label values PresmonW PresmonW4
          label def PresmonW4 1 "January", modify
          label def PresmonW4 3 "March", modify
          label def PresmonW4 4 "April", modify
          label def PresmonW4 5 "May", modify
          label def PresmonW4 6 "June", modify
          label def PresmonW4 7 "July", modify
          label def PresmonW4 8 "August", modify
          label def PresmonW4 9 "September", modify
          label def PresmonW4 10 "October", modify
          label def PresmonW4 11 "November", modify
          label def PresmonW4 12 "December", modify
          label values EmpNW EmpNW4
          label def EmpNW4 -7 "Does not apply", modify
          label def EmpNW4 -6 "Error / partial", modify
          label def EmpNW4 1 "1 to 10,", modify
          label def EmpNW4 3 "25 to 49,", modify
          label def EmpNW4 4 "50 to 249,", modify
          label def EmpNW4 5 "250 to 499,", modify
          label def EmpNW4 6 "or, 500 or more?", modify
          label values TypeW TypeW4
          label def TypeW4 2 "W3-W4 Linked case", modify
          label def TypeW4 6 "W4 New household", modify
          label values PermJbW PermJbW4
          label def PermJbW4 -7 "Does not apply", modify
          label def PermJbW4 1 "Permanent", modify
          label def PermJbW4 4 "No employment contract", modify
          I think I understand what you mean by the data still being a bit unclear. The data I attached previously was just from wave 4, I hadn't merged it because I couldn't due to not having an ID that uniquely identified my observations. The way I got it to merge the two waves before doing the reshape commands you suggested was to do this:

          use "DataW3"
          egen personID = group(personW3 CASEw3)
          save "DataW3.1"

          use "DataW4"
          egen personID = group(personW4 CASEw4)
          save "DataW4.1"

          use "DataW3.1"
          merge 1:1 ID using "DataW4.1"

          I don't know if this is right and I fear that as you said I may be putting a wave 3 person with an arbitrary wave 4 person. In the original wave dataset I think the person variable does just number the members no. 1,2,3 etc. of a particular household and the household variable then uniquely identifies each household but give the same household ID to those in the same household. But it did reshape successfully so I think it has worked.

          Thank you for your help,

          Alice


          Comment


          • #6
            use "DataW3"
            egen personID = group(personW3 CASEw3)
            save "DataW3.1"

            use "DataW4"
            egen personID = group(personW4 CASEw4)
            save "DataW4.1"
            is problematic. Even assuming that the same person numbers and case numbers are used consistently for the same person/case in each wave, the use of -egen, group- will not produce the same results in each data set if there are people who appear in one data set but not the other (which is nearly always the case with surveys). When you then -merge- on personID, you would then be combining the data from different people into the same observation.

            I think that the best way to combine these data sets is to -append- them, not -merge- them. Not only will it give you a data set in long layout to start with, it completely avoids the problem of creating a unique identifiers on different groups of people in the two waves. If for other reasons you need to combine the person and CASE variables into a single personID variable, you can do that after -append- and there will be no problem of inconsistency.

            If there are other reasons why it is better to -merge- than append, then the solution is not to use -egen, group-. Rather, the solution is to use both person and CASE in the merge key. Thus

            Code:
            use "DataW3"
            rename personW3 person
            rename CASEW3 CASE
            save "DataW3.1"
            
            use "DataW4"
            rename personW4 person
            rename CASEW4 CASE
            save "DataW4.1"
            
            merge 1:1 person CASE using DataW3.1

            Comment


            • #7
              Thank you Clyde. I will try both options, appending and merging using person and case as the merge key. Thank you for all your help.

              Comment

              Working...
              X