Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining several datasets to create one large longitudinal dataset

    Hi

    I would like some advice on the best way to combine (or merge) several datasets to create one large dataset with longitudinal data. I have ten datasets, each corresponding to a calendar year between 2005 and 2014. A dataset may contain between 12,000 and 25,000 subjects and up to 15 variables. When combined, one should be able to follow a subject over the years. There will be new subjects included each year and several will also drop out. There is an ID variable that identifies subjects as well as a variable indicating the year. The datasets contain both time-constant and time varying variables. Some time-constant variables such as birthdate, date of diagnosis, ethnicity are recorded each year and should be comparable over the years once all datasets are merged together.

    Examples of three datasets (made-up):

    dataset 1:

    Code:
    input ID  birthdate05  diagdate05 bmi05  chol05  year05
    
    1     1981         1981           18     180    2005
    2     1979         1997           21     169    2005
    3     2001         2004           17     188    2005
    4     2000         2005           23     179    2005
    5     1982         1994           .      159    2005
    6     1999         2001           22     173    2005
    7     1996         2000           21     181    2005
    dataset 2:

    Code:
    input ID  birthdate07  diagdate07   bmi07  chol07   year07
    
    1       1981            1981        18     180     2007
    2       1979            1997        21     169     2007
    3       2001            2004        17     188     2007
    
    5       1982            1994        19     159     2007
    6       1999            2001        22     173     2007
    16      2000            2004        21     181     2007
    dataset 3:

    Code:
    input ID  birthdate11  diagdate11  bmi11  chol11   year011
    
    1       1981         1981        18        180     2011
    2       1979         1997        21        169     2011
    3       2001         2004        17        188     2011
    4       2000         2005        23        .       2011
    5       1982         1994        19        159     2011
    16      1999         2001        22        173     2011
    19      1996         2000        21        181     2011
    21      2009         2010        21       .        2011

    I would like to merge the above three datasets to have one dataset in the long format so I can run longitudinal analysis (for e.g. multilevel or GEE models).
    Ideally, I would like each subject to have as many records (or 'rows') as data measurements over time:

    Combined dataset:

    Code:
    input ID  birthdate diagdate   bmi   chol   year
    
    1       1981         1981       18     180     2005
    1       1981         1981       18     180     2007
    1       1981         1981       18     180     2011
    2       1979         1997       21     169     2005
    2       1979         1997       21     169     2007
    2       1979         1997       21     169     2011
    3       2001         2004       17     188     2005
    3       2001         2004       17     188     2007
    3       2001         2004       17     188     2011
    4       2000         2005       23     179     2005
    4       2000         2005       23     .       2011
    5       1982         1994       .      159     2005
    5       1982         1994       19     159     2007
    5       1982         1994       19     159     2011
    6       1999         2001       22     173     2005
    6       1999         2001       22     173     2007
    7       1996         2000       21     181     2005
    16      2000         2004       21     181     2007
    16      1999         2001       22     173     2011
    19      1996         2000       21      81     2011
    21      2009         2010       21     .       2011

    1. My first approach was to use append but this results in a dataset being added below the previous (master) dataset and I don't obtain the above structure I would like.

    2.Additionally, I'm guessing the names of the variables have to be identical across all the datasets (they currently differ in the suffix - perhaps this is what is creating the problem with append?).

    Thanks

    /Amal

  • #2
    since you already have them in long form, maybe you could just "reshape" into wide form - assuming you really want wide form - many forms of analysis are simpler with the data in long form (though you will need to do some renaming)

    Comment


    • #3
      Thanks Rich - but I do want the ''final'' dataset in the long format. My question was more about the best way to combine the three smaller datasets (the first datasets above) to get one combined dataset (the last one above).

      Thanks

      /Amal

      Comment


      • #4
        I think you will first want something like the following. This assumes that you are using a recent enough version of Stata that rename group has been implemented (I know it's in 13 and think it's in 12).
        Code:
        clear
        save data_all, emptyok replace
        use data2005, clear
        rename *05 *
        append using data_all
        save data_all, replace
        use data2007, clear
        rename *07 *
        append using data_all
        save data_all, replace
        use data2011, clear
        rename *11 *
        append using data_all
        sort ID year
        save data_all, replace
        list, clean

        Comment


        • #5
          rename groups was introduced in Stata 12. Regardless of which version people are currently using, old whatsnew help files are accessible for some versions back from the StataCorp website.

          It can take some rummaging to find out, but in this case see http://www.stata.com/help.cgi?whatsnew11to12 (item 16)

          Comment


          • #6
            Hi William

            Thanks for the example code. I understand that the rename gets rid of the suffix that distinguishes variables between the different years but what is the use of the line:

            Code:
            save data_all, emptyok replace
            Shouldn't the append line read as:

            Code:
            append using data2007
            and so on and so forth...

            I thought I should open the first dataset (data2005), append the data2007 to this, followed by data2011...?


            Thanks

            /Amal

            Comment


            • #7
              Apologies - I figured it out. I wasn't entering the right directory pathway and thus wasn't able to start with an ''empty dataset'' such as the data_all above.

              I see that when one uses the append command to join datasets, the variables retain the labels of the latest appended dataset. Is there anyway to retain the original labels for the various variables from both the master and using datasets (and all subsequent datasets appended to the data_all dataset?).

              Thanks

              /Amal

              Comment


              • #8
                I'm not sure what you're asking here. If you want to retain the labels from the first data set, just add the -nolabel- option to all of the -append- commands except the first.

                But you talk about retaining the original labels from both master and using datasets. I don't even understand what that means. If the labels are different from one data set to the next, you have to make a choice which you want to retain. A variable can only have one label within a data set. (Well, that's not strictly true: see -label language-, and perhaps what you want to do could be accomplished by storing each dataset's labels as a different "language." But you would still have only one version of the label "in play" at any time.)

                But I don't think that's what you mean. Could you perhaps show us an example of two data sets with clashing labels and what you would want the result to look like?

                Comment

                Working...
                X