Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining duplicate obs across multiple vars in wide format

    Hello,

    I have a dataset that is currently in wide format in the following form:


    ID name dob ID1_1 name1_1 dob1_1 ID2_1 name2_1 dob2_1 ID3_1 name3_1 dob3_1
    i123 Hubert 1940
    i234 Jenny 1970 i345 Sam 1963
    i235 Paul 1962 i768 Katy 1950 i768 Katy 1950
    i435 Angela 1980 i980 Megan 1942 i980 Megan 1942

    As you can see, there are obs that are duplicated within rows that are associated with a given ID. You can see that there are also gaps in between the obs, i.e. the data is stretched longer across the rows than it could be. The dataset is unique at the level of ID, and I'd like to keep the associated relationships in wide format. What I'm trying to do is concatenate / combine the duplicate obs that occur across the rows (and the different var names), so that I can have a dataset that would look like this:
    ID name dob ID1_1 name1_1 dob1_1 ID2_1 name2_1 dob2_1 ID3_1 name3_1 dob3_1
    i123 Hubert 1940
    i234 Jenny 1970 i345 Sam 1963
    i235 Paul 1962 i768 Katy 1950
    i435 Angela 1980 i980 Megan 1942 i980 Megan 1942

    Essentially I just want to move everything "up" towards the identifying ID for ease of inspection / viewing, and most importantly to remove the duplicates that occur across the different var names (which are associated by the var`i'_`num')

    Any and all help is much appreciated!

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 id str7 name int dob str5 id1_1 str6 name1_1 int dob1_1 str5(id2_1 name2_1) int dob2_1 str5 id3_1 str6 name3_1 int dob3_1
    "i123 " "Hubert " 1940 ""      ""          . ""      ""         . ""      ""          .
    "i234 " "Jenny "  1970 ""      ""          . ""      ""         . "i345 " "Sam "   1963
    "i235 " "Paul "   1962 "i768 " "Katy "  1950 "i768 " "Katy " 1950 ""      ""          .
    "i435 " "Angela " 1980 "i980 " "Megan " 1942 ""      ""         . "i980 " "Megan " 1942
    end
    
    gen long obs_no = _n
    rename (id name dob) =0_1
    reshape long id@_1 name@_1 dob@_1, i(obs_no) j(_j)
    drop if missing(id_1) & missing(name_1) & missing(dob_1)
    by obs_no id_1 name_1 dob_1 (_j), sort: keep if _n == 1
    by obs_no (_j), sort: replace _j = _n-1
    reshape wide
    rename *0_1 *
    Frankly, unless you have a clear and persuasive reason for retaining the wide layout, you should skip the last two commands and stick with long layout. There aren't many things that can be done easily (or at all) with wide data in Stata. As you can see, this task of pruning the data, was quite simple and short once the data were -reshape-d to long. Doing it directly in wide layout would have required several loops and a lot of more or less unreadable code. Most things in Stata are like that.

    In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Thank you, Clyde, for your response. My data is sensitive and as such I cannot post the direct output from the dataex command, unfortunately.

      I tried the code that you provided, but it seems to have gotten me back to exactly where I was before I merged the datasets (in wide format) onto the IDs and did not do anything to help. I think the issue here is that you have "0_1" and my data goes from name1_1 to name22_5 (similarly for dob and ID). Can you please explain how the
      reshape long id@_1 name@_1 dob@_1, i(obs_no) j(_j) accounts for the numbers as they grow? I also thought of reshaping long, but thought this would need to be done in a loop to account for all the vars as they grow from 1- 22, and 1-5, respectively (i.e. name1_1 to name22_5)

      Comment


      • #4
        Let me point out that the problem stems from the fact that your example data and your description in post #1 gave no indication at all that the variable names could end in anything other than "_1". Not until post #3 did you make that apparent.

        You could help us help you by providing a data example - using dataex as Clyde recommended - that better represents the variable names that occur in your data.

        Comment


        • #5
          Also, in light of the new information given in #3, you need to clarify how the results should look. Suppose in some observation everything between id1_1 name1_1 dob1_1 and id1_2 name1_2 dob1_2 is missing. Do you want id1_2 name1_2 dob1_2 to "move left" to become id2_1 name2_1 dob2_1? Or should the _2, _3, _4, and _5 series maintain their separate "identities" with data "moving left" only within, but not among, these series?

          I am familiar with confidentiality issues, and understand that you cannot post your actual data. But just as you made up a table of fake data to show in #1, you can make a table of fake data in the Stata data editor having the structure of your actual data set and then use -dataex- to show that. Also, no need to show all of the id* dob* and name* variables. Just showing enough series that one can see the variation of both numerical indices in the names would suffice.

          Comment

          Working...
          X