Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • String variables - Same string value in many columns to be collapsed

    Hello,

    I am working with a large data set (5 million + rows), of individuals. I have a column for every month of the year for their spouse in that month, identified by a string identifier (Column: S_M1 is for Spouse in month 1). Most individuals have the same spouse from months 1 to 12. However, some change their spouse between months. I want to change the data to have just max 3 columns of first spouse, second spouse , etc, during the year

    Current format:
    Individual S_M1 S_M2 S_M3 S_M4 S_M5 S_M6 S_M7 S_M8 S_M9 S_M10 S_M11 S_M12 All other variables related to household
    XD309 GG345 GG345 GG345 GG345 GG345 GG345 GG345 GG345 GG345 GG345 GG345 GG345
    DG23$ WD111 WD111 WD111 WD111 . TT343 TT343 TT343 TT343 . QW123 QW123
    SA098 . . . . . . . . . . . .
    GF203 FD123 FD123 . . . . AA999 AA999 AA999 AA999 AA999 AA999

    Preferred format:

    Individual Spouse1 Spouse2 Spouse3 All other variables related to household
    XD309 GG345 . .
    DG23$ WD111 TT343 QW123
    SA098 . . .
    GF203 FD123 AA999 .

    If I can get it to the preferred format above, I will reshape it and it and every spouse will have their own row with all the associated households variables. I wonder if its possible to skip step 2 above. The ideal end format is as follows:

    Individual All other variables related to household
    XD309
    GG345
    DG23$
    WD111
    TT343
    QW123
    SA098
    GF203
    FD123
    AA999

    Please let me know if anyone is able to solve this problem. Thank you!









  • #2
    For your future posts, present data examples using the dataex command (refer to FAQ Advice #12). Dot entries within what seem to be string variables imply that the names of spouses may be value labels and not actual string entries. This will then require changes to the code below.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5(individual s_m1 s_m2 s_m3 s_m4 s_m5 s_m6 s_m7 s_m8 s_m9 s_m10 s_m11 s_m12) byte allothervariablesrelatedtohouseh
    "XD309" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" .
    "DG23$" "WD111" "WD111" "WD111" "WD111" "."     "TT343" "TT343" "TT343" "TT343" "."     "QW123" "QW123" .
    "SA098" "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     .
    "GF203" "FD123" "FD123" "."     "."     "."     "."     "AA999" "AA999" "AA999" "AA999" "AA999" "AA999" .
    end
    
    rename s_m* spouse*
    reshape long spouse, i(individual) j(which)
    bys individual spouse (which): keep if _n==1
    drop if spouse=="."
    by individual: replace which=_n
    reshape wide spouse, i(individual) j(which)
    Res.:

    Code:
    . l
    
         +---------------------------------------------------+
         | indivi~l   spouse1   spouse2   spouse3   alloth~h |
         |---------------------------------------------------|
      1. |    DG23$     QW123     TT343     WD111          . |
      2. |    GF203     AA999     FD123                    . |
      3. |    XD309     GG345                              . |
         +---------------------------------------------------+
    Last edited by Andrew Musau; 28 Jun 2023, 14:27.

    Comment


    • #3
      Thank You very much for this I have not tried it yet but I hope my computer will be able to handle a reshape of 5 million x 10

      Comment


      • #4
        Thank You very much for this I have not tried it yet but I hope my computer will be able to handle a reshape of 5 million x 10

        Comment


        • #5
          If you are wary of reshaping such a large dataset, consider this code. I don't know if it is more efficient, though.

          Code:
          clear
          input str5(individual s_m1 s_m2 s_m3 s_m4 s_m5 s_m6 s_m7 s_m8 s_m9 s_m10 s_m11 s_m12) byte allothervariablesrelatedtohouseh
          "XD309" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" .
          "DG23$" "WD111" "WD111" "WD111" "WD111" "."     "TT343" "TT343" "TT343" "TT343" "."     "QW123" "QW123" .
          "SA098" "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     .
          "GF203" "FD123" "FD123" "."     "."     "."     "."     "AA999" "AA999" "AA999" "AA999" "AA999" "AA999" .
          end
          
          gen last_one = "."
          gen all_spouses = ""
          
          forval i = 1/12 {
              replace all_spouses = all_spouses + " " + s_m`i' if s_m`i' != "." & s_m`i' != last_one
              replace last_one = s_m`i' if s_m`i' != "." & s_m`i' != last_one
          }
          
          split all_spouses, gen(spouse_)
          drop all_spouses last_one
          which produces:

          Code:
          . list individual spouse_*, noobs abbrev(10)
          
            +---------------------------------------------+
            | individual   spouse_1   spouse_2   spouse_3 |
            |---------------------------------------------|
            |      XD309      GG345                       |
            |      DG23$      WD111      TT343      QW123 |
            |      SA098                                  |
            |      GF203      FD123      AA999            |
            +---------------------------------------------+

          Comment


          • #6
            Out of curiosity, I checked the two procedures on a dataset of 40 million unique individuals, and it turns out my method in #5 takes only one fourth of the time that the method in #2 takes -- specifically, 395.2 seconds versus 1,653.5 seconds on my machine (Apple M2 Pro) using Stata 18 MP/12-core.

            Comment


            • #7
              Hello! Thank you so much and sorry for the late reply. I was redirected to another project and I also didn't think this would run so fast, I was expecting my computer to be unavailable over-night for this so I didn't run it.

              Anyways, It worked! I used method in #5 and it worked in seconds using Stata 17 and HP Elite-book. I am not sure how you check how many seconds, but VERY QUICK.

              Now, the ultimate goal was to get to "ideal" format in the last version, from my post 1 (I named the other "preferred", which was probably a bad idea). Aka, every individual and every spouse has their own row with the associated household variables. They are no longer connected (spouse and individual). As below:
              Individual All other variables related to household
              XD309
              GG345
              DG23$
              WD111
              TT343
              QW123
              SA098
              GF203
              FD123
              AA999

              The re-shape formats the data into the "preferred format" as is shown in your table in the posts. However, the reshape still only has the spouse as an attached variable to the individual and not as their own person in a row with their own household variables.

              I have an idea of how to move forward, but maybe you have some magic idea that is an easier way to do this. Here are my next steps (I could make a new post if this is not ok).

              1. Create a Unique identifier for each family (individual+spouse)
              2. Export all spouses and the family identifier into a separate dataset (SPOUSE DATASET)
              3. Use SPOUSE DATASET and merge all household info from the master file using the family identifier
              4. Amend the two files where each individual and spouse will now be their own row

              Thank you for all the help. I cant believe this is a free resource. Where do I donate please?


              Lexi

              Comment


              • #8
                Consider this code:

                Code:
                clear
                input str5(individual s_m1 s_m2 s_m3 s_m4 s_m5 s_m6 s_m7 s_m8 s_m9 s_m10 s_m11 s_m12) byte hhvar
                "XD309" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" 1
                "DG23$" "WD111" "WD111" "WD111" "WD111" "."     "TT343" "TT343" "TT343" "TT343" "."     "QW123" "QW123" 2
                "SA098" "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     1
                "GF203" "FD123" "FD123" "."     "."     "."     "."     "AA999" "AA999" "AA999" "AA999" "AA999" "AA999" 3
                end
                
                
                gen `c(obs_t)' hh = _n
                
                rename individual ind0
                rename s_m# ind#
                
                reshape long ind , i(hh hhvar) j(ind_num)
                drop if ind == "."
                
                bysort hh ind (ind_num): keep if _n == 1
                drop ind_num
                which produces:
                Code:
                . list, noobs sep(0)
                
                  +--------------------+
                  | hh   hhvar     ind |
                  |--------------------|
                  |  1       1   GG345 |
                  |  1       1   XD309 |
                  |  2       2   DG23$ |
                  |  2       2   QW123 |
                  |  2       2   TT343 |
                  |  2       2   WD111 |
                  |  3       1   SA098 |
                  |  4       3   AA999 |
                  |  4       3   FD123 |
                  |  4       3   GF203 |
                  +--------------------+
                For the reshape to work, we do need a unique household identifier. I have made one up (called hh) here which does the job. If you already have such a variable, you could use that instead. I have also made up a household variable called hhvar with arbitrary values. This is a placeholder for "all other variables related to the household".

                Note also that the code above is designed to keep an observation from each household a person appears in (as "individual" or spouse), and so there could be multiple observations for a person. If you are sure a person only ever appears in a single household, then this code will still do the job but can be slightly simplified by changing the bysort command to

                Code:
                bysort ind: keep if _n == 1
                Last edited by Hemanshu Kumar; 06 Sep 2023, 21:27.

                Comment


                • #9
                  Incredible, thank you. Great thinking to rename the individual as zero, and the spouses are 1-3, which allows for this task to be done by re-shape.

                  THANK YOU

                  Comment

                  Working...
                  X