String variables - Same string value in many columns to be collapsed

Lexi Mastalerz

Join Date: Jun 2023
Posts: 22

String variables - Same string value in many columns to be collapsed

28 Jun 2023, 13:17

Hello,

I am working with a large data set (5 million + rows), of individuals. I have a column for every month of the year for their spouse in that month, identified by a string identifier (Column: S_M1 is for Spouse in month 1). Most individuals have the same spouse from months 1 to 12. However, some change their spouse between months. I want to change the data to have just max 3 columns of first spouse, second spouse , etc, during the year

Current format:

Individual

S_M1

S_M2

S_M3

S_M4

S_M5

S_M6

S_M7

S_M8

S_M9

S_M10

S_M11

S_M12

All other variables related to household

XD309

GG345

DG23$

WD111

TT343

QW123

SA098

GF203

FD123

AA999

Preferred format:

Individual	Spouse1	Spouse2	Spouse3	All other variables related to household
XD309	GG345	.	.
DG23$	WD111	TT343	QW123
SA098	.	.	.
GF203	FD123	AA999	.

If I can get it to the preferred format above, I will reshape it and it and every spouse will have their own row with all the associated households variables. I wonder if its possible to skip step 2 above. The ideal end format is as follows:

Individual	All other variables related to household
XD309
GG345
DG23$
WD111
TT343
QW123
SA098
GF203
FD123
AA999

Please let me know if anyone is able to solve this problem. Thank you!

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10275

28 Jun 2023, 14:11

For your future posts, present data examples using the dataex command (refer to FAQ Advice #12). Dot entries within what seem to be string variables imply that the names of spouses may be value labels and not actual string entries. This will then require changes to the code below.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str5(individual s_m1 s_m2 s_m3 s_m4 s_m5 s_m6 s_m7 s_m8 s_m9 s_m10 s_m11 s_m12) byte allothervariablesrelatedtohouseh
"XD309" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" .
"DG23$" "WD111" "WD111" "WD111" "WD111" "."     "TT343" "TT343" "TT343" "TT343" "."     "QW123" "QW123" .
"SA098" "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     .
"GF203" "FD123" "FD123" "."     "."     "."     "."     "AA999" "AA999" "AA999" "AA999" "AA999" "AA999" .
end

rename s_m* spouse*
reshape long spouse, i(individual) j(which)
bys individual spouse (which): keep if _n==1
drop if spouse=="."
by individual: replace which=_n
reshape wide spouse, i(individual) j(which)

Res.:

Code:

. l

     +---------------------------------------------------+
     | indivi~l   spouse1   spouse2   spouse3   alloth~h |
     |---------------------------------------------------|
  1. |    DG23$     QW123     TT343     WD111          . |
  2. |    GF203     AA999     FD123                    . |
  3. |    XD309     GG345                              . |
     +---------------------------------------------------+

Last edited by Andrew Musau; 28 Jun 2023, 14:27.

Comment

Lexi Mastalerz

Join Date: Jun 2023

Posts: 22
#3

07 Jul 2023, 07:49

Thank You very much for this I have not tried it yet but I hope my computer will be able to handle a reshape of 5 million x 10
Comment
Lexi Mastalerz

Join Date: Jun 2023

Posts: 22
#4

07 Jul 2023, 08:02

Thank You very much for this I have not tried it yet but I hope my computer will be able to handle a reshape of 5 million x 10
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1474

07 Jul 2023, 09:27

If you are wary of reshaping such a large dataset, consider this code. I don't know if it is more efficient, though.

Code:

clear
input str5(individual s_m1 s_m2 s_m3 s_m4 s_m5 s_m6 s_m7 s_m8 s_m9 s_m10 s_m11 s_m12) byte allothervariablesrelatedtohouseh
"XD309" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" .
"DG23$" "WD111" "WD111" "WD111" "WD111" "."     "TT343" "TT343" "TT343" "TT343" "."     "QW123" "QW123" .
"SA098" "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     .
"GF203" "FD123" "FD123" "."     "."     "."     "."     "AA999" "AA999" "AA999" "AA999" "AA999" "AA999" .
end

gen last_one = "."
gen all_spouses = ""

forval i = 1/12 {
    replace all_spouses = all_spouses + " " + s_m`i' if s_m`i' != "." & s_m`i' != last_one
    replace last_one = s_m`i' if s_m`i' != "." & s_m`i' != last_one
}

split all_spouses, gen(spouse_)
drop all_spouses last_one

which produces:

Code:

. list individual spouse_*, noobs abbrev(10)

  +---------------------------------------------+
  | individual   spouse_1   spouse_2   spouse_3 |
  |---------------------------------------------|
  |      XD309      GG345                       |
  |      DG23$      WD111      TT343      QW123 |
  |      SA098                                  |
  |      GF203      FD123      AA999            |
  +---------------------------------------------+

Comment

Hemanshu Kumar

Join Date: Mar 2015

Posts: 1474
#6

07 Jul 2023, 10:22

Out of curiosity, I checked the two procedures on a dataset of 40 million unique individuals, and it turns out my method in #5 takes only one fourth of the time that the method in #2 takes -- specifically, 395.2 seconds versus 1,653.5 seconds on my machine (Apple M2 Pro) using Stata 18 MP/12-core.
Comment
Lexi Mastalerz

Join Date: Jun 2023

Posts: 22
#7

06 Sep 2023, 15:33

Hello! Thank you so much and sorry for the late reply. I was redirected to another project and I also didn't think this would run so fast, I was expecting my computer to be unavailable over-night for this so I didn't run it.

Anyways, It worked! I used method in #5 and it worked in seconds using Stata 17 and HP Elite-book. I am not sure how you check how many seconds, but VERY QUICK.

Now, the ultimate goal was to get to "ideal" format in the last version, from my post 1 (I named the other "preferred", which was probably a bad idea). Aka, every individual and every spouse has their own row with the associated household variables. They are no longer connected (spouse and individual). As below:

Individual All other variables related to household

XD309

GG345

DG23$

WD111

TT343

QW123

SA098

GF203

FD123

AA999

The re-shape formats the data into the "preferred format" as is shown in your table in the posts. However, the reshape still only has the spouse as an attached variable to the individual and not as their own person in a row with their own household variables.

I have an idea of how to move forward, but maybe you have some magic idea that is an easier way to do this. Here are my next steps (I could make a new post if this is not ok).

1. Create a Unique identifier for each family (individual+spouse)
2. Export all spouses and the family identifier into a separate dataset (SPOUSE DATASET)
3. Use SPOUSE DATASET and merge all household info from the master file using the family identifier
4. Amend the two files where each individual and spouse will now be their own row

Thank you for all the help. I cant believe this is a free resource. Where do I donate please?

Lexi
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1474

06 Sep 2023, 21:07

Consider this code:

Code:

clear
input str5(individual s_m1 s_m2 s_m3 s_m4 s_m5 s_m6 s_m7 s_m8 s_m9 s_m10 s_m11 s_m12) byte hhvar
"XD309" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" "GG345" 1
"DG23$" "WD111" "WD111" "WD111" "WD111" "."     "TT343" "TT343" "TT343" "TT343" "."     "QW123" "QW123" 2
"SA098" "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     "."     1
"GF203" "FD123" "FD123" "."     "."     "."     "."     "AA999" "AA999" "AA999" "AA999" "AA999" "AA999" 3
end


gen `c(obs_t)' hh = _n

rename individual ind0
rename s_m# ind#

reshape long ind , i(hh hhvar) j(ind_num)
drop if ind == "."

bysort hh ind (ind_num): keep if _n == 1
drop ind_num

which produces:

Code:

. list, noobs sep(0)

  +--------------------+
  | hh   hhvar     ind |
  |--------------------|
  |  1       1   GG345 |
  |  1       1   XD309 |
  |  2       2   DG23$ |
  |  2       2   QW123 |
  |  2       2   TT343 |
  |  2       2   WD111 |
  |  3       1   SA098 |
  |  4       3   AA999 |
  |  4       3   FD123 |
  |  4       3   GF203 |
  +--------------------+

For the reshape to work, we do need a unique household identifier. I have made one up (called hh) here which does the job. If you already have such a variable, you could use that instead. I have also made up a household variable called hhvar with arbitrary values. This is a placeholder for "all other variables related to the household".

Note also that the code above is designed to keep an observation from each household a person appears in (as "individual" or spouse), and so there could be multiple observations for a person. If you are sure a person only ever appears in a single household, then this code will still do the job but can be slightly simplified by changing the bysort command to

Code:

bysort ind: keep if _n == 1

Last edited by Hemanshu Kumar; 06 Sep 2023, 21:27.

Comment

Lexi Mastalerz

Join Date: Jun 2023

Posts: 22
#9

07 Sep 2023, 08:58

Incredible, thank you. Great thinking to rename the individual as zero, and the spouses are 1-3, which allows for this task to be done by re-shape.

THANK YOU
Comment

Announcement

String variables - Same string value in many columns to be collapsed

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment