Mirroring Variables using Loops

David Kapaon

Join Date: Nov 2019
Posts: 30

Mirroring Variables using Loops

01 Apr 2020, 18:23

Hi,
I have a wide dataset (about 5000 observations) containing data on age and marital status from the years 2014-2019 – (I was only able to include 2014-2017 due to the limit of dataex though). However, not everyone has data for all four (in reality, six) years. Specifically, many age and marital status values are left-justified (so to speak) and don’t actually correspond to their ‘correct’ column. As an example, ID “PLMIB” has data for year 2016 only, but their respective yearly age and marital status values appear in column 2014. I need to transfer/shift their age (84) and marital status (Widowed) values, which are currently in the 2014 column, to the correct 2016 columns where they belong. However, I need to do this for every age and marital status value that is in the incorrect column. I included a short example of the data since visually looking at it might be easier to understand.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 ID int(Census_2014 Census_2015 Census_2016 Census_2017 Age_Census_2014 Age_Census_2015 Age_Census_2016 Age_Census_2017) str14(Marital_Status_Census_2014 Marital_Status_Census_2015 Marital_Status_Census_2016 Marital_Status_Census_2017)
"URHDJ" 2014 2015 2016 2017 62 63 64 65 "Married" "Married" "Married" "Married"
"MKLRD" 2014 2015 2016 2017 58 59 60 61 "Married" "Married" "Married" "Married"
"BCGDT"    . 2015 2016 2017 53 54 55  . "Married" "Widowed" "Married" ""      
"AQWSD"    . 2015 2016 2017 46 47 48  . ""        ""        ""        ""      
"ZFHKA"    . 2015 2016 2017 47 48 49  . ""        ""        ""        ""      
"TYRHF"    . 2015 2016 2017 60 61 62  . ""        ""        ""        ""      
"WRYUT" 2014    . 2016 2017 52 53 54  . "Married" "Married" "Married" ""      
"BHNDF" 2014    . 2016 2017 51 52 53  . "Married" "Married" "Widowed" ""      
"AWESR" 2014 2015    . 2017 60 61 62  . ""        ""        ""        ""      
"MCNFR" 2014 2015    . 2017 54 55 57  . "Married" "Married" "Married" ""      
"HNJGB" 2014 2015    . 2017 58 59 60  . "Widowed" "Widowed" "Widowed" ""      
"MNBVC" 2014    .    . 2017 52 53  .  . ""        ""        ""        ""      
"VBCNF"    .    .    . 2017 76  .  .  . "Widowed" ""        ""        ""       
"XCDVS" 2014 2015 2016    . 52 53 54  . ""        ""        ""        ""      
"IOPUT" 2014 2015 2016    . 44 45 46  . "Married" "Married" "Married" ""      
"NMDRF" 2014 2015 2016    . 56 57 58  . "Married" "Married" "Widowed" ""      
"LKDLF" 2014 2015 2016    . 85 86 87  . "Married" "Married" "Married" ""      
"TYRUE"    . 2015 2016    . 81 82  .  . "Widowed" "Widowed" ""        ""      
"SRIUO"    . 2015 2016    . 57 58  .  . ""        ""        ""        ""      
"FCRVT"    .    . 2016    . 60  .  .  . "Widowed" ""        ""        ""      
"PLMIB"    .    . 2016    . 84  .  .  . "Widowed" ""        ""        ""      
"YHNBG" 2014 2015    .    . 43 44  .  . ""        ""        ""        ""      
"LIKMD" 2014 2015    .    . 44 45  .  . ""        ""        ""        ""      
"DDERF"    . 2015    .    . 65  .  .  . "Married" ""        ""        ""      
"VGVBF" 2014    .    .    . 71  .  .  . ""        ""        ""        ""      
"ASDFC" 2014    .    .    . 50  .  .  . ""        ""        ""        ""      
end

Essentially, I need to shift all of the age and marital status values to their correct column so they ‘line up’ according to whichever years a respective ID/person has data on. I was trying to do some sort of loop(s) that individually shifted incorrect values one cell to the right until they reached the correct location. However, my issue was twofold: 1. how to account for the different types of ‘shifts’ (1 cell to the right, 2 cells to the right, etc.) in the loop, and 2. How to code a break so the loop doesn’t shift cells when it doesn’t need to.

Code:

levelsof ID, local(Key)             // For 1 cell shifts
foreach p of local Key {                                                  
local c 6          
forvalues j = 7(-1)4 {
capture replace Age_Census_201`j' = Age_Census_201`c' if Census_201`j' != . & ID == "`p'"
capture replace Age_Census_201`j' = . Census_201`j' == . & ID == "`p'"
local `c--'            
if (Year`j' != . & Year`c' == .) {
continue
}
}
}
 
levelsof ID, local(Key)             // For 2 cell shifts
foreach p of local Key {                                                  
local c 5          
forvalues j = 7(-1)4 {
capture replace Age_Census_201`j' = Age_Census_201`c' if Census_201`j' != . & ID == "`p'"
capture replace Age_Census_201`j' = . Census_201`j' == . & ID == "`p'"
local `c--'            
if (Year`j' != . & Year`c' == .) {
continue
}
}
}

David

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30174

01 Apr 2020, 20:23

It's funny. Nearly every time I see a post with "loop" in the title, it turns out that the solution doesn't involve loops at all. This is true here as well. The key here is the data layout: this task is close to impossible in wide layout, but it is only a tad tricky in long layout.

Code:

clear*
* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 ID int(Census_2014 Census_2015 Census_2016 Census_2017 Age_Census_2014 Age_Census_2015 Age_Census_2016 Age_Census_2017) str14(Marital_Status_Census_2014 Marital_Status_Census_2015 Marital_Status_Census_2016 Marital_Status_Census_2017)
"URHDJ" 2014 2015 2016 2017 62 63 64 65 "Married" "Married" "Married" "Married"
"MKLRD" 2014 2015 2016 2017 58 59 60 61 "Married" "Married" "Married" "Married"
"BCGDT"    . 2015 2016 2017 53 54 55  . "Married" "Widowed" "Married" ""     
"AQWSD"    . 2015 2016 2017 46 47 48  . ""        ""        ""        ""     
"ZFHKA"    . 2015 2016 2017 47 48 49  . ""        ""        ""        ""     
"TYRHF"    . 2015 2016 2017 60 61 62  . ""        ""        ""        ""     
"WRYUT" 2014    . 2016 2017 52 53 54  . "Married" "Married" "Married" ""     
"BHNDF" 2014    . 2016 2017 51 52 53  . "Married" "Married" "Widowed" ""     
"AWESR" 2014 2015    . 2017 60 61 62  . ""        ""        ""        ""     
"MCNFR" 2014 2015    . 2017 54 55 57  . "Married" "Married" "Married" ""     
"HNJGB" 2014 2015    . 2017 58 59 60  . "Widowed" "Widowed" "Widowed" ""     
"MNBVC" 2014    .    . 2017 52 53  .  . ""        ""        ""        ""     
"VBCNF"    .    .    . 2017 76  .  .  . "Widowed" ""        ""        ""      
"XCDVS" 2014 2015 2016    . 52 53 54  . ""        ""        ""        ""     
"IOPUT" 2014 2015 2016    . 44 45 46  . "Married" "Married" "Married" ""     
"NMDRF" 2014 2015 2016    . 56 57 58  . "Married" "Married" "Widowed" ""     
"LKDLF" 2014 2015 2016    . 85 86 87  . "Married" "Married" "Married" ""     
"TYRUE"    . 2015 2016    . 81 82  .  . "Widowed" "Widowed" ""        ""     
"SRIUO"    . 2015 2016    . 57 58  .  . ""        ""        ""        ""     
"FCRVT"    .    . 2016    . 60  .  .  . "Widowed" ""        ""        ""     
"PLMIB"    .    . 2016    . 84  .  .  . "Widowed" ""        ""        ""     
"YHNBG" 2014 2015    .    . 43 44  .  . ""        ""        ""        ""     
"LIKMD" 2014 2015    .    . 44 45  .  . ""        ""        ""        ""     
"DDERF"    . 2015    .    . 65  .  .  . "Married" ""        ""        ""     
"VGVBF" 2014    .    .    . 71  .  .  . ""        ""        ""        ""     
"ASDFC" 2014    .    .    . 50  .  .  . ""        ""        ""        ""     
end

//  VERIFY THAT THE AMOUNT OF NON-MISSING DATA IS THE SAME FOR
//  YEARS, MARITAL STATUS AND AGES
foreach x in Census Age_Census Marital_Status_Census {
    egen m_`x' = rownonmiss(`x'*), strok
}
assert m_Census == m_Age_Census
assert m_Age_Census == m_Marital_Status_Census | m_Marital_Status_Census == 0
drop m_*

//  SPLIT THE DATA INTO TWO SETS, ONE HAVING THE YEARS
//  AND THE OTHER HAVING THE AGE AND MARITAL VARIABLES
frame put ID Age_Census* Marital*, into(age_marital)
drop Age_Census* Marital*
reshape long Census_, i(ID)
drop if missing(Census_)
by ID: gen seq = _n

frame change age_marital
reshape long Age_Census_ Marital_Status_Census_, i(ID) j(year)
drop if missing(Age_Census_)
by ID: gen seq = _n
count

frame change default
frlink 1:1 ID seq, frame(age_marital)
frget Age_Census_ Marital_Status_Census_, from(age_marital)

//  SEE DISCUSSION; IT IS PROBABLY BEST TO LEAVE IT EXACTLY THE WAY IT IS NOW
//  BUT IF YOU WANT TO REVERT TO WIDE LAYOUT
drop seq age_marital
reshape wide Census_ Age_Census_ Marital_Status_, i(ID) j(_j)
order Marital_Status_Census*, sequential after(ID)
order Age_Census*, sequential after(ID)
order Census*, sequential after(ID)

Just as this task was facilitated (made possible) by going to long layout, most of whatever you plan to do with the data from this point on will also be easier, or perhaps only possible, in long layout. So I recommend that you omit the code that comes after the -frget- command. But if you are certain that you will be doing some of those rare things that are easier in wide layout than long in Stata, the remainder of the code show how to get back there.

Comment

David Kapaon

Join Date: Nov 2019

Posts: 30
#3

02 Apr 2020, 08:49

For whatever reason - (maybe because they look flashy?) - I seem to think loops can solve any problem. I guess I need to start breaking myself of that habit...
Thank you for the help Clyde! It's a good thing you're so good at Stata because I would have never ever ever came up with that solution! I would have been trying loops till the end of time
Thanks again!
David
Comment

Announcement

Mirroring Variables using Loops

Comment

Comment