Generating new variables based on a variable naming convention

Marcus Green

Join Date: Mar 2019

Posts: 4
#1

Generating new variables based on a variable naming convention

05 Mar 2019, 08:36

Dear Statalist,

I hope this finds you well - my first post so please be gentle.
I regularly use data-sets which contain a very methodical naming convention and I'm trying to write some script to combine them.

The naming convention is variable_e1_c1 (incomplete observations 1-10), variable_e1_c2 (incomplete observations 11-20), variable_e1_c3 (incomplete observations 21-30), variable_e1_c4 (incomplete observations 31-40) where the variables in the data set I want to combine will all contain the same variable-name_e-number. I've started with a few approaches, however, I don't know how to pass the variable name into logical test after running say:

- ds, has(varl *_e1*).
If varname == "*_c*" {then do}.

I've also tried the approach with:

- foreach x of varlist *_e2_* {

but again I'm lost as to how to apply logic gates to the value of `x' as it takes the value of the first observation where .

Ultimately I'm keen on generating a new variable where all the observations take a similar convention and I currently use the tedious approach of copy and paste a few lines of code as such:

gen Bar=.
replace Bar= real(biadlyn_e1_c1) if biadlyn_e1_c2==.
replace Bar= biadlyn_e1_c2 if real(biadlyn_e1_c1)==.

I apologies is this thread has been covered and would appreciate a link and also do note although I'm not new to programming I am new to STATA and it's handling power so any advice would be most welcome.

Kind regards
Tags: None

Maarten Buis

Join Date: Mar 2014
Posts: 3467

05 Mar 2019, 08:58

Code:

sysuse auto, clear

// get me a list of all variables with names containing the letter m
unab mlist : *m*
di `"|`mlist'|"'

// get me a list of all variables with names containing the letter e
unab elist : *e*
di `"|`elist'|"'

// get the intersection of these lists (see: help macrolists)
local combined_list : list mlist & elist
di `"|`combined_list'|"'

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

Marcus Green

Join Date: Mar 2019

Posts: 4
#3

11 Mar 2019, 11:58

Thanks Maarten,

I've started implementing the code and have gotten to the next gap in knowledge so far I've got this:

unab camlist : *_E2_*
unab datlist : BRTHDAT*
local combined_list : list camlist & datlist
egen BRTHDAT = rowmax("`combined_list'")

The egen command is the step which I'm struggling with I essentially would want to count through the macro list (combined_list) and assign the variables to a complete variable (BRTHDAT) with the command and augment this

I also want to run this for all variables in the database so envisaged running it thought a loop and a variable naming truncation:

forvalues i = 1/9{
foreach n of varlist *_E`i'_*{
local vargen = `"regexs(0) regexm("`n'", "[A-Z][A-Z][A-Z][A-Z][A-Z][A-Z]")"'
display `vargen'
}
}

A sample of the birthday fields below shows the naming convention which I'm keen on amalgamating (to generate a BRTHDAT_E2 = BRTHDAT_E2_C15 + BRTHDAT_E2_C16 + BRTHDAT_E2_C19 + BRTHDAT_E2_C20 if BRTHDAT == . output)

BRTHDAT_E1_C1
BRTHDAT_E1_C2
BRTHDAT_E2_C15
BRTHDAT_E2_C16
BRTHDAT_E2_C19
BRTHDAT_E2_C20
BRTHDAT_E3_C21
BRTHDAT_E3_C22
BRTHDAT_E3_C25
BRTHDAT_E3_C26
BRTHDAT_E4_C27
BRTHDAT_E4_C28
BRTHDAT_E4_C32
BRTHDAT_E4_C33
BRTHDAT_E5_C34
BRTHDAT_E5_C35
BRTHDAT_E5_C43
BRTHDAT_E5_C44
BRTHDAT_E6_C45
BRTHDAT_E6_C46

The combination of the two codes at this point isn't even close to working - any assistance would be really great. I feel I'm chasing my tale with the macro tags atm `""' / `' / "`'" / ``'' / ``""'' ....

Thanks again for your time,
kind regards,

Marcus
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35793
#4

11 Mar 2019, 12:27

I find this very hard to follow. egen, rowmax() is not for string manipulations. It's for finding numerical maxima, across observations, of several variables.

I can't see that the variables named in #3 have any relation to what's in #1. Here's a very wild guess at what you're aiming at

Code:

forval j = 1/6 { egen BRTHDAT_E`j' = rowtotal(BRTHDAT_E`j'_C*) }

If that is way off, and you don't get a better answer, http://xyproblem.info/ seems to apply.

1. Tell us about your data. Give a data example. If your data are complicated, give us a simplified analogue that shows the problem.

2. Tell us what you want to do. The clearest part of #3 is, or seems to be, that you want add variables of a certain kind, given by a shared name prefix.

As you say, you have programming knowledge, but only some of will apply literally to Stata and you don't know which part at present.
Comment

Marcus Green

Join Date: Mar 2019
Posts: 4

13 Mar 2019, 09:26

Thanks for following this up Nick and apologies for causing any frustration.

Your code certainly helps with some elements of the dataset - and you're right I should have been more descriptive about my data-set and problem so apologies for that.

- So the data I have is fragmented, across a number of variables in the data set, I want to combine (or de-fragment these variables into one or fewer variables) all the variables contain the same variable name (I've called variablecode) an event number(I've called En₁) and a version or capture number(I've called Cn₂). Across the database it has the repeated naming convention:- VARIABLECODE_En₁_Cn₂(Where in the example above post#3 - the variables downloaded from Openclinica are BRTHDAT_En₁_Cn₂).

- As you've shown above (in post #4) I'm keen to capture all of the observations in BRTHDAT_E1 from the BRTHDAT_E1_C* variables - BRTHDAT_En₁ should contain all of the Birthday Observations (not the sum of the rows - apologies again) which will be recorded in the Cn₂convention.

- I've started with the variables for BRTHDAT and BIADLYN and to help I'll give a visual key of what I have and how I'm keen to transform the dataset:

Table 1 - The data I have

BRTHDAT_E4_C28	BRTHDAT_E4_C32	BRTHDAT_E5_C34	BRTHDAT_E5_C35	BRTHDAT_E5_C43	BIADLYN_E1_C1	BIADLYN_E1_C2	BIADLYN_E5_C34	BIADLYN_E5_C35
01/11/2025				01/11/2025		1		1
01/12/2022	01/12/2022		01/12/2022	01/12/2022		1		1
01/03/2027		01/03/2027	01/03/2027			1		1
01/07/2014				01/07/2014		1
04/06/2019		04/06/2019	04/06/2019			1		1
03/08/2019			03/08/2019	03/08/2019		1		1
03/09/2019		03/09/2019	03/09/2019			1		1
01/07/2013				01/07/2013		1
01/09/2020					1
10/09/2019		10/09/2019	10/09/2019		1		1
08/08/2019		08/08/2019	08/08/2019		1		1
01/07/2025		01/07/2025	01/07/2025		1		1
01/10/2019			01/10/2019	01/10/2019	1
01/03/2027	01/03/2027		01/03/2027	01/03/2027	1		1
01/06/2028		01/06/2028	01/06/2028		1		1
01/06/2027				01/06/2027	1

Table 2: The output of I'm keen to receive:

BRTHDAT_E4	BRTHDAT_E5	BIADLYN_E1	BIADLYN_E5
01/11/2025	01/11/2025	1	1
01/12/2022	01/12/2022	1	1
01/03/2027	01/03/2027	1	1
01/07/2014	01/07/2014	1
04/06/2019	04/06/2019	1	1
03/08/2019	03/08/2019	1	1
03/09/2019	03/09/2019	1	1
01/07/2013	01/07/2013	1
01/09/2020		1
10/09/2019	10/09/2019	1	1
08/08/2019	08/08/2019	1	1
01/07/2025	01/07/2025	1	1
01/10/2019	01/10/2019	1
01/03/2027	01/03/2027	1	1
01/06/2028	01/06/2028	1	1

- I was keen to understand if there are/is a program or function in STATA which can transform all of my data like this i.e. loop for all 1700+ variables and defragment into VARIABLECODE_En1. from VARIABLECODE_En₁_Cn₂.

My thought process was to:
- loop through *_E`1 - 9'_* which is the only sequential part to the naming convention.

- The second thought was to truncate the VARIABLECODE out from the variable name to generate a new variable.

- Before combining, "defragmenting" the various versions of the variable VARIABLECODE_En1_C* together

Thanks again for your time on my problem and sincerely, apologies for confusing a few of my ideas, scripts and concepts.
Kind regards,

Marcus

Announcement