multencode: similar command that creates only one new output variable?

Kate Dubberley

Join Date: Nov 2018

Posts: 9
#1

multencode: similar command that creates only one new output variable?

21 May 2019, 10:14

Hello everyone,

Description (taken from Stata help):

multencode creates new numeric variables newvarlist, with value labels defined and attached, based on the string variables strvarlist. The same set of value labels is used for all the new variables. By default a new set of labels will be created with the same name as the first variable in strvarlist. Optionally, a name may be specified using the label() option. In either case, no existing set of value labels with the same name will be used, unless the force option is specified, in which case those value labels will be over-written. In this way, the user is assured that the new variables will have the same alphabetically ordered set of value labels, provided as usual that the request does not breach any limits that apply.

This is from SSC.

Question:
Is there a command that creates one new variable based on multiple string variables? The definition of such a command may read as follows:

-[command]- creates a new numeric variable newvar, with value labels defined and attached, based on the string variables strvarlist.

I'm assuming code is not required for this discussion, however, I am happy to provide if requested.

Thank you and kind regards,
Kate

Last edited by Kate Dubberley; 21 May 2019, 10:56.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35694
#2

21 May 2019, 10:40

multencode is from SSC. Please note FAQ Advice #12 and #18.

egen, group() could work well here. See https://www.stata-journal.com/sjpdf....iclenum=dm0034 as one miniature survey.
Comment

Kate Dubberley

Join Date: Nov 2018
Posts: 9

22 May 2019, 12:03

Hi Nick, thank you for the quick reply and clarifying the multencode and encode commands. After reading the document you cited, it became clear that I should have included more information.

I'm working with Stata version 14.1 with a wide, line-level database of 98 variables and 7,503 observations.

44 of these variables are string and labelled diagnosis1, diagnosis2, diagnosis3 consecutively up to diagnosis44. These contain mutually exclusive ICD10-CA codes.
Fall-related diagnoses always begin with 'W', from the W00 ICD block from ICD chapter XX. They always contain three characters.
A fall-related diagnosis can be in any of the diagnosis variables (i.e. 1 to 44).
- In order to write code for routine reporting, I do not want to limit my code to include diagnosis occurrences up to only 44. That is, in a future extract, although unlikely, I could have a patient with more than 44 diagnoses.

My goal is to create a binary, numeric Fall variable (fall-related diagnosis or no-fall diagnosis) per patient in order to generate counts for rates and to at times, keep/drop patients. For example, a patient either has a fall-related diagnosis, regardless of the occurrence, or they do not. The ideal code will be suited for an infinite number of diagnosis occurrences and generate one output variable.

Unless I'm mistaken, the encode command does not appear to be relevant here as I want one output variable (i.e., Fall).
The below multencode attempt resulted in a type mismatch error
- Code:
```
 multencode diagnosis1-diagnosis44 if "W", gen(fall)
```

I have also tried more "simplified", albeit cumbersome, commands and functions, such as the below but they result in no counts of falls.

Code:

gen fall=0.
        replace fall=1 if diagnosis1 == "W" | diagnosis2 == "W" | diagnosis3 == "W" | diagnosis4 == "W" | diagnosis5 == "W"    | diagnosis6 == "W" | diagnosis7 == "W" | diagnosis8 == "W" | diagnosis9 == "W" | diagnosis10 == "W" | diagnosis11 == "W" | diagnosis12 == "W" | diagnosis13 == "W" | diagnosis14 == "W" | diagnosis15 == "W" | diagnosis16 == "W"  | diagnosis17 == "W" | diagnosis18 == "W" | diagnosis19 == "W" | diagnosis20 == "W"  | diagnosis21 == "W" | diagnosis22 == "W" | diagnosis23 == "W" | diagnosis24 == "W"  | diagnosis25 == "W" | diagnosis26 == "W"  | diagnosis27 == "W" | diagnosis28 == "W"  | diagnosis29 == "W" | diagnosis30 == "W"  | diagnosis31 == "W" | diagnosis32 == "W"  | diagnosis33 == "W" | diagnosis34 == "W"  | diagnosis35 == "W" | diagnosis36 == "W"  | diagnosis37 == "W" | diagnosis38 == "W"  | diagnosis39 == "W" | diagnosis40 == "W"  | diagnosis41 == "W"  | diagnosis42 == "W"  | diagnosis43 == "W" | diagnosis44 == "W"  
    
label define falllabel 1 "Fall" 0 "OtherICD", modify
  label values fall falllabel
  label variable fall "Fall"

.	tab	fall
	fall	Freq.	Percent	Cum.

	0	7,503	100.00	100.00

	Total	7,503	100.00

I have also tried the below, but I won't share any more attempts because it will make this post unneccesraily longer.

Code:

gen fall=1 if  strpos(diagnosis1, "W*") > 0 | strpos(diagnosis2, "W*") > 0 | strpos(diagnosis3, "W*") > 0 | strpos(diagnosis4, "W*") > 0 | strpos(diagnosis5, "W*") > 0 | strpos(diagnosis6, "W*") > 0 | strpos(diagnosis7, "W*") > 0 | strpos(diagnosis8, "W*") > 0 | strpos(diagnosis9, "W*") > 0 | strpos(diagnosis10, "W*") > 0 | strpos(diagnosis11, "W*") > 0 | strpos(diagnosis12, "W*") > 0 | strpos(diagnosis13, "W*") > 0 | strpos(diagnosis13, "W*") > 0 | strpos(diagnosis14, "W*") > 0 | strpos(diagnosis15, "W*") > 0 | strpos(diagnosis16, "W*") > 0 | strpos(diagnosis17, "W*") > 0 | strpos(diagnosis18, "W*") > 0 | strpos(diagnosis19, "W*") > 0 | strpos(diagnosis20, "W*") > 0 | strpos(diagnosis21, "W*") > 0 | strpos(diagnosis22, "W*") > 0 | strpos(diagnosis23, "W*") > 0 | strpos(diagnosis24, "W*") > 0 | strpos(diagnosis25, "W*") > 0 | strpos(diagnosis26, "W*") > 0 | strpos(diagnosis27, "W*") > 0 | strpos(diagnosis28, "W*") > 0 | strpos(diagnosis29, "W*") > 0 | strpos(diagnosis30, "W*") > 0 | strpos(diagnosis31, "W*") > 0 | strpos(diagnosis32, "W*") > 0 | strpos(diagnosis33, "W*") > 0 | strpos(diagnosis34, "W*") > 0 | strpos(diagnosis35, "W*") > 0 | strpos(diagnosis36, "W*") > 0 | strpos(diagnosis37, "W*") > 0 | strpos(diagnosis38, "W*") > 0 | strpos(diagnosis39, "W*") > 0 | strpos(diagnosis40, "W*") > 0 | strpos(diagnosis41, "W*") > 0 | strpos(diagnosis42, "W*") > 0 | strpos(diagnosis43, "W*") > 0 | strpos(diagnosis44, "W*") > 0
  tab fall
  **no obs

Code:

gen fall = diagnosis1-diagnosis44 if strmatch(diagnosis1-diagnosis44, "W*")
  **error type mismatch

Code:

gen fall = diagnosis* if strmatch(diagnosis*, "W*")
  **error diagnosis ambiguous abbreviation

Kind regards,
Kate

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35694
#4

22 May 2019, 13:26

I suggest

Code:

gen wanted = 0 quietly forval j = 1/44 { replace wanted = substr(diagnosis`j', 1, 1) == "W" if !wanted }

or

Code:

gen wanted = 0 quietly forval j = 1/44 { replace wanted = max(wanted, substr(diagnosis`j', 1, 1) == "W") }

What's wrong with your code? Precisely no variables will be exactly "W" or literally contain "W*" (strpos() is entirely literal; it doesn't use any pattern matching syntax at all).

strmatch() doesn't take wildcards that I know.

Code:

diagnosis1-diagnosis44

as defining an expression to be evaluated is the difference between diagnosis1 and diagnosis44, not a wildcard.
Comment
Kate Dubberley

Join Date: Nov 2018

Posts: 9
#5

22 May 2019, 15:30

Hi Nick,

Extremely helpful and it worked, thank you!

Is there any way to adapt the code you provided

Code:

gen wanted = 0 quietly forval j = 1/44 { replace wanted = max(wanted, substr(diagnosis`j', 1, 1) == "W") }

so that I can capture diagnoses between W00 and W19? (This is actually the fall sub-chapter I need, there was an error above - the chapter should have read "Other external causes of accidental injury". Nevertheless, the code you provided I still require.)

I tried the following, changing what I understood to be the length of n2 (substr(s,n1,n2); the substring of s, starting at n1, for a length of n2)

Code:

gen wanted = 0 quietly forval j = 1/44 { replace wanted = substr(diagnosis`j', 1, 3) == "W" if !wanted }

However, it returned zero observations. I expect this is an issue related to substr being literal, am I correct?

wanted Freq. Percent Cum.

0 7,503 100.00 100.00

Total 7,503 100.00

I also tried this, but received a type mismatch error (noting the 'W''s would go up to W19):

Code:

quietly forval j = 1/44 { replace wanted = substr(diagnosis`j', 1, 3) == "W00" | "W01" | "W02" | "W03" | "W04" if !wanted }

Last, is there any way to have an infinite number of diagnosis variables (i.e. beyond the 44) or shall I just make it 99 knowing that the overflow of occurrences will never be that high?

I see now my errors with strmatch()and

Code:

diagnosis1-diagnosis44

, thank you for explaining this.

Thank you,
Kate
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#6

22 May 2019, 23:33

The first three characters of a string three or more characters long are never going to be equal to "W" any more than any number between 100 and 999 is ever going to equal 7. As said, strpos() is literal. I think you are seeing that, however.

This may be what you want:

Code:

gen wanted = 0 quietly forval j = 1/44 { replace wanted = max(wanted, inrange(diagnosis`j', "W00", "W19")) }

Note that something like

Code:

substr(diagnosis`j', 1, 3) == "W00" | "W01"

is never going to be interpreted as

Code:

substr(diagnosis`j', 1, 3) == "W00" | substr(diagnosis`j', 1, 3) == "W01"

Stata parses that as

Code:

(substr(diagnosis`j', 1, 3) == "W00") | "W01"

The first expression is evaluated as 0 or 1 and thus has numeric value, but the second expression is just a string, hence the type mismatch.

See also https://www.stata-journal.com/articl...article=dm0058
1 like
Comment
Kate Dubberley

Join Date: Nov 2018

Posts: 9
#7

18 Feb 2020, 15:00

Nick, thank you for this. I've only just yet been able to confirm that your code works.
Comment

Announcement