Create two new variables based on logic of string variables

Blake Lee

Join Date: Sep 2021
Posts: 11

Create two new variables based on logic of string variables

29 Oct 2021, 15:04

Hello

Please consider my data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str4 Y1 str11(Option1 Option2 Option3) str9 Option4 str11(Option5 Option6 Option7 Option8 Option9 Option10 Option11 Option12 Option13 Option14 Y2) str9 Y3
"Cow"  "Cow milk"    "Cow grass"   "Cow grass"   "Cow grass" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Cow grass"   ""         
"Cow"  "Cow milk"    "Cow milk"    "Cow grass"   "Cow milk"  "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"    "Cow milk"    ""         
"Cow"  "Cow milk"    "Cow milk"    "Cow grass"   "Cow grass" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Cow milk"    ""         
"Iced" "Iced coffee" "Iced coffee" "Iced tea"    "Iced tea"  "Cow milk"    "Cow milk"    "Cow milk"    "Cow milk"    "Cow milk"    "Cow grass"   "Cow grass"   "Cow grass"   "Cow grass"   "Cow grass"   "Iced coffee" "Cow milk" 
"Iced" "Iced coffee" "Iced tea"    "Iced tea"    "Iced tea"  "Cow milk"    "Cow milk"    "Cow milk"    "Cow milk"    "Cow milk"    "Cow milk"    "Cow grass"   "Cow grass"   "Cow grass"   "Cow grass"   "Iced tea"    "Cow milk" 
"Iced" "Iced coffee" "Iced coffee" "Iced coffee" "Iced tea"  "Cow grass"   "Cow grass"   "Cow grass"   "Cow grass"   "Cow grass"   "Cow grass"   "Cow milk"    "Cow milk"    "Cow milk"    "Cow milk"    "Iced coffee" "Cow grass"
end

The goal is to create variables Y2 and Y3 seen in the data based on the variables Option*.

Y2:
If Y1 is 'Cow', Y2's value should begin with 'Cow' and would be the more common one between the two options, 'Cow milk' and 'Cow grass'. If it is an equal number like in the third observation, the first value (amongst Options 1 to 4) is chosen.
If Y1 is 'Iced', Y2's value should begin with 'Iced' and would be the more common one between the two options, 'Iced coffee' and 'Iced tea'. If it is an equal number like in the sixth observation, the first value (amongst Options 1 to 4) is chosen.

Y3:
If Y1 is 'Cow', Y3's value would be a blank.
If Y2 is 'Iced', Y3's value would be the most common (amongst Options 1 to 14). If there is an equal number like in the fourth observation, the first value (amongst Options 5 to 14) is chosen.

I have only been able to get Y3's value if Y1 is 'Cow' with

Code:

replace Y3 = "" if strpos(Y1,"Co") > 0

Thank you.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

29 Oct 2021, 16:08

Well, part of the reason you are not making headway on Y3 is because your criteria for it don't make sense. Y2 can only take on values from the original Option* variables, so it is never "Iced" It mightbe "Iced tea" or "Iced coffee," but it cannot be "Iced." Based on analogy with how you defined Y2, I wonder if you meant to say: "If Y1 is 'Iced', Y3's value would be...."

While we're clarifying, are you certain that you want Y3's value to be the most common of all among all of Options 1 to 14, but in the event of a tie there, you want the one that comes first among only options 5 to 14?

And one more question: In the example data, Option1 through Option4 are always precisely the ones that start with the word that appears in Y1. Is this true in your real data? I would like to rely on that assumption in writing code--but that obviously wouldn't be a good idea if it's just a fluke of this example.
Comment
Blake Lee

Join Date: Sep 2021

Posts: 11
#3

30 Oct 2021, 07:18

Thank you Clyde Schechter .

You are right Y2 can take on values only from the original Option* variables. If Y1 is 'Cow', Y2 has to be either 'Cow milk' or 'Cow grass'. Y2 would then be the more frequent between the two. So Y2 for the first observation would be 'Cow grass' because it appears three times compared with 'Cow milk' which appears only once.

For Y3, I need the value to be the most frequent from Options 1 to 14. If there is a tie, it would simply be the first one which appears looking from left to right. For example in observation 4, 'Cow milk' and 'Cow grass' both appears five times and both are the most frequent but 'Cow milk' appears first, so Y3 would be 'Cow milk'.

Option 1 to Option 4 do not always have to start with the word which appears in Y1. They can be in the mix from Options 1 to 14. (I did it that way so it would be easier to illustrate my example which I hoped it did.)

Thank you for your help.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

30 Oct 2021, 10:41

Great, much clearer!

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str4 Y1 str11(Option1 Option2 Option3) str9 Option4 str11(Option5 Option6 Option7 Option8 Option9 Option10 Option11 Option12 Option13 Option14)
"Cow"  "Cow milk"    "Cow grass"   "Cow grass"   "Cow grass" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee"
"Cow"  "Cow milk"    "Cow milk"    "Cow grass"   "Cow milk"  "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"    "Iced tea"   
"Cow"  "Cow milk"    "Cow milk"    "Cow grass"   "Cow grass" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee" "Iced coffee"
"Iced" "Iced coffee" "Iced coffee" "Iced tea"    "Iced tea"  "Cow milk"    "Cow milk"    "Cow milk"    "Cow milk"    "Cow milk"    "Cow grass"   "Cow grass"   "Cow grass"   "Cow grass"   "Cow grass"  
"Iced" "Iced coffee" "Iced tea"    "Iced tea"    "Iced tea"  "Cow milk"    "Cow milk"    "Cow milk"    "Cow milk"    "Cow milk"    "Cow milk"    "Cow grass"   "Cow grass"   "Cow grass"   "Cow grass"  
"Iced" "Iced coffee" "Iced coffee" "Iced coffee" "Iced tea"  "Cow grass"   "Cow grass"   "Cow grass"   "Cow grass"   "Cow grass"   "Cow grass"   "Cow milk"    "Cow milk"    "Cow milk"    "Cow milk"   
end


gen long obs_no = _n
reshape long Option, i(obs_no) j(order)
split Option, gen(option_word_)
gen byte match = (Y1 == option_word_1)

by obs_no match Option (order), sort: gen count = _N
gsort obs_no match count -order
by obs_no: gen Y2 = Option[_N]

gsort obs_no count -order
by obs_no: gen Y3 = Option[_N]
replace Y3 = "" if Y1 == "Cow"

//  IF YOU WANT TO RETURN TO WIDE LAYOUT
drop option_word_* match count
reshape wide

As with most things in Stata, this is not very difficult when the data are in long layout, but extremely complicated in wide. So the solution begins with a -reshape- of the data. If you want to go back to wide layout, the code for that is at the end. But it is likely that whatever else you want to do with this data will also be facilitated by staying in long layout; Stata has only a handful of commands that work best with wide layout. So think carefully about it.

In your clarification in #3, you appear to have dropped the provision that Y3 should be "blank" (which I interpreted to mean missing value ""), but I retained it in the final line before the code for returning to wide layout. If you don't want that provision, you can just drop that -replace Y3 = ""...- command.

Comment

Blake Lee

Join Date: Sep 2021

Posts: 11
#5

30 Oct 2021, 10:53

Thank you very much Clyde Schechter . I had reshaped my data to long but could not figure how to decide when the most common value was a tie. Yours is fantastic. Thank you.
Comment

Announcement

Create two new variables based on logic of string variables

Comment

Comment

Comment

Comment