Generating an indicator variable to denote the highest value from a set of variables in an observation

Sara Saltzer

Join Date: Mar 2022

Posts: 6
#1

Generating an indicator variable to denote the highest value from a set of variables in an observation

05 Mar 2022, 21:06

I am working with a dataset from a poll of registered voters. For each observation (one voter), respondents were asked to use "feeling thermometers" to indicate their attitudes towards a series of different candidates. Each feeling thermometer variable is coded as a numeric variable with an integer value of 1, 2, 3, or 4. I am looking for a way to create an indicator variable for each observation to denote which feeling thermometer variable has the highest value (in order to eventually determine a correlation between positive attitudes and voting behavior). I haven't been able to figure out how to create such an indicator variable with the limited Stata knowledge I currently have. I have tried using the rowsort command and generating new variables, which is close to what I want to do, but I don't simply want the numeric values of the variables; rather, I want to identify which variable has that maximum value. Advice appreciated!

Last edited by Sara Saltzer; 05 Mar 2022, 21:42.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30148
#2

05 Mar 2022, 22:37

I think I grasp the gist of what you want to do, but I doubt anybody can help you (and I'm sure I can't) without example data to work with. At best I could give you some generic advice that would probably prove unhelpful as so much depends on the details of how your data is organized. Please use the -dataex- command to post a representative example from your Stata data set. Be sure to explain which variables correspond to the various things that play a role in your problem statement.

If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35756
#3

06 Mar 2022, 01:36

Clyde Schechter gives excellent advice, to which I will add a guess.

rowsort is from the Stata Journal (FAQ Advice #12)

SJ-20-2 pr0046_1 . . . . . . . . . . . Speaking Stata: More ways for rowwise
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q2/20 SJ 20(2):481--488 (no commands)
focuses on returning which variable or variables are equal
to the maximum or minimum in a row

SJ-9-1 pr0046 . . . . . . . . . . . . . . . . . . . Speaking Stata: Rowwise
(help rowsort, rowranks if installed) . . . . . . . . . . . N. J. Cox
Q1/09 SJ 9(1):137--157
shows how to exploit functions, egen functions, and Mata
for working rowwise; rowsort and rowranks are introduced

Why didn't I discuss what you want in pr0046? One reason was ties. What should happen if votes give the same maximum to two or more candidates?

Any way, https://www.stata-journal.com/articl...ticle=pr0046_1 may help. If it's behind a paywall, email the author for a copy.
Comment

Sara Saltzer

Join Date: Mar 2022
Posts: 6

06 Mar 2022, 15:04

Thank you both! Here is a sample of the data--for example, I would want to create an indicator to show which of fav_biden_2019Nov, fav_sanders_2019Nov, and fav_warren_2019Nov has the highest value.

[

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(fav_biden_2019Nov fav_sanders_2019Nov fav_warren_2019Nov)
2 2 1
1 3 1
2 2 1
1 1 1
2 3 1
1 1 1
3 2 1
2 2 1
4 4 5
1 1 1
4 2 2
4 4 4
4 4 4
4 4 4
4 3 4
5 2 1
2 2 1
3 3 2
4 4 4
2 2 1
end
label values fav_biden_2019Nov Q8_f_2019Nov
label def Q8_f_2019Nov 1 "Very favorable", modify
label def Q8_f_2019Nov 2 "Somewhat favorable", modify
label def Q8_f_2019Nov 3 "Somewhat unfavorable", modify
label def Q8_f_2019Nov 4 "Very unfavorable", modify
label def Q8_f_2019Nov 5 "Don't know", modify
label values fav_sanders_2019Nov Q8_g_2019Nov
label def Q8_g_2019Nov 1 "Very favorable", modify
label def Q8_g_2019Nov 2 "Somewhat favorable", modify
label def Q8_g_2019Nov 3 "Somewhat unfavorable", modify
label def Q8_g_2019Nov 4 "Very unfavorable", modify
label values fav_warren_2019Nov Q8_h_2019Nov
label def Q8_h_2019Nov 1 "Very favorable", modify
label def Q8_h_2019Nov 2 "Somewhat favorable", modify
label def Q8_h_2019Nov 4 "Very unfavorable", modify
label def Q8_h_2019Nov 5 "Don't know", modify

I'm frankly not sure how I would handle ties at the moment; it's probably alright if the new indicator variable has multiple values, I believe I could clean that up and use it for the rest of my analysis later (though I may return to Statalist with my questions at that point)!

Last edited by Sara Saltzer; 06 Mar 2022, 15:09.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35756
#5

06 Mar 2022, 15:17

You can’t have a single indicator having multiple values — at most you can have multiple indicators, one for each candidate.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30148
#6

06 Mar 2022, 15:19

Umm, more clarification needed here. By "highest value" would you be referring to numbers nearest to 1 representing the most favorable value, or numbers nearest to 4 representing the highest numerical value in the variable's numeric coding scheme for an expressed evaluation (least favorable value), or do we even include 5 which is, in effect, a missing value? I ask because there are substantial differences in how one would code this.

Added: Crossed with #4.

Last edited by Clyde Schechter; 06 Mar 2022, 15:22.
Comment
Sara Saltzer

Join Date: Mar 2022

Posts: 6
#7

06 Mar 2022, 19:27

Thank you both for the clarification! The "highest" value would be referring to the most favorable--my apologies for the confusion there, it is indeed the lowest numeric value. Additionally, I would not want to include 5 since, as you mention, it is a missing value.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30148
#8

06 Mar 2022, 20:42

The following code will give you a new variable, preferred_candidates, which contains the names of all candidates that received the most favorable rating, separated by commas where there is more than one (which happens very often in your example data).

Code:

gen long obs_no = _n reshape long fav_@_2019Nov, i(obs_no) j(candidate) string by obs_no (fav__2019Nov), sort: gen byte most_favored = (fav__2019Nov == fav__2019Nov[1]) by obs_no: gen preferred_candidates = candidate if _n == 1 by obs_no (fav__2019Nov): replace preferred_candidates = preferred_candidates[_n-1] /// + cond(most_favored, ", " + candidate, "") if _n > 1 drop most_favored by obs_no (fav__2019Nov): replace preferred_candidates = preferred_candidates[_N] reshape wide

You might want to use -encode- to convert that into a value labeled numeric variable (just like the fav_*_2019Nov variables).

Because you were asking for the lowest numeric rating, it was not necessary to do anything special with 5. However, 5 is essentially a non-response to the question, and having it as a number is likely to get you into trouble later during analysis commands. And if it doesn't it is only because you will have been extremely meticulous about including -if ... != 5- on many, many lines of code. To avoid that and have these non-responses automatically excluded from calculations, it is best to convert them to missing values. See -help mvdecode- for details.
Comment

Sara Saltzer

Join Date: Mar 2022
Posts: 6

06 Mar 2022, 22:02

Thank you very much, Clyde! Updated to add a question about this code: I ran it and on the line

Code:

by obs_no (fav__2019Nov): replace preferred_candidates = preferred_candidates[_n-1] ///

I received a 198 error, "invalid syntax." On the line

Code:

+ cond(most_favored, ", " + candidate, "") if _n > 1

, I received error 199, "+ is not a valid command name." In the end, after running all of the code, the column "preferred_candidate" is empty. I'm certain this is an issue with my reading of your code and understanding of how to implement it, but I would be so grateful for a bit of guidance and appreciate your patience with my ignorance.

This thread has helped me think a bit more clearly about what I am ultimately looking to do with this data, and how to do that. My eventual goal is to compare actual vote choice with the favorability ratings, and determine if voters actually choose to vote for (one of) the candidates whom they favor the most. If I used encode as you noted, would I then be able to create another dummy variable where the value is 1 if any of the numbers contained in most_favored equals the value of magicdempres_2019Nov (see data below)? What I am looking for is to have a dichotomous variable indicating whether the candidate a voter chose in magicdempres_2019Nov was one that they rated the most favorably. How would I go about coding that? Thank you in advance!

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(fav_biden_2019Nov fav_sanders_2019Nov fav_warren_2019Nov) int magicdempres_2019Nov
2 2 1   4
1 3 1   2
2 2 1   1
1 1 1 999
2 3 1   1
1 1 1   1
3 2 1   1
2 2 1  16
4 4 5 999
1 1 1   2
4 2 2 999
4 4 4 999
4 4 4 999
4 4 4 999
4 3 4 999
5 2 1   1
2 2 1   1
3 3 2   4
4 4 4 999
2 2 1   4
end
label values fav_biden_2019Nov Q8_f_2019Nov
label def Q8_f_2019Nov 1 "Very favorable", modify
label def Q8_f_2019Nov 2 "Somewhat favorable", modify
label def Q8_f_2019Nov 3 "Somewhat unfavorable", modify
label def Q8_f_2019Nov 4 "Very unfavorable", modify
label def Q8_f_2019Nov 5 "Don't know", modify
label values fav_sanders_2019Nov Q8_g_2019Nov
label def Q8_g_2019Nov 1 "Very favorable", modify
label def Q8_g_2019Nov 2 "Somewhat favorable", modify
label def Q8_g_2019Nov 3 "Somewhat unfavorable", modify
label def Q8_g_2019Nov 4 "Very unfavorable", modify
label values fav_warren_2019Nov Q8_h_2019Nov
label def Q8_h_2019Nov 1 "Very favorable", modify
label def Q8_h_2019Nov 2 "Somewhat favorable", modify
label def Q8_h_2019Nov 4 "Very unfavorable", modify
label def Q8_h_2019Nov 5 "Don't know", modify
label values magicdempres_2019Nov Q16_2019Nov
label def Q16_2019Nov 1 "Elizabeth Warren", modify
label def Q16_2019Nov 2 "Joe Biden", modify
label def Q16_2019Nov 4 "Pete Buttigieg", modify
label def Q16_2019Nov 16 "Joe Sestak", modify
label def Q16_2019Nov 999 "not asked", modify

Last edited by Sara Saltzer; 06 Mar 2022, 22:37.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35756
#10

07 Mar 2022, 00:38

The syntax

Code:

///

isn't allowed interactively. Clyde Schechter is meaning you to copy and paste all the syntax into the do-file editor and run it all at once.

See 16.1.2 at https://www.stata.com/manuals/u16.pdf

If you type it line by line interactively, the triple slashes must be omitted.

This is all one command

Code:

by obs_no (fav__2019Nov): replace preferred_candidates = preferred_candidates[_n-1] + cond(most_favored, ", " + candidate, "") if _n > 1
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30148
#11

07 Mar 2022, 11:49

My eventual goal is to compare actual vote choice with the favorability ratings, and determine if voters actually choose to vote for (one of) the candidates whom they favor the most. If I used encode as you noted, would I then be able to create another dummy variable where the value is 1 if any of the numbers contained in most_favored equals the value of magicdempres_2019Nov (see data below)? What I am looking for is to have a dichotomous variable indicating whether the candidate a voter chose in magicdempres_2019Nov was one that they rated the most favorably. How would I go about coding that?

Well, encoding that variable would not be very helpful for that purpose. But it is not hard to do by a slightly different route.

Code:

gen long obs_no = _n reshape long fav_@_2019Nov, i(obs_no) j(candidate) string by obs_no (fav__2019Nov), sort: gen byte most_favored = (fav__2019Nov == fav__2019Nov[1]) by obs_no: gen preferred_candidates = candidate if _n == 1 label define n_candidate 1 "warren" /// 2 "biden" /// 3 "sanders" /// PUT THE RIGHT NUMBER HERE; I MADE 3 UP 4 "buttigieg" /// 16 "sestak" encode candidate, gen(n_candidate) by obs_no: egen voted_for_a_favorite = max(most_favored & magicdempres_2019Nov == n_candidate) by obs_no (fav__2019Nov): replace preferred_candidates = preferred_candidates[_n-1] /// + cond(most_favored, ", " + candidate, "") if _n > 1 drop most_favored n_candidate by obs_no (fav__2019Nov): replace preferred_candidates = preferred_candidates[_N] reshape wide

I want to call your attention to the bold-faced -label define- command in the above code, as you will need to adapt it to your actual situation. This label has to "substantively" match the value label Q16_2019Nov that is already in your data set. By substantively match, I mean this: the variable names in your data set present the candidates surnames in lower case, whereas the label Q16_2019Nov has given and surnames in proper case. The key here is that you must assign the same numbers in label n_candidate as that candidate gets in label Q16_2019Nov. For Warren and Biden I have done that. In your -dataex- output, label Q16_2019Nov doesn't show a value for Sanders--but I imagine in the real data set it does. I put 3 in my version of n_candidate because there needs to be something there for each candidate. But you must use the actual number assigned to Sanders in Q16_2019Nov for this code to work properly. Similarly, since, I assume, there are more candidates than just biden, sanders, and warren represented among the fav_*_2019Nov variables in your real data set, you must be sure to include each candidate in label n_candidate, with the corresponding number from label Q16_2019Nov.

If you are not sure what label Q16_2019Nov looks like, running -label list Q16_2019- will show it to you.
1 like
Comment
Sara Saltzer

Join Date: Mar 2022

Posts: 6
#12

07 Mar 2022, 12:26

Thank you both so much! I really appreciate your help here.
Comment

Announcement