Create Variable Z=Variable Y if Variable X is the highest value among all Variable X's

Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#1

Create Variable Z=Variable Y if Variable X is the highest value among all Variable X's

05 Oct 2020, 14:51

Dear Statalisters:

I apologize if this has been asked somewhere and already answered but I don't know where to begin with this one.

I have a list of nine possible values for Variable Y. It's an occupation code where, for example, 1=CEO, 2=Manager, etc. A person can hold up to 9 possible jobs, hence there are 9 variables (y_1, y_2, y_3...y_9).
I also have a list of nine possible values for Variable X. Variable X could be any number of things, although hourly rate of pay or number of hours worked at this job are two possibilities. As a person can hold up to 9 possible jobs, there are 9 variables here too (x_1, x_2...x_9).

Here's what I want to do: I want to create a single variable (Variable Z) equal to the value from Variable Y that has the highest value of Value X.

So, for example, let's say a person held 9 jobs, where they were paid between $11 to $19 per hour (with x_9 being the $19 / hour one). I would want that person to have Z=y_9.
Another example: If a different person held 9 jobs, where they were paid between $21 and $29 per month (with x_6 being the $29 / hour one), then I want that person to have Z=y_6.

I know how to do this if I wanted Variable Z to be equal to the highest value of X, using egen and rowmax. I don't know how to do it if I want Z to equal the Y with the highest value of X.

Has anyone figured out an easy solution to this? Thank you in advance.
Tags: None
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

05 Oct 2020, 15:04

Reshape your data to long -reshape-, so that Y and X become only two variables and for each there are 9 observations.

Then use -egen, max- to find maxX, the maximum value of X within this group of 9 observations.

Then -gen Z = Y if X==maxX. For the maximum X it will be a value, for the rest it will be missing.

Spread this single nonmissing value to all others.

Reshape back to wide.

If you cannot implement the procedure, provide a sample data using -dataex-.
Comment
Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#3

05 Oct 2020, 15:40

That is an interesting idea and one I hadn't thought of, but it seems like the reshaping is unnecessary? Would the following work?

Code:

egen maxX=rowmax(x1 x2...x9) gen Z=. forval i=1/9 { replace Z=Y_`i' if X_`i'==maxX }
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#4

05 Oct 2020, 18:07

The suggestion in #2 and the code in #3 will both work, sort of. But there is a problem. What is to be done if there is a tie for the maximum value of X? For example, a person might have been paid the same at two jobs, and those might both have been the highest paying. The algorithm in #2 will give multiple values of Z, one for each of the tied observations, and then will break on attempting to -reshape wide-. The code in #3 will use the Y_`i' corresponding to the last (i.e. highest value of i) such value of X. The question itself, as posed in #1, appears to not have contemplated the possibility. (If the data set is large enough to care about wanting to avoid -reshape-, the probability that there will be some ties with this kind of data is appreciable.)

Added: In #3, -rowmax(x1 x2...x9)- will throw a syntax error. The correct syntax would be -rowmax(x1 x2 x3 x4 x5 x6 x7 x8 x9)- or, if the x_i variables are located consecutively in the data set that can be abbreviated to -rowmax(x1-x9)-.

Added: in #2, the missing values generated for Z (even when there are no ties) will prevent the -reshape wide- command from working because Z will not be constant within persons. What needs to be done instead is -egen Z = max(cond(X == max_X, Y, .))-. This will populate all of a person's observations of Z with "the" value of Y corresponding to the largest X. This will permit -reshape wide- to run. But it still chokes on the problem of tied values of X--in this case it will report the largest Y among those that occur with the largest value of X.

Bottom line: Unless there is some assurance that ties will not occur, O.P. needs to decide how ties are to be handled, and then modify code accordingly.

Last edited by Clyde Schechter; 05 Oct 2020, 18:15.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#5

06 Oct 2020, 02:19

Clyde's point about ties is discussed at greater length in a 2020 column in the Stata Journal

SJ-20-2 pr0046_1 . . . . . . . . . . . Speaking Stata: More ways for rowwise
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q2/20 SJ 20(2):481--488 (no commands)
focuses on returning which variable or variables are equal
to the maximum or minimum in a row

SJ-9-1 pr0046 . . . . . . . . . . . . . . . . . . . Speaking Stata: Rowwise
(help rowsort, rowranks if installed) . . . . . . . . . . . N. J. Cox
Q1/09 SJ 9(1):137--157
shows how to exploit functions, egen functions, and Mata
for working rowwise; rowsort and rowranks are introduced

Here's the abstract of the second paper:

A previous column (Cox, 2009, Stata Journal 9: 137–157) gave a review of methods for working rowwise in Stata. Here rows means observations in a dataset, and the concern is calculations in each observation with a bundle of variables. For example, a row mean variable can be generated as the mean of some numeric variables in each observation. This column is an update. It is briefly flagged that official Stata now has rowmedian() and rowpctile() functions for egen. The main focus is on returning which variable or variables are equal to the maximum or minimum in a row. The twist that requires care is that two or more variables can tie for minimum or maximum. That may entail a decision on what is to be recorded, such as all of them or just the first or last occurrence of an extreme.
Comment
Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#6

24 Oct 2020, 09:12

Thanks for this, everyone. Sorry for the late reply. With COVID this semester has been chaotic.

So I would need to add another tiebreaker; so let's say that one possible decision mechanism is hours and the other is hours worked. So I could use one as the primary and another as the secondary choice (it is, in theory, possible that both would be tied, but it seems far less likely).

So let's try something like this? (and yes, thanks Clyde, I was just getting bored with typing out every variable but the "-" option is syntactically correct):

Code:

egen max_hrs=rowmax(h1-h9) egen max_wage=rowmax(w1-w9) gen Z=. gen tie=. forval i=1/9 { replace tie=. replace tie=1 if Z==Y_`i' & H_`i'==max_hrs replace Z=Y_`i' if H_`i'==max_hrs & tie!=1 replace Z=Y_`i' if W_`i'==max_wage & tie==1 }

The idea being, we set an indicator variable if Z is already the same as Y_`i', meaning that we've already got an entry here. In that case, we don't replace it based on hours, we replace it based on hourly wage.

I can already tell this idea won't work, though. The problem with this is that it's possible that there is a third option with a higher hourly wage than the two that are tied, in which case the wrong one would be replaced. I'm not really sure how to get out of that one. My instinct is that I need an -if- loop inside of the loop but I can't work out how it would be done, so perhaps that's not the right way to go.

Any ideas are appreciated and thank you for the help so far.

Last edited by Jonathan Horowitz; 24 Oct 2020, 09:16.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#7

24 Oct 2020, 11:55

My instinct is that I need an -if- loop inside of the loop but I can't work out how it would be done, so perhaps that's not the right way to go.

It's not the right way to go. (Actually ,there is no such thing as an -if- loop anyway.) The right way to go is to reshape the data to long. Then the whole thing becomes very easy.

Code:

reshape long x y hours wage, i(person_id) // FIND VALUE OF Y CORRESPONDING TO LARGEST VALUE OF X // BREAK TIES WITH LARGEST VALUES OF HOURS AND WAGES // AND SAVE THAT IN Z by person_id (x hours wage), sort: gen z = y[_N]

If there is some good reason to return to wide layout once this is done, a simple -reshape wide- will do it. But don't do that unless you have a good reason. Nearly all data management and analysis in Stata is easier, often only possible, with the data in long layout. So long should be the default layout for your data, venturing into wide territory only when doing something that specifically calls for it.

Last edited by Clyde Schechter; 24 Oct 2020, 11:59.
Comment

Announcement

Create Variable Z=Variable Y if Variable X is the highest value among all Variable X's

Comment

Comment

Comment

Comment

Comment

Comment