Problem combining multiple IF qualifiers which reference value of other variable entries

Harry Coleman

Join Date: May 2019
Posts: 4

Problem combining multiple IF qualifiers which reference value of other variable entries

21 May 2019, 09:49

Hi all,

First time posting - very grateful for this well-resourced community that has helped me thus far.

Using Stata13.0. Please see below for excerpt of dataset I am using; treatmentsourcex can take categorical values (e.g. 1 = hospital) while treatmentsourcex_exp is a continuous monetary amount, detailing the expenditure at that treatment source.

treatmentsource1	treatmentsource1_exp	treatmentsource2	treatmentsource2_exp	treatmentsource3	treatmentsource3_exp	formal_public
1	700	2	1000	3	400	.
1	200	3	500	8	.	.
1	200	3	800	8	.	.
1	200	3	500	.	.	.
1	200	3	500	.	.	.
1	200	3	300	.	.	.
8	300	.	500	.	.	.
3	.	.	.	.	.	.
1	800	.	.	.	.	.

I am trying to create a dummy variable (formal_public) that evaluates to 1 when either treatmentsource1, treatmentsource2, or treatmentsource3 are 1 and their relative expenditure value (treatmentsource1_exp, treatmentsource2_exp2, treatmentsource3_exp3) is larger than the other two expenditure values.

My understanding of this would be something like:

replace formal_public=1 if (treatmentsource1==1 & treatmentsource1_exp>treatmentsource2_exp & treatmentsource1_exp>treatmentsource3_exp)|(treatm entsource2==1 & treatmentsource2_exp>treatmentsource1_exp & treatmentsource2_exp>treatmentsource3_exp)|(treatm entsource3==1 & treatmentsource3_exp>treatmentsource1_exp & treatmentsource3_exp>treatmentsource2_exp)

However, when I ran this, it only made one real change, when there are numerous instances when the category 1 treatment source has the largest relative expenditure compared to other treatment sources. I am wondering if this could be due to the missing values (i.e., will Stata evaluate an expenditure to 1 if it also compares this expenditure to missing values, which is what I would like it to do)?

Further, if possible, I would like to distinguish between individuals who only list one treatment source and its expenditure, as this should to evaluate to 1 (e.g. the last row shown here) and individuals who list multiple treatment sources but did not also list the associated expenditure (this should evaluate to missing).

Hope this explanation provides enough detail.

Many thanks,

Harry

Edit: apologies Nick, changed my data excerpt to a table to clarify missing values and added some more explanation

Last edited by Harry Coleman; 21 May 2019, 10:32.

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35634
#2

21 May 2019, 09:57

Sorry, but I can't follow your data example. I see 7 variable names and variously 6, 5 and 4 values. Too much guesswork required for me to make sense of that.

Please follow https://www.statalist.org/forums/help#stata and give a data example using dataex.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

21 May 2019, 11:32

A data example using dataex would have been more helpful, but I was able to beat yours into shape, taking the opportunity to substitute shorter simpler variable names to make my life easier and lend clarity to this example, and to change the second observation to better test the code.

As with so many things in Stata this problem appears complicated only because you are working with your data in wide layout. If you switch to long, it is easy:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte ts1 int te1 byte ts2 int te2 byte ts3 int te3 byte formal_public
1 700 2 1000 3 400 .
7 200 1  500 8   . .
1 200 3  800 8   . .
1 200 3  500 .   . .
1 200 3  500 .   . .
1 200 3  300 .   . .
8 300 .  500 .   . .
3   . .    . .   . .
1 800 .    . .   . .
end

generate id = _n
reshape long ts te, i(id) j(j)
bysort id: egen maxte = max(te)
generate big1 = ts==1 & te==maxte
bysort id: egen wanted = max(big1)
drop maxte big1
reshape wide ts te, i(id) j(j)
list, clean noobs

Code:

. list, clean noobs

    id   ts1   te1   ts2    te2   ts3   te3   formal~c   wanted  
     1     1   700     2   1000     3   400          .        0  
     2     7   200     1    500     8     .          .        1  
     3     1   200     3    800     8     .          .        0  
     4     1   200     3    500     .     .          .        0  
     5     1   200     3    500     .     .          .        0  
     6     1   200     3    300     .     .          .        0  
     7     8   300     .    500     .     .          .        0  
     8     3     .     .      .     .     .          .        0  
     9     1   800     .      .     .     .          .        1

With that, I will add that I did the second reshape only to ease comparison with the input data. If there is some compelling reason you need to go back to wide layout, you can do that. But think twice before doing that. There are only a small number of things in Stata that are easier to do with wide data. Unless you know you're going to do one of them, stick with long.

Comment

Harry Coleman

Join Date: May 2019

Posts: 4
#4

22 May 2019, 02:54

Hi William,

I appreciate you persisting with my data; I realise it will have been more laborious on your end. I have just got my head round each of the steps you performed and (hopefully) how I can amend these for subsequent actions.

This is my first encounter with the reshape command and the use of ordering data in this fashion (aside from time-series data). I was wondering if reshaping to long has an effect on analysis. I will conduct some form of multivariate regression to measure the effect of other variables on, for example, primary treatment source (named 'wanted' in your example). Given that reshaping this data to long would, at the basic level, triple the number of observations (and cause other, non-time-series-like data to be repeated in triplicate), would this affect the effect estimations or other analysis?

Best,

Harry
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35634
#5

22 May 2019, 03:12

Your last question is more general. A regression based on a long layout can't fail to be a different regression from one based on a wide layout. If you have say a bunch of related predictors les that correspond to different times, as compared with a single predictor and a time variable, the models implied, including the error structure, are quite different.

I think you may need to give concrete examples to get better advice. For example, you say multivariate regression, but usually when people say that they don't mean mvreg, just (multiple) regression. These days, it is hardly necessary to mention "multiple" any more, except for organizing material in text books.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

22 May 2019, 14:07

I agree with Nick's advice. But I would say that whether wide or long is needed for your analysis, I would use a long layout for as long as possible while attempting tasks like the one at hand,

Taking a step back from what you asked in post #1, I've given some thought to what you might need for your analysis. For sure, having two sets of variables to describe one set of observations of expense is cumbersome in Stata, unlike some alternative statistical packages that make it easy to loop across variables in an observation (while making other things much more different). Consider the dataset that results from the following code, which starts with the example data in post #3 having been read in.

Code:

drop formal_public generate id = _n reshape long ts te, i(id) j(j) drop if missing(te) replace ts = 99 if missing(ts) fillin id ts drop _fillin replace te = 0 if missing(te) bysort id: egen te_total = total(te) bysort id: egen te_max = max(te) bysort id: egen ts_max = min(cond(te==te_max,ts,.)) drop j rename te te_ts reshape wide te_ts, i(id) j(ts) list, clean noobs

Code:

. list, clean noobs id te_ts1 te_ts2 te_ts3 te_ts7 te_ts8 te_ts99 te_total te_max ts_max 1 700 1000 400 0 0 0 2100 1000 2 2 500 0 0 200 0 0 700 500 1 3 200 0 800 0 0 0 1000 800 3 4 200 0 500 0 0 0 700 500 3 5 200 0 500 0 0 0 700 500 3 6 200 0 300 0 0 0 500 300 3 7 0 0 0 0 300 500 800 500 99 9 800 0 0 0 0 0 800 800 1

You now have for each observation te_ts1 having the treatment expense for treatment source 1, and so forth. So the variable names are now meaningful, no more need to consult a ts variable to find out what treatment the te variable represents. Then te_total gives the total of the treatment expenses for that observation, while te_max and ts_max give the maximum treatment expense and the treatment source for which it occurs (with ties going to the smallest-numbered source). So ts_max==1 corresponds to the observations for which formal_public==1 in the previous arrangement.
Comment

Announcement

Problem combining multiple IF qualifiers which reference value of other variable entries

Comment

Comment

Comment

Comment

Comment