Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem combining multiple IF qualifiers which reference value of other variable entries

    Hi all,

    First time posting - very grateful for this well-resourced community that has helped me thus far.

    Using Stata13.0. Please see below for excerpt of dataset I am using; treatmentsourcex can take categorical values (e.g. 1 = hospital) while treatmentsourcex_exp is a continuous monetary amount, detailing the expenditure at that treatment source.
    treatmentsource1 treatmentsource1_exp treatmentsource2 treatmentsource2_exp treatmentsource3 treatmentsource3_exp formal_public
    1 700 2 1000 3 400 .
    1 200 3 500 8 . .
    1 200 3 800 8 . .
    1 200 3 500 . . .
    1 200 3 500 . . .
    1 200 3 300 . . .
    8 300 . 500 . . .
    3 . . . . . .
    1 800 . . . . .
    I am trying to create a dummy variable (formal_public) that evaluates to 1 when either treatmentsource1, treatmentsource2, or treatmentsource3 are 1 and their relative expenditure value (treatmentsource1_exp, treatmentsource2_exp2, treatmentsource3_exp3) is larger than the other two expenditure values.

    My understanding of this would be something like:

    replace formal_public=1 if (treatmentsource1==1 & treatmentsource1_exp>treatmentsource2_exp & treatmentsource1_exp>treatmentsource3_exp)|(treatm entsource2==1 & treatmentsource2_exp>treatmentsource1_exp & treatmentsource2_exp>treatmentsource3_exp)|(treatm entsource3==1 & treatmentsource3_exp>treatmentsource1_exp & treatmentsource3_exp>treatmentsource2_exp)

    However, when I ran this, it only made one real change, when there are numerous instances when the category 1 treatment source has the largest relative expenditure compared to other treatment sources. I am wondering if this could be due to the missing values (i.e., will Stata evaluate an expenditure to 1 if it also compares this expenditure to missing values, which is what I would like it to do)?

    Further, if possible, I would like to distinguish between individuals who only list one treatment source and its expenditure, as this should to evaluate to 1 (e.g. the last row shown here) and individuals who list multiple treatment sources but did not also list the associated expenditure (this should evaluate to missing).

    Hope this explanation provides enough detail.

    Many thanks,

    Harry

    Edit: apologies Nick, changed my data excerpt to a table to clarify missing values and added some more explanation
    Last edited by Harry Coleman; 21 May 2019, 10:32.

  • #2
    Sorry, but I can't follow your data example. I see 7 variable names and variously 6, 5 and 4 values. Too much guesswork required for me to make sense of that.

    Please follow https://www.statalist.org/forums/help#stata and give a data example using dataex.

    Comment


    • #3
      A data example using dataex would have been more helpful, but I was able to beat yours into shape, taking the opportunity to substitute shorter simpler variable names to make my life easier and lend clarity to this example, and to change the second observation to better test the code.

      As with so many things in Stata this problem appears complicated only because you are working with your data in wide layout. If you switch to long, it is easy:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte ts1 int te1 byte ts2 int te2 byte ts3 int te3 byte formal_public
      1 700 2 1000 3 400 .
      7 200 1  500 8   . .
      1 200 3  800 8   . .
      1 200 3  500 .   . .
      1 200 3  500 .   . .
      1 200 3  300 .   . .
      8 300 .  500 .   . .
      3   . .    . .   . .
      1 800 .    . .   . .
      end
      
      generate id = _n
      reshape long ts te, i(id) j(j)
      bysort id: egen maxte = max(te)
      generate big1 = ts==1 & te==maxte
      bysort id: egen wanted = max(big1)
      drop maxte big1
      reshape wide ts te, i(id) j(j)
      list, clean noobs
      Code:
      . list, clean noobs
      
          id   ts1   te1   ts2    te2   ts3   te3   formal~c   wanted  
           1     1   700     2   1000     3   400          .        0  
           2     7   200     1    500     8     .          .        1  
           3     1   200     3    800     8     .          .        0  
           4     1   200     3    500     .     .          .        0  
           5     1   200     3    500     .     .          .        0  
           6     1   200     3    300     .     .          .        0  
           7     8   300     .    500     .     .          .        0  
           8     3     .     .      .     .     .          .        0  
           9     1   800     .      .     .     .          .        1
      With that, I will add that I did the second reshape only to ease comparison with the input data. If there is some compelling reason you need to go back to wide layout, you can do that. But think twice before doing that. There are only a small number of things in Stata that are easier to do with wide data. Unless you know you're going to do one of them, stick with long.

      Comment


      • #4
        Hi William,

        I appreciate you persisting with my data; I realise it will have been more laborious on your end. I have just got my head round each of the steps you performed and (hopefully) how I can amend these for subsequent actions.

        This is my first encounter with the reshape command and the use of ordering data in this fashion (aside from time-series data). I was wondering if reshaping to long has an effect on analysis. I will conduct some form of multivariate regression to measure the effect of other variables on, for example, primary treatment source (named 'wanted' in your example). Given that reshaping this data to long would, at the basic level, triple the number of observations (and cause other, non-time-series-like data to be repeated in triplicate), would this affect the effect estimations or other analysis?

        Best,

        Harry

        Comment


        • #5
          Your last question is more general. A regression based on a long layout can't fail to be a different regression from one based on a wide layout. If you have say a bunch of related predictors les that correspond to different times, as compared with a single predictor and a time variable, the models implied, including the error structure, are quite different.

          I think you may need to give concrete examples to get better advice. For example, you say multivariate regression, but usually when people say that they don't mean mvreg, just (multiple) regression. These days, it is hardly necessary to mention "multiple" any more, except for organizing material in text books.

          Comment


          • #6
            I agree with Nick's advice. But I would say that whether wide or long is needed for your analysis, I would use a long layout for as long as possible while attempting tasks like the one at hand,

            Taking a step back from what you asked in post #1, I've given some thought to what you might need for your analysis. For sure, having two sets of variables to describe one set of observations of expense is cumbersome in Stata, unlike some alternative statistical packages that make it easy to loop across variables in an observation (while making other things much more different). Consider the dataset that results from the following code, which starts with the example data in post #3 having been read in.

            Code:
            drop formal_public
            generate id = _n
            reshape long ts te, i(id) j(j)
            drop if missing(te)
            replace ts = 99 if missing(ts)
            fillin id ts
            drop _fillin
            replace te = 0 if missing(te)
            bysort id: egen te_total = total(te)
            bysort id: egen te_max = max(te)
            bysort id: egen ts_max = min(cond(te==te_max,ts,.))
            drop j
            rename te te_ts
            reshape wide te_ts, i(id) j(ts)
            list, clean noobs
            Code:
            . list, clean noobs
            
                id   te_ts1   te_ts2   te_ts3   te_ts7   te_ts8   te_ts99   te_total   te_max   ts_max  
                 1      700     1000      400        0        0         0       2100     1000        2  
                 2      500        0        0      200        0         0        700      500        1  
                 3      200        0      800        0        0         0       1000      800        3  
                 4      200        0      500        0        0         0        700      500        3  
                 5      200        0      500        0        0         0        700      500        3  
                 6      200        0      300        0        0         0        500      300        3  
                 7        0        0        0        0      300       500        800      500       99  
                 9      800        0        0        0        0         0        800      800        1
            You now have for each observation te_ts1 having the treatment expense for treatment source 1, and so forth. So the variable names are now meaningful, no more need to consult a ts variable to find out what treatment the te variable represents. Then te_total gives the total of the treatment expenses for that observation, while te_max and ts_max give the maximum treatment expense and the treatment source for which it occurs (with ties going to the smallest-numbered source). So ts_max==1 corresponds to the observations for which formal_public==1 in the previous arrangement.

            Comment

            Working...
            X