Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Assigning value "1" to a dummy variable when another variable reaches its max, for each individual across choice alternatives

    I have a dataset of about 3800 observations. This dataset is in long form: I have about 760 individuals (760*5=3800) who have five choice alternatives (in my case, hours of labour in a week). With a mixed logit model, I have obtained predictions for each number of hours worked, by each individual (idcode). I want to check the accuracy of my logit predictions with what the individuals actually worked, and wanted to do this in 3 steps, listed below.

    My problem lies at step 2: when i generate the match dummy with the line starting with by idperson, for all 3800 observations, Stata generates a missing value. I have made dummies before and have tried to make this one in different ways as well, but each time Stata output says: (3,800 missing values generated).

    Any help is greatly appreciated,
    Olivier


    1. With the following code, I created a variable that lists the maximum probability across the 5 labour quantity alternatives for each idcode:

    Code:
    bysort idcode: egen max_prediction=max(pr)
    pr being the probabilities generated by the logit model.

    2. Now I want to create a dummy "match" that equals 1 when the choice alternative for a row (pr) equals max_prediction. In other words, a dummy that equals 1 for the choice alternative with the highest probability across the 5 options.

    Code:
    by idperson: gen match = 1 if max_prediction==pr
    replace match=0 if match==.
    3. See how often "match" equals one & the variable "choice" equals one, choice being a dummy that is 1 when the individual's true hours worked is that choice alternative.

  • #2
    Comparing two fractions for equality often leads to problems of precision, especially if one of your variables is type double and the other type float. You don't show us sample data, so I can't be sure that's your problem, but the code below demonstrates the problem.
    Code:
    . set obs 5
    number of observations (_N) was 0, now 5
    
    . generate double pr=_n/9
    
    . egen float max_pr = max(pr)
    
    . list if pr==max_pr
    
    . list
    
         +----------------------+
         |        pr     max_pr |
         |----------------------|
      1. | .11111111   .5555556 |
      2. | .22222222   .5555556 |
      3. | .33333333   .5555556 |
      4. | .44444444   .5555556 |
      5. | .55555556   .5555556 |
         +----------------------+
    A more robust alternative is to replace your first two steps with the following, assuming choice is the variable that identifies the five alternatives.
    Code:
    bysort idperson (pr choice) : generate max_prediction = pr[_N]
    bysort idperson (pr choice) : generate match = _n==_N
    sort idperson choice
    This has the side effect of ensuring that if there is a tie for the largest value of pr only one observation will be chosen as matching.

    Comment


    • #3
      Thank you for your answer. The choice variable identifies which option the individual actually chose in real life: say he works 0 hours, choice would be 1 for line 1 in the data example you gave, and 0 for row 2, 3, 4 and 5.

      Indeed max_prediction was a float type and pr a double. I did the following now:
      Code:
      recast double max_prediction
      So now they are both type double. However, now I still have the same problem, that Stata does not find any observations where max_prediction equals pr, or to use your data example, it does not pick up that pr(row5)=.555556 equals the max_prediction value in the right column. Stata still generates missing values for each observation. Do you know what else I might be doing that is causing this problem?

      Thank you,
      Olivier

      Comment


      • #4
        Once you store a value in a float variable, you have lost precision, and recasting the variable as double will not somehow rediscover the missing bits. You need to create the variable as a double. This is no different than storing the value of pi in an int and the recasting the int as double and expecting the digits 14159 to appear to the right of the decimal point.
        Code:
        . set obs 5
        number of observations (_N) was 0, now 5
        
        . generate double pr=_n/9
        
        . egen double max_pr = max(pr)
        
        . list if pr==max_pr
        
             +-----------------------+
             |        pr      max_pr |
             |-----------------------|
          5. | .55555556   .55555556 |
             +-----------------------+
        Do read the output of help precision and better still the Stata blog post at https://blog.stata.com/2012/04/02/th...-to-precision/ to get a better understanding of precision issues.

        Comment


        • #5
          Thank you for the useful answer and links! My issue is solved now.

          Olivier

          Comment

          Working...
          X