Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a new variable as percentage of cases above a numerical treshold

    Hi everyone,
    I have a question (I'm working full time on Stata during this period and not always I'm able to solve the problems).

    I would like to create a new variable that measure, for each job, the percentage of workers more satisfied with respect to the median of satisfaction of the whole labor market.
    In other words, I have the median of job satisfaction (let's say, value 7, from 0 to 10). For each job I would like to know the percentage of workers that rate a job_satisfaction above the value of 7. I would opt to this, instead of the mean, because it is less sensitive to outliers.

    Here some code (variable of job satisfaction)

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double job_sat
     6
     8
     5
    10
     9
     9
    10
     6
     8
     8
     0
     0
     8
    10
    10
     7
     7
     9
     6
    10
     3
     8
    10
     6
     5
     9
     3
     8
     7
     8
     7
     8
     7
     0
     8
     9
     7
    10
    10
     6
     9
     8
    10
     8
     4
     9
    10
     8
     8
     9
     9
     5
     4
     9
     6
     8
    10
     3
     4
     8
     9
     3
     7
     5
     8
     8
     2
     5
     9
     7
    10
     6
     8
     7
     9
     9
     2
     0
     4
     5
    10
     3
     8
     2
     5
     7
    10
     8
     5
     6
     5
     7
     8
     4
     7
     5
     7
     3
     7
     1
    end
    label values job_sat stfjb
    label def stfjb 0 "Extremely dissatisfied", modify
    label def stfjb 1 "1", modify
    label def stfjb 2 "2", modify
    label def stfjb 3 "3", modify
    label def stfjb 4 "4", modify
    label def stfjb 5 "5", modify
    label def stfjb 6 "6", modify
    label def stfjb 7 "7", modify
    label def stfjb 8 "8", modify
    label def stfjb 9 "9", modify
    label def stfjb 10 "Extremely satisfied", modify
    Thanks a lot, as always.
    Best, G.P.

  • #2
    You don't have an identifier for "Job" in your example data, so it's not clear if you are doing this separately for different jobs or just for one.

    If just for one:
    egen pct_gt_7=mean((job_sat>7 & !mi(job_sat))
    If you have many different jobs, indicated by job_id, then

    bys job_id : egen pct_gt_7=mean((job_sat>7)& !mi(job_sat))
    hth,
    Jeph

    Comment


    • #3
      I agree with Jeph, but not on one detail. He's right that missing values need care, but I would do it differently. In his code the expression

      Code:
      (job_sat>7)& !mi(job_sat)
      does not ignore missing values as missings get scored 0 (for false) and then the function averages over the 1s and 0s.

      Consider this fabricated example. Half the known values are above 7 and half below, and there are some missing values.

      Code:
      clear 
      input job_id job_sat 
      1  9
      1  9
      1  9 
      1  9    
      1  9 
      1  5
      1  5
      1  5
      1  5
      1  5 
      1  . 
      1  .
      1  . 
      end 
      bys job_id : egen pct_gt_7 = mean((job_sat>7)& !mi(job_sat))
      bys job_id : egen pct_gt_7_2 = mean((job_sat>7) / !mi(job_sat))
      
      tabdisp job_id, c(pct_gt_7*) 
      
      ----------------------------------
         job_id |   pct_gt_7  pct_gt_7_2
      ----------+-----------------------
              1 |   .3846154          .5
      ----------------------------------
      Jeph's code gives a mean 0.385 = 5/13. I have added another way to do it, which really does ignore the missings, and so gives 0.5, which I think is the mean desired here. .

      Yet another way is

      Code:
      bys job_id : egen pct_gt_7_3 = mean(cond(missing(job_sat), ., (job_sat>7)))
      Discussion in sections 9 and 10 of

      SJ-11-2 dm0055 . . . . . . . . . . . . . . Speaking Stata: Compared with ...
      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
      Q2/11 SJ 11(2):305--314 (no commands)
      reviews techniques for relating values to values in other
      observations

      http://www.stata-journal.com/sjpdf.h...iclenum=dm0055


      On the other hand, if missings are known substantively always to mean a very low score, or you want to presume that, Jeph's code is what you want.

      Comment


      • #4
        Maybe I didn't understand it correctly, but when you say

        I would like to create a new variable that measure, for each job, the percentage of workers more satisfied with respect to the median of satisfaction of the whole labor market. [...] In other words, I have the median of job satisfaction (let's say, value 7, from 0 to 10). For each job I would like to know the percentage of workers that rate a job_satisfaction above the value of 7.
        You are basically dealing with the median itself as a cut off.

        There is an example, below:

        Code:
        . sysuse auto
        (1978 Automobile Data)
        
        . sum price, detail
        
                                    Price
        -------------------------------------------------------------
              Percentiles      Smallest
         1%         3291           3291
         5%         3748           3299
        10%         3895           3667       Obs                  74
        25%         4195           3748       Sum of Wgt.          74
        
        50%       5006.5                      Mean           6165.257
                                Largest       Std. Dev.      2949.496
        75%         6342          13466
        90%        11385          13594       Variance        8699526
        95%        13466          14500       Skewness       1.653434
        99%        15906          15906       Kurtosis       4.819188
        
        . tabstat price, statistics( p50 )
        
            variable |       p50
        -------------+----------
               price |    5006.5
        ------------------------
        
        . tabstat price, statistics( p50 ) by(foreign)
        
        Summary for variables: price
             by categories of: foreign (Car type)
        
         foreign |       p50
        ---------+----------
        Domestic |    4782.5
         Foreign |      5759
        ---------+----------
           Total |    5006.5
        --------------------
        
        . quiet sum price, detail
        
        . return list
        
        scalars:
                          r(N) =  74
                      r(sum_w) =  74
                       r(mean) =  6165.256756756757
                        r(Var) =  8699525.97426879
                         r(sd) =  2949.49588476892
                   r(skewness) =  1.653433511704859
                   r(kurtosis) =  4.819187528464004
                        r(sum) =  456229
                        r(min) =  3291
                        r(max) =  15906
                         r(p1) =  3291
                         r(p5) =  3748
                        r(p10) =  3895
                        r(p25) =  4195
                        r(p50) =  5006.5
                        r(p75) =  6342
                        r(p90) =  11385
                        r(p95) =  13466
                        r(p99) =  15906
        
        
        . egen float mymedian = median(1) if price > 5006.5
        (37 missing values generated)
        
        . replace mymedian = 0 if price <=5006.5 & price !=.
        (37 real changes made)
        
        . tab mymedian
        
           mymedian |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  0 |         37       50.00       50.00
                  1 |         37       50.00      100.00
        ------------+-----------------------------------
              Total |         74      100.00
        
        . tab foreign mymedian
        
                   |       mymedian
          Car type |         0          1 |     Total
        -----------+----------------------+----------
          Domestic |        29         23 |        52
           Foreign |         8         14 |        22
        -----------+----------------------+----------
             Total |        37         37 |        74
        
        . by foreign, sort : egen float mymedian2 = median(1) if price > 5006.5
        (37 missing values generated)
        
        . replace mymedian2 = 0 if mymedian2==.
        (37 real changes made)
        
        . tab mymedian2
        
          mymedian2 |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  0 |         37       50.00       50.00
                  1 |         37       50.00      100.00
        ------------+-----------------------------------
              Total |         74      100.00
        
        . tabulate foreign mymedian, cell
        
        +-----------------+
        | Key             |
        |-----------------|
        |    frequency    |
        | cell percentage |
        +-----------------+
        
                   |       mymedian
          Car type |         0          1 |     Total
        -----------+----------------------+----------
          Domestic |        29         23 |        52
                   |     39.19      31.08 |     70.27
        -----------+----------------------+----------
           Foreign |         8         14 |        22
                   |     10.81      18.92 |     29.73
        -----------+----------------------+----------
             Total |        37         37 |        74
                   |     50.00      50.00 |    100.00
        
        
        . tabulate foreign mymedian2, cell
        
        +-----------------+
        | Key             |
        |-----------------|
        |    frequency    |
        | cell percentage |
        +-----------------+
        
                   |       mymedian2
          Car type |         0          1 |     Total
        -----------+----------------------+----------
          Domestic |        29         23 |        52
                   |     39.19      31.08 |     70.27
        -----------+----------------------+----------
           Foreign |         8         14 |        22
                   |     10.81      18.92 |     29.73
        -----------+----------------------+----------
             Total |        37         37 |        74
                   |     50.00      50.00 |    100.00
        In short, we get the same results, and we have just the median itself. After all, that's what the median is all about.

        Best,

        Marcos
        Last edited by Marcos Almeida; 20 Oct 2016, 12:24.
        Best regards,

        Marcos

        Comment


        • #5
          Dear Jeph,
          thanl you very much, your command works perfectly! Just a question, what does the part "!mi(job_sat) mean?

          Thanks a lot, G

          Comment


          • #6
            Thanks a lot actually to everyone, I did not notice your further responses!

            Comment


            • #7
              Still, I found very interesting your interventions especially with respect to missing values, that it is always an hot issue... in this case anyway, I dropped all the missing, so I guess there is no nedd to care them... But I was wondering, how Stata recognize that "mi" refers to missing values (as far as I understood), if missing values could be coded differently from dataset to dataset?

              Comment


              • #8
                Originally posted by Giorgio Piccitto View Post
                ... But I was wondering, how Stata recognize that "mi" refers to missing values (as far as I understood), if missing values could be coded differently from dataset to dataset?
                I would advice you to take a look on the missing help file (see help missing or go directly to the manual).

                To answer your question: mi() and !mi() do not directly refer to missing value, but rather to the missing() function (see help missing() ), which returns 1 if a missing value is reported (!mi does the reverse).
                And, missing values are not coded differently from dataset to dataset, because a missing value in Stata has a very precise meaning, and should be coded with a period (.) for numerical variables, and with empty quotes ("") for string variables (without mentioning extended missing values).

                Otherwise Stata will not consider them as missing, and the missing() function won't find them.

                Comment


                • #9
                  Thanks Nick for the elaboration. I typically emply the -cond()- construction in this context, but here I thought Giorgio truly wanted the % with a valid score > 7, so only tried to rule out spurious ">7": values. Yet another example where -missing- is an unknown unknown....

                  Comment

                  Working...
                  X