Creating a new variable as percentage of cases above a numerical treshold

Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#1

Creating a new variable as percentage of cases above a numerical treshold

20 Oct 2016, 11:20

Hi everyone,
I have a question (I'm working full time on Stata during this period and not always I'm able to solve the problems).

I would like to create a new variable that measure, for each job, the percentage of workers more satisfied with respect to the median of satisfaction of the whole labor market.
In other words, I have the median of job satisfaction (let's say, value 7, from 0 to 10). For each job I would like to know the percentage of workers that rate a job_satisfaction above the value of 7. I would opt to this, instead of the mean, because it is less sensitive to outliers.

Here some code (variable of job satisfaction)

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double job_sat 6 8 5 10 9 9 10 6 8 8 0 0 8 10 10 7 7 9 6 10 3 8 10 6 5 9 3 8 7 8 7 8 7 0 8 9 7 10 10 6 9 8 10 8 4 9 10 8 8 9 9 5 4 9 6 8 10 3 4 8 9 3 7 5 8 8 2 5 9 7 10 6 8 7 9 9 2 0 4 5 10 3 8 2 5 7 10 8 5 6 5 7 8 4 7 5 7 3 7 1 end label values job_sat stfjb label def stfjb 0 "Extremely dissatisfied", modify label def stfjb 1 "1", modify label def stfjb 2 "2", modify label def stfjb 3 "3", modify label def stfjb 4 "4", modify label def stfjb 5 "5", modify label def stfjb 6 "6", modify label def stfjb 7 "7", modify label def stfjb 8 "8", modify label def stfjb 9 "9", modify label def stfjb 10 "Extremely satisfied", modify

Thanks a lot, as always.
Best, G.P.
Tags: None
Jeph Herrin

Join Date: Apr 2014

Posts: 335
#2

20 Oct 2016, 11:24

You don't have an identifier for "Job" in your example data, so it's not clear if you are doing this separately for different jobs or just for one.

If just for one:
egen pct_gt_7=mean((job_sat>7 & !mi(job_sat))
If you have many different jobs, indicated by job_id, then

bys job_id : egen pct_gt_7=mean((job_sat>7)& !mi(job_sat))
hth,
Jeph
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35727
#3

20 Oct 2016, 11:41

I agree with Jeph, but not on one detail. He's right that missing values need care, but I would do it differently. In his code the expression

Code:

(job_sat>7)& !mi(job_sat)

does not ignore missing values as missings get scored 0 (for false) and then the function averages over the 1s and 0s.

Consider this fabricated example. Half the known values are above 7 and half below, and there are some missing values.

Code:

clear input job_id job_sat 1 9 1 9 1 9 1 9 1 9 1 5 1 5 1 5 1 5 1 5 1 . 1 . 1 . end bys job_id : egen pct_gt_7 = mean((job_sat>7)& !mi(job_sat)) bys job_id : egen pct_gt_7_2 = mean((job_sat>7) / !mi(job_sat)) tabdisp job_id, c(pct_gt_7*) ---------------------------------- job_id | pct_gt_7 pct_gt_7_2 ----------+----------------------- 1 | .3846154 .5 ----------------------------------

Jeph's code gives a mean 0.385 = 5/13. I have added another way to do it, which really does ignore the missings, and so gives 0.5, which I think is the mean desired here. .

Yet another way is

Code:

bys job_id : egen pct_gt_7_3 = mean(cond(missing(job_sat), ., (job_sat>7)))

Discussion in sections 9 and 10 of

SJ-11-2 dm0055 . . . . . . . . . . . . . . Speaking Stata: Compared with ...
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q2/11 SJ 11(2):305--314 (no commands)
reviews techniques for relating values to values in other
observations

http://www.stata-journal.com/sjpdf.h...iclenum=dm0055

On the other hand, if missings are known substantively always to mean a very low score, or you want to presume that, Jeph's code is what you want.
Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

20 Oct 2016, 12:11

Maybe I didn't understand it correctly, but when you say

I would like to create a new variable that measure, for each job, the percentage of workers more satisfied with respect to the median of satisfaction of the whole labor market. [...] In other words, I have the median of job satisfaction (let's say, value 7, from 0 to 10). For each job I would like to know the percentage of workers that rate a job_satisfaction above the value of 7.

You are basically dealing with the median itself as a cut off.

There is an example, below:

Code:

. sysuse auto
(1978 Automobile Data)

. sum price, detail

                            Price
-------------------------------------------------------------
      Percentiles      Smallest
 1%         3291           3291
 5%         3748           3299
10%         3895           3667       Obs                  74
25%         4195           3748       Sum of Wgt.          74

50%       5006.5                      Mean           6165.257
                        Largest       Std. Dev.      2949.496
75%         6342          13466
90%        11385          13594       Variance        8699526
95%        13466          14500       Skewness       1.653434
99%        15906          15906       Kurtosis       4.819188

. tabstat price, statistics( p50 )

    variable |       p50
-------------+----------
       price |    5006.5
------------------------

. tabstat price, statistics( p50 ) by(foreign)

Summary for variables: price
     by categories of: foreign (Car type)

 foreign |       p50
---------+----------
Domestic |    4782.5
 Foreign |      5759
---------+----------
   Total |    5006.5
--------------------

. quiet sum price, detail

. return list

scalars:
                  r(N) =  74
              r(sum_w) =  74
               r(mean) =  6165.256756756757
                r(Var) =  8699525.97426879
                 r(sd) =  2949.49588476892
           r(skewness) =  1.653433511704859
           r(kurtosis) =  4.819187528464004
                r(sum) =  456229
                r(min) =  3291
                r(max) =  15906
                 r(p1) =  3291
                 r(p5) =  3748
                r(p10) =  3895
                r(p25) =  4195
                r(p50) =  5006.5
                r(p75) =  6342
                r(p90) =  11385
                r(p95) =  13466
                r(p99) =  15906


. egen float mymedian = median(1) if price > 5006.5
(37 missing values generated)

. replace mymedian = 0 if price <=5006.5 & price !=.
(37 real changes made)

. tab mymedian

   mymedian |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         37       50.00       50.00
          1 |         37       50.00      100.00
------------+-----------------------------------
      Total |         74      100.00

. tab foreign mymedian

           |       mymedian
  Car type |         0          1 |     Total
-----------+----------------------+----------
  Domestic |        29         23 |        52
   Foreign |         8         14 |        22
-----------+----------------------+----------
     Total |        37         37 |        74

. by foreign, sort : egen float mymedian2 = median(1) if price > 5006.5
(37 missing values generated)

. replace mymedian2 = 0 if mymedian2==.
(37 real changes made)

. tab mymedian2

  mymedian2 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         37       50.00       50.00
          1 |         37       50.00      100.00
------------+-----------------------------------
      Total |         74      100.00

. tabulate foreign mymedian, cell

+-----------------+
| Key             |
|-----------------|
|    frequency    |
| cell percentage |
+-----------------+

           |       mymedian
  Car type |         0          1 |     Total
-----------+----------------------+----------
  Domestic |        29         23 |        52
           |     39.19      31.08 |     70.27
-----------+----------------------+----------
   Foreign |         8         14 |        22
           |     10.81      18.92 |     29.73
-----------+----------------------+----------
     Total |        37         37 |        74
           |     50.00      50.00 |    100.00


. tabulate foreign mymedian2, cell

+-----------------+
| Key             |
|-----------------|
|    frequency    |
| cell percentage |
+-----------------+

           |       mymedian2
  Car type |         0          1 |     Total
-----------+----------------------+----------
  Domestic |        29         23 |        52
           |     39.19      31.08 |     70.27
-----------+----------------------+----------
   Foreign |         8         14 |        22
           |     10.81      18.92 |     29.73
-----------+----------------------+----------
     Total |        37         37 |        74
           |     50.00      50.00 |    100.00

In short, we get the same results, and we have just the median itself. After all, that's what the median is all about.

Best,

Marcos

Last edited by Marcos Almeida; 20 Oct 2016, 12:24.

Best regards,

Marcos

Comment

Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#5

20 Oct 2016, 13:10

Dear Jeph,
thanl you very much, your command works perfectly! Just a question, what does the part "!mi(job_sat) mean?

Thanks a lot, G
Comment
Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#6

20 Oct 2016, 13:12

Thanks a lot actually to everyone, I did not notice your further responses!
Comment
Giorgio Piccitto

Join Date: Oct 2016

Posts: 238
#7

20 Oct 2016, 13:24

Still, I found very interesting your interventions especially with respect to missing values, that it is always an hot issue... in this case anyway, I dropped all the missing, so I guess there is no nedd to care them... But I was wondering, how Stata recognize that "mi" refers to missing values (as far as I understood), if missing values could be coded differently from dataset to dataset?
Comment
Charlie Joyez

Join Date: Dec 2014

Posts: 421
#8

20 Oct 2016, 13:57

Originally posted by Giorgio Piccitto View Post

... But I was wondering, how Stata recognize that "mi" refers to missing values (as far as I understood), if missing values could be coded differently from dataset to dataset?

I would advice you to take a look on the missing help file (see help missing or go directly to the manual).

To answer your question: mi() and !mi() do not directly refer to missing value, but rather to the missing() function (see help missing() ), which returns 1 if a missing value is reported (!mi does the reverse).
And, missing values are not coded differently from dataset to dataset, because a missing value in Stata has a very precise meaning, and should be coded with a period (.) for numerical variables, and with empty quotes ("") for string variables (without mentioning extended missing values).

Otherwise Stata will not consider them as missing, and the missing() function won't find them.
Comment
Jeph Herrin

Join Date: Apr 2014

Posts: 335
#9

25 Oct 2016, 08:21

Thanks Nick for the elaboration. I typically emply the -cond()- construction in this context, but here I thought Giorgio truly wanted the % with a valid score > 7, so only tried to rule out spurious ">7": values. Yet another example where -missing- is an unknown unknown....
Comment

Announcement