A unit free approach when variables are coded differently?

John Adler

Join Date: Apr 2017
Posts: 173

A unit free approach when variables are coded differently?

22 Nov 2019, 09:44

Hi all,

I want to make a comparison between two different variables across waves, indicators of dissatisfaction,

Usually I would simply code this as:

Code:

egen dissatisfaction_wave1=std(dissatisfactionmeas1) , mean(0) std(1)

egen dissatisfaction_wave2=std(dissatisfactionmeas2) , mean(0) std(1)

But, here's the rub. Measure 1 is a standard measure, higher numbers equal higher levels of dissatisfaction, meanwhile Measure 2 is awkward, a score of 7-20 is low satisfaction, 20-27 is moderate satisfaction, and 28-35 is high satisfaction.

Thus I'm not sure what to do with Measure 2, sure I can do the above and run a continous regression, but if I get a positive result of 0.06*** then what does that mean? If they both moved in the same direction I would say the probability of poor satisfaction increased by 0.06 standard deviations, significant at the *** level, however, the second is such that it could be better or worse if individuals score increases, i..e. their dissatisfaction may be increasing past the low satisfaction and into the high satisfaction...

I tried reverse coding measure 2 but that didn't work well as you can see below:

Code:


revv dissatisfactionwave2

. tab dissatisfactionwave2

Warwick-Edi |
     nburgh |
      Scale |
Post-Transf |
   ormation |      Freq.     Percent        Cum.
------------+-----------------------------------
      16.36 |          3        0.66        0.66
      16.88 |          1        0.22        0.88
      17.43 |          3        0.66        1.54
      17.98 |          5        1.10        2.64
      19.25 |         19        4.19        6.83
      19.98 |         11        2.42        9.25
      20.73 |         22        4.85       14.10
      21.54 |         36        7.93       22.03
      22.35 |         30        6.61       28.63
      23.21 |         34        7.49       36.12
      24.11 |         60       13.22       49.34
      25.03 |         66       14.54       63.88
      26.02 |         41        9.03       72.91
      27.03 |         30        6.61       79.52
      28.13 |         23        5.07       84.58
      29.31 |         20        4.41       88.99
       30.7 |         24        5.29       94.27
      32.55 |          8        1.76       96.04
         35 |         18        3.96      100.00
------------+-----------------------------------
      Total |        454      100.00

. tab rvdissatisfactionwave2

Warwick-Edi |
     nburgh |
      Scale |
Post-Transf |
   ormation |      Freq.     Percent        Cum.
------------+-----------------------------------
      16.36 |         18        3.96        3.96
      16.88 |          8        1.76        5.73
      17.43 |         24        5.29       11.01
      17.98 |         20        4.41       15.42
      19.25 |         23        5.07       20.48
      19.98 |         30        6.61       27.09
      20.73 |         41        9.03       36.12
      21.54 |         66       14.54       50.66
      22.35 |         60       13.22       63.88
      23.21 |         34        7.49       71.37
      24.11 |         30        6.61       77.97
      25.03 |         36        7.93       85.90
      26.02 |         22        4.85       90.75
      27.03 |         11        2.42       93.17
      28.13 |         19        4.19       97.36
      29.31 |          5        1.10       98.46
       30.7 |          3        0.66       99.12
      32.55 |          1        0.22       99.34
         35 |          3        0.66      100.00
------------+-----------------------------------
      Total |        454      100.00

.

I'm not even sure that reversing this would actually do much to fix this problem.

Similarly, I include descriptives on the first var below to make things a bit clearer:

Code:

. tab dissatisfactionwave1

GHQ Score A |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        470       44.63       44.63
          1 |        207       19.66       64.29
          2 |        105        9.97       74.26
          3 |         54        5.13       79.39
          4 |         60        5.70       85.09
          5 |         35        3.32       88.41
          6 |         33        3.13       91.55
          7 |         24        2.28       93.83
          8 |         24        2.28       96.11
          9 |         11        1.04       97.15
         10 |         12        1.14       98.29
         11 |          9        0.85       99.15
         12 |          9        0.85      100.00
------------+-----------------------------------
      Total |      1,053      100.00

I would be grateful if someone could please advise?

Very best,

John

Tags: None

ericmelse

Join Date: May 2014

Posts: 434
#2

23 Nov 2019, 08:01

Dear John,
I gather from your problem description that you have two measures on very different scales (0-12, 7-35). Assuming that both measures were administered to the same persons (i.e. wave1 and wave2), I am somewhat disturbed by the difference in frequencies of these measures. If indeed both scales are supposed to vary between 'minimum' and 'maximum' dissatisfaction, how would you explain that the wave2 frequency pattern is rather different from that of wave1. Has some intervention taken place in between? I mean to ask, do you expect that at wave2 your cohort is 'much more' dissatisfied than at the wave1 measurement?

http://publicationslist.org/eric.melse
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#3

23 Nov 2019, 08:30

Dear Eric,

Thank you for your response, yes in wave 2 10 years have passed and most respondents experienced own unemployment or unemployment of their partner, and the corresponding mental health effects fit with the literature,

Thank you,

John
Comment

ericmelse

Join Date: May 2014
Posts: 434

23 Nov 2019, 12:21

Well, I suppose that makes sense.

My suggestion would be to follow this procedure with each variable:
* create a variable which is a percentile rank of the source scale (pcr in my code);
* next, create a normalized variable using it;
* next, standardize the normalized variable (if needed).

Here is an example of this procedure:

Code:

ssc install pshare    // Jann, B. (2015). pshare: Stata module to compute and graph percentile share
* Data set up
sysuse nlsw88 , clear
* Repeat the code below with each variable
local var wage
pshare estimate `var', percent nquantiles(100)
gen byte touse = e(sample)  // to make sure that the same observation are used
bysort touse (`var'): gen p = _n/_N * 100
gen pcr_`var' = .    // to create percentile rank
local bins = e(bins)
forvalues i=1/`bins' {
    qui replace pcr_`var' = `i' if p>el(e(ll),1,`i') & p<=el(e(ul),1,`i')
label var pcr_`var' "Percentile share coding of `var'"
}
generate zpcr_`var' = invnorm(pcr_`var'/101) // to create Normalized scale
label var zpcr_`var' "`var' Z-score percentile 1st rank"
zval zpcr_`var'     // to create Standardized scale
label var z_zpcr_`var' "`var' Standardized score"
ren z_zpcr_`var' z_`var'
order pcr_`var' , a(`var')
order z_`var' zpcr_`var' , a(pcr_`var')
format %8.0g pcr_`var'
format %8.0g zpcr_`var'

tab pcr_`var'    // to inspect the result
kdensity pcr_`var' , name(pcr_`var')    // to inspect the distributions
kdensity zpcr_`var' , name(zpcr_`var')
kdensity z_`var' , name(z_`var')

For some background about the above, you might want to visit this UCLA web page.

I am interested to learn about your result.

Best,
Eric

http://publicationslist.org/eric.melse

Comment

John Adler

Join Date: Apr 2017

Posts: 173
#5

27 Nov 2019, 11:36

Dear Eric,

Thank you for your help,

I have been concentrating on getting the above code to work in my data and now that it does, I must confess that I don't know what it's doing for my problem of comparison of well-being measures, where one has higher scores indicating worse well being, and the other has lower scores indicating worse well being.

Could you explain that part to me?

Thanks for your help and sorry for my ignorance in this area!

All the best,

John
Comment
ericmelse

Join Date: May 2014

Posts: 434
#6

28 Nov 2019, 03:34

Dear John,

Good, you now have both measurements available as a standardized scales. This means that you can study their association with your outcome measure. If, as you point out, the original scales are in opposite directions, then their association will be likewize ( + vs. - or - vs. +). Now, to get these associations expressed in the same direction you have to invert one of the two standardized scales, depending on your analytical objective. Assuming that 'worse well being' increases as the scale increases (i.e. with the 'worst' at the higher end), you should invert the standardized scale that now is decreasing 'worse well being' as the scale increases.
Following the example of #4, this is what you should do:

Code:

sum zpcr_wage , d // inspect 1% & 99% & mean values gen zpcrIwage = zpcr_wage*-1 // create inverted scale sum zpcrIwage , d // inspect 1% & 99% & mean values

I suppose your regression model now should include either 'scale1 scale2inverted' or 'scale1inverted scale2' depending on your needs. Whether or not the coefficients then make sense is something that you have to substantiate. Be careful that their effect is quantified in your model as the effect (of the increase or decrease) of 'one standard deviation' (for more on this you could read Michael N. Mitchell's Interpreting and Visualizing Regression Models Using Stata and also I recommend his Stata for the Behavioral Sciences).

Best,
Eric

http://publicationslist.org/eric.melse
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#7

30 Nov 2019, 04:13

Dear Eric,

Thank you, this works perfectly and has been added to my .do file of helpful Stata commands! However, I think a standardized approach may more generally not be as appropriate to my analysis as I once thought.

This is because the WarwickEdinburgh scale starts at a minimum value of 7, while the scale I am trying to standardize to, the GHQ, starts at a minimum value of 0. Thus, I think creating a mean/standard deviation approach across these two is generally going to be mismatched, see pre-standardized below:

Code:

sum dissatisfactionwave1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- ghqscorea | 2,106 1.878443 2.688666 0 12 sum dissatisfactionwave2 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- PostWarwi~gh | 908 24.93725 3.817796 16.36 35

Can I ask what your general feelings on this are? And if you could suggest another way of comparing across two continuous but different scales?

All the best,

John
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#8

30 Nov 2019, 05:59

To add some clarification,

The max GHQ score is 12, with a score of 2 or greater indicating poor well-being and a score of 6 indicating severely diminished well-being. The SWEMWBS has low well-being (7–19.3), medium well-being (20.0–27.0) and high well-being (28.1–35). Which strikes me that these are ordinal scales, and thus may need to be standardized to each other differently.

And of course, as mentioned above, the SWEMWBS begins at 7 while the GHQ starts at 0.....

Thanks again,

John
Comment
ericmelse

Join Date: May 2014

Posts: 434
#9

01 Dec 2019, 09:34

Dear John,

I think that is rather difficult to somehow align these two very different original measurement scales.
Hence, my suggested code in #4. What it accomplishes is the 'ranking' of cases along each of these scales (seperately, of course). This ranking is on the percentiles of the response as measured by each scale. It is the percentile ranking that is subsequently normalized and standardized (not the original scales).
Your objective was to compare the two measurements using a 'unit free approach', and that it is now possible, but, of course, only with the interpretation in terms of these two standardized scales (with a zero mean and a standard deviation of one), i.e. the comparison of the percentile position of each case transformed to a standardized distribution. You could use the percentiles instead of the standardized scales, but, in my opinion, for models a standardized scale is more useful because of its distributional properties (compare the two kernel densities plots in #4).
Now, for me it is impossible to appreciate the result of this 'unit free approach' for your study. You have to do some further analysis comparing (groups of) cases using these scales and determine if you get results that make sense (or not). In any case, you should not try to make an interpretation of these 'unit free' scales in terms of the original data on which they are based.

Best,
Eric

http://publicationslist.org/eric.melse
Comment

Announcement

A unit free approach when variables are coded differently?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment