Transformation of values

Birgitte Tholin

Join Date: Feb 2022

Posts: 5
#1

Transformation of values

31 May 2022, 03:15

Hi,

I have a data set on tumors with the variables karnofsky index/KI (categorical with values 40, 60, 80, 90, 100) and gross tumor volume/GTV (continuous, m3), among others. I have done a linear regression showing a significant negative correlation between the two (coefficient -0.16, p 0.007).

1. I want to visualise the relationship, f.eks with a scatter plot, but as KI is categorical and GTV is continuous, it ends up looking very strange and I cannot see a linear relationship. I have transformed the GTV variable with cube root because it was very left-skewed, but KI has a normal distribution. Do I still have to transform KI to get a linear relationship?Or convert it to a continuous variable?

2. I want to calculate the Pearon's correlation coefficient, but as far as I know assumptions are that the variables must be continuous and have a linear relationship. How do I solve this? By transforming one or both values?

Thanks in advance,
Best regards
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17854

31 May 2022, 03:30

Birgitte:
1) if one out of two variable is categorical, how can you see a linear relationship (y=mx+q) between them?
2) I would not transform your variables, as normality is a weak requirement for OLS residual didtribution only. Conversely, I'd not go with a simple OLS here, and switch to coefficient of correlation (non parametric) instead from a coefficient of determination (as per OLS);
3) as far as correlation is concerned, I would go -ktau-.
What above can be transformed in a toy-example saving tons of words:

Code:

. use "C:\Program Files\Stata17\ado\base\a\auto.dta"
(1978 automobile data)

. regress price i.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(4, 64)        =      0.24
       Model |  8360542.63         4  2090135.66   Prob > F        =    0.9174
    Residual |   568436416        64     8881819   R-squared       =    0.0145
-------------+----------------------------------   Adj R-squared   =   -0.0471
       Total |   576796959        68  8482308.22   Root MSE        =    2980.2

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       rep78 |
          2  |   1403.125   2356.085     0.60   0.554    -3303.696    6109.946
          3  |   1864.733   2176.458     0.86   0.395    -2483.242    6212.708
          4  |       1507   2221.338     0.68   0.500    -2930.633    5944.633
          5  |     1348.5   2290.927     0.59   0.558    -3228.153    5925.153
             |
       _cons |     4564.5   2107.347     2.17   0.034     354.5913    8774.409
------------------------------------------------------------------------------

. twoway (scatter price rep78)


. ktau price rep78, stats(taua taub obs p)

  Number of obs =      69
Kendall's tau-a =       0.0648
Kendall's tau-b =       0.0767
Kendall's score =     152
    SE of score =     182.223   (corrected for ties)

Test of H0: price and rep78 are independent
     Prob > |z| =       0.4073  (continuity corrected)

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Nick Cox

Join Date: Mar 2014

Posts: 36058
#3

31 May 2022, 05:58

Your KI variable is like many variables in medical or social or environmental science: optimists will see it as a measurement that is a little crude but conventional and pessimists will regard it as at best ordinal scale.

What's hard to know is the extent or overplotting of marker symbols. so plotting e.g. density or dot plots or histograms may help too.

I think the description of Pearson correlation here is a little backwards. To my mind it's not primarily a test of whether a linear relationship exists but more a measurement of how far it does,

This should be familiar to you but I guess there is a selection problem about who gets into the sample: only people known to have tumours which have been measured.

Your statements about distributions are not clear to me.

1. Volume I would expect to be right-skewed (positively skewed) and so I would expect cube rooting to reduce that skewness.

2. KI can't be normally distributed as on your evidence the distribution consists of 6 spikes. Perhaps you mean that the distribution is roughly symmetric, which is not at all the same as being normal.
1 like
Comment

Announcement