EPI STUDIES - log transform data

Melissa Bujtor

Join Date: Jul 2017

Posts: 29
#1

EPI STUDIES - log transform data

27 Jul 2017, 13:43

Dear All,

I am running GLM in an epidemiological study, between an individuals genetic diplotype (exposure) and various dietary intake outcomes (which were derived from an FFQ). Below is an example the raw data - the individuals diplotype and Total Vegetable Intake (g/day):

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float Infant_diplo double total_veg_intake_gday_5Y 1 53.94285714285714 . 9.857142857142856 2 79.54571428571425 2 101.2214285714286 2 115.6 end label values Infant_diplo Infant_diplo label def Infant_diplo 1 "Heterozygous", modify label def Infant_diplo 2 "Tasters", modify

However, a number of the dietary intake outcomes have been log-transformed as they are not normally distributed.

An example of the GLM model is as follows:

Code:

putexcel set "File Name", sheet("total_veg_intake_gday_5Y")modify putexcel A1=("Variable") B1 = ("b") C1=("ll") D1=("ul") E1=("P") loc row = 2 foreach x of varlist Maternal_diplo Infant_diplo { glm total_veg_intake_gday_5Y_log `x',family(gaussian) link(identity) vce(robust) putexcel A`row' = ("`x'") B`row' = (_b[`x']) C`row' = (_b[`x']-1.96*_se[`x']) D`row' = (_b[`x']+1.96*_se[`x']) E`row' = (2*ttail(e(df), abs(_b[`x']/_se[`x']))) F`row' = matrix(e(N)) loc row = `row' + 1 }

My questions are as follows:

1. Andy Field states in his book "I know that taking the logarithm of a set of numbers squashes the right tail of the distribution therefore it’s a good way to reduce positive skew. However, you can’t get a log value of zero or negative numbers, so if your data tend to zero..." what is the interpretation of "tend to zero", is there a cut-off percentage within each variable that you should consider for missing values or values of zero that would mean a log-transformation of your data is no longer viable?

For example, for the variable depicted above "total_veg_intake_gday_5Y" :
Of a sample size for the variable of 803:
- There are 14 values that are very small (between 0 and 1, ie 0.7845)
- There are 47 missing values "."

Is it still possible to use log-transform on this variable to manage the fact that it is not normally distributed and receive valid output, provided of course the skew is positive?

2. Below is an example of the output of my model:

In variables that are not log-transformed, where I am running the same GLM model I understand that what the model is suggesting is that for every unit increase in diplotype there is an increase / decrease (depending on the direction of the beta coef) in g/day of that outcome variable consumed. However given this variable for Total Vegetable Intake has been log-transformed I am having difficulty interpreting exactly what the results are telling me, and how to present them / write them up and would value any insights or thoughts.

3. Lastly, for those experienced in Epi studies that have encountered this issue before with dietary data, is there another way to manage it rather than log-transform. I have a considerable number of variables to run the model for.

Thanks in advance for your time,
Mel

Attached Files
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

27 Jul 2017, 14:43

Before turning to the questions you asked, I'll start by observing that -glm- with family(gaussian) and link(id) is just a long-winded way of saying -regress-. And it has the drawback that you lose the availability of some post-regress commands that you might find helpful (e.g. -rvfplot- to name just one.) So is there some particular reason you are doing that?

Now, just as a log transformation compresses the right tail of the data, it also inflates the left tail if there are values near zero. All numbers from 1 down to (but not including 0) map to the range 0 all the way out to negative infinity.

I am not a fan of using a log transform to "normalize" the data. First of all, there is no requirement that the dependent variable of a linear regression be normalized. That is some kind of urban legend; and it's a zombie--it refuses to die. In the classical analysis of linear regression, it is often assumed that the residuals are normally distributed. From that assumption one can derive the t-distributions of the coefficients divided by their standard errors. But, even without that residual normality assumption, if you look at the calculation of the regression coefficients carefully you can realize that the central limit theorem applies to show that the coefficients divided by their standard errors follow an asymptotically normal distribution. So if your sample is reasonably large (as appears to be the case here) you really don't have to worry about this distribution at all. It's only an issue in small samples.

The drawback to log-transforming is that: 1) you cannot apply it to zero or negative values, 2) if you have values that are close to zero, things get really distorted, and 3) explaining the results is, as you have noted, less than obvious. So it's pretty much all downside with no upside. to speak of.

Now, there are other reasons to log-transform data. If you are using a continuous predictor, then the functional relationship between them might actually be a power law, in which case log-transforming the outcome variable linearizes the relationship and leads to a properly specified model. But you are using dichotomous predictors here, so none of this applies.

So if this were my project, I would jettison the log-transformed analyses and just proceed with analysis of the untransformed data.

As an aside, in the future, please don't post screenshots of Stata output. This particular one was easy enough to read, but often they are not readable. And, if it were necessary to extract some numbers from that output to do some calculations with, you can't copy/paste from a graphical image. The helpful way to post Stata output is to copy it into the forum editor between code delimiters. If you don't know about code delimiters, please read FAQ #12.
1 like
Comment
Melissa Bujtor

Join Date: Jul 2017

Posts: 29
#3

27 Jul 2017, 14:46

As always, Clyde thank you for your detailed advice. Noted regarding screen-shots of Stata Output, will not include in that manner again.
Comment
Melissa Bujtor

Join Date: Jul 2017

Posts: 29
#4

27 Jul 2017, 15:07

Apologies, I hit post prematurely:

1. Would you have any solid references you could alert me too, surrounding "So if your sample is reasonably large (as appears to be the case here) you really don't have to worry about this distribution at all. It's only an issue in small samples." that I could use to build this argument with my supervisors?

2. re: regress vs GLM, no particular reason, other than am fairly green at all of this and it was some advice that I received. Would you suggest it is better to utilise -regress-, are there instances whereby -glm- is more appropriate.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#5

27 Jul 2017, 16:47

1. William H. Greene, Econometric Analysis (4th ed.). Prentice-Hall 2000, Chapter 9. This is an old version of Greene's classic text. I don't have a more modern version of the book, but on Greene's website, he posts Chapter 4 of the current edition--look near the end of the chapter for large-sample behavior, which has essentially the same content (identical for present purposes.) http://people.stern.nyu.edu/wgreene/...neChapter4.pdf.

2. People generally use -glm- when they want to use one of the other families or links. It enables you to fit more general models and was one of the great advances in statistical computing in the last half of the 20th century (in my opinion). I use it frequently in my own work. But using -glm- to do an ordinary linear regression is a bit like taking a helicopter to get to the corner grocery store. It'll get you there, but it's way overkill for the task. I can't think of any advantages it offers for this task, and using it does prevent you from running some of the simpler post-estimation commands available after -regress-.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#6

27 Jul 2017, 19:39

Alternatively -glm- with log link gives all the advantages of log transformation without any of the difficulties of dealing with zeros. The assumption is that means are always positive, not the raw data.

Whether it matches your generating process i
terms of functional form is hard to say. You certainly have different error families to choose from.
1 like
Comment
Melissa Bujtor

Join Date: Jul 2017

Posts: 29
#7

28 Jul 2017, 13:41

Thank you both, for your input, will locate and read up the references now. Appreciated.
Comment
Melissa Bujtor

Join Date: Jul 2017

Posts: 29
#8

30 Jul 2017, 23:24

Dear Nick and Clyde,

I read the reference from Clyde, thank you.

Where Nick makes the comment

Alternatively -glm- with log link gives all the advantages of log transformation without any of the difficulties of dealing with zeros. The assumption is that means are always positive, not the raw data.

Would you have a reference you could alert me too on this topic Nick?

Some points to note about the data I am working with for only some of the dependent variables in my analysis:
- Positively skewed to the right;
- Sample sizes can vary from 200 to 1200 depending on the dependent variable (outcome variable) in question;

Would you have any other thoughts, comments or suggestions regarding how to manage the variables that are not normally distributed?

Kind regards,
Mel
Comment
Melissa Bujtor

Join Date: Jul 2017

Posts: 29
#9

31 Jul 2017, 00:13

One more point, in addition, one of my data-sets has approx 70% zero values. If to log transform is still the best option (although seems it isnt), the paper in the link below suggests that you add a constant to the raw data to manage for 0's. So say +1 the data-set before transforming and then use inverse gaussian to analyze?

What would your thoughts be regarding that?

https://link.springer.com/article/10...651-005-6817-1
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#10

31 Jul 2017, 01:26

Don't do that. It is also not what your linked article proposed, it explicitly presents a method to avoid that transformation. The log transformation is a non-linear transformation, so adding and subtracting are no longer as save as we are used to. Instead use the log link function as Nick proposed. You asked for some references, so here are some:

Nicholas J. Cox, Jeff Warburton, Alona Armstrong, Victoria J. Holliday (2007) "Fitting concentration and load rating curves with generalized linear models" Earth Surface Processes and Landforms, 33(1):25--39. DOI: 10.1002/esp.1523

Santos Silva, J.M.C., S. Tenreyro 2006. "The log of gravity", The Review of Economics and Statistics 88(4):641-658. http://dx.doi.org/10.1162/rest.88.4.641

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Melissa Bujtor

Join Date: Jul 2017

Posts: 29
#11

01 Aug 2017, 14:42

Nick, Maarten and Clyde,

Thanks all for your input to date.

I am coming up against some resistance to using the gamma log-link method, I think simply because it is foreign to those involved - ie. change.

1. Would you have any solid references specifically around it's use that relate to epidem studies, with dietary intake (nutrition based).

2. When reviewing the data once more, some of the variables in fact are still not normally distributed post log-transformation anyhow. It has also been suggested that we categorise the dietary intake into quantiles and run a linear regression. In your experience what is the drawback and/or the upside to approaching the data in this manner? Do you have to split ALL dependent variables in your analysis into into quantiles, or only those that are not normally distributed? If choosing this method, is it appropriate to use GLM, and if so do you run it with family gaussian link identity?

Appreciate any thoughts you can offer up.

Regards
Mel
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#12

01 Aug 2017, 15:50

1. Here is one that I turned up with a quick PubMed search:

Bigio RS, Verly E Jr, de Castro MA, Cesar CL, Fisberg RM, Marchioni DM. Are plasma homocysteine concentrations in Brazilian adolescents influenced by the intake of the main food sources of natural folate? Ann Nutr Metab.2013;62(4):331-8. doi: 10.1159/000348883. Epub 2013 Jul 2. PubMed PMID: 23838397.

My guess is that with some concerted effort you could hunt down some others. That said, few statistical procedures are restricted or even tailored to application in any particular scientific field (though many have originated in the study of some particular discipline). The essence of statistics, it seems to me, is that it is cross-disciplinary. It is the science of using data to understand the world. Any statistical technique, in principle, could be used in any subject if its mathematical assumptions are met (or close to met). Moreover, it's not as if log-link and gamma family are new and radical. Generalized linear models have been around for several decades now. They are dealt with in every basic biostatistics text. The people who are resisting this--where have they been?

2. I don't even understand this suggestion. First, whence this quest for normality? Are you dealing with people who have not yet emerged from the early days of ANOVA in small samples? Second, if you categorize these variables and use them as outcome variables (which is what I understand to be their role in your research) then you would not run a linear regression as your outcome is polychotomous. You would be looking, depending on how the quantiles were cut, at a discrete outcome model such as logistic, or ordered logistic or multinomial logistic, or perhaps other techniques altogether.

Be that as it may, it's a terrible idea. Let's say your variable is X and that it ranges in realistic data from, say, 10 to 50. Suppose you categorize it by cutting at some value, say, 30. Then you are treating subjects where X = 10 and X = 29 as identical but subjects where X = 30 and X = 31 as radically different. Clearly that makes no sense unless, in reality, there is something abrupt and discontinuous that happens when you cross that line at 30. But the real world seldom works that way, and I would be willing to bet large sums that it does not work that way with dietary intake measures (if only because they are much too noisy.) So I wouldn't do this with any variable unless I had clear scientific evidence of such discontinuous effects. To my mind, the only purpose for which this sort of thing is occasionally helpful is if you want to graphically display the distributions of some other variables conditional on the values of X and you think something like a bar graph or series of boxplots or a panel of histograms/density curves is nicer than a scatterplot or the like. (As you may imagine, I wouldn't really like that approach to graphing conditional distributions either, but I would find it less irksome than the current context.)
Comment
Melissa Bujtor

Join Date: Jul 2017

Posts: 29
#13

01 Aug 2017, 17:06

Hi Clyde,

Thank you for the advice above.

The variables that we are discussing for categorization are indeed the outcome variables...

Bring it back to one earlier point you made (and I am going to apologise in advance if this is a stupid question...I am very new to this):

Now, there are other reasons to log-transform data. If you are using a continuous predictor, then the functional relationship between them might actually be a power law, in which case log-transforming the outcome variable linearizes the relationship and leads to a properly specified model. But you are using dichotomous predictors here, so none of this applies.

The predictor in my models are the variants of the genes (the diplotypes) and we have been told to model them continously.

I believe the way I have written the code achieves this, while there are 3 categories (coded as 0, 1 and 2), , i have not included "i" in front of the variable (Maternal Diplotype and Infant Diplotype) so I am telling STATA to treat it as a continuous variable - I believe this is correct?:

Code:

putexcel set "File Name", sheet("bw_for_ga_zscore_gusto")modify putexcel A1=("Variable") B1 = ("b") C1=("ll") D1=("ul") E1=("P") loc row = 2 foreach x of varlist Maternal_diplo Infant_diplo { glm bw_for_ga_zscore_gusto `x',family(gaussian) link(identity) vce(robust) putexcel A`row' = ("`x'") B`row' = (_b[`x']) C`row' = (_b[`x']-1.96*_se[`x']) D`row' = (_b[`x']+1.96*_se[`x']) E`row' = (2*ttail(e(df), abs(_b[`x']/_se[`x']))) F`row' = matrix(e(N)) loc row = `row' + 1 } loc row = 10 foreach x of varlist Maternal_diplo Infant_diplo { glm bw_for_ga_zscore_gusto `x' i.MothersEthnicity_Nu,family(gaussian) link(identity) vce(robust) putexcel A`row' = ("`x'") B`row' = (_b[`x']) C`row' = (_b[`x']-1.96*_se[`x']) D`row' = (_b[`x']+1.96*_se[`x']) E`row' = (2*ttail(e(df), abs(_b[`x']/_se[`x']))) F`row' = matrix(e(N)) loc row = `row' + 1 }

Hence, does that change your view at all on a approach to the data? Would a log transform, if it were to normalize the data (although Shapiro-wilks is suggesting it doesnt) then be the best way to approach or would it still be GLM, gamma log-link?

thanks in advance as always
Mel

Last edited by Melissa Bujtor; 01 Aug 2017, 17:42.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#14

01 Aug 2017, 17:22

I believe the way I have written the code achieves this, while there are 3 categories (coded as 0, 1 and 2), , i have not included "i" in front of the variable (Maternal Diplotype and Infant Diplotype) so I am telling STATA to treat it as a continuous variable - I believe this is correct?:

Yes that's correct.

Hence, does that change your view at all on a approach to the data? Would a log transform, if it were to normalize the data (although Shapiro-wilks is suggesting it doesnt) then be the best way to approach or would it still be GLM, gamma log-link?

It wouldn't change my view at all.
Comment

Melissa Bujtor

Join Date: Jul 2017
Posts: 29

#15

01 Aug 2017, 17:56

Thank you Clyde.

Attached is a summary of the results, using the Infant Diplotype as the predictor, and the outcome variable of "Dark Green Leafy Vegetable Intake (g/day) which is positively skewed to the right. Showing are:

Table 1: the glm using family gaussian link identity, however these could now be considered "incorrect" as the log transform did not normalise the data.

Table 2: raw data using family gamma link log

	Log Transformed
Infant Diplo
Outcome - Dark Green Leafy Vegetables (g/day) - 5Y		Beta	Lower Limit	Upper Limit	P	# Obs
Crude Model		-0.13	-0.27	0.01	0.065	522
Adjusted Model (i.Mothers Ethnicity)		-0.21	-0.36	-0.06	0.005	521

	Family (gamma) log(link)
Infant Diplo
Outcome - Dark Green Leafy Vegetables (g/day) - 5Y		Beta	Lower Limit	Upper Limit	P	# Obs
Crude Model		-0.08	-0.25	0.09	0.332	522
Adjusted Model (i.Mothers Ethnicity)		-0.18	-0.36	-0.06	0.042	521

1. Assuming the log transform HAD normalized the data, would my interpretation of these results be correct:

With taster as the reference, the intake of dark green leafy per day at 5 years old decrease from the taster to non-taster is about 12% ( P=0.064, 95% CI [0.77,1.00]) holding other variables constant , since exp(-0.129)=0.88. But this result is not significant. The intake of dark green leafy per day at 5 years old decrease from the taster to non-taster is about 19% in the Chinese (P=0.005, 95% CI[0.70,0.94] holding other variables constant, since exp(-0.21)=0.81

2. What drives the difference between the log transformed data and the raw data with gamma/log. What are the differences saying here?

Announcement

EPI STUDIES - log transform data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment