Dear All,
I am running GLM in an epidemiological study, between an individuals genetic diplotype (exposure) and various dietary intake outcomes (which were derived from an FFQ). Below is an example the raw data - the individuals diplotype and Total Vegetable Intake (g/day):
However, a number of the dietary intake outcomes have been log-transformed as they are not normally distributed.
An example of the GLM model is as follows:
My questions are as follows:
1. Andy Field states in his book "I know that taking the logarithm of a set of numbers squashes the right tail of the distribution therefore it’s a good way to reduce positive skew. However, you can’t get a log value of zero or negative numbers, so if your data tend to zero..." what is the interpretation of "tend to zero", is there a cut-off percentage within each variable that you should consider for missing values or values of zero that would mean a log-transformation of your data is no longer viable?
For example, for the variable depicted above "total_veg_intake_gday_5Y" :
Of a sample size for the variable of 803:
- There are 14 values that are very small (between 0 and 1, ie 0.7845)
- There are 47 missing values "."
Is it still possible to use log-transform on this variable to manage the fact that it is not normally distributed and receive valid output, provided of course the skew is positive?
2. Below is an example of the output of my model:

In variables that are not log-transformed, where I am running the same GLM model I understand that what the model is suggesting is that for every unit increase in diplotype there is an increase / decrease (depending on the direction of the beta coef) in g/day of that outcome variable consumed. However given this variable for Total Vegetable Intake has been log-transformed I am having difficulty interpreting exactly what the results are telling me, and how to present them / write them up and would value any insights or thoughts.
3. Lastly, for those experienced in Epi studies that have encountered this issue before with dietary data, is there another way to manage it rather than log-transform. I have a considerable number of variables to run the model for.
Thanks in advance for your time,
Mel
I am running GLM in an epidemiological study, between an individuals genetic diplotype (exposure) and various dietary intake outcomes (which were derived from an FFQ). Below is an example the raw data - the individuals diplotype and Total Vegetable Intake (g/day):
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float Infant_diplo double total_veg_intake_gday_5Y 1 53.94285714285714 . 9.857142857142856 2 79.54571428571425 2 101.2214285714286 2 115.6 end label values Infant_diplo Infant_diplo label def Infant_diplo 1 "Heterozygous", modify label def Infant_diplo 2 "Tasters", modify
An example of the GLM model is as follows:
Code:
putexcel set "File Name", sheet("total_veg_intake_gday_5Y")modify putexcel A1=("Variable") B1 = ("b") C1=("ll") D1=("ul") E1=("P") loc row = 2 foreach x of varlist Maternal_diplo Infant_diplo { glm total_veg_intake_gday_5Y_log `x',family(gaussian) link(identity) vce(robust) putexcel A`row' = ("`x'") B`row' = (_b[`x']) C`row' = (_b[`x']-1.96*_se[`x']) D`row' = (_b[`x']+1.96*_se[`x']) E`row' = (2*ttail(e(df), abs(_b[`x']/_se[`x']))) F`row' = matrix(e(N)) loc row = `row' + 1 }
1. Andy Field states in his book "I know that taking the logarithm of a set of numbers squashes the right tail of the distribution therefore it’s a good way to reduce positive skew. However, you can’t get a log value of zero or negative numbers, so if your data tend to zero..." what is the interpretation of "tend to zero", is there a cut-off percentage within each variable that you should consider for missing values or values of zero that would mean a log-transformation of your data is no longer viable?
For example, for the variable depicted above "total_veg_intake_gday_5Y" :
Of a sample size for the variable of 803:
- There are 14 values that are very small (between 0 and 1, ie 0.7845)
- There are 47 missing values "."
Is it still possible to use log-transform on this variable to manage the fact that it is not normally distributed and receive valid output, provided of course the skew is positive?
2. Below is an example of the output of my model:
In variables that are not log-transformed, where I am running the same GLM model I understand that what the model is suggesting is that for every unit increase in diplotype there is an increase / decrease (depending on the direction of the beta coef) in g/day of that outcome variable consumed. However given this variable for Total Vegetable Intake has been log-transformed I am having difficulty interpreting exactly what the results are telling me, and how to present them / write them up and would value any insights or thoughts.
3. Lastly, for those experienced in Epi studies that have encountered this issue before with dietary data, is there another way to manage it rather than log-transform. I have a considerable number of variables to run the model for.
Thanks in advance for your time,
Mel
Comment