Latent profile analysis - the observed variables are skewed with zero-values

Yingyi Lin

Join Date: Nov 2017

Posts: 68
#1

Latent profile analysis - the observed variables are skewed with zero-values

11 Nov 2018, 11:43

Hi Statalists,

I am starting to explore the -gsem command of Stata. Recently I am trying to do a Latent profile analysis (LPA), when I found that all my observed variables are skewed and have at least 10-20% of zero-values. I ended up have a large class from LPA, which account for 95+% of the observations. This class emerged no matter how many classes I specified in -gsem. I wonder if the problem is the skewed variables.

I wonder if I could transform the data, for example, using something like "inverse hyperbolic sine transformation". If so, how should I interpret my LPA results?

Any thought is welcome and thank-you in advance!
Yingyi
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#2

11 Nov 2018, 12:12

An inverse hyperbolic arcsine transformation isn't going to help you. You will still have 20% zeroes (asinh(0) = 0); it will just shrink any skewness on the right. In fact, by bringing some near-zero values even closer to zero it may worsen the problem.

In fact, no transform is going to save you. Whatever function f you use will still give you 20% of your observations stacked up at the value f(0), whatever that happens to be--so you are no better off.

So the question is why you have so many zero values. Is this a variable that can only take on non-negative values? If so, does the zero represent a floor effect in the measurement? Or are you really dealing with a mixture distribution consisting of a component where the values are always zero and another component that has a broader distribution?

If it is a floor effect, you then need to consider whether this represents censorship of a latent variable that is potentially negative, or, on the other hand, whether it is best to model the outcome variable as coming from a truncated distribution of some kind. -gsem- has options that support both of these possibilities.

To be honest, I'm not sure whether you could fit a latent class model of a distribution that is itself a mixture in -gsem-, but there are several Forum members who are much more knowledgeable about -gsem- and might offer ways to do that, if it can be done and that is what you need.
Comment

Yingyi Lin

Join Date: Nov 2017
Posts: 68

11 Nov 2018, 13:47

#2

Hi Clyde,

Thank-you for your always prompt and detailed reply, which is very helpful!

1) You are right that there is a floor effect in my data. The data consists 5 variables (v1, v2, v3, v4 and v5).

2) For each variables, there is 10-20% zero-values; and these variables cannot take non-negative values.

3) However, there is no observation while v1=0 & v2=0 & v3=0 & v4=0 & v5=0; i.e. at least one of the varlist is > 0.

Regarding your comments, I have few questions below:

Q1)
Is my variable property described above a good reason for me to consider IHS?
I thought about dichotomize my data to 0/1, But in that case I feel like I would lose a lot of information.
Indeed, I just transform the data and re-run LPA, the classes results are no more skewed.

Q2)
Now I got couple classes using the transformed data and I plotted them out.
By in this case the y-axis are the IHS-transformed values;
I wonder if there is a way in STATA to show the y-axis as the original data means before the transformation.

My codes are

Code:

margins, ///
predict(outcome(v1) class(1)) ///
predict(outcome(v2) class(1)) ///
predict(outcome(v3) class(1)) ///
predict(outcome(v4) class(1)) ///
predict(outcome(v5) class(1)) 


marginsplot, recast(bar) title("Class 1") xtitle("") ///
xlabel(1 "v1" 2 "v2" 3 "v3" 4 "v4" 5 "v5", angle(45) labsize(small)) ///
ytitle("Predicted ihs_mean") ylabel(0(1)6) name(class1)

margins, ///
predict(outcome(v1) class(2)) ///
predict(outcome(v2) class(2)) ///
predict(outcome(v3) class(2)) ///
predict(outcome(v4) class(2)) ///
predict(outcome(v5) class(2)) 

marginsplot, recast(bar) title("Class 2") xtitle("") ///
xlabel(1 "v1" 2 "v2" 3 "v3" 4 "v4" 5 "v5", angle(45) labsize(small)) ///
ytitle("Predicted ihs_mean") ylabel(0(1)6) name(class2)

margins, ///
predict(outcome(v1) class(3)) ///
predict(outcome(v2) class(3)) ///
predict(outcome(v3) class(3)) ///
predict(outcome(v4) class(3)) ///
predict(outcome(v5) class(3)) 

marginsplot, recast(bar) title("Class 3") xtitle("") ///
xlabel(1 "v1" 2 "v2" 3 "v3" 4 "v4" 5 "v5", angle(45) labsize(small)) ///
ytitle("Predicted ihs_mean") ylabel(0(1)6) name(class3)

graph combine class1 class2 class 3

Last edited by Yingyi Lin; 11 Nov 2018, 13:51.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#4

11 Nov 2018, 14:39

Is my variable property described above a good reason for me to consider IHS?

No. As I said in #2, that won't help with this problem, and could actually make things worse.

I wonder if there is a way in STATA to show the y-axis as the original data means before the transformation.

No direct way. You would need to go back and undo the transformations, then rerun the model and all the -margins- commands.
Comment
Yingyi Lin

Join Date: Nov 2017

Posts: 68
#5

11 Nov 2018, 15:01

#4

Hi Clyde,

Thanks for the response.

Does that means transforming my original data to categories (e.g. 0/1) would be an option here?

Thanks,
Yingyi
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#6

11 Nov 2018, 17:15

It is one possibility, but it discards a lot of information. If it is reasonable to think of the data as coming from a truncated parametric distribution, or censored, I think that would be better.
Comment
Yingyi Lin

Join Date: Nov 2017

Posts: 68
#7

11 Nov 2018, 18:37

#6

Thank you very much for all these valuable thoughts Clyde!
I'll start with the 0/1 option and see how it goes.

Thanks again,
Yingyi
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#8

11 Nov 2018, 19:37

Originally posted by Yingyi Lin View Post

Hi Statalists,

I am starting to explore the -gsem command of Stata. Recently I am trying to do a Latent profile analysis (LPA), when I found that all my observed variables are skewed and have at least 10-20% of zero-values. I ended up have a large class from LPA, which account for 95+% of the observations. This class emerged no matter how many classes I specified in -gsem. I wonder if the problem is the skewed variables.

I wonder if I could transform the data, for example, using something like "inverse hyperbolic sine transformation". If so, how should I interpret my LPA results?

Any thought is welcome and thank-you in advance!
Yingyi

Yingyi,

One other thought: if, after most analyses, you have one class with 95% of your observations, that's a good sign that your sample could be pretty homogeneous, in which case an LPA isn't very strongly justified. I agree that I don't see most transformations being helpful to your case. It might help if you told us a bit more about what these variables are. Maybe they are count variables, or something that could be modeled with Poisson or negative binomial distributions? Count variables, as you know, have support ranging from zero to infinity, and they can frequently have a large mass at zero depending on their rate parameter.

To be honest, if the data can't take on values less than zero, only 10-20% of the observations are zero, and the mean is some number where most of the distribution of an estimated mean for any of your classes would be well above zero, then modeling them as Gaussian doesn't seem to be terribly wrong to me.

For completeness, I should point out that -gsem- has an option (presentation by Stata's Rafal Raciborski) where you can specify that one or more of the latent classes is a point mass density of zero. This can be used to model zero-inflated models (e.g. zero inflated Poisson). It's mainly used in finite mixture models, and I'm not sure how if or how well it works in latent profile analysis.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Yingyi Lin

Join Date: Nov 2017

Posts: 68
#9

12 Nov 2018, 13:21

Hi Weiwen,

Thanks for your thought here.

To be honest, if the data can't take on values less than zero, only 10-20% of the observations are zero, and the mean is some number where most of the distribution of an estimated mean for any of your classes would be well above zero, then modeling them as Gaussian doesn't seem to be terribly wrong to me.

I was also thinking that Guassian might be an option here. But I have some variables that have ~50% 0s.

For completeness, I should point out that -gsem- has an option (presentation by Stata's Rafal Raciborski) where you can specify that one or more of the latent classes is a point mass density of zero. This can be used to model zero-inflated models (e.g. zero inflated Poisson). It's mainly used in finite mixture models, and I'm not sure how if or how well it works in latent profile analysis.

This is super helpful! I feel like my original data fits more into zero-inflated Poisson, which is count data and collected over a fixed time period.

Thanks again,
Yingyi
Comment

Announcement