Latent class mixed models – with graphics (2024)

[This article was first published on Matt's Stats n stuff » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Foreword

Apologies that the output isn’t showing well as far as spacing and alignment is concerned. Am working on fixing that!

Background

Generally when we have a set of data, we have known groupings. Be that three different treatment groups, two sex groups, 4 ethnicity groups etc. There is also the possibility of unobserved groupings within your data, some examples (I’m clutching at straws here) are those who are vegetarians and those who aren’t, those who regularly exercise and those who don’t, or those who have a family history of a certain condition and those who don’t (assuming those data were not collected). There is an approach to look into this.

Model and implementation

I’m no expert on anything, let alone the literature for this, yet. Terms used are Latent Class Analysis (LCA) and Mixture Modeling (MM) (sometimes Finite MM). From what I can tell these are largely the same, certainly they’re from the same ‘school’ of approaches to investigating this sort of occurrence within a data set. Reading the first few lines of the wiki page for LCA and MM would lead you to think that the differences are technical (minimal?). It would be nice if with that I’ve put my foot in it enough for someone to comment and provide a very clear, concise explanation of the differences!

It makes sense then given the above, that the R package that I am about to use is called ‘Latent Class Mixed Modeling’ LCMM. As you would expect, modeling of this nature is implemented in other software as well. LatentGold and Mplus are (paid) options, and there are implementations within most common statistics packages. Applying this methodology to longitudinal data can present further challenges. ProcTraj implemented through SAS is a popular option for this.

What I, personally, found clumsy in the both the ProcTraj and the Mplus environment, was how to handle data where not everyone’s observations were observed at the exact same time points. In tightly designed studies where all subjects are measured at baseline, 6 months then 12 months these other packages seem straight forward. However these packages do not seem intuitive in the way they handle this sort of data from an observational epidemiology setting, where people are diagnosed at different ages, people leave the study (area) or pass away and hence data collection ends at different times, or just if you chose a point in time from which to analyse your data hence subjects have differing amounts of data available.

My example

To see the lcmm package in action, we have the following data set:

> str(dat)
'data.frame': 43108 obs. of 10 variables:
$ id : chr "I072028" "I072028" "I072028" "I072028" ...
$ x : num 155 157 158 159 161 ...
$ x0 : num 155 155 155 155 155 ...
$ y : num 29.1 23.4 21.6 22.1 25.7 ...
$ rf1 : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
$ rf2 : chr "2003" "2003" "2003" "2003" ...
$ rf3 : Factor w/ 6 levels "pre 94","95-97",..: 4 4 4 4 4 4 4 4 4 4 ...
$ rf4 : num 0.134 1.474 2.278 4.02 5.628 ...
$ totobs: int 33 33 33 33 33 33 33 33 33 33 ...
$ x2 : num 24162 24580 24833 25385 25900 ...

> head(dat)
id x x0 y rf1 rf2 rf3 rf4 totobs x2
1 I072028 155.440 155.3336 29.12 B 2003 01-03 0.134 33 24161.59
2 I072028 156.780 155.3336 23.40 B 2003 01-03 1.474 33 24579.97
3 I072028 157.584 155.3336 21.58 B 2003 01-03 2.278 33 24832.72
4 I072028 159.326 155.3336 22.10 B 2003 01-03 4.020 33 25384.77
5 I072028 160.934 155.3336 25.74 B 2003 01-03 5.628 33 25899.75
6 I072028 162.408 155.3336 23.66 B 2003 01-03 7.102 33 26376.36

Here x is our independent variable (think age), y is our dependent outcome (say cholesterol levels), id is our id variable to identify which data comes from which subjects. rf1 and rf2 are two risk factors that we may look to incorporate at a later time and totobs lets us know how many observations each subject has.

So with >2000 subjects and up to 80 time points per subject, where do you start? Well, let’s plot everyone’s data.

Latent class mixed models – with graphics (1)

Some might say the first graph (top left) shows you nothing. I don’t. You can clearly see the top right portion is far denser than the top left. So more people finish with a higher value than start with a higher value. Interesting. The next three plots show just 300 random subjects with at least 5 time points (to avoid the clutter a bit). The top right is a plot of their raw data, bottom left is a straight line fit for each person, and the bottom right is a smoothed curve (loess) for each person, fitted using the default settings of geom_smooth() from the beyond awesome ggplot2 package. Removing this clutter even further highlights the increased spread at the right of the graph compared to the left.

Segway

I just wanted to quickly show (visually) how the loess smoothing algorithm works. Here are 3 subjects with a reasonable amount of data. This shows their raw (jagged) plot, their smoothed curve, and the two combined to highlight how the trend is accurately portrayed but the noise is removed to allow the eye to identify the path of the trend with ease.

Latent class mixed models – with graphics (2)

Back on track

You can see all the input options for the lcmm function by looking at the pdf or using ?lcmm after loading the library. The user is able to specify how many latent class (unobserved groups) they would like the algorithm to find. I used the following code to find 2 in my data*.

d2 <-lcmm(y~x,random=~x,subject='id',mixture=~x,ng=2,idiag=TRUE,data=dummy,link="linear")summary(d2) 

Here I am modeling the dependent y with the independent x, allowing x to vary per person (random slope) and using (linear) x as the variable in the mixture term. Below I will look at more appropriate models, am keeping it simple for now. In the output (below) we see the results from the main section of the summary. This highlights, with the ‘x 1′ and ‘x 2′ line, that each class has a different slope, one slightly positive and one slightly negative, with (see ‘intercept 2′) quite different intercepts.

Fixed effects in the class-membership model:
(the class of reference is the last class)
coef Se Wald p-value
intercept 0.4665299 0.1712701 2.723943 0.006450762

Fixed effects in the longitudinal model:
coef Se Wald p-value
intercept 1 (not estimated) 0.000000000
intercept 2 -4.476219941 0.2091687393 -21.400043 0.000000000
x 1 -0.002339034 0.0008328176 -2.808579 0.004976061
x 2 0.023911923 0.0011908951 20.078950 0.000000000

Goodness of fit statistics are also provide. Of interest is the posterior probabilities of group membership. Each subject is assigned a probability of belong to each (of the 2) class(es).

postprob(d2)
Posterior classification:
class1 class2
N 152.00000 93.00000
% 62.04082 37.95918

Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2
class1 0.9330 0.0665
class2 0.0934 0.9070

We see one class has 62% of subjects and the other has 38%. We see that of those in class 1 their mean posterior probability of belonging to class 1 was 0.93 while their mean PP of belonging to class 2 was 0.09. Similarly, of those assigned to class 2, their mean PP of belonging to class 2 was 0.90 and 0.07 for belonging to class 1. We can look further at this with the code below. The minimums are close to 0.5 (somewhat expected), so we know that some subjects trajectories could have really been in either class. I take solace in both lower quartiles being >0.80.

> round(summary(as.numeric(d2$pprob[d2$pprob[,"class"]==1,"prob1"])),2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.51 0.94 1.00 0.93 1.00 1.00

> round(summary(as.numeric(d2$pprob[d2$pprob[,"class"]==2,"prob2"])),2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.54 0.83 0.99 0.91 1.00 1.00

Given we are dealing with trends and trajectories, we might like to visualise what this means for our data. Using the code below, we can pull out which class subjects were assigned to, and then plot their data, colouring based on class membership.

dummy$id <- as.character(dummy$id)people2 <- as.data.frame(d2$pprob[,1:2])dummy$group2 <- factor(people2$class[sapply(dummy$id, function(x) which(people2$id==x))])p1 <- ggplot(dummy, aes(x, y, group=id, colour=group2)) + geom_line() + geom_smooth(aes(group=group2), method="loess", size=2, se=F) + scale_y_continuous(limits = c(13,37)) + labs(x="x",y="y",colour="Latent Class") + opts(title="Raw")p2 <- ggplot(dummy, aes(x, y, group=id, colour=group2)) + geom_smooth(aes(group=id, colour=group2),size=0.5, se=F) + geom_smooth(aes(group=group2), method="loess", size=2, se=T) + scale_y_continuous(limits = c(13,37)) + labs(x="x",y="y",colour="Latent Class") + opts(title="Smoothed", legend.position="none")grid.arrange(p1,p2, ncol=2, main="2 Latent Class")

This produces the following graph.

Latent class mixed models – with graphics (3)

Here we see what we wanted to see. For starters the lines when smoothing across all data within a class intersect, suggesting we are seeing a real effect of classes. I would not be too interested, myself, in seeing two parallel lines, as that would suggest everyone has the same ‘kind’ of trajectory, and I would suspect the classes to be purely an artifact of the algorithm having to ‘go out and find two classes’. In saying that, we knew from the coefficients that we would see something like this. What I particularly like is that we can see a group (class), class 2, who clearly rocket up at the end. When x hits 200, a vast majority of the subjects in class 2 rapidly increase, where as class 1 generally remain consistent (in y).

If we try 3 classes.

Latent class mixed models – with graphics (4)

And 4 classes?

Latent class mixed models – with graphics (5)

As far as choosing how many classes are appropriate, the generally accepted method is to use the BIC (as opposed to the AIC), with the lower BIC being the preferred model. With this data we get a BIC of 32941 for 4 classes, 32972 for 3 classes and 33084 or 2 classes. This suggests 4 classes offers a slightly better fit than 3. Actually truth be told, 5 classes achieves a slightly lower BIC again. However like all modeling, you have to have some human (or really really smart robot circa 2035) input. With 4 classes, the 4th class only holds 10% of subjects, in our reduced modeling subset of the data set, this is 25 people. You have to decide if this is appropriate for your data or not. Here I certainly wouldn’t go any lower than 10%.

Refining the model

As I mentioned earlier, a linear effect may not be the best way to model this data. Here is my preferred number of classes, 3, with a quadratic term introduced. This takes substantially longer to complete (converge). Compare this graph below, to the graph 2 above (for 3 latent classes). There are some subtle differences

Latent class mixed models – with graphics (6)

As posterior probabilities, and percentage of class membership is concerned, the model with only the linear term and the model that also has the quadratic differ quite a bit. It looks like the quadratic term has a tougher time determining which class each subject falls into. See below.

Just linear term

Posterior classification:
class1 class2 class3
N 139.00000 59.00000 47.00000
% 56.73469 24.08163 19.18367

Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2 prob3
class1 0.8800 0.08060 0.039600
class2 0.1390 0.86000 0.000981
class3 0.0887 0.00293 0.908000

> summary(as.numeric(d3$pprob[d3$pprob[,"class"]==1,"prob1"]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.5024 0.8210 0.9402 0.8798 0.9931 1.0000

> summary(as.numeric(d3$pprob[d3$pprob[,"class"]==2,"prob2"]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.5136 0.7642 0.9549 0.8597 0.9972 1.0000

> summary(as.numeric(d3$pprob[d3$pprob[,"class"]==3,"prob3"]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.4962 0.8986 0.9898 0.9083 0.9994 1.0000

Quadratic term included

Posterior classification:
class1 class2 class3
N 110.00000 44.00000 91.00000
% 44.89796 17.95918 37.14286

Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2 prob3
class1 0.7990 0.0237 0.1770
class2 0.0465 0.8670 0.0862
class3 0.1630 0.0482 0.7890

> summary(as.numeric(d32$pprob[d32$pprob[,"class"]==1,"prob1"]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3598 0.6565 0.8524 0.7992 0.9688 0.9999

> summary(as.numeric(d32$pprob[d32$pprob[,"class"]==2,"prob2"]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.4284 0.7217 0.9847 0.8673 0.9997 1.0000

> summary(as.numeric(d32$pprob[d32$pprob[,"class"]==3,"prob3"]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.4159 0.6348 0.7822 0.7890 0.9618 1.0000

However, the BIC actually reduces from 32972 to 32594 which is a much bigger difference than we saw by changing the number of classes. In this instance I would say a) further model exploration is required and b) if I had to, I would chose the model based on the linear term only, giving preference to the better posterior probability performance over the BIC performance.

Hopefully this has been useful as an introduction to latent class modeling and/or and introduction to the lcmm package and/or plotting and visualising longitudinal latent class mixture modeling.

Final note

I have been in contact with the author of this package, who early next year aims to enhance some of the features of the package.

*prior to running the analysis on the data from here on I limited the data set to a subset based on a few criteria, mainly to do with year and initial x value (x0). This was purely to eliminate known calendar and other effects effecting the efficiency of the algorithm. I can provide more information on this if contacted by anyone interested.

**********************************************************************************

Here is the script used in this example. I haven’t included the loading of the data as it was a data set from a project and I have subsequently applied a coefficient to the x and y variable so it is not obvious what they are, for obvious reasons. NOTE: I had to switch between the id column being a factor and character depending on which function I was running at the time. For the lcmm() it needed to be a factor, or the postprob() function would give warnings (not partitioning the data correctly), but then it needed to be a character for the sapply() function to work properly It also needed to be a character for the colour=id in ggplot to avoid the legend ‘blowing out’ with 2000 odd levels available (despite only 3 being plotted). I’m sure there was a better way to do this (probably re write the sapply(), I just ran out of time). Got a question on this, just ask.

Data used is available here. Use `load(“2011-10-04_lcmm_post_mc.rdata”)`.


Latent class mixed models – with graphics (7) Latent class mixed models – with graphics (8)

Related

To leave a comment for the author, please follow the link and comment on their blog: Matt's Stats n stuff » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Latent class mixed models – with graphics (2024)

FAQs

What is the latent class mixed effect model? ›

The Latent Class Linear Mixed Model (LCLMM) combines the features of the linear mixed model (LMM) with an additional component, which partitions the population into subpopulations or latent classes. This model has usually been specified with relatively simple, restrictive assumptions.

What is the latent class model approach? ›

Latent class analysis (LCA) is a modeling approach that identifies individuals that share common characteristics, allowing distinct “clusters” to be isolated.

What is latent class analysis structural equation modeling? ›

Latent class analysis (LCA) is a subset of structural equation modeling, used to find groups or subtypes of cases in multivariate categorical data. These subtypes are called "latent classes".

When to use a mixed-effects model? ›

Mixed Effects Models are used when there is one or more predictor variables with multiple values for each unit of observation. This method is suited for the scenario when there are two or more observations for each unit of observation.

What is an example of a latent class? ›

Examples of Latent Class Analysis

For example, you think that people fall into one of three different types: abstainers, social drinkers and alcoholics. Since you cannot directly measure what category someone falls into, this is a latent variable (a variable that cannot be directly measured).

What is latent mixture modeling? ›

Latent variable mixture modeling is an emerging person-centered statistical approach that models heterogeneity by classifying individuals into unobserved groupings (latent classes) with similar (more hom*ogenous) patterns.

What are the assumptions of the latent class model? ›

The purpose of LCA.

The main assumptions of LCA: persons in a sample belong to one latent class or another, i.e. latent classes are exhaustive and capture the whole of the sample. Another assumption is that each person at a given time belongs to one and only one of the latent classes: classes are mutually exclusive.

Why use latent class analysis? ›

Latent Class Analysis (LCA) is one approach that makes it possible to take into account multiple dimensions of trauma exposure instead of focusing only on single indicators, such as number of different types of trauma exposure.

What are examples of latent variable models? ›

Examples of latent variables from the field of economics include quality of life, business confidence, morale, happiness and conservatism: these are all variables which cannot be measured directly.

What is the three step approach of latent class analysis? ›

Researchers using latent class (LC) analysis often proceed using the following three steps: 1) a LC model is built for a set of response variables, 2) subjects are as- signed to latent classes based on their posterior class membership probabilities, and 3) the association between the assigned class membership and ...

What is the Bayesian latent class model? ›

Bayesian latent class analysis (BLCA) is a statistical method used to identify unobserved subgroups or latent classes within a population based on observed categorical variables [20].

What is latent class growth modeling? ›

Latent class growth analysis estimates a mean intercept and slope per class. Growth mixture modeling is an extension of LCGA, which allows individuals to vary around the class' intercept and slope (similar to random effects in multilevel analysis).

What are the disadvantages of mixed effects model? ›

Disadvantages. These models can be more complex to set up and interpret compared to simpler linear models. They require a good understanding of both fixed and random effects, which can be challenging for some researchers. Like linear models, these models have their own set of assumptions.

What is an example of a mixed model design? ›

For example, a researcher might use a mixed design to study the influence of different types of music on relaxation. They could divide participants into a control group (listening to no music) and two experimental groups (one listening to classical music and one listening to rock music).

Why is mixed model better than ANOVA? ›

Furthermore, with the ability to accommodate missing values, the mixed model is more efficient in identifying significant effects. This article illustrates the advantages of using the mixed model over the repeated measures ANOVA for longitudinal datasets with multiple missing data points.

What is the meaning of latent class? ›

Latent class analysis (LCA) is a statistical procedure used to identify qualitatively different subgroups within populations that share certain outward characteristics (Hagenaars & McCutcheon, 2002). Subgroups are referred to as latent groups (or classes).

What is the latent factor analysis model? ›

Factor Analysis is a method for modeling observed variables, and their covariance structure, in terms of a smaller number of underlying unobservable (latent) “factors.” The factors typically are viewed as broad concepts or ideas that may describe an observed phenomenon.

What is the latent trait model? ›

Latent trait models deal with a set of binary responces from an individual in a given population. The aim of the statistical analysis is to estimate the value of an assumed latent parameter, characteristic of the individual, and to describe the variation of the latent parameters among individuals.

Top Articles
Latest Posts
Article information

Author: Kareem Mueller DO

Last Updated:

Views: 6557

Rating: 4.6 / 5 (66 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Kareem Mueller DO

Birthday: 1997-01-04

Address: Apt. 156 12935 Runolfsdottir Mission, Greenfort, MN 74384-6749

Phone: +16704982844747

Job: Corporate Administration Planner

Hobby: Mountain biking, Jewelry making, Stone skipping, Lacemaking, Knife making, Scrapbooking, Letterboxing

Introduction: My name is Kareem Mueller DO, I am a vivacious, super, thoughtful, excited, handsome, beautiful, combative person who loves writing and wants to share my knowledge and understanding with you.