Introduction

In this document we present a simple workflow using the ProfileGLMM pacakge applied to an Exposure inspired dataset.

The (simulated) dataset can be loaded directly from the package:

library(ProfileGLMM)
data("exposure_data")
exp_data = exposure_data$df
head(exp_data)
#>             X          t indiv        Exp1       Exp2         Y
#> 1 -0.89691455 0.77952297     1 -0.04384600  1.2588400 2.6520873
#> 2  0.18484918 1.70443803     2 -0.66430177 -1.2629737 0.1597814
#> 3  1.58784533 2.22057485     3 -1.40052637 -0.1769742 5.3494451
#> 4 -1.13037567 0.08683717     4 -1.36745207  0.7744430 2.5207198
#> 5 -0.08025176 1.09472696     5  1.19371755  1.4146036 6.6933655
#> 6  0.13242028 2.02427989     6  0.06508781  0.3339544 0.2785757

This dataset comprises 1500 individuals each observed at three different dates (4500 observations). For each of these observations we observe an individual characteristic X, two exposures Exp1 and Exp2 and an health outcome Y. The goal of this tutorial is to show how the ProfileGLMM package can perform a profile regression using a linear mixed model as an output. In other words the goal, here, is to identify clusters of exposures affecting the outcome and its relationship with the individual characteristics. The motivations for the analysis of such dataset can be found in this paper.

Model specifications

The first step to perform profile regression is to preprocess the data:

covList = {}
covList$FE = c('X')

covList$RE = c('t')
covList$REunit = c('indiv')

covList$Lat = c('X')

covList$Assign$Cont = c('Exp1','Exp2')
covList$Assign$Cat = NULL

covList$Y = c('Y')
dataProfile = profileGLMM_preprocess(regType = 'linear',
                                     covList = covList,
                                     dataframe = exp_data,
                                     nC = 30,
                                     intercept = list(FE = T, RE = F, Lat = T))
print(dataProfile)
#> --- pglmm Data summary ---
#> - Number of observations :  4500 
#> 
#> -- Clustering model summary --
#> - Continuous clustering variables : ' Exp1 Exp2 '
#> 
#> -- Outcome model summary --
#> - Model type:  linear mixed model
#> - Outcome  : ' Y ' 
#> - Fixed effects  : ' Intercept X '
#> - ' indiv ' level random effects  : ' t '
#> - Latent clusters interacting with: ' Intercept X '

The structure of the model is defined by the covList entry. The first set of variables define the outcome mixed model structure: the FE field contains the fixed effects covariates, the RE field the random effects covariates and Reunit its associated statistical unit. To complete the outcome model Lat indicates which covariates will interact with the latent clusters. The covariates to be clustered are indicated in the Assign field, with Cont and Cat fields respectively containing the continuous and categorical clustering covariates.

To complete the outcome model definition, the regType entry indicates that we want our outcome model to be a linear mixed model (as the output is continuous), and the intercept field which parts of the model require an intercept.

Note that the profileGLMM_preprocess function initialises the parameters of the priors to default uninformative values that can be found here. Changing the default parameters can be easily changed on site.

Posterior sampling of the parameters and postprocessing of the MCMC chain

Once the model is set up the we can sample parameters from the posterior:

MCMC_Obj = profileGLMM_Gibbs(model = dataProfile,
                             nIt = 5000,
                             nBurnIn = 2000)
#> Iteration: 0
#> Iteration: 1000
#> Iteration: 2000
#> Iteration: 3000
#> Iteration: 4000
#> Iteration: 5000

MCMC_Obj returns all the nIt - nBurnIn = 3000 MCMC samples. To verify the convergence of the MCMC chain it is a good practice to plot the traces of population constant parameters such as \(\zeta\) or \(\beta\):

par(mfrow = c(1,2), col.sub="blue", cex.sub=2)
plot(MCMC_Obj$zeta,ylab = 'zeta',xlab = 'samples')
plot(MCMC_Obj$beta[1,],ylab = 'beta_1',xlab = 'samples')

These raw samples are then passed to the profileGLMM_postProcess function that estimates the population level parameters and their uncertainty as well as the characteristics of the representative clustering.

post_Obj = profileGLMM_postProcess(MCMC_Obj)
#> egap optimal K: 9
#> Ng spectral clustering.
#> Gaussian Mixture Modelling clustering.
#Fixed effects estimates
summary(post_Obj)
#>         Length   Class     Mode
#> coocMat 20250000 dsCMatrix S4  
#> clust         10 -none-    list
#> pop            1 -none-    list

One can also plot a representation of the representative clustering in the exposure space:

plot(post_Obj,color = hcl.colors(9, palette = "Set 2"))

One can also predict cluster memberships of new points in a pre-processed format. Here we predict cluster membership on the training data and and, given that we have access to the true labels, we compare the resulting classification with the true one:

pred = predict(post_Obj,dataProfile$d) 
table(pred$classPred,exposure_data$theta0$Lat)
#>    
#>     -11 -10  -9  -1   0   1   9  10  11
#>   1   0   1  15   0  31 415   0   2  19
#>   2  20   0   0 371  35   0  24   4   0
#>   3  22 448  16   1  30   1   0   0   0
#>   4   0  23 476   0   1  40   0   0   0
#>   5   0   0   0   0   0  21   0  23 483
#>   6   0  12   0  10 423  17   0  17   0
#>   7 456  14   0  26   2   0   0   0   0
#>   8   0   0   0   1  20   2  25 407  21
#>   9   0   0   0  35   2   0 457  31   0

Introduction to ProfileGLMM

Introduction

Model specifications

Posterior sampling of the parameters and postprocessing of the MCMC chain