Documentation for wfmm

 

This documents the basics of how to use the code provided for implementing the Bayesian wavelet-based functional mixed models methodology introduced in Morris and Carroll (2004).  The code implements the Markov Chain Monte Carlo (MCMC) procedure described in Section 5 of the paper and outputs posterior samples for many model quantities.

 

Morris, JS and Carroll, RJ (2004). Wavelet-based functional mixed models, Technical Report, University of Texas MD Anderson Cancer Center Department of Biostatistics and Applied Mathematics, UTMDABTR-006-04.

 

We apologize that this initial documentation is incomplete.  More extensive documentation will be provided in the near future, along with scripts for performing some of the most commonly used Bayesian estimation, inference, and prediction procedures from the MCMC output from this program.  We also plan to develop an R interface for this program, which we will also post when it is ready. 

 

Sample program call (from DOS window):

wfmm c:\work\input.mat output.mat > log_file.log

 

Input:  The program requires an input file (“input.mat” in the call) that is a matlab file

containing the following objects:

 

  1. Y:  N –by- T matrix, each row containing one of the observed functions on an equally-spaced grid of length T
  2. model: Matlab structure indicating details of model.  The following elements are required to be included in the structure:
    1. X: N-by-p matrix containing the desired covariates for the p fixed effect functions in the model.
    2. Hstar: integer indicating the number of levels of random effect functions
    3. Z{h}, h=1, …, H:  N-by-mh matrices containing covariates for each set of random effect functions. 
    4. The following three variables should just be included as given here

                                                                                             i.         covq=2

                                                                                         ii.         covs=2

                                                                                      iii.         covT=1

These are artifacts of some exploration of different ways of doing the modeling – very soon these will be removed from the code, but for now they are still there and must be included.

3.         wavespecs: Matlab structure describing wavelet settings to be used.  The following elements must be included:

a.          wavelet: text string indicating the wavelet basis to use.  Abbreviations for wavelet bases in Matlab toolbox are used.  List of available bases will be provided in future documentation.

b.         nlevels: integer indicating the number of levels of decomposition

c.          boundary: boundary correction method assumed (‘reflection’ is default)

4.         MCMCspecs: Matlab structure describing details of MCMC.  The following elements must be included:

a.          B= number of MCMC samples to obtain

b.         burnin= burn-in length

c.          thin= thinning parameter; e.g. if 10, then keep every 10 samples in MCMC

d.         propvarTheta= multiple of var(MLE) to use in proposal variance for variance components in step 2 of the MCMC, =1.5 by default

e.          nj_nosmooth= number of lowest frequency wavelet levels for which we want a vague prior (no smoothing)

The following parameters are simply for numerical stability:

f.           minp= minimum value for any pij, 10-14 by default

g.          minT= minimum value for Tij, 1 by default

h.         bigT= value to use for Tij when vague prior desired (no smoothing)

i.            maxO = maximum odds ratio, 1020 by default (prevents overflow)

j.            minVC = minimum value of variance component, 10-6 by default (prevents instability of variance components wandering near zero)

k.         VC0_thresh = minimum size for important variance component   (=minVC by default)

l.            delta_theta = multiple for prior on theta: “number of datasets of information” in prior (see discussion in Morris, et al. 2003 JASA),         10-4 by default.

m.      thetaMOM_maxiter = maximum number of iterations in finding MOM starting values for variance components, default 100

n.         thetaMOM_convcrit= convergence criteria for iterative procedure for finding MOM starting values for variance components, default 10-3

 

Output:

 

The output of the program is a Matlab data file (“output.mat” in the sample call), containing the following Matlab objects, as well as a log file.

 

·             Y, model, wavespecs, MCMCspecs: as input

·             D: matrix containing wavelet coefficients for observed data

·             PI: matrix containing the pij  estimated by the Empirical Bayes procedure described in Section 4.4 of the paper.

·             Tau: matrix containing the Tij by the Empirical Bayes procedure described in Section 4.5 of Morris and Carroll (2004).

·             theta_MOM: matrix containing method of moments starting values for the wavelet-space variance components qjk and sjk in model (3).

·             theta_MLE: matrix containing profile maximum likelihood starting values for the wavelet-space variance components.

·             se_Theta: estimate of the variance of theta_MLE, to use in automatic proposal variances in Metropolis-Hastings procedure described in step 2 of Section 5 in Morris and Carroll (2004)

·             betans:  matrix containing non-shrunken estimate of wavelet coefficients for fixed effects conditioning on starting values of variance components, given by equation (6) in Morris and Carroll (2004).

·             Vbetans: Matrix containing variance of these wavelet-spaced estimates, given by equation (7) in Morris and Carroll (2004).

·             alpha: Matrix containing starting values for shrinkages for wavelet coefficients for fixed effect functions, which are their posterior probabilities of being “nonzero”.  Condition on theta_MLE for variance components.

·             betahat: betans*alpha; shrinkage starting values for betas.

·             prior_Theta_a, prior_Theta_b: matrices containing the prior hyperparameters for the inverse gamma distributions on the wavelet-space variance components

·             Wv: structure containing, for each wavelet coefficient, the following statistics, using starting values of the variance components for Sjk

o           XvX=X’(Sjk)-1X

o           XvZ=X’(Sjk)-1Z

o           XvD=X’(Sjk)-1D

o           ZvZ=Z’(Sjk)-1Z 

o           ZvD=Z’(Sjk)-1D

o           dvd=diag(D’(Sjk)-1D)

o           L1=det(Sjk)

o           L2= (djk-X Bjk)’ (Sjk)-1 (djk-X Bjk)

·             MCMC_wbeta: matrix containing MCMC posterior samples for wavelet coefficients for fixed effects.  Each row contains all B*ijk , i=1, …, p, j=1, .., J, k=1, …, Kj , for one iteration of MCMC.

·             MCMC_beta: matrix containing MCMC posterior samples for data-space fixed effect functions.  Each row contains Bij , i=1, …, p, j=1, …, T, for one iteration of MCMC.

·             MCMC_theta: matrix containing MCMC posterior samples for variance components in wavelet space.  Each row contains {qhjk, h=1, …, H, sjk}, in that order, for j=1, …, J, k=1, …, Kj

·             Paccept_newtheta: vector containing Metropolis-Hastings acceptance probabilities for the set of variance components for each wavelet coefficient, indexed by scale j and location k.

·             ghat: p-by-T matrix containing the posterior mean for each fixed effect function.

·             Q05_ghat: p-by-T matrix containing the pointwise .05 quantile for each fixed effect function, which is the lower bound for the 90% posterior credible interval.

·             Q95_ghat: p-by-T matrix containing the pointwise .95 quantile for each fixed effect function, which is the upper bound for the 90% posterior credible interval.

 

 

 

 

Comments:

 

·             This code has been compiled and used on Microsoft Windows systems.  In the near future, we will also test it on Linux, as well.

·             The current interface assumes you create the input files and want to post-process the output files in Matlab.  We will write an R wrapper for this program in the near future, and post on our web site when it is ready.

·             The current version of the code assumes:

1.         You want to estimate the shrinkage hyperparameters using the empirical Bayes method. 

2.         You want vague proper priors for the variance components, centered at the starting values with information equivalent to delta_Theta observations.

3.         The random effect functions are independent and identically distributed, so P=R=I

Future versions of the code will allow the users to specify shrinkage hyperparameters and priors for the variance components, and will also allow other covariance structures on the random effect functions.

·             This code yields MCMC samples for the quantities in the wavelet-space model, (3) in Morris and Carroll (2004), plus MCMC samples for the fixed effect functions B in the data space model (2). 

·             MCMC samples of Qh and Si matrices can be obtained by applying the 2-D IDWT to the corresponding diagonal wavelet-space matrices.  They are not generated by default because they are frequently not of direct interest, and their large size will cause memory issues in large data sets.  We are currently writing code to compute and store the diagonal elements of these matrices in a more efficient manner.  These diagonal elements are useful for certain purposes we discuss in future publications.

·             MCMC samples of the random effect functions and their wavelet coefficients are not calculated by default, again to save memory and computing resources.  We have matlab scripts to compute these that we are in the process of converting over to C, and will post once they are complete.

·             Our scripts for post-processing the posterior samples from the MCMC to do various types of Bayesian inference are in Matlab.  We are in the process of converting some of them over to C, at which time they will be posted along with their documentation.

·             It is possible to parallelize the MCMC step if parallel processing is available – a parallel processing version of the code will be posted in due time.

·             For extremely large data sets, one may need to compute the memory requirements of fitting their data and compare it with the capabilities of their machine.  Again, future documentation will give guidelines here.

·             There are numerous extensions and improvements in the works for this method – stay tuned for updated code.