# Seminar - Updating established prediction models with new biomarkers

### School of Mathematics and Statistics Research Seminar

**Speaker:** Professor Jeremy M G Taylor, Department of Biostatistics, University of Michigan.

**Time:**
Monday 23rd April 2018 at 01:00 PM -
02:00 PM

**Location:**
Cotton Club,
Cotton 350

**URL:** https://sph.umich.edu/faculty-profiles/taylor-jeremy.html

**Groups:**
"Mathematics"
"Statistics and Operations Research"

## Abstract

We consider the situation where there is a known established regression model that can be used to predict an important outcome, Y, from a set of commonly available predictor variables X. There are many examples of this in the medical and epidemiologic literature. A new variable B is thought to be important and would enhance the prediction of Y. A modest sized dataset of size n containing Y, X and B is available, and the challenge is to build a good model for [Y|X,B] that uses both the available dataset and the known model for [Y|X]. Proposals in the literature to achieve this include Bayesian approaches and constrained and empirical likelihood based methods (Grill et al 2015 J Clin Epi, Chatterjee et al 2016 JASA, Cheng et al 2018 Stat in Med). The constrained approach is to maximize the likelihood for [Y|X,B] subject to the constraints on the parameters from the known model for [Y|X]. We compare these approaches and illustrate them on a prostate cancer dataset. We also propose a synthetic data approach. The approach consists of creating m additional synthetic data observations, and then analyzing the combined dataset of size n+m to estimate the parameters of the model [Y|X,B]. The synthetic data is created by replicating X then generating a synthetic value of Y from the known [Y|X] distribution. This combined dataset has missing values of B for m of the observations, and is analyzed using methods that can handle missing data. One such analysis approach is multiple imputation, or in special cases exact methods can be used. In special cases when [Y,X,B] is trivariate normal or when all of Y,X and B are binary we show that the synthetic data approach with very large m gives identical asymptotic variance for the parameters of the [Y|X,B] model as the constrained maximum likelihood estimation approach. This provides some theoretical justification for the synthetic data approach and given its broad applicability makes the approach very appealing.

This is joint work with Wenting Cheng, Bhramar Mukherjee, Jason Estes and Tian Gu