Skip to main content


Biostatistics & Bioinformatics offers a vibrant seminar series featuring researchers and statisticians from UC San Diego and other academic institutions. Seminars are typically held from 1:00-2:00 PM PST on the third Wednesday of each month in MTF 168. Check out the upcoming and past presentations below. 

Upcoming Seminars



Recent Seminars


"Surrogate Assisted Semi-Supervised Inference for High Dimensional Risk Prediction" 

Jue “Marquis” Hou, PhD
1:00 PM – 2:00 PM PST

BACKGROUND: Precise risk prediction is vitally important for clinical care. High risk patients can be monitored more intensively. More aggressive treatment options should be considered for those unlikely to respond to the standard therapy. Developing generalizable precision medicine strategies requires real world data (RWD). A valuable source of RWD is the electronic health records (EHR) which accrue detailed clinical data in broad patient populations. Risk modeling with EHR data, however, is challenging due to a lack of direct observations on the disease outcome, Y, and the high dimensionality of the candidate predictors, X. In this paper, we develop a surrogate assisted semi-supervised-learning (SAS) approach to risk modeling with high dimensional predictors, leveraging a large size N of unlabeled observations on X and surrogates of Y, S, as well as a small size n labeled observations on X, S, and Y. The SAS procedure borrows information from S along with X to impute the unobserved Y via a sparse working imputation model with moment conditions to achieve robustness against misspecification in the imputation model and a one-step bias correction to enable interval estimation for the predicted risk. We demonstrate that the SAS procedure provides valid inference for the predicted risk derived from a high dimensional working model, P(Y = 1|X), even when the underlying population parameter for is dense and the risk model is mis-specified. We present an extensive simulation study to demonstrate the superiority of our SSL approach compared to existing supervised methods. We apply the method to derive genetic risk prediction of type-2 diabetes mellitus using a EHR biobank cohort.


"Proximal Causal Learning for Complex Longitudinal Studies" 

Andrew Ying, PhD
1:00 PM – 2:00 PM PST

 BACKGROUND: A standard assumption for causal inference with longitudinal data is that at each follow-up time, one has measured a sufficiently rich set of covariates to ensure that within covariate strata, subjects are exchangeable across observed treatment values, also known as “sequential randomization assumption (SRA)”. Skepticism about SRA in observational studies is often warranted because it hinges on investigators’ ability to accurately measure covariates over time capturing all potential sources of time-varying confounding. Realistically, confounding mechanisms can rarely if ever, be learned with certainty from measured covariates. One can therefore only ever hope that covariate measurements are at best proxies of true underlying confounding mechanisms operating in an observational study, thus invalidating causal claims made on basis of SRA. In this paper, we extend the proximal causal inference framework of Tchetgen Tch-etgen, Ying, et al. (2020) and Cui et al. (2020) to the longitudinal setting under a semiparametric marginal structural mean model (MSMM). The longitudinal proximal inference approach we propose offers an opportunity to learn about joint causal effects in settings where SRA on the basis of measured time-varying covariates fails, by formally accounting for the covariate measurements as imperfect proxies of underlying confounding mechanisms. We establish sufficient conditions for nonparametric identification with the aid of a pair of time varying proxies, when sequential randomization fails to hold due to unmeasured confounding. We provide a characterization of all regular and asymptotically linear estimators of the parameter indexing the MSMM, including a rich class of doubly robust estimators, and establish the corresponding semiparametric efficiency bound for the MSMM. Our approach is illustrated via extensive simulation studies and a data application on potential protective effects of the anti-rheumatic therapy Methotrexate (MTX) among patients with rheumatoid arthritis.


"Learning survival from EMR/EHR data to estimate treatment effects using high dimensional claims codes" 

Lily (Ronghui) Xu, PhD
1:00 PM – 2:00 PM PST

BACKGROUND:  Our work was motivated by the analysis projects using the linked US SEER-Medicare database to study treatment effects in men of age 65 years or older who were diagnosed with prostate cancer. Such data sets contain up to 100,000 human subjects and over 20,000 claim codes. The data were obviously not randomized with regard to the treatment of interest, for example, radical prostatectomy versus conservative treatment. Informed by previous instrumental variable (IV) analysis, we know that confounding most likely exists beyond the commonly captured clinical variables in the database, and meanwhile the high dimensional claims codes have been shown to contain rich information about the patients’ survival. Hence we aim to incorporate the high dimensional claims codes into the estimation of the treatment effect. The orthogonal score method is one that can be used for treatment effect estimation and inference despite the bias induced by regularization under the high dimensional hazards outcome model and the high dimensional treatment model. In addition, we show that with cross-fitting the approach has rate doubly-robust property in high dimensions.


Past Seminars