Research statement

During my career, I developed a varied research interests in statistics, data science, and probability. There are several major areas in which I have made multiple contributions: foundations of statistical inference and generalized fiducial inference, methodology and theory for data integration, applications of statistics to forensic science, psychology, genomics, engineering and finance, and analytical probability.

My contributions and future plans are summarized by area below:

Generalized Fiducial Inference and related topics
(
funded by NSF grants DMS 1007520, 0707037, 1512945, 1916115, and 2210388)

A large percentage of my current effort is concerned with studying theoretical properties of generalized fiducial inference. R. A. Fisher's fiducial inference has been the subject of many discussions and controversies ever since he introduced the idea during the 1930s. The idea experienced a bumpy ride, to say the least, during its early years and one can safely say that it eventually fell into disfavor among mainstream statisticians. However, it appears to have made a resurgence recently under various labels such as generalized inference, confidence distributions, Dempster-Shafer calculus and its derivatives. In these new guises fiducial inference has proved to be a useful tool for deriving statistical procedures for problems where frequentist methods with good properties were previously unavailable.

The aim of my work is to revisit the fiducial idea of Fisher from a fresh new angle. I do not attempt to derive a new paradox free theory of fiducial inference as I do not believe this is possible. Instead, with minimal assumptions I present a new simple fiducial recipe that can be applied to conduct statistical inference via the construction of generalized fiducial distributions. This recipe is designed to be easily implementable in various practical applications, and can be applied regardless of the dimension of the parameter space, e.g., including nonparametric problems. I term the resulting inference generalized fiducial inference (GFI). A reader interested in learning more about generalized fiducial inference can consult our review article and short course slides.

From the very beginning our work has been motivated by important applications in areas such as pharmaceutical statistics and metrology. Jointly with several of my students and collaborators at other institutions we applied the generalized fiducial methodology to important applied problems with a great success. In addition to methodological research I have also analyzed mathematical properties of generalized fiducial distribution proving that generalized fiducial distribution often gives rise to statistical procedures that have good statistical properties asymptotically. Fiducial based statistical procedures are also very competitive for small samples. This is shown by mounting evidence from several simulation studies giving practitioners an exciting new data analysis tool.

In particular, my contributions naturally cluster into four areas: First, is the definition and theoretical properties of generalized fiducial inference. My initial significant contribution was in a 2006 paper that connected fiducial inference to a new field of generalized inference sparking a number of subsequent publications. The current formal definition of generalized fiducial inference can be found in a 2016 review article that contains many useful result and formulas. In a series of publications my students and I have provided proof of Bernstein-von Mises theorems establishing asymptotic correctness of generalized fiducial inference for a large class of parametric models. We are currently in the process of studying behaviour of fiducial distribution on manifolds. Additionally, my coauthors and I show how generalized fiducial distribution can be used for prediction, connect generalized fiducial distribution with another growing field of confidence distributions, and address computational issues.

Second area of interest is application of generalized fiducial inference to statistical problems of practical interest. For example, GFI provides statistical tools for inference in linear mixed models that have properties very favorable compared to other methodologies available in the literature. Other applications include inference for extremes, a novel method for volatility estimation in high frequency financial data, and an application to psychology (item response modeling). Finally, several papers address specific applications in measurement science. Some of these ideas have a direct impact on government policymaking regarding international inter-laboratory experiments and assessment of measurement capabilities by the U. S. National Institute of Standards and Technology (NIST). Current applications of interest include the use of fiducial idea in analysis of genetics and RNASeq data with special attention paid to parts of the DNA used in forensic science.

Third area is the use of the generalized fiducial distribution for model selection and non-parametric models. The flexibility of generalized fiducial inference allows us to move beyond parametric problems, e.g., a large sparse linear systems and estimation of a non-parametric survival function. As a first step in this direction, we investigate the use of generalized fiducial inference for constructing wavelet regression confidence intervals and a generalized fiducial solution to the ultra-high-dimensional regression problem using EBIC-like penalty. Current projects are approaching this problem from a completely new angle, doing model selection without the use of an arbitrary penalty. We have successfully completed a project on high dimensional regression and multidimensional time series model selection and currently pursue fiducial model selection for Gaussian graphical models. Another direction we are investigating is to apply non-parametric version of GFI to deconvolution problems motivated by problems such as differential privacy.

The last area is related to practical computation of fiducial distribution. Initially, the main computational tool for generalized fiducial inference was Markov Chain Monte Carlo and Sequential Monte Carlo. Recently, we have also developed an algorithm that can implement fiducial distribution for massive data set using divide and conquer approach. However, historically each application fiducial inference required a particular implementation of a computational procedure. For fiducial approaches to be useful to data scientist, there is a need to develop a general purpose probabilistic programing software that will be applicable to most data science problem. We are investigating several avenues in this direction.

 

Big Data, Data Integration and SiZer
(
funded by NSF grant IIS 1633074, DMS 2113404)

Data science provides a natural place for collaborations between statisticians, computer scientists, and mathematicians. One of the current challenges is caused by data heterogeneity. This phenomenon is frequent in Big Data, because it naturally arises when data sets are merged. So far there has been rather little thought or discussion within the quantitative communities (neither in statistics, nor elsewhere) about the impact of data set combination, yet that is a crucial issue. Integrative analysis of disparate data blocks measured on a common set of experimental subjects is an example of this major challenge in modern data analysis.

A natural goal of integrative analysis is simultaneous exploration of the joint and individual variation within each data block resulting in new insights. For instance, there is a strong desire to integrate the multiple genomic data sets in The Cancer Genome Atlas to characterize the common and the unique aspects of cancer genetics and cell biology for each source. Our first contribution to this topic was a method termed Angle-Based Joint and Individual Variation Explained (AJIVE) capturing both joint and individual variation within each data block. AJIVE provides a major improvement over earlier approaches to this challenge in terms of a new conceptual understanding, much better adaption to data heterogeneity and a fast linear algebra computation. AJIVE has been used to better understand underlying common driving forces or on the opposite side of the spectrum to eliminate batch effects.

Currently we are developing methodology termed Data Integration via Analysis of Subspaces (DIVAS) that completely redesigns the statistical and optimization underpinning of our method to allow for partially shared joint structures, valid statistical inference, while using information both in the object (scores) and feature (loadings) space. We are applying DIVAS to many different data sets beyond the original cancer genomics, e.g., integrating connection information with behavioral data in neuroscience, and integrating various types of measurements in quantitative psychology.

Another basic question in statistics is finding a functional relationship between predictors and response variables known under the technical term regression. When the relationship cannot be described by a simple function, e.g., line, a more flexible, non-parametric, method is sought. Such methods are generally requiring a selection of a smoothing parameter. At the turn of the millennium Chaudhury and Marron have argued that instead of selecting a single tuning parameter one should work with several of them identifying features visible at different levels of smoothing. These features were distinguished using a large number of statistical tests summarized using a special graphics termed SiZer (Significant Zero crossing) map. Since then, SiZer maps have found their use in many exploratory data analysis situations. The fact that SiZer map is based on a large number of statistical tests requires an adjustment to make the map less susceptible to false positive, multiple testing adjustment. The original SiZer of Chaudhuri and Marron had an ad-hoc adjustment that made the original SiZer prone to false positive results.

My first contribution to this area was to provide a rigorous multiple testing adjustment based on extreme value theory substantially improving the validity of SiZer maps. I worked out a second order approximation to the extreme value distribution of the Gaussian random process implied by SiZer. In order to make the SiZer idea practical beyond the original i.i.d. setup one needs to extend it to other models. My next contributions have concentrated on such extensions. For example, we use quantile regression and M-estimation to provide a robust SiZer map capable of dealing with outliers and propose a version of a SiZer for dependent data. We also provided a tool for rigorous comparison of SiZer maps. The general philosophy of SiZer, i.e., looking at the results of a statistical algorithm for number of possible tuning parameters and coupling it with a rigorous statistical test to make sense of the changes in the outputs, is transferable to other data science applications.

 

Applications
(
funded by NSF grants DMS 1016441, 1916115 , and ECCS 0700559)

Forensic science:
My initial interest in application of statistics to forensic science was quality assessment and development of numerical summaries of evidence called "likelihood ratios". The use of likelihood ratios for quantifying the strength of forensic evidence in criminal cases is gaining widespread acceptance in many forensic disciplines. Legal requirements of reliability of expert evidence have encouraged researchers to develop likelihood ratio systems based on statistical modelling using relevant empirical data. Many such systems exhibit exceptional power to discriminate between the scenario presented by the prosecution and an alternate scenario implying the innocence of the defendant. However, such systems are not necessarily well-calibrated. My collaborators and I developed a GFI based approach for assessing calibration discrepancy of likelihood ratio systems using ground truth known empirical data. Additional I work on development of well calibrated likelihood ratios for several forensics’ problems, e.g., DNA mixture deconvolution, glass evidence attribution, and others.

Social Sciences:
Given the current replication crises, social sciences are looking beyond p-value for quantification uncertainty and statistical evidence. One of the popular new directions is to use Bayesian methods, such as Bayes factors. However, there is often a lack of understanding of the effect of prior and other choices inherent in this approach. My collaborators are comparing various statistical paradigms, e.g., Bayesian, fiducial, and frequentist and their value in answering questions such as null effects, small and interrupted studies, factor analysis, and item response modeling.

Biology and Genomics:
Next, I am working with collaborators in biology, on statistical analysis of high throughput DNA sequencing data. One problem of interest is detecting changes in genome in response to the environmental pressure. The techniques my student and I are currently applying combine SiZer type ideas with generalized fiducial and objective Bayesian methods. Another problem we are currently working on is basic analysis of uncertainties for a new sequencing method measuring lengths of segments between pre-determined sequences.

Engineering:
The first engineering application I was part of is concerned with modeling and simulation of extremely large networks using time-dependent partial differential equations (PDEs). In many applications, numerical simulation is the tool of choice for the design and evaluation of large networks. However, the computational overhead associated with direct simulation severely limits the size and complexity of networks that can be studied in this fashion. Performing numerical simulations of large stochastic networks has been widely recognized as a major hurdle to future progress in understanding and evaluating large networks. Our modeling approach is based on asymptotic analysis of a stochastic system that provides a probabilistic description of the network dynamics. The team working on this project contained an electrical engineer, PDE specialist, and myself (statistician). In a series of papers, we developed technical tools, provide a rigorous mathematical proof of convergence of the random process modeling a class of communication networks to the limiting PDE, and apply the ideas to various network protocols. Recently, a student and I addressed a related inverse problem for stochastic ODE. This work combined ideas from applied mathematics with GFI.

The second application was simultaneous target tracking. The main idea here is to provide an algorithm that, based on a limited information from sensors or images, provides a location (or sequence of locations called a track) for each of the targets with high fidelity. We developed a model-based algorithm for tracking of multiple moving objects extracted from an image sequence allowing for birth, death, splitting, and merging of targets. This is an important problem which finds numerous applications in science and engineering.

Finally, I collaborated on a statistical algorithm for detecting anthrax from laser induced spectroscopy data and detection of misclassified compounds in chemical libraries.

Finance:
The presence or absence of jumps in the financial time series data, such as stock prices has been of interest among researchers and practitioners due to the effect presence of jumps has on pricing of various financial instruments. We provide a new test for detecting jumps in financial time series. I also used the ideas of generalized fiducial inference and non-parametric smoothing to provide new statistical inference procedures for volatility in financial data.

 

Analytical probability
(funded by NSF grant DMS 0504737)

A large portion of my early career was spent working on problems from analytical probability. The main area of my interest was small deviation for Gaussian processes, i.e., understanding the behavior of the probability that a stochastic process X(t) stays during a time interval [0,T] in a small ball of radius e around the origin. As e tends to 0, this probability clearly tends to zero and the question is at what rate? Answers to small deviation questions are used in other fields of mathematics such as analysis of non-parametric Bayes estimators, quantization, and metric entropy. My collaborators and I made contributions to the theory of small deviations under the L2 norm. we characterize the precise L2 small deviations for a large class of continuous Gaussian processes. We also provide a comparison theorem for lower tail of sums of positive random variables.

Another area of interest was an analysis of several stochastic search algorithms related to simulated annealing. The convergence of simulated annealing has been established earlier by Hajek in the 1980s. Our work uses an alternative simple approach based on the relative frequency simulated annealing spends in the various states of the system. We also provide a particular type of rates of convergence not available before.

Finally, my dissertation studied the properties of filtrations supporting only purely discontinuous martingales. The main result could be paraphrased as follows: if all martingales have at least one jump then all the information available in the system is included in the timing and sizes of the jumps.

top

copyright © Jan Hannig, designed by JanAltonDesign, 2008