Research statement

During my career, I developed rather diverse research interests in both theoretical statistics and probability. In particular there are four major areas in which I have made multiple contributions: generalized fiducial inference, methodology and theory for SiZer, analytical probability and application to engineering and finance. Currently I am also developing a new research interest in biological applications. I will now describe my contributions and future plans by area.


Generalized Fiducial Inference and related topics
funded by NSF grants DMS 1007520, DMS 0707037 and DMS 1512945)

A large percentage of my current effort is concerned with studying theoretical properties of generalized fiducial inference. R. A.Fisher's fiducial inference has been the subject of many discussions and controversies ever since he introduced the idea during the 1930s. The idea experienced a bumpy ride, to say the least, during its early years and one can safely say that it eventually fell into disfavor among mainstream statisticians. However, it appears to have made a resurgence recently under various labels such as generalized inference, confidence distributions, Dempster-Shafer calculus and its derivatives. In these new guises fiducial inference has proved to be a useful tool for deriving statistical procedures for problems where frequentist methods with good properties were previously unavailable.

The aim of my work is to revisit the fiducial idea of Fisher from a fresh new angle. I do not attempt to derive a new paradox free theory of fiducial inference as I do not believe this is possible. Instead, with minimal assumptions I present a new simple fiducial recipe that can be applied to conduct statistical inference via the construction of generalized fiducial distributions. This recipe is designed to be fairly easily implementable in various practical applications, and can be applied regardless of the dimension of the parameter space, e.g., including nonparametric problems. I term the resulting inference generalized fiducial inference (GFI). A reader interested in learning more about generalized fiducial inference can consult our review article.

From the very beginning our work has been motivated by important applications in areas such as pharmaceutical statistics and metrology. Jointly with several of my students and collaborators at other institutions we applied the generalized fiducial methodology to important applied problems with a great success. In addition to methodological research I have also analyzed mathematical properties of generalized fiducial distribution proving that generalized fiducial distribution often gives rise to statistical procedures that have good statistical properties asymptotically. Fiducial based statistical procedures are also very competitive for small samples. This is shown by mounting evidence from several simulation studies giving practitioners an exciting new data analysis tool.

In particular, my contributions naturally cluster into three areas: First, is the definition and theoretical properties of generalized fiducial inference. My initial significant contribution was in a 2006 paper that connected fiducial inference to a new field of generalized inference sparking a number of subsequent publications. The current formal definition of generalized fiducial inference can be found in a 2016 review article that contains a number of useful result and formulas. In a series of publications my students and I have provided proof of Bernstein-von Mises theorems establishing asymptotic correctness of generalized fiducial inference for a large class of parametric models. We are currently in the process of studying higher order asymptotic results and first paper on this topic is being revised. Additionally, my coauthors and I show how generalized fiducial distribution can be used for prediction, connect generalized fiducial distribution with another growing field of confidence distributions, and address computational issues.

Second area of interest is application of generalized fiducial inference to statistical problems of practical interest. For example GFI provides statistical tools for inference in linear mixed models that have properties very favorable compared to other methodologies available in the literature. Other applications include inference for extremes, a novel method for volatility estimation in high frequency financial data, and an application to psychology (item response modeling). Finally, several papers address specific applications in measurement science. Some of these ideas have a direct impact on government policy-making with regard to international inter-laboratory experiments and assessment of measurement capabilities by the U. S. National Institute of Standards and Technology (NIST).
Current applications of interest include the use of fiducial idea in analysis of genetics and RNASeq like data.

Third and the most current area is the use of the generalized fiducial distribution for model selection and non-parametric models. My initial contribution was fiducial model selection in more classical setups.
However, the flexibility of generalized fiducial inference also allows us to move beyond parametric problems, e.g., a large sparse linear systems and estimation of a non-parametric survival function. As a first step in this direction we investigates the use of generalized fiducial inference for constructing wavelet regression confidence intervals and a generalized fiducial solution to the ultra-high-dimensional regression problem using EBIC-like penalty.
Current projects are approaching this problem from a completely new angle; doing model selection without the use of an arbitrary penalty. We have successfully completed a pilot project on high dimensional regression and currently pursue a multidimensional time series model selection. We are also studying properties of GFI in known challenging non-parametric problems such as estimation of non-decreasing density,

In the long term, I plan to continue working on applications of fiducial inference to important current problems of statistical inference. One particularly exciting direction is the development of deep fiducial inference: the use of deep learning in fiducial computations.


Big Data, Data Integration, Non-parametric smoothing and SiZer
funded by NSF grant IIS-1633074)

Big Data has become a popular fad among statistician, computer scientist and mathematicians. However, not enough attention has yet been paid to large scale Big Data challenges such as data heterogeneity. This phenomenon frequently arises in Big Data sets, because it naturally arises when data sets are combined. So far there has been rather little thought or discussion within the quantitative communities (neither in statistics, nor elsewhere) about the impact of data set combination, yet that is a crucial issue. .

Integrative analysis of disparate data blocks measured on a common set of experimental subjects is a major challenge in modern data analysis. This data structure naturally motivates the simultaneous exploration of the joint and individual variation within each data block resulting in new insights. For instance, there is a strong desire to integrate the multiple genomic data sets in The Cancer Genome Atlas to characterize the common and also the unique aspects of cancer genetics and cell biology for each source. We introduce a method termed Angle-Based Joint and Individual Variation Explained capturing both joint and individual variation within each data block.

AJIVE provides a major improvement over earlier approaches to this challenge in terms of a new conceptual understanding, much better adaption to data heterogeneity and a fast linear algebra computation. Moreover, the tools developed allow to compare several approximations of the data creating a scale space view. This can be use to better understand underlying common driving forces or on the opposite side of the spectrum to eliminate batch effects.

Our future work will extend the AJIVE methodology to include more complex intermediate variation and provide theoretical studies of its properties. We also plan to apply this tool to various datasets and develop a supervised version of JIVE.


Another basic questions in statistics is finding a functional relationship between predictors and response variables known under the technical term regression. When the relationship cannot be described by a simple function, e.g. line, a more flexible, non-parametric, method is sought. One of such non-parametric methods is called local polynomial smoothing. The idea of local polynomial smoothing is to fit a simple function, e.g., polynomial to the data in a narrow sliding window. The size of the window is often called a bandwidth and the practical performance of the local polynomial smoothing is critically influenced by its choice. There has been a lot of competing literature on how to best select the bandwidth in various of situations, mainly in the 1990s.

At the turn of the millennium Chaudhury and Marron have argued that instead of selecting a single ``best'' bandwidth one should work with a number of possible bandwidths identifying features at number of different scales. These features were distinguished using a large number of statistical tests summarized using a special graphics termed SiZer (Significant Zero crossing) map. Since then SiZer maps have found their use in many exploratory data analysis situations.

The fact that SiZer map is based on a large number of statistical tests requires an adjustment to make the map less susceptible to false positive, multiple testing adjustment. The original SiZer of Chaudhuri and Marron had an ad-hoc adjustment that made the original SiZer prone to false positive results. My first contribution to this area was to provide a rigorous multiple testing adjustment based on extreme value theory substantially improving the validity of SiZer maps. I worked out a ``second order'' approximation to the extreme value distribution of the Gaussian random process implied by SiZer.

In order to make the SiZer idea practical beyond the original i.i.d. setup one needs to extend it to other models. My next contributions have concentrated on such extensions. For example we use quantile regression and M-estimation to provide a robust SiZer map capable of dealing with outliers and propose a version of a SiZer for dependent data. We also provided a tool for rigorous comparison tool for SiZer maps. This is important because currently there is no way of deciding in a rigorous way under what condition will one of the many versions of SiZers in the literature outperform others.


Applications to Engineering and Finance
funded by NSF grants DMS 1016441 and ECCS 0700559)

Another important active area of my interest is application of statistics and probability to engineering and finance. Here I have worked on several interesting applications.

The first application I am part of is concerned with the modeling and simulation of extremely large networks using time-dependent partial differential equations (PDEs). In many applications, numerical simulation is the tool of choice for the design and evaluation of large networks. However, the computational overhead associated with direct simulation severely limits the size and complexity of networks that can be studied in this fashion. Performing numerical simulations of large stochastic networks has been widely recognized as a major hurdle to future progress in understanding and evaluating large networks. Our modeling approach is based on asymptotic analysis of a stochastic system that provides a probabilistic description of the network dynamics. This approach appears particularly promising for networks like a wireless ad hoc network.

In this kind of network, nodes send to and receive from other nodes that are within transmission range. Transmission success is affected by interference; e.g., nodes are often so simple that they can receive only one message at a time, and propagation losses are often modeled by a power law dependence on distance. In this situation, we believe that it is possible to formulate the flow of information through the network using hydrodynamic scaling limits for the behavior of the individual packets or particles. The technical details involve defining a probability structure that describes the likelihood of information from one node passing to nearby nodes and then passing from this local probability structure to a diffusion limit description of the motion. The team working on this project contains an electrical engineer, probabilist and a pde specialist. In a series of papers we develop technical tools, provide a rigorous mathematical proof of convergence of the random process modeling a class of communication networks to the limiting PDE and apply the ideas to various network protocols.

The second application is simultaneous target tracking. The main idea here is to provide an algorithm that, based on a limited information from sensors or images, provides a location (or sequence of locations called a track) for each of the targets with high fidelity. My student, collaborator at another institution and I provide a model based algorithm for tracking of multiple moving objects extracted from an image sequence allowing for birth, death, splitting and merging of targets. This is an important problem which finds numerous applications in science and engineering. We also establish the almost sure convergence of the estimators based on our model to the truth. In other words we proved that this model should work well under the conditions it was designed for provided we have enough data. This consistency property of the tracking estimates was empirically verified by numerical experiments. From a somewhat different angle, another group of collaborators and I study a tracking of targets using sensors with limited communication capacity using information theoretic tools.

The third application is financial data. The presence or absence of jumps in the financial time series data, such as stock prices has been of interest among researchers and practitioners due to the effect presence of jumps has on pricing of various financial instruments. We provide a new test for detecting jumps in financial time series. In the future I plan to use the ideas of generalized fiducial inference and non-parametric smoothing to provide new statistical inference procedures for volatility in financial data.

Next, we deal with some aspects of statistical analysis of internet traffic data. In particular, we addresse a controversy about a distribution of sizes of files transferred over the internet. We also look at the relationship of statistical summaries (such as sample covariance) of internet traffic data computed by aggregating the data at different resolutions.

Lastly, my students, collaborators at EPA and I provide a statistical algorithm for detecting anthrax from laser induced spectroscopy data. This manuscript is a first among several forthcoming manuscript dealing with application to chemical statistics and pharmacology. One of them will develop an algorithm for finding misclassified compounds in chemical libraries.


Analytical probability
(funded by NSF grant DMS 0504737)

A large portion of my early career was spent working on problems from analytical probability. The main area of my interest was small deviation for Gaussian processes, i.e., understanding the behavior of the probability that a stochastic process X(t) stays during a time interval [0,T] in a small ball of radius e around the origin. As e tends to 0, this probability clearly tends to zero and the question is at what rate? Answers to small deviation questions are used in other fields of mathematics such as analysis of non-parametric Bayes estimators, quantization and metric entropy. My collaborators and I made contributions to the theory of small deviations under the L2 norm. we characterize the precise L2 small deviations for a large class of continuous Gaussian processes. We also provide a comparison theorem for lower tail of sums of positive random variables.

Another area of interest was an analysis of several stochastic search algorithms related to simulated annealing. The convergence of simulated annealing has been established earlier by Hajek in the 1980s. Our work uses an alternative simple approach based on the relative frequency simulated annealing spends in the various states of the system. We also provide a particular type of rates of convergence not available before.

Next, we provides a definition of continuous ARMA(p,q) in the case of q>=p in which case the process does not exist in the classical sense.

Finally, my dissertation studied the properties of filtrations supporting only purely discontinuous martingales. The main result could be paraphrased as follows: if all martingales have at least one jump then all the information available in the system is included in the timing and sizes of the jumps.


copyright © Jan Hannig, designed by JanAltonDesign, 2008