Big data has created new challenges for data analysis due to the large size of the datasets, with millions of observations and/or thousands of variables. A work-around is to conduct the analysis with a random sample of the dataset and a recent proposal is to replace the random sample with a set of “support vectors”. However, these solutions may not be adequate since the structure of the dataset, particularly at the tails or edges of the dataset, is not guaranteed to be captured very well. The first method that I will present is a new solution for analyzing large datasets such as these through the concept of “data nuggets”. These data nuggets reduce a very large dataset into a small collection of nuggets of data, each containing a center, weight, and a scale parameter. Once the data is re-expressed as data nuggets, we may apply algorithms that compute standard unsupervised and supervised statistical methods, such as principal components analysis (PCA), clustering, linear models, etc.
The second problem is how to deal with the high dimensionality issues that appear in many genomics experiments, when the dimension is very high, and the sample size is small. These datasets illustrate the limitations of the standard penalized methods for model selection and model building. I will present a methodology of soft dimension reduction called enrichment that is a weighting scheme for the variables, not the observations, and that emphasizes the important variables. The enrichment method provides a new modeling approach for very high dimensional data.
To illustrate these methods, I will present two applications. The first one is an analysis of a dataset in flow cytometry with millions of observations. The second one is a lupus clinical study involving a combination of genetic (high dimensional) and clinical (low dimensional) data.