Generalised factor models are a family of linear and non-linear latent variable models widely used to model the joint distribution of multivariate data. These models have received wide applications in social and behavioural sciences. With the advances in information technology, larger-scale data are increasingly commonly encountered, with large numbers of observations and manifest variables. For such data, the traditional generalised factor models and their estimation procedures are no longer suitable due to several statistical and computational barriers brought by the high dimensionality of the data.
To make the generalised factor models scalable to high-dimensional multivariate data, we propose to revisit the joint maximum likelihood estimator, a vintage estimation approach in the latent variable model literature. This approach treats the latent variables (factors) as fixed parameters rather than random variables (in modern psychometrics, latent variables are treated as random). The joint maximum likelihood estimator is statistically inconsistent under a low-dimensional setting where the number of manifest variables is fixed, as the number of model parameters diverges with the sample size in this case. The story changes under a high-dimensional setting where the numbers of observations and manifest variables are large. In fact, the estimator becomes consistent as the numbers of observations and manifest variables grow to infinity simultaneously, even though the number of model parameters still diverges. We have developed theories and methods for the estimation and model selection of generalised factor models under high-dimensional regimes. This line of research has several future directions, especially regarding the statistical inference based on the joint maximum likelihood estimator.
Relevant Publications:
Outliers in multivariate data -- extreme observations or manifest variables -- are commonly encountered. One such example is cheating in standardised educational tests, where cheating test takers and leaked items are regarded as outlying observations and manifest variables that need to be detected. Unlike regression analysis, outliers are less well-defined and harder to detect under multivariate analysis settings.
We have developed methods for detecting outliers in multivariate data. In our framework, a latent variable model is imposed as a baseline model based on substantive knowledge and historical data, and outliers are defined as observations and manifest variables that deviate from the baseline model. We have also developed a compound decision theory to detect outliers in observations and manifest variables that control FDR-type compound risks.
Relevant Publications:
We have developed a sequential decision framework for detecting changes in parallel data streams, a problem widely encountered when analysing large-scale real-time streaming data. In this problem, we have multiple parallel streams for which data are sequentially observed in each of the streams. Each stream has its own change point. We need to declare whether changes have occurred to the streams at each time point. Once a stream is declared to have changed, it is deactivated permanently so that its future data will no longer be collected. This problem is motivated by the problem of item quality monitoring in standardised educational tests, where each data stream corresponds to an item in the item pool of a test, the change point may correspond to the leakage of the item, and each time point corresponds to one use of the item pool. The goal is to detect and remove the changed items quickly and accurately to balance the test fairness and the financial cost of maintaining the item pool.
This is a compound decision problem because we may want to optimise certain compound performance metrics that concern all the streams as a whole. For example, in detecting changed items in educational tests, we are often interested in maximising the expected number of true detections while controlling the quality of the remaining item pool (e.g., the expected proportion of leaked items in the remaining pool). With a compound criterion, the decisions are not independent of each other, and thus, we cannot simply run a classical change detection procedure for individual streams independently. We have developed a general decision framework and computationally efficient procedures for making sequential decisions and established optimality results. We have applied the method to item quality monitoring in educational tests.
Relevant Publications:
We have developed a pairwise maximum likelihood (PML) estimation and testing framework for factor analysis models with binary, ordinal and mixed data both in an exploratory and confirmatory set-up when data are missing at random. The advantage of PML over Full information maximum likelihood (FIML) is mainly computational. The computational complexity of FIML increases with the number of factors or observed variables depending on the model formulation, while that of PML is affected by neither of them. In addition to the estimation and testing (goodness of fit and model selection), we have also proposed methods for reducing the computational complexity by implementing sampling of pairs and methods for increasing the efficiency of the estimates by weighted sampling.
Relevant Publications:
Interpretability has become increasingly more important to machine learning algorithms, especially unsupervised learning, which is closely related to algorithm fairness. Many unsupervised learning algorithms, such as cluster analysis, principal component analysis, and topic models, can be viewed as methods that estimate certain latent variable models. Thus, structure learning of latent variable models, which dates back to the rotation approach to exploratory factor analysis, provides a solution to interpretable unsupervised learning.
The structural learning of latent variable models aims to learn a sparse graphical representation (in the sense of conditional independence) of the relationship between the latent variables and the manifest variables so that the latent variables can be interpreted based on the associated manifest variables. Traditionally, the structural learning of exploratory factor analysis is achieved by post-estimation rotation methods. We have developed penalised estimation methods that simultaneously learn the sparse structure and estimate the model parameters for various latent variable models. More recently, we have been studying the connections and differences between the rotation approach and the penalised estimation approach based on theoretical and numerical analyses.
Relevant Publications:
Latent variable models are typically estimated using the marginal likelihood, where the latent variables are treated as random variables and integrated out. Traditionally, the marginal likelihood is optimised with the Expectation-Maximisation (EM) algorithm, where the integrals with respect to the latent variables typically need to be approximated by numerical integrals. Since the computational complexity of the numerical integrals grows exponentially fast with the number of latent variables, the computation of the EM algorithm becomes unaffordable when the dimension of the latent space is large.
To reduce the computational burden, a solution is to use stochastic optimisation methods that replace the numerical integrals with Monte Carlo simulation of the latent variables (under the posterior law). We have considered two general computational frameworks, the Stochastic EM framework and the stochastic approximation framework. Under the former, we have developed an improved stochastic EM algorithm for solving large-scale full-information item factor analysis problems. This algorithm can be extended to general latent variable models. Under the latter, we have proposed a quasi-Newton stochastic proximal gradient algorithm that achieves a nearly optimal theoretical convergence rate. This algorithm converges fast in practice and can handle all kinds of non-smooth penalties and constraints.
Relevant Publications:
In this information age, students need not only traditional skills like math and reading but also more advanced and complex skills such as complex problem-solving and collaboration. Unlike the traditional skills that can be measured by paper-and-pencil-based tests, these advanced skills are better measured by computer-simulated tasks or educational games. Logfile process data from the simulated tasks or games provide a unique opportunity to learn students' behavioural patterns in task-solving and measure their proficiency in advanced skills. However, logfile process data are non-regular, for which the traditional dimension reduction tools and measurement models are no longer suitable. Thus, extracting useful information from the data is a challenging task. Making use of latent variable modelling and event history analysis, we have developed dimension reduction tools and measurement models for making sense of logfile process data.
Relevant Publications:
Personalized learning refers to instruction in which the pace of learning and the instructional approach are optimized for the needs of each learner. With the latest advances in information technology and data science, personalized learning is becoming possible for anyone with a personal computer, supported by a data-driven recommendation system that automatically schedules the learning sequence. The engine of such a recommendation system is a recommendation strategy that, based on data from other learners and the performance of the current learner, recommends suitable learning materials to optimize certain learning outcomes. A powerful engine achieves a balance between making the best possible recommendations based on the current knowledge and exploring new learning trajectories that may potentially pay off.
We have proposed a Markov decision framework for sequential recommendation in a personalised learning system. Under this framework, the optimal recommendation of learning materials becomes a sequential decision rule that maximises a certain utility function (defined at a future time point) that measures the learning achievement. We have proposed a reinforcement learning approach to learn the optimal sequential decision rule from data.
Relevant Publications:
Most of the existing latent variable models rely on strong parametric assumptions that may not be flexible enough for various applications. To fill the gap, we are developing general statistical frameworks for linear and non-linear latent variable models. In particular, we are extending the Generalised Additive Models for Location, Shape and Scale (GAMLSS) framework (Rigby and Stasinopoulos, 2005) to models with latent variables (Bartholomew et al., 2011). The proposed framework allows for more flexible functional forms of the measurement equations for the mean and higher-order moments.
We are also developing a semi-parametric multi-dimensional non-linear factor model framework by combining the single-index regression and non-linear multiple-factor model. This model is more flexible than the traditional parametric models while enjoying essentially the same interpretation as the classical models. A sieve estimator is proposed to estimate the proposed model.
Relevant Publications:
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.