Selected Research topics

Hign-dimensional Generalised Factor Models and Joint Maximum Likelihood Estimation

Generalised factor models are a family of linear and non-linear latent variable models widely used to model the joint distribution of multivariate data. These models have received wide applications in social and behavioural sciences. With the advances in information technology, larger-scale data are increasingly commonly encountered, with large numbers of observations and manifest variables. For such data, the traditional generalised factor models and their estimation procedures are no longer suitable due to several statistical and computational barriers brought by the high dimensionality of the data.

To make the generalised factor models scalable to high-dimensional multivariate data, we propose to revisit the joint maximum likelihood estimator, a vintage estimation approach in the latent variable model literature. This approach treats the latent variables (factors) as fixed parameters rather than random variables (in modern psychometrics, latent variables are treated as random). The joint maximum likelihood estimator is statistically inconsistent under a low-dimensional setting where the number of manifest variables is fixed, as the number of model parameters diverges with the sample size in this case. The story changes under a high-dimensional setting where the numbers of observations and manifest variables are large. In fact, the estimator becomes consistent as the numbers of observations and manifest variables grow to infinity simultaneously, even though the number of model parameters still diverges. We have developed theories and methods for the estimation and model selection of generalised factor models under high-dimensional regimes. This line of research has several future directions, especially regarding the statistical inference based on the joint maximum likelihood estimator.

Relevant Publications:

Chen, Y., Li, X., and Zhang, S. (2019). Joint Maximum Likelihood Estimation for High-dimensional Exploratory Item Response Analysis. Psychometrika. 84, 124-146.
Chen, Y., Li, X., and Zhang, S. (2020). Structured Latent Factor Analysis for Large-scale Data: Identifiability, Estimability, and Implications. Journal of the American Statistical Association. 115, 1756-1770.
Zhang, H., Chen, Y. and Li, X. (2020). A Note on Exploratory Item Factor Analysis by Singular Value Decomposition. Psychometrika. 85, 358-372.
Chen, Y., Ying, Z. and Zhang, H. (2021). Unfolding-Model-Based Visualisation: Theory, Method and Applications. Journal of Machine Learning Research. 22, 1-51.
Chen, Y. and Li, X. (2022). Determining the Number of Factors in High-dimensional Generalised Latent Factor Models. Biometrika. 109, 769–782.
Chen, Y., Li, C., Ouyang, J, and Xu, G. (2023+). A Note on Statistical Inference for Noisy Incomplete 1-Bit Matrix. Journal of Machine Learning Research. To appear.

Detecting Outliers in Multivariate Data with Applications to Cheating Detection in Educational Tests

Outliers in multivariate data -- extreme observations or manifest variables -- are commonly encountered. One such example is cheating in standardised educational tests, where cheating test takers and leaked items are regarded as outlying observations and manifest variables that need to be detected. Unlike regression analysis, outliers are less well-defined and harder to detect under multivariate analysis settings.

We have developed methods for detecting outliers in multivariate data. In our framework, a latent variable model is imposed as a baseline model based on substantive knowledge and historical data, and outliers are defined as observations and manifest variables that deviate from the baseline model. We have also developed a compound decision theory to detect outliers in observations and manifest variables that control FDR-type compound risks.

Relevant Publications:

Mavridis, D. and Moustaki, I.(2008) Detecting outliers in factor analysis using the forward search algorithm. Multivariate Behavioral Research. Vol. 43 (3), 453-475.
Mavridis, D. and Moustaki, I. (2009) The forward search algorithm for detecting aberrant response patterns in factor analysis for binary data. Journal of Computational and Graphical Statistics 18(4): 1016-1034.
Moustaki, I. and Knott, M. (2014) Latent variable models that account for atypical responses. Journal of the Royal Statistical Society, Series C. 63. 343-360.
Chen, Y., Lu, Y., and Moustaki, I. (2022+). Detection of Two-way Outliers in Multivariate Data and Application to Cheating Detection in Educational Tests. Annals of Applied Statistics. To appear.

Parallel Change Detection with Application to Item Quality Monitoring in Educational Tests

We have developed a sequential decision framework for detecting changes in parallel data streams, a problem widely encountered when analysing large-scale real-time streaming data. In this problem, we have multiple parallel streams for which data are sequentially observed in each of the streams. Each stream has its own change point. We need to declare whether changes have occurred to the streams at each time point. Once a stream is declared to have changed, it is deactivated permanently so that its future data will no longer be collected. This problem is motivated by the problem of item quality monitoring in standardised educational tests, where each data stream corresponds to an item in the item pool of a test, the change point may correspond to the leakage of the item, and each time point corresponds to one use of the item pool. The goal is to detect and remove the changed items quickly and accurately to balance the test fairness and the financial cost of maintaining the item pool.

This is a compound decision problem because we may want to optimise certain compound performance metrics that concern all the streams as a whole. For example, in detecting changed items in educational tests, we are often interested in maximising the expected number of true detections while controlling the quality of the remaining item pool (e.g., the expected proportion of leaked items in the remaining pool). With a compound criterion, the decisions are not independent of each other, and thus, we cannot simply run a classical change detection procedure for individual streams independently. We have developed a general decision framework and computationally efficient procedures for making sequential decisions and established optimality results. We have applied the method to item quality monitoring in educational tests.

Relevant Publications:

Chen, Y., Lee, Y-H, and Li, X. (2022). Item Quality Control in Educational Testing: Change Point Model, Compound Risk, and Sequential Detection. Journal of Educational and Behavioral Statistics. 47, 322–352.
Chen, Y. and Li, X. (2023). Compound Online Changepoint Detection in Parallel Data Streams. Statistica Sinica. 33, 453-474.
Lu, Z., Chen, Y. and Li, X. (2023+). Optimal Parallel Sequential Change Detection under Generalized Performance Measures. IEEE Transactions on Signal Processing. To appear.

Pairwise likelihood estimation and testing for latent variable models with missing values

We have developed a pairwise maximum likelihood (PML) estimation and testing framework for factor analysis models with binary, ordinal and mixed data both in an exploratory and confirmatory set-up when data are missing at random. The advantage of PML over Full information maximum likelihood (FIML) is mainly computational. The computational complexity of FIML increases with the number of factors or observed variables depending on the model formulation, while that of PML is affected by neither of them. In addition to the estimation and testing (goodness of fit and model selection), we have also proposed methods for reducing the computational complexity by implementing sampling of pairs and methods for increasing the efficiency of the estimates by weighted sampling.

Relevant Publications:

Katsikatsou, M., Moustaki, I., Yang-Wallentin, F. and Jöreskog , K. (2012) Pairwise Likelihood Estimation for factor analysis models with ordinal data. Computational Statistics and Data Analysis. 56 (2012) 4243–4258.
Vasdekis, V., Cagnone, S. and Moustaki, I. (2012) A Composite likelihood inference in latent variable models for ordinal longitudinal responses. Psychometrika. Vol. 77(3) 425-441.
Vasdekis, V., Rizopoulos, D. and Moustaki, I. (2014) Weighted composite likelihood estimation. Biostatistics, 15(4), 677-689.
Florios, K. Moustaki, I., Rizopoulos, D. and Vasdekis, V. (2015) A modified weighted pairwise likelihood estimator for a class of random effects models. Metron, 73 (2), 217-228.
Katsikatsou, M. and Moustaki, I. (2016) Pairwise likelihood ratio tests and model selection criteria for structural equation model with ordinal variables. Psychometrika, 81(4), 1046-1068.
Papageorgiou, I. and Moustaki, I. (2019) Sampling of pairs in pairwise likelihood estimation for latent variable models with categorical observed variables. Statistics and Computing, 29(2), 351-365.
Katsikatsou, M., Moustaki, I. and Jamil, H. (2022) Pairwise likelihood estimation for confirmatory factor analysis models with categorical variables and data that are missing at random. British Journal of Mathematical and Statistical Psychology, 75(1), 23-45.

Interpretable Unsupervised Learning

Interpretability has become increasingly more important to machine learning algorithms, especially unsupervised learning, which is closely related to algorithm fairness. Many unsupervised learning algorithms, such as cluster analysis, principal component analysis, and topic models, can be viewed as methods that estimate certain latent variable models. Thus, structure learning of latent variable models, which dates back to the rotation approach to exploratory factor analysis, provides a solution to interpretable unsupervised learning.

The structural learning of latent variable models aims to learn a sparse graphical representation (in the sense of conditional independence) of the relationship between the latent variables and the manifest variables so that the latent variables can be interpreted based on the associated manifest variables. Traditionally, the structural learning of exploratory factor analysis is achieved by post-estimation rotation methods. We have developed penalised estimation methods that simultaneously learn the sparse structure and estimate the model parameters for various latent variable models. More recently, we have been studying the connections and differences between the rotation approach and the penalised estimation approach based on theoretical and numerical analyses.

Relevant Publications:

Chen, Y., Xu, G., Liu, J., and Ying, Z. (2015). Statistical Analysis of Q-matrix Based Diagnostic Classification Models Journal of the American Statistical Association, 110, 850-866.
Sun, J., Chen, Y., Liu, J., Ying, Z. and Tao, X. (2016). Latent Variable Selection for Multidimensional Item Response Theory Models via L1 Regularisation. Psychometrika, 81, 921-939.
Chen, Y., Li, X., Liu, J. and Ying, Z. (2016). Regularized Latent Class Analysis with Application in Cognitive Diagnosis. Psychometrika, 82, 660-692.
Jin, S., Moustaki, I and Wallentin, F. (2018) Exploratory factor analysis via approximate penalised maximum likelihood. Psychometrika, (83), 628-649.
Geminiani, E., Marra, G. and Moustaki, I. (2021) Single and multiple-group penalised factor analysis: a trust-region algorithm approach with integrated automatic multiple tuning parameter selection. Psychometrika, 86, 65-95.
Liu, X., Wallin, G., Chen, Y. and Moustaki, I. (2023+). Rotation to Sparse Loadings using Lp Losses and Related Inference Problems. Psychometrika. To appear.

Stochastic Optimisation for Estimating Latent Variable Models

Latent variable models are typically estimated using the marginal likelihood, where the latent variables are treated as random variables and integrated out. Traditionally, the marginal likelihood is optimised with the Expectation-Maximisation (EM) algorithm, where the integrals with respect to the latent variables typically need to be approximated by numerical integrals. Since the computational complexity of the numerical integrals grows exponentially fast with the number of latent variables, the computation of the EM algorithm becomes unaffordable when the dimension of the latent space is large.

To reduce the computational burden, a solution is to use stochastic optimisation methods that replace the numerical integrals with Monte Carlo simulation of the latent variables (under the posterior law). We have considered two general computational frameworks, the Stochastic EM framework and the stochastic approximation framework. Under the former, we have developed an improved stochastic EM algorithm for solving large-scale full-information item factor analysis problems. This algorithm can be extended to general latent variable models. Under the latter, we have proposed a quasi-Newton stochastic proximal gradient algorithm that achieves a nearly optimal theoretical convergence rate. This algorithm converges fast in practice and can handle all kinds of non-smooth penalties and constraints.

Relevant Publications:

Zhang, S., Chen, Y. and Liu, Y. (2020). An Improved Stochastic EM Algorithm for Large-Scale Full-information Item Factor Analysis. British Journal of Mathematical and Statistical Psychology. 73, 44-71.
Chen, Y. and Zhang, S. (2021). Estimation Methods for Item Factor Analysis: An Overview. In Zhao, Y. and Chen, D., editor, Modern Statistical Methods for Health Research. Springer, New York, NY.
Zhang, S. and Chen, Y. (2022+). Computation for Latent Variable Model Estimation: A Unified Stochastic Proximal Framework. Psychometrika. To appear.

Analysing Log-file Process Data from Computer Simulated Tasks

In this information age, students need not only traditional skills like math and reading but also more advanced and complex skills such as complex problem-solving and collaboration. Unlike the traditional skills that can be measured by paper-and-pencil-based tests, these advanced skills are better measured by computer-simulated tasks or educational games. Logfile process data from the simulated tasks or games provide a unique opportunity to learn students' behavioural patterns in task-solving and measure their proficiency in advanced skills. However, logfile process data are non-regular, for which the traditional dimension reduction tools and measurement models are no longer suitable. Thus, extracting useful information from the data is a challenging task. Making use of latent variable modelling and event history analysis, we have developed dimension reduction tools and measurement models for making sense of logfile process data.

Relevant Publications:

Xu, H., Fang, G., Chen, Y., Liu, J., Ying, Z. (2018). Latent Class Analysis of Recurrent Events in Problem-Solving Items. Applied Psychological Measurement. 42, 478-498.
Chen, Y., Li, X., Liu, J. and Ying, Z. (2019). Statistical Analysis of Complex Problem-solving Process Data: From Prediction Perspective. Frontiers in Psychology. 10, 1-10.
Chen, Y. (2020). A Continuous-Time Dynamic Choice Measurement Model for Problem-Solving Process Data. Psychometrika. 85, 1052 - 1075.

Personalised Learning

Personalized learning refers to instruction in which the pace of learning and the instructional approach are optimized for the needs of each learner. With the latest advances in information technology and data science, personalized learning is becoming possible for anyone with a personal computer, supported by a data-driven recommendation system that automatically schedules the learning sequence. The engine of such a recommendation system is a recommendation strategy that, based on data from other learners and the performance of the current learner, recommends suitable learning materials to optimize certain learning outcomes. A powerful engine achieves a balance between making the best possible recommendations based on the current knowledge and exploring new learning trajectories that may potentially pay off.

We have proposed a Markov decision framework for sequential recommendation in a personalised learning system. Under this framework, the optimal recommendation of learning materials becomes a sequential decision rule that maximises a certain utility function (defined at a future time point) that measures the learning achievement. We have proposed a reinforcement learning approach to learn the optimal sequential decision rule from data.

Relevant Publications:

Chen, Y., Li, X., Liu, J, and Ying, Z. (2018). Recommendation System for Adaptive Learning. Applied Psychological Measurement. 42, 24-41.
Tang, X., Chen, Y., Li, X., Liu, J. and Ying, Z. (2019). A Reinforcement Learning Approach to Personalized Learning Recommendation System. British Journal of Mathematical and Statistical Psychology. 72, 108-135.

Flexible Latent Variable Model Frameworks

Most of the existing latent variable models rely on strong parametric assumptions that may not be flexible enough for various applications. To fill the gap, we are developing general statistical frameworks for linear and non-linear latent variable models. In particular, we are extending the Generalised Additive Models for Location, Shape and Scale (GAMLSS) framework (Rigby and Stasinopoulos, 2005) to models with latent variables (Bartholomew et al., 2011). The proposed framework allows for more flexible functional forms of the measurement equations for the mean and higher-order moments.

We are also developing a semi-parametric multi-dimensional non-linear factor model framework by combining the single-index regression and non-linear multiple-factor model. This model is more flexible than the traditional parametric models while enjoying essentially the same interpretation as the classical models. A sieve estimator is proposed to estimate the proposed model.

Relevant Publications:

Bartholomew, D. J., Knott, M., & Moustaki, I. (2011). Latent variable models and factor analysis: A unified approach. John Wiley & Sons.

Psychometrics lab

Psychometrics lab

Selected Research topics

Hign-dimensional Generalised Factor Models and Joint Maximum Likelihood Estimation

Detecting Outliers in Multivariate Data with Applications to Cheating Detection in Educational Tests

Parallel Change Detection with Application to Item Quality Monitoring in Educational Tests

Pairwise likelihood estimation and testing for latent variable models with missing values

Interpretable Unsupervised Learning

Stochastic Optimisation for Estimating Latent Variable Models

Analysing Log-file Process Data from Computer Simulated Tasks

Personalised Learning

Flexible Latent Variable Model Frameworks

Psychometrics lab

Selected Research topics

Hign-dimensional Generalised Factor Models and Joint Maximum Likelihood Estimation

Detecting Outliers in Multivariate Data with Applications to Cheating Detection in Educational Tests

Parallel Change Detection with Application to Item Quality Monitoring in Educational Tests

Pairwise likelihood estimation and testing for latent variable models with missing values

Interpretable Unsupervised Learning

Stochastic Optimisation for Estimating Latent Variable Models

Analysing Log-file Process Data from Computer Simulated Tasks

Personalised Learning

Flexible Latent Variable Model Frameworks

This website uses cookies.