The paper “Consensus-Driven Active Model Selection” introduces CODA, a method that selects the best machine learning model using the predictions of many candidate models and minimal labeled data. CODA builds a probabilistic framework that leverages model agreement and disagreement to guide which examples should be labeled next.
🚀 Key Concepts Active model selection: Instead of labeling a full validation set, CODA selectively chooses which data points to label by estimating which would be most informative. Consensus modeling: CODA uses a Bayesian adaptation of the Dawid-Skene model to evaluate model performance based on agreement among models. PBest distribution: Represents the current belief about which model is best, updated with each newly labeled data point. 🧪 How Does CODA Work? Model predictions are collected over unlabeled data. A consensus label for each data point is calculated using a weighted sum of predictions from all models. Each model is assigned a confusion matrix prior using a Dirichlet distribution: $$ \theta_{k, c, c’} = \frac{\beta_{c, c’} + \alpha \hat{M}_{k, c, c’}}{T} $$ CODA updates a probabilistic estimate over which model is best: $$ PBest(h_k) = \int_0^1 f_k(x) \prod_{l \ne k} F_l(x) dx $$ It selects the next data point to label by maximizing expected information gain: $$ EIG(x_i) = H(PBest) - \sum_c \hat{\pi}(c \mid x_i) H(PBest^c) $$ 📊 Results CODA outperforms previous state-of-the-art methods on 18 out of 26 benchmark tasks. Achieves optimal model selection with up to 70% fewer labels compared to baselines. Especially effective in multi-class tasks (e.g., DomainNet, WILDS). ❗ Limitations In binary classification with high data imbalance, CODA may underperform due to biased early estimates (e.g., CivilComments, CoLA datasets). CODA assumes that consensus is meaningful; highly divergent models may reduce effectiveness. 🔮 Future Work Better priors from human knowledge or unsupervised features. Extension to non-classification tasks and alternative metrics. Integration with active learning and active testing frameworks. Links Based on the publication 📄 arXiv:2507.23771 PDF