Have you ever wondered why the latest artificial intelligence models, like GPT-4 or Claude 3 Opus, are so enormous? We’re talking hundreds of billions or even trillions of parameters. These are digital monsters requiring massive amounts of energy and data-center-level infrastructure.
For years, AI followed a simple rule:
“Bigger means better.”
Want a smarter model? Add more layers, more data, more GPUs.
But — what if this is a dead end?
What if instead of one giant that tries to know everything, it’s better to build a team of experts, agile and specialized?
This is the vision introduced in the recent work from Purdue University:
“Experts are all you need: A Composable Framework for Large Language Model Inference” (arXiv:2511.22955).
The proposed architecture, Comp-LLM, allows combining small and mid-sized models into an intelligent, composable system that:
- is faster,
- is cheaper,
- runs in parallel,
- and rivals giant models in quality.
Sounds like a revolution?
This article walks through everything — from analogies to mathematical and architectural details.
Comp-LLM “in simple terms”
The Problem: One-Man Band vs. Professional Crew
Imagine you’re doing a major home renovation. Two options:
Option 1: Monolithic Model (GPT-4, Llama-70B)
You hire one person — Mr. Jack-of-all-Trades.
He knows everything: plumbing, electricity, painting, poetry, quantum physics.
Sounds great, but:
- he’s slow,
- he’s gigantic,
- his “brain” is full of knowledge irrelevant to the current task,
- to fix a faucet, he must search his entire “universe of knowledge.”
This is your classic LLM.
Option 2: Agent Systems (AutoGen, ReAct)
You hire a manager who summons specialists one by one:
- Plumber arrives → works → leaves.
- Electrician arrives → works → leaves.
Quality is good, but it takes forever.
Everyone waits for the previous one.
These are classic agent systems — sequential.
Option 3: Comp-LLM — A New Paradigm
Here comes the innovation.
You’ve got a Super-Manager (Sub-query Generator).
You give a task:
“Fix the faucet and paint the living room.”
How does it work?
- Splits the task into two subproblems: plumbing + painting.
- Checks dependencies — they’re independent.
- Runs both in parallel.
- Collects two results.
- Merges them into a final answer.
Sounds simple?
That’s Comp-LLM: parallelism + specialization + intelligent routing.
Architecture & Technical Details
Comp-LLM consists of three pillars:
1. Sub-query Generator
Responsible for:
decomposing the original query $Q$ into sub-queries:
$$ Q \rightarrow { q_1, q_2, \dots, q_n } $$
building a dependency graph (DAG),
routing queries to the right experts.
Routing is zero-shot (no training), based on embeddings:
$$ \text{Expert}(q_i) = \arg\max_{E_j} \frac{v_{q_i} \cdot v_{E_j}}{|v_{q_i}| |v_{E_j}|} $$
Similarity threshold: 0.7.
This means:
- you can add any model as an expert,
- no system-wide retraining is needed,
- experts may come from different sources (Meta, Google, HF).
This is true composability.
2. Query Executor (Parallel Engine)
Executes sub-queries in parallel, obeying the DAG:
- finds nodes with $in_degree = 0$,
- dispatches them to experts,
- frees dependent nodes once results arrive.
Speedup observed in the paper:
1.1× – 1.7× faster than sequential agent systems.
3. Response Aggregator
Combines:
- the original query $Q$,
- expert outputs,
- dependency graph context,
into a single coherent answer.
Why does this work?
Because it separates concerns:
- Router → general reasoning
- Experts → specialized domain knowledge
- Aggregator → final logic
Monolithic models must do all three at the same time — and often fail.
Results: The Numbers Speak
Benchmark: MultiExpertQA-P
| Model | Parameters | F1 Score |
|---|---|---|
| Llama-2 7B | 7B | 0.56 |
| Llama-2 13B | 13B | 0.67 |
| Llama-2 34B | 34B | 0.75 |
| Llama-2 70B | 70B | 0.85 |
| Comp-LLM | ~35B | 0.83 |
Conclusions:
- A composite system of ~35B nearly matches 70B.
- And it crushes monolithic 34B.
- Resource reduction: 1.67× – 3.56× at comparable quality.
Real-World Applications
1. Business (On-Premise AI)
You can deploy:
- an HR expert,
- a finance expert,
- an IT expert,
running in parallel on affordable hardware.
2. Medicine
Experts in:
- cardiology,
- endocrinology,
- neurology,
can analyze a patient simultaneously.
The system is explainable (XAI).
3. Education
Question:
“How did the Industrial Revolution influence Victorian literature?”
The system calls:
- a history expert,
- a literature expert.
The result is interdisciplinary and accurate.
4. Edge AI — Your Phone as ‘AI with Swap-In Brains’
A smartphone loads on demand:
- a cooking expert,
- a navigation expert,
- a music expert.
Comp-LLM provides the architecture for such modularity.
Summary: Why Comp-LLM Matters
Because it demonstrates that:
- Architecture > Parameters
- Synergy > Monoliths
- Modules > Giants
This is the future of AI:
- modular,
- composable,
- energy-efficient,
- open.
Appendix: Mathematical Analysis (for experts)
1. Generating the DAG
The Sub-query Generator maps:
$$ f_{SG}: Q \rightarrow G(V, E) $$
where:
- $V = { s_1, \dots, s_k }$ — sub-queries,
- $E = { (s_i, s_j) \mid s_i \text{ is required for } s_j }$.
Classic CoT produces a chain:
$$ s_1 \rightarrow s_2 \rightarrow \dots \rightarrow s_k $$
Comp-LLM produces a true DAG, enabling parallel execution.
2. Scheduling
The set of ready tasks:
$$ C_t = { s \in V \setminus Completed \mid \forall p \in Parents(s),, p \in Completed } $$
GPU memory constraint:
$$ \sum_{s \in S_t} M(Expert(s)) \le R_{total} $$
Solution: greedy heuristic (variation of RCPSP).
Links
- Based on 📄 arXiv:2511.22955 PDF