Have you ever wondered why the latest artificial intelligence models, like GPT-4 or Claude 3 Opus, are so enormous? We’re talking hundreds of billions or even trillions of parameters. These are digital monsters requiring massive amounts of energy and data-center-level infrastructure.

For years, AI followed a simple rule:
“Bigger means better.”
Want a smarter model? Add more layers, more data, more GPUs.

But — what if this is a dead end?

What if instead of one giant that tries to know everything, it’s better to build a team of experts, agile and specialized?

This is the vision introduced in the recent work from Purdue University:

“Experts are all you need: A Composable Framework for Large Language Model Inference” (arXiv:2511.22955).

The proposed architecture, Comp-LLM, allows combining small and mid-sized models into an intelligent, composable system that:

  • is faster,
  • is cheaper,
  • runs in parallel,
  • and rivals giant models in quality.

Sounds like a revolution?
This article walks through everything — from analogies to mathematical and architectural details.


Comp-LLM “in simple terms”

The Problem: One-Man Band vs. Professional Crew

Imagine you’re doing a major home renovation. Two options:

Option 1: Monolithic Model (GPT-4, Llama-70B)

You hire one person — Mr. Jack-of-all-Trades.

He knows everything: plumbing, electricity, painting, poetry, quantum physics.
Sounds great, but:

  • he’s slow,
  • he’s gigantic,
  • his “brain” is full of knowledge irrelevant to the current task,
  • to fix a faucet, he must search his entire “universe of knowledge.”

This is your classic LLM.


Option 2: Agent Systems (AutoGen, ReAct)

You hire a manager who summons specialists one by one:

  1. Plumber arrives → works → leaves.
  2. Electrician arrives → works → leaves.

Quality is good, but it takes forever.
Everyone waits for the previous one.

These are classic agent systems — sequential.


Option 3: Comp-LLM — A New Paradigm

Here comes the innovation.

You’ve got a Super-Manager (Sub-query Generator).

You give a task:
“Fix the faucet and paint the living room.”

How does it work?

  1. Splits the task into two subproblems: plumbing + painting.
  2. Checks dependencies — they’re independent.
  3. Runs both in parallel.
  4. Collects two results.
  5. Merges them into a final answer.

Sounds simple?
That’s Comp-LLM: parallelism + specialization + intelligent routing.


Architecture & Technical Details

Comp-LLM consists of three pillars:

1. Sub-query Generator

Responsible for:

  • decomposing the original query $Q$ into sub-queries:

    $$ Q \rightarrow { q_1, q_2, \dots, q_n } $$

  • building a dependency graph (DAG),

  • routing queries to the right experts.

Routing is zero-shot (no training), based on embeddings:

$$ \text{Expert}(q_i) = \arg\max_{E_j} \frac{v_{q_i} \cdot v_{E_j}}{|v_{q_i}| |v_{E_j}|} $$

Similarity threshold: 0.7.

This means:

  • you can add any model as an expert,
  • no system-wide retraining is needed,
  • experts may come from different sources (Meta, Google, HF).

This is true composability.


2. Query Executor (Parallel Engine)

Executes sub-queries in parallel, obeying the DAG:

  • finds nodes with $in_degree = 0$,
  • dispatches them to experts,
  • frees dependent nodes once results arrive.

Speedup observed in the paper:

1.1× – 1.7× faster than sequential agent systems.


3. Response Aggregator

Combines:

  • the original query $Q$,
  • expert outputs,
  • dependency graph context,

into a single coherent answer.


Why does this work?

Because it separates concerns:

  • Router → general reasoning
  • Experts → specialized domain knowledge
  • Aggregator → final logic

Monolithic models must do all three at the same time — and often fail.


Results: The Numbers Speak

Benchmark: MultiExpertQA-P

ModelParametersF1 Score
Llama-2 7B7B0.56
Llama-2 13B13B0.67
Llama-2 34B34B0.75
Llama-2 70B70B0.85
Comp-LLM~35B0.83

Conclusions:

  • A composite system of ~35B nearly matches 70B.
  • And it crushes monolithic 34B.
  • Resource reduction: 1.67× – 3.56× at comparable quality.

Real-World Applications

1. Business (On-Premise AI)

You can deploy:

  • an HR expert,
  • a finance expert,
  • an IT expert,

running in parallel on affordable hardware.


2. Medicine

Experts in:

  • cardiology,
  • endocrinology,
  • neurology,

can analyze a patient simultaneously.
The system is explainable (XAI).


3. Education

Question:
“How did the Industrial Revolution influence Victorian literature?”

The system calls:

  • a history expert,
  • a literature expert.

The result is interdisciplinary and accurate.


4. Edge AI — Your Phone as ‘AI with Swap-In Brains’

A smartphone loads on demand:

  • a cooking expert,
  • a navigation expert,
  • a music expert.

Comp-LLM provides the architecture for such modularity.


Summary: Why Comp-LLM Matters

Because it demonstrates that:

  • Architecture > Parameters
  • Synergy > Monoliths
  • Modules > Giants

This is the future of AI:

  • modular,
  • composable,
  • energy-efficient,
  • open.

Appendix: Mathematical Analysis (for experts)

1. Generating the DAG

The Sub-query Generator maps:

$$ f_{SG}: Q \rightarrow G(V, E) $$

where:

  • $V = { s_1, \dots, s_k }$ — sub-queries,
  • $E = { (s_i, s_j) \mid s_i \text{ is required for } s_j }$.

Classic CoT produces a chain:

$$ s_1 \rightarrow s_2 \rightarrow \dots \rightarrow s_k $$

Comp-LLM produces a true DAG, enabling parallel execution.


2. Scheduling

The set of ready tasks:

$$ C_t = { s \in V \setminus Completed \mid \forall p \in Parents(s),, p \in Completed } $$

GPU memory constraint:

$$ \sum_{s \in S_t} M(Expert(s)) \le R_{total} $$

Solution: greedy heuristic (variation of RCPSP).