Why the AI rocket ride suddenly feels like it has slowed down
July 8, 2025
Eighteen months ago it seemed as though each new model ripped a hole in the sky. ChatGPT hit a million users in five days after its 30 November 2022 debut, stunned educators, and triggered Google’s now-famous internal “code-red.”
Within a single year GPT-3.5 morphed into GPT-4; Anthropic’s original Claude sprouted two successors; Google’s Bard gave way to Gemini; and Meta open-sourced Llama 2. The leap from “autocomplete on steroids” to passing the bar exam felt instant.
Fast-forward to mid-2025 and the headlines look… quieter. What changed?
- Law of diminishing returns. Each time researchers simply doubled the number of “parameters” (the adjustable weights inside a model) they used to get eye-popping gains. Recent papers show those gains flattening, a phenomenon some call the scaling wall. medium.combusinessinsider.com
- Data-center bottlenecks. Training today’s giants eats thousands of high-end GPUs and megawatts of power. Microsoft alone is spending $80 billion this fiscal year on new AI facilities just to keep the pipeline flowing. reuters.com
- Longer safety pit-stops. Frontier-model makers now put releases through red-team drills and external audits that add weeks or months.
- Shift from showy demos to plumbing. Instead of bigger model files, teams are busy wiring in memory, tool-use and agents—advances you don’t “see” until developers build products on top.
The result: progress is still there, it’s just happening under the hood—so the ride feels slower even while the engine is being rebuilt.
Sidebar: What is a Frontier Model?
A frontier model is the very largest, most experimental system a lab can currently build—often running into the hundreds of billions or even trillions of parameters.
Because these models sit on the cutting edge, they’re where brand-new capabilities first emerge, whether that’s live video understanding, chain-of-thought reasoning, or the ability to call external tools and act autonomously. They demand enormous compute budgets and megawatts of power to train, so you won’t build one yourself, but you may fine-tune or host an open-weight release (Meta’s Llama line is the prime example). Their power also means they pass through extra-rigorous safety reviews and launch more slowly.
In short: frontier models preview the skills—and the guard-rails—that will trickle into everyday enterprise AI over the next year or so.
What’s next on the runway (and why it matters in plain English)
OpenAI – GPT-5 (target: summer 2025).
Sam Altman has said the next flagship arrives “this summer.” youtube.com Expect an assistant that remembers past conversations, works natively in voice and video, and can juggle 200,000 tokens of context—enough to read an entire mortgage guideline binder in one go.
Sidebar: Why Does The Number Of Tokens Matter?
At launch in late 2022, ChatGPT’s underlying model could keep track of roughly 3,000 English words at a time—just enough for a long email. Today, state-of-the-art engines swallow entire books in one gulp. GPT-4’s original 32 k-token limit jumped to 128 k with GPT-4 Turbo, and in spring 2025 OpenAI’s GPT-4.1 prototype stretched all the way to a one-million-token context window. Google followed suit: early testers of Gemini 1.5 can already experiment with a million-token window, with plans to push to two million later this year. Anthropic’s Claude 4 sits at 200k tokens—still enough to ingest a full closing package plus conversation history.
A “token” is just a bite-sized piece of text—often a syllable or short word. The more tokens a model can juggle, the longer the passage it can read and remember in one shot. For a mortgage executive that means an AI can now scan every page of a borrower’s file, compare it against underwriting guidelines, and spot inconsistencies without splitting the work into smaller chunks. Bigger windows also unlock true conversation memory: an assistant can recall what it told a borrower three Zoom calls ago, or notice that today’s pay stub contradicts last month’s. In short, swelling token capacity turns generative AI from a clever paragraph-writer into a system-wide auditor and continuous partner.
Google DeepMind – Gemini 3 (late 2025 preview).
Demis Hassabis hints at a model that fuses Gemini’s language chops with Veo’s video understanding. techcrunch.com Picture an AI that can watch a intro call with a prospect and notify the loan officer before the call ends that they have not asked about the prospect’s marital status.
Anthropic – Claude “5” (early 2026).
Claude 4 already handles multi-step reasoning; the company’s public risk-scaling roadmap points to an even larger frontier model inside two years. anthropic.com The goal is safer autonomy—think an AI that can draft, file and track an appraisal request end-to-end while logging its every decision.
Meta – Llama 4 Behemoth (Q4 2025).
Meta’s own blog teases a mixture-of-experts system with nearly two trillion total parameters yet lower run-time cost. ai.meta.com Crucially, Meta plans to release weights, letting banks deploy a high-end model inside their firewall—no borrower data leaves the building.
SIDEBAR: What’s a “mixture-of-experts” model?
Instead of one huge neural network doing every task, a mixture-of-experts (MoE) model is more like a call-center full of specialists. Each “expert” is a smaller sub-network trained to excel at a particular slice of language or reasoning. A lightweight router sees your prompt, decides which few experts are most relevant, and sends the work their way—so only a fraction of the total parameters are active on any given request.
The payoff is two-fold: labs can pack trillions of parameters into a single system (because they’re not all firing at once) and run it for a fraction of the compute and energy. For business users that means GPT-4-class quality at lower cost and latency, plus room for ever-larger context windows and domain-specific experts—imagine a mortgage underwriter expert, a compliance expert, and a servicing expert all inside the same model, each waking up only when you need them.
Mistral – next Mixtral (Q4 2025).
The French up-start ships fast; insiders talk about an 8-expert 45 B-parameter model that rivals GPT-4 quality but runs on a single server-class GPU cluster. mistral.ai Ideal for branch-level or mobile applications where every millisecond counts.
What this means for mortgage leaders
- Document automation will leap again. Longer “context windows” let models read an entire closing package at once, spotting mismatched names across 500 pages instead of 50. Start feeding today’s 200k-token engines sample files so workflows are in place when GPT-5 lands.
- Voice-first borrower co-pilots. Multimodal models will speak, listen and annotate PDFs in real time. Begin capturing chat and call transcripts now—they’re the training gold you’ll need for a fully conversational LOS front-end.
- Agentic operations. The next wave will act, not just advise: rerunning DU, ordering VOEs, updating LOS fields. Map every API your team touches and build audit trails so regulators can see why the AI clicked the button.
- On-prem privacy options. Open-weight giants like Llama 4 Behemoth make it feasible to keep generative AI under your roof. Budget a modest GPU rack (four to eight H100s) and work with InfoSec on inference gateways before the hardware rush.
- Avoid single-vendor lock-in. Each roadmap is different; hedge by integrating through open frameworks (LangChain, OpenAI Assistants, etc.) so swapping a model is a config change, not a rewrite.
Sidebar: What Is An Inference Gateway and Why Do You Need One?
An inference gateway is a specialised traffic-cop that sits between your apps and one or more large-language models. Instead of every service calling OpenAI, Gemini or an on-prem Llama cluster directly, everything flows through this single door. The gateway authenticates each request, strips or masks sensitive data, decides which model (or version) should handle the prompt, caches repeat answers, logs every token, and enforces cost or rate limits—giving you one place to secure, observe and control all model traffic. Vendors position it as the AI-specific counterpart to the traditional API gateway, tuned for token counting and model fail-over rather than REST endpoints solo.io.
Why bother?
- Compliance & privacy: scrub borrower PII or block disallowed content before it leaves your firewall.
- Cost and latency control: route routine questions to a cheaper model, let GPT-5 handle the tough ones, and cache common prompts.
- Vendor agility: swap OpenAI for an on-prem Llama or a new provider by changing a routing rule, not rewriting every integration.
- Full audit trail: one log shows who asked what, which model answered, how many tokens were burned and how long it took.
Real-world options today
- Gloo AI Gateway solo.io.
- Azure API Management’s “AI gateway” features learn.microsoft.com.
- Kubernetes Gateway API Inference Extension cncf.io.
- Inference-gateway (OSS) github.com.
For any mortgage lender moving sensitive borrower data through AI models, an inference gateway is the quickest way to wrap enterprise-grade governance around today’s—and tomorrow’s—LLMs.
The bottom line
The tempo of AI discovery hasn’t slowed—it’s maturing. 2025-26 will bring another surge of capability, just packaged in memory, autonomy and deployment flexibility rather than raw shock value. Mortgage executives who prep data pipelines, security posture and change-management now will be ready to flip the switch on day one, while competitors wait for version 6.
Six to nine months is a short runway in banking IT. Throttle up before the tower clears the next departure.