AI Outlook 2025

The AI sector in 2024 has been one of the most intense periods of perpetual tech innovation I can recall in recent memory. Almost every week there was a notable new release of capability, product, or research creating what was already a highly dynamic space – early stage tech startups – even more chaotic. Countless startups have both boomed and bust. AGI timelines have been adjusted by many. Yet we’re still so incredibly early.

I’ve been building on some version of GPT for nearly 5 years, the last 2 with our portfolio at Sibylline Labs. The following is a retrospective of our experiences building with AI this year, and the assumptions and outlook we’ll be building from as we go into 2025.

The Frontier Labs

A purposeful design choice made early was to be model agnostic for each use case, selecting the most appropriate model for each based on a qualitative (vibes) and quantitative (cost benefit) analysis. This decision has shown outsized returns for us, especially as the gap between state of the art closed across the major players.

This is not an exhaustive list of the frontier labs and their efforts. Notably in recent weeks, the Chinese lab DeepSeek has released an incredibly impressive model, albeit one that we’ve no direct experience with, nor its predecessors, and thus won’t go into detail in here.

Anthropic

Anthropic will continue to win by innovating with builders on "just good enough tooling", underpinned by clearly state-of-the-art models. I think the first signal from them here was optimizing around the use of XML tags as the underlying parsing mechanism for responses. These are demonstrably easier to parse out from complex and often inconsistent text responses than JSON. This was, at least to my knowledge, unprecedented at the time and has proven to be a trivial yet powerful interface for testing and integrating new prompts that might not have structured output consistency yet. If I remember correctly, there was even a paper demonstrating how forcing a JSON response from LLMs actually handicapped its reasoning capabilities, leading many teams (us included) to perform a 2-step LLM call to first execute the reasoning, and then format a final structure using a weaker model.

Google

Google will get its platform offering figured out, and with it will become the real "startup killer" that OpenAI isn't turning out to be. RAG-as-a-Service, Code Execution, Internet Search, and Data Connectors (to GSuite at least here) are all nascent startups that are finding their positioning in the market that now have direct competition with Google. This is enabled by both their immature studio offering and the highly robust Google Cloud Platform, capable of serving not just their own demonstrably SOTA models but open source and partner models in a single, battle-tested platform, where these kinds of integration points are a config item, not a whole third-party integration.

Even now, I believe the market is underpricing just how robust and thought-through Google has made their vertical integration of the AI stack. They are capable of running SOTA models at a fraction of the cost of other labs and PaaS providers thanks to incredible custom silicon. From the offset with their first Gemini release, their "free tier" offering was so generous for the model's performance that it was nonsensical, at least to us, to not deploy it for some use cases.

As Google reorients its execution capacity and ability to attract and retain outlier researchers around its re-found "mission", even if it's consistently lagging other labs on SOTA releases, its ability to deploy unit economics that no one else can come close to will force competition into making difficult decisions. It's worth noting that many researchers, notably in the UK, highly value the stability and "established" offering of larger firms, rather than the scale-stage firms other labs offer.

OpenAI

OpenAI will continue to focus on its position as the "leader" in research of models themselves, rather than a holistic "AI platform" offering. Custom GPTs and other "side effort" retail-oriented developments will continue, but not as the core thesis. OpenAI is reporting significantly more usage on its retail ChatGPT offering than API usage compared with Anthropic, though I believe these numbers are speculative, not official. Accordingly, my thesis here is:

There is enough "ideas and research" being explored/backlogged and more than enough capital (and notably ability to raise future capital if needed) to chase them down to actually getting to a near AGI model.
ChatGPT is still a data asset class to them. This is reaffirmed by copious evidence that synthetic data works, and the introduction of "thinking" models as released with o1 where their LCOT can be used. We know of at least one open-source lab that is training their smaller models on the long chain-of-thought outputs from their reasoning models.
Deploying an operating fleet of weak/strong models to cost-and-performance-optimize faces as many, if not more, SRE challenges that they're keen to battle-test. Their most recent outage, a well-known Kubernetes DNS-based issue, demonstrates that they're still left on the maturity curve of SRE teams, and this provides a great testbed for them to refine that.

High-profile implementation deals with major firms will continue at pace. The Bay Area might understand tech, but as always struggled to understand the "irrationality" of all other major business sectors. High-profile tailored implementations for Magic Circle Law and Big-4 Professional Services firms, even if via a "GPT wrapper", provide not only market positioning but insight to researchers about how the "AGI we already have" works via market forces.

It's easy for AI labs to optimize for "frontier math" and coding benchmarks, but do they know how to design effective post-training for the intricacies of cross-border, cross-jurisdiction, cross-language, cross-precedent, subjective-outcome problems such as legal disputes between Germany and The UAE? This example is not random. It's a requested use case we have been working on with a regional partner for 7 months now that has forced us to design and implement a myriad of interesting techniques to address a non-trivial list of edge cases and considerations to get even a base-level result.

The Vertical Platforms

It’s been a great year for platform players that had successfully positioned themselves for the unfolding landscape. We’ve found ourselves increasingly leaning on two of them as core strategic capabilities.

Vercel

Vercel has consistently proven to be a platform we can double down on throughout 2024. The team there is clearly very switched on and has a clear view of where and how they want to go. v0 has become one of the most essential tools of our workflow, and yet a rounding error on our total tooling bill. It being designed and built by other serious builders with builders in mind has meant its addition and establishment in our workflows was so natural it was almost a foregone conclusion.

Vercel's new AI SDK offering alongside NextJS failed to deliver enough value from "yet another framework" for us for a long time, though demonstrably not for others. That said, it has rapidly matured into a powerful and production-ready SDK. This is a clear testament to the underlying culture within Vercel about how to build developer tools.

I fully expect Vercel to eat away at use cases both above and below them in their vertical with thoughtful, production-ready offerings delivered with their usual impressive velocity, which we shall be all too happy to consume.

Replit

Replit has clearly gone from strength to strength as its full end-to-end thesis continues to be realized. It has become my go-to response to clients and friends when they ask where to start trying to actually ship things. I downloaded Replit whilst on a night out in London to ship a moment of inspiration that came to me, a moment that really drove home its powerful value proposition of making good use of its well-executed AI Agent product.

It's not quite at a Soften level of noob UX, but also not as flexible and powerful when desirable (which is fine as it's clearly not the target user), but I don't think it's far away. Whereas Vercel started from the Developer Platform and moved vertically, Replit started with the best web-based IDE and moved vertically. My intuition is that this interface choice is now going to become constricting should they double down into it.

Their AI Assistant product is, albeit well-executed, just a chat AI over the repo. I think there are so many opportunities for exploration here, such as a thin abstraction layer over confusing directory structures of repos for new beginners. Perhaps a more "proactive" clippy-like experience that attempts to understand more ancillary actions, not just code generation. From personal experience, trivial low-hanging fruit they've not addressed is an agent that can actually perform Replit/Repo-level actions. I spent more than 30 minutes in disparate documentation and unintuitive settings trying to change a simple config line that could easily be an AI utility.

I look forward to putting more leverage into this tooling. As the barriers to entry and inertia for "technical work" continue to lower, our senior engineers and product people are able to move at a compounded velocity to those outcomes. Where v0 fails to provide a persistence layer, nor a reference backend implementation, Replit can, enabling executives and other customer-facing resources to independently create and iterate on concepts and sales assets without forcing context switching from engineers.

Developer Tools

One of the most immediately obvious set use cases for language models was around code generation and developer tooling. This reached an almost mimetic level this year as YC announced the backing of almost a dozen different “AI Code Copilots” or similar.

Devin

The impressive general availability launch of Devin led to one of the fastest purchases for the studio this year. As a purposely lean team, we use Slack as the control plane for orchestration and communication, which Devin appropriately aligns with as a first-class citizen. Where normally we would need to look to custom integrations to our issue tracking tooling, our ability to also connect that to Slack and just import that context in a simple manner feels like an incredibly natural addition to the team. I purposely chose "team" here, not "workflows" where normal new tooling intersects, as it truly introduces a whole new set of workflows and resourcing capabilities that would typically require a full-time hire.

Popular complaints about Devin are largely a result of their own marketing positioning it as a whole software developer asset, when pragmatically right now it's a talented, book-smart junior developer, and thus needs to be treated as such. Many enterprise teams who've had this trust thrust upon them may vent their frustration at how it's not the magical AI developer they were sold. As someone who is viscerally aware of how much it costs (I pay the damn invoice) and what we get for it, Devin is a fantastic product that I eagerly await future versions of.

Cursor

Public sentiment of Cursor is fascinating to watch. I don't think I've seen a middle-ground take that says "eh I'll stick with it and see". People either instantly fall in love with it or outright reject it. Those in love are the least interesting of the two. It's built upon a good experience of a well-executed product, building on the popular and powerful open-source software that is VSCode. They clearly have at least some kind of insight into effective AI co-piloting.

It's easy to dismiss those who hate Cursor as dinosaurs that hate it because they hate all AI code assistance. I put zero stock in that thesis. My intuition here is that these are the same people who also objectively do not enjoy "dynamic pair programming", that is to say, having someone (or something) modifying and changing the environment that you are working in. They've built AI assistance flows with an independent mechanism (chat or otherwise) where they can control the flow of change. I am one of those people. Originally starting with just Anthropic (both console and chat) and then progressively leaning on aider, what in my opinion is the perfect interface for a code assistant.

I wonder how penalizing Cursor's choice to exclusively build around a standalone VSCode fork, rather than a cross-platform plugin, will impact their longevity as the 20 other YC-backed co-pilot startups chase them down. Perhaps there is a market for N number of winners here, but as touched on vis-a-vis Replit, as the interface gradually moves away from the editor, what does the future for them look like?

Closing

We’re still incredibly early. Across foundation models, infrastructure, tooling, and product design. Despite the velocity of releases and shifting landscape, we don’t expect a slow down any time soon. Our thesis is that vertically integrated, AI-native apps, with newly enabled interfaces and experiences present the most exciting opportunities, and thus where we’ll be spending most of our time.