AI is all about Software Engineering
There was a story posted on Hacker News a few days ago titled: 73% of AI startups are just prompt engineering .
I think that we might easily get fooled by the fact that quickly repurposing an AI model with some smart prompts can give you a working prototype very quickly. But creating a long-lasting, effective product takes a lot of traditional software engineering skills. Here’s why.
A whole new paradigm
AI poses a brand-new challenge: dealing with non-determinism. In software, we’re used to dealing with deterministic scenarios (aside from parallelism). AI models are non-deterministic, and we’ll need to learn how to deal with this.
For example, for some products that I have advised, we’ve spent MOST of the product development time NOT on the creation of the actual product (the prompts, the screens), but on internal frameworks to assess that the product is working “correctly”.
The “correctly” here is highlighted, because given their non-deterministic nature, you CANNOT assume that an AI model will work as expected 100% of the time. So, for example, while in traditional Software Testing you expect a “boolean” result (tests are passing or not), in AI Products you have a confusion matrix, and your metrics will be Precision and Recall.

The complexity here lies in the “explosion of dimensions”. We went from something boolean (product works or not), to something “multi-class” with all its permutations:
- The agent replied correctly but forgot something (is this wrong? is it acceptable?)
- The agent found the right sources but hallucinated towards the end
- etc…
Things are always changing
In addition to the “explosion of dimensions” of different outcomes you also have to add the abundance of options of models to choose from. Some are smarter but more expensive, some might be dumber and obviously cheaper, some are faster, some can reason, some are multi-modal, some are local, some store your data, etc.
For example, you might be tempted to choose a “dumber” but cheaper model, only to realize that to obtain the desired behavior your prompts have to be a lot longer and detailed (Few-shot prompting, for example) increasing the final cost, as you have more input tokens. In comparison, a more expensive model might achieve the same results with a much shorter prompt, and even though the cost per token is higher, the overall cost decreases because your prompts get shorter.
Latency and reliability are also big factors. Not all models work the same all the time. Even the “pinned-down” versions of GPT gpt-5-nano-2025-08-07 seem to “behave differently” from time to time.
To all this explosion of dimensions and choices, you have to add your own variations:
- Different prompts
- Different parameters (temperature, reasoning, etc)
- Different providers (Bedrock, Open Router, etc)
- etc.
This makes the creation of an AI app an Engineering marvel. You’ll end up with a high-dimensional spreadsheet with multiple prompts and their performance with different models, costs, latency, etc.
Vulnerabilities are also non-deterministic
Most of the startups I advise are surprised when I make them test their prompts for vulnerabilities on a “traditional machine learning basis”. It’s YET another confusion matrix. You can NO LONGER say “your prompt is safe or not”, it’s not boolean, you’ll ALWAYS be subject to vulnerabilities. The best you can do is say: “we can mitigate prompt injection attacks to a 99.9% level”.
For this, you’ll have to create multiple experiments beforehand, combining:
- Prompts
- Model
- Parameters
- Providers
- Etc.
And you’ll need to run the same experiment MULTIPLE times. The same combination of prompt/model/parameter/etc, run 10 times, might show a vulnerability in 2 out of those 10 runs.
Software Supply Chain is a Pain in the A
Software Engineering has “enjoyed” the blessing of ~20 years of stability. How many web frameworks are out there? 10? 50? The number doesn’t matter: it’s finite, known and stable. But do you remember the time when your programming language made it to the web? For example, for Python we had Zope, Pylons, Pyramid, CherryPy, TurboGears, Twisted, web.py, Bottle, etc… (and those are the ones I just remember off the top of my head). A LOT of people back then made “the wrong decision” and chose a framework that eventually ended up deprecated. For example, I built my first startup’s API using Django Tastypie, when the clear winner was DRF. Needless to say we had to migrate our entire API after a few years.
There are hundreds (or thousands?) of models to choose from and libraries/frameworks to use. We know that only a small percentage will survive the hype, and if we don’t choose right, we’ll have to rewrite, refactor, etc.
On top of that, the room for vulnerabilities and supply chain attacks has increased dramatically. Have you seen how to install LangChain in Python? You’d think it’s a simple:
$ pip install langchain
Oh no, no. It’s at least 5 different packages to have a working app:
* langchain-core
* langgraph
* langchain-openai
* langchain-anthropic
* toolbox-langchain
* ...
And the worst part is that this might have already changed (from the time I wrote it)! There are dependency conflicts EVERYWHERE and solving it means manually changing dependency versions or just doing nasty dependency tricks (huge props to uv).
I see this becoming a cybersecurity risk pretty soon: abandoned frameworks and packages being taken over by attackers, and naive programmers leaving them in their pyproject.toml “just in case”, only to be bitten in a few years from now.
Spaghetti Code
I won’t mention much here because it’s a completely subjective view. But I invite you to take a look at the source code of some of the Python libraries and frameworks out there. It’s not pretty. Everything seems to be either crafted by inexperienced software architects, or just rushed out.
As this Reddit user put it “LiteLLM’s __init__.py is 1200 lines long” (1600 at the time of this writing).
As I said, this is highly subjective, so take it with a pinch of salt and do your own assessment.
Given this “concerning low-quality” of libraries, and the fact that things are changing all the time (see the previous point), I have “forbidden” the use of some AI frameworks to my developers. Do we need some particular feature? Code it yourself. It’ll take you one extra day, it’s ok, but we won’t deal with dependency nightmares, vulnerabilities and deprecated frameworks in the future.
State Machines, Parallelism, and everything nice…
Crafting an effective agent will require you to master 2 main things: State Machines and Parallelism. Your agent will have a single entry point, but to work effectively, it’ll probably need to run a bunch of things in parallel (the main LLM prompts, searches, LLMs as judges, etc).
Doing this at scale, with proper error handling, recovery and observability is a HUGE challenge (and always has been) in Software Engineering. Before you know it, you’ll need to use advanced architectural designs like the Saga Pattern.
This is far from just “prompt engineering”.
Final words
I’ll just leave you with this. McDonald’s and Starbucks succeeded because they mastered “location” (and real estate). Gatorade and Red Bull because they mastered brand awareness and extreme distribution. Coca-Cola, Heineken, P&G, etc. succeeded because they mastered finances and consolidation. None of the mentioned products are the best in their category: McDonald’s is not the best burger, Starbucks doesn’t have the best coffee, Red Bull and Gatorade are “just sugared water”, etc.
But they did one thing right: focused on the fundamentals. What are the fundamentals of digital products? It’s Software Engineering.
