Key Insights from the Microsoft AI Tour on LLMs

Things I wish I had realized sooner

On January 30, 2025, I attended the Microsoft AI Tour in NYC. The event featured some fascinating discussions and knowledge sharing that I believe are valuable for anyone developing AI applications — regardless of platform or vendor.

This post will be focused on my reflections on what Large Language Models (LLMs) can — and can’t — do largely inspired by one of the breakout sessions — Prompty, AI Foundry, and Practical End-to-End Development led by Seth Juarez from Microsoft. Seth offered some refreshing insights that challenged mainstream perspectives on LLMs.

**What LLMs Are Not Meant to Do**

One of Seth’s most compelling points was:

“LLMs are not databases or repositories of knowledge — they are language calculators.”

This struck a chord with me. Despite knowing that LLMs only generate responses based on probability, we still find ourselves tempted to use them as search engines or knowledge bases. As models improve in efficiency and cost per token drops, the temptation grows even stronger.

Three Takeaways About LLM Limitations:

LLMs should not be used as databases. They don’t “retrieve” facts the way databases do; they generate text based on learned patterns.
Hallucinations are inherent. The term “hallucination” is misleading — LLMs don’t “make mistakes”; they simply generate the most statistically probable next word. LLMs are making up things ALL the time.
Grounding LLMs is essential. One common approach to mitigating errors is providing external sources (e.g., retrieval-augmented generation, RAG) and building a review system for outputs.

That said, I’m still grappling with how these principles apply to new reasoning models like Open O1 (released in Dec 2024) and DeepSeek R1 (released in Jan 2025). These models are designed for logical reasoning, math, coding, and problem-solving, pushing LLMs beyond just predicting words.

So at what point do we stop calling them “language calculators” and start treating them as something more?

Here is a 5-level AI defined by Open AI:

Image designed by the Author

Level 1: Conversational AI

This is already widely adopted, from chatbots (e.g., ChatGPT, Google Gemini) to customer support assistants. These models engage in human-like conversation but don’t reason deeply.

Level 2: Reasoning AI

This is where models start excelling at logic, math, and structured problem-solving. OpenAI O1 and DeepSeek R1 are examples of this new generation, improving at tasks like coding and analytical reasoning.

Level 3: Agentic AI

At this level, AI can autonomously take on tasks and make decisions for extended periods without human intervention. Industry insiders predict this will be a major focus for AI companies and startups in 2025.

Level 4: Innovative AI

AI reaches the point where it can propose new ideas and contribute to scientific breakthroughs beyond what humans initially program.

Level 5: Organizational AI

Imagine an AI functioning like an organization — with multiple specialized AI agents collaborating, communicating, and orchestrated by a leadership AI to deliver value, just like a structured company.

With AI advancing rapidly, I ask myself: At what point do we stop treating LLMs as mere “language calculators” and recognize them as something fundamentally different?

What are your thoughts on this shift?

What LLMs can do?

Another of Seth’s most compelling points was:

“LLMs can soften the edge around human-computer interaction.”

I am unsure if Seth’s statement would remain conceptually sound when the day of AGI (Artificial General Intelligence) comes, but until then, I think it is a proven concept.

The primary challenge in human-computer interaction lies in the rigid input-output structure of traditional software — computers process only well-defined inputs and generate corresponding outputs. This creates friction because if a user fails to provide the exact expected input, they may not get the desired results.

For example, in a traditional system, if a doctor wants to find patients with uncontrolled diabetes, report developers need to write a specific query with criteria like “HbA1c > 9%” or “Diagnosis: Type 2 Diabetes AND HbA1c > 9”. If they instead search for “patients with high blood sugar issues”, the system would return no results because it doesn’t recognize that phrase as a valid query.

With LLM-powered search, the AI understands intent. It can infer that “high blood sugar issues” relate to “elevated HbA1c,” “diabetes complications,” or even ICD-10 codes, delivering relevant results without strict keyword matching.

This flexibility bridges the gap between human expression and machine processing.

“You are in control of the input of the prompt. You are in control of the output of the prompt.”

This principle is a risk control strategy that ensures users and developers can leverage the power of LLMs while minimizing the potential for unintended consequences or errors.

Even as AI advances to Level 5 Organizational AI, where complex systems of AI agents autonomously work together toward achieving goals, I think the principle of controlling both the input and output will remain relevant.

The “Leader” AI in a Level 5 AI system will still need to set parameters and goals for the AI agents, ensuring they are working within predefined frameworks and continuously verifying the actions and outcomes of those agents to avoid risks associated with unmonitored behavior.

Ultimately, we will still have the authority to decide whether or not to trust the output.