How to Build a Chatbot with an LLM from Scratch

Founder of Picasso IA

May 26, 2026 - 5:31 PM

Building a chatbot used to mean years of NLP research and mountains of training data. Today, you can wire up a working conversational bot in an afternoon by calling an LLM API. The hard part is no longer getting it to respond. The hard part is getting it to respond well, consistently, without drifting off-topic or hallucinating facts. This article walks you through every layer of that problem, from picking the right model to designing system prompts that hold up in production.

What an LLM Actually Does in a Chatbot

Most people treat LLMs like a black box. You send text in, text comes out. But understanding what happens inside changes how you build, and more importantly, changes how you debug when things go wrong.

Tokens, context windows, and memory

An LLM does not read words. It reads tokens, which are chunks of text roughly 3-4 characters each. Every model has a context window: the maximum number of tokens it can process in one request. GPT-4o handles 128,000 tokens. Llama 2 7B Chat caps at 4,096. That difference matters enormously when you are building a multi-turn conversation.

💡 Rule of thumb: 1,000 tokens is roughly 750 words. A long conversation hits the context limit faster than most developers expect.

When the context window fills up, the model cannot remember what came before it. That is not a bug. That is how transformers work. Your application code must handle memory, not the model.

Developer hands typing on mechanical keyboard with LLM code on screen

Stateless vs stateful conversations

Here is something that catches every first-time builder: LLMs are stateless. Each API call is completely independent. The model has no idea what was said five messages ago unless you explicitly include those messages in the current request.

This means your chatbot's "memory" is entirely your responsibility. You maintain a list of messages. You pass the entire list on every call. The model sees context, not history. This single architectural fact changes everything about how you build a chatbot application.

Why this matters for architecture

If you forget this and store conversation state only on the frontend, you will have a chatbot that loses all memory on page refresh. If you store too much history without trimming, you will hit context limits in long sessions. The right approach is a server-side session store that persists the message list, trims it intelligently, and sends the right slice to the model on every call.

Choosing the Right LLM

The model you pick determines the ceiling of your chatbot's capability. There is no single right answer, but there are clear trade-offs worth understanding before you commit to an infrastructure direction.

Open-source vs proprietary models

	Open-Source	Proprietary
Cost	Free or cheap hosting	Pay per token
Privacy	Data stays on your servers	Data sent to provider
Performance	Varies by model size	Generally stronger
Control	Full fine-tuning access	API-only
Setup time	Needs infrastructure	Ready in minutes

For prototyping, proprietary models win on speed. GPT-5 and Claude 4 Sonnet get you to a working demo fast. For production with strict privacy requirements or very high volume, models like Llama 4 Maverick Instruct or Mistral 7B v0.1 running on your own infrastructure cut costs significantly.

The models worth knowing

GPT-5: Best general reasoning, highest cost per token
Claude 4 Sonnet: Exceptional at following complex, multi-part instructions
Gemini 2.5 Flash: Fast, multimodal, well-suited for high-throughput bots
DeepSeek R1: Strong reasoning at a fraction of GPT-5's cost
Kimi K2 Instruct: Excellent for coding tasks and agentic workflows
Llama 4 Maverick Instruct: Open-source, high capability, fully self-hostable
Mistral 7B v0.1: Lightweight, fast, runs on modest hardware

Chatbot conversation UI displayed on large desktop monitor

The Core Architecture

A chatbot is three things working together: a message store, a system prompt, and an API call. Nail those three and everything else is polish.

System prompt design

The system prompt is your chatbot's personality and its operating rules. It runs before every conversation and sets the frame for all responses. This is where most chatbot projects succeed or fail.

A weak system prompt: "You are a helpful assistant."

A strong system prompt:

You are a customer support agent for a SaaS product called Orbit.
You only answer questions about Orbit's features, pricing, and troubleshooting.
If a user asks about something outside Orbit, politely redirect them.
Keep responses under 150 words unless the user explicitly asks for more detail.
Never make up pricing figures. If you do not know the answer, say so clearly
and offer to escalate to a human agent.

The difference is specificity. The model needs constraints, not just a role. Constraints produce consistent, predictable behavior. Vague roles produce vague outputs.

💡 Test your system prompt by asking the bot to do things it should refuse. A prompt that holds under adversarial input will hold in production.

Write your system prompt like a job description for a very literal employee who does exactly what you say and nothing more. Every sentence that adds a constraint is a sentence that reduces unpredictability.

Message history management

Your message store is a list of objects, each with a role (system, user, or assistant) and content. A typical conversation looks like this:

messages = [
    {"role": "system", "content": "You are a helpful support agent..."},
    {"role": "user", "content": "How do I reset my password?"},
    {"role": "assistant", "content": "To reset your password, click..."},
    {"role": "user", "content": "I do not see that button anywhere."},
]

You append each new user message and each assistant response to this list. On every API call, you send the full list. The model reads the entire conversation and generates the next reply in context.

Trimming strategy: When the list grows large, trim it before sending to avoid context window errors. Three options:

Sliding window: Drop the oldest user/assistant pairs while always keeping the system prompt.
Summarization: Use a separate LLM call to compress old messages into a short summary, then inject that summary as a pseudo-message.
Fixed turn limit: Allow only the last N conversation turns in the payload.

Sliding window is simplest to implement and works for most use cases. Summarization preserves the most context at the cost of extra API calls and latency.

Temperature and parameters

Temperature controls randomness. At 0.0, responses are deterministic and often repetitive. At 1.0, they are creative and prone to drifting off-topic. For most chatbots, 0.3 to 0.7 is the right range.

Parameter	What it controls	Typical value
`temperature`	Randomness and creativity	0.3 to 0.7
`max_tokens`	Hard cap on response length	512 to 2048
`top_p`	Token sampling breadth	0.9
`frequency_penalty`	Penalizes repeated phrases	0.1 to 0.3

Young woman using chatbot app on smartphone at cafe table

Building the Backend in Python

The actual code is simpler than most tutorials suggest. Here is a minimal working implementation using the OpenAI client, which is compatible with most LLM providers.

Setting up the API client

pip install openai

from openai import OpenAI

client = OpenAI(api_key="your-api-key-here")

For open-source models served via a local API (like Ollama or LM Studio), you swap the base_url parameter. The rest of the code stays identical, which is one of the real benefits of the OpenAI-compatible API standard most providers have adopted.

Handling conversation state

class Chatbot:
    def __init__(self, system_prompt: str, max_turns: int = 20):
        self.system_prompt = system_prompt
        self.max_turns = max_turns
        self.history = []

    def chat(self, user_message: str) -> str:
        self.history.append({"role": "user", "content": user_message})

        # Keep only the last N turns to avoid context overflow
        recent = self.history[-(self.max_turns * 2):]

        messages = [
            {"role": "system", "content": self.system_prompt}
        ] + recent

        response = client.chat.completions.create(
            model="gpt-5",
            messages=messages,
            temperature=0.5,
            max_tokens=1024,
        )

        reply = response.choices[0].message.content
        self.history.append({"role": "assistant", "content": reply})
        return reply

This is the core loop. Every production chatbot is a variation of this pattern, with additional logic layered on top.

Streaming responses

Users tolerate waiting for responses much better when they see text arriving in real time. Streaming is built into every major LLM API and is worth implementing from day one:

stream = client.chat.completions.create(
    model="gpt-5",
    messages=messages,
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Stream the response to your frontend via Server-Sent Events (SSE) or WebSockets. Perceived latency drops from several seconds to nearly instant, and users report significantly higher satisfaction with streamed responses compared to waiting for a full reply.

Developer sitting cross-legged on couch with laptop showing LLM config JSON

How to Use LLMs on PicassoIA

PicassoIA hosts 65+ large language models available directly in the browser, with no API setup, no billing configuration, and no infrastructure required. This makes it practical for testing your system prompts and chatbot logic before writing a single line of code.

Finding the right model for your use case

Navigate to the Large Language Models section on PicassoIA. You will find every major model family: OpenAI's GPT-5 and GPT-4o, Anthropic's Claude 4 Sonnet, Google's Gemini 2.5 Flash, Meta's Llama 4 Maverick Instruct, and reasoning specialists like DeepSeek R1 and Kimi K2 Instruct.

Step-by-step: test your chatbot system prompt on PicassoIA

Open the model page for the LLM you want to test. Start with GPT-4o or Claude 4 Sonnet for broad capability testing.
Paste your system prompt into the system message field. Use the exact production version, not a simplified draft.
Send test messages that cover typical use cases, edge cases, and adversarial inputs like "ignore all previous instructions."
Iterate on the prompt: adjust specificity, add refusal instructions, tighten the scope. Reload and retest after each change.
Compare models: run the same test set on Llama 4 Maverick Instruct and DeepSeek R1 to see which handles your use case best, without paying for API calls during development.
Lock in your configuration: note the winning model, system prompt text, and temperature setting before you start coding.

💡 Use Kimi K2 Instruct when your chatbot needs to write or review code. Use DeepSeek R1 when it needs to reason through multi-step problems step by step.

Aerial flat-lay of developer workspace with MacBook, notebook, and coffee

Common Mistakes That Break Chatbots

Context overflow with no fallback

The number one failure mode in production chatbots is hitting the context window limit with no handling code. When that happens, the API throws a context_length_exceeded error and your chatbot crashes or returns a generic error page. Always implement a trim or summarize strategy before you ship.

Simple rule: if your conversation history is approaching 80% of the model's context limit based on token count, start dropping the oldest non-system messages from the history list.

Libraries like tiktoken (for OpenAI models) let you count tokens precisely before making each API call. Use them.

Vague system prompts

Vague instructions produce vague behavior. Three patterns to avoid:

Too generic: "Be helpful and friendly." This tells the model nothing about what it should or should not do.
Too long: A 2,000-word system prompt eats context on every call and often contradicts itself in ways that confuse the model.
No refusal instructions: If you do not tell the bot what to refuse, it will attempt to answer everything, including things it absolutely should not.

Ignoring token costs in production

At scale, every unnecessary token costs money. A system prompt that is 800 tokens longer than necessary costs those 800 tokens on every single API call. At 10,000 daily conversations, that is 8 million extra input tokens per day billed at whatever the model's input rate is. Audit your system prompt aggressively before launch and remove any redundant or verbose instructions.

Two developers collaborating on LLM chatbot architecture at standing desk

Deploying and Scaling Your Chatbot

Rate limits and costs

Every LLM API has rate limits: requests per minute, tokens per minute, and in some cases daily caps. Plan for these before launch. At low traffic, proprietary APIs are the right call for quality and speed. At high traffic, self-hosting an open-source model like Llama 4 Maverick Instruct often becomes significantly cheaper.

Rough cost comparison for a chatbot with 10,000 daily conversations averaging 500 output tokens each:

Model	Approx cost per 1M output tokens	Daily output cost
GPT-4o	~$15	~$75
GPT-4o Mini	~$0.60	~$3
Claude 4 Sonnet	~$15	~$75
Llama 4 (self-hosted)	Infrastructure only	Variable

GPT-4o Mini deserves serious evaluation for high-volume use cases where the full capability of GPT-4o is not strictly required. Many chatbot tasks, like FAQ answering or simple routing, do not need a frontier model.

Monitoring response quality

Shipping is not the end. You need visibility into what your bot is actually saying to real users. At minimum, log every conversation turn with:

Timestamp and session ID
User message text
Model response text
Response latency in milliseconds
Token count for the full request

Review a random sample of conversations daily in the first week. You will catch prompt failures, hallucinations, and edge cases that did not surface during testing.

💡 Set up review triggers: if a user sends a phrase like "that is wrong" or "you made that up," flag that conversation for immediate manual review.

Close-up of monitor showing Python LLM API integration code

3 Things to Add After Your Bot Works

Once your chatbot is functional and deployed, these three additions move it from demo to product:

1. Retrieval Augmented Generation (RAG): Connect your bot to a vector database of your own documents. Instead of relying on the model's training data, it retrieves relevant passages and uses them as context before generating a reply. This is how you build a chatbot that accurately answers questions about your specific product, internal policies, or knowledge base without hallucinating details.

2. Guardrails: Add a secondary check that validates both user inputs and bot outputs before anything reaches the user. This catches prompt injection attempts, policy violations, and off-topic responses before they become visible to your audience. A simple rules layer or a second lightweight model call handles most cases.

3. Evaluation suite: Write a test set of 50-100 input/expected output pairs covering your chatbot's core use cases. Run it automatically after every system prompt change. This is the only reliable way to catch regressions before users encounter them. Manual testing at that scale is not realistic after week one.

Developer testing chatbot in browser on laptop at sunlit kitchen island

How to Build a Chatbot with an LLM: The Real Starting Point

The architecture is clear. The code is not complicated. What separates a chatbot that impresses people in a demo from one that works reliably in production is iteration. More specifically: iteration on the system prompt, the trimming strategy, the model choice, and the monitoring setup, all driven by real conversation data.

None of that iteration requires writing code first. You can do the most valuable part of it, finding the right model and a system prompt that actually holds up, in a browser before you open your IDE.

Woman developer reviewing chatbot analytics dashboard on monitor

Try It With Your Own Idea Right Now

PicassoIA gives you direct access to 65+ large language models in your browser, including GPT-5, Claude 4 Sonnet, Llama 4 Maverick Instruct, Kimi K2 Instruct, DeepSeek R1, and Gemini 2.5 Flash.

Write your system prompt. Pick a model. See how it actually responds. Test the edge cases. Break it intentionally. Then refine. That iteration loop is where real chatbot quality gets built, and you can run the entire process on PicassoIA before you commit to a single API call or line of code.

The models are available. Your chatbot is not going to build itself.

Share this article

How to Build a Chatbot with an LLM: From Zero to Working Bot