How to Build a Chatbot with an LLM: From Zero to Working Bot
Everything you need to build a real chatbot powered by a large language model. This article covers model selection, system prompt engineering, conversation memory, API integration in Python, common pitfalls that crash production bots, and how to test and scale without burning your API budget.
Building a chatbot used to mean years of NLP research and mountains of training data. Today, you can wire up a working conversational bot in an afternoon by calling an LLM API. The hard part is no longer getting it to respond. The hard part is getting it to respond well, consistently, without drifting off-topic or hallucinating facts. This article walks you through every layer of that problem, from picking the right model to designing system prompts that hold up in production.
What an LLM Actually Does in a Chatbot
Most people treat LLMs like a black box. You send text in, text comes out. But understanding what happens inside changes how you build, and more importantly, changes how you debug when things go wrong.
Tokens, context windows, and memory
An LLM does not read words. It reads tokens, which are chunks of text roughly 3-4 characters each. Every model has a context window: the maximum number of tokens it can process in one request. GPT-4o handles 128,000 tokens. Llama 2 7B Chat caps at 4,096. That difference matters enormously when you are building a multi-turn conversation.
💡 Rule of thumb: 1,000 tokens is roughly 750 words. A long conversation hits the context limit faster than most developers expect.
When the context window fills up, the model cannot remember what came before it. That is not a bug. That is how transformers work. Your application code must handle memory, not the model.
Stateless vs stateful conversations
Here is something that catches every first-time builder: LLMs are stateless. Each API call is completely independent. The model has no idea what was said five messages ago unless you explicitly include those messages in the current request.
This means your chatbot's "memory" is entirely your responsibility. You maintain a list of messages. You pass the entire list on every call. The model sees context, not history. This single architectural fact changes everything about how you build a chatbot application.
Why this matters for architecture
If you forget this and store conversation state only on the frontend, you will have a chatbot that loses all memory on page refresh. If you store too much history without trimming, you will hit context limits in long sessions. The right approach is a server-side session store that persists the message list, trims it intelligently, and sends the right slice to the model on every call.
Choosing the Right LLM
The model you pick determines the ceiling of your chatbot's capability. There is no single right answer, but there are clear trade-offs worth understanding before you commit to an infrastructure direction.
Open-source vs proprietary models
Open-Source
Proprietary
Cost
Free or cheap hosting
Pay per token
Privacy
Data stays on your servers
Data sent to provider
Performance
Varies by model size
Generally stronger
Control
Full fine-tuning access
API-only
Setup time
Needs infrastructure
Ready in minutes
For prototyping, proprietary models win on speed. GPT-5 and Claude 4 Sonnet get you to a working demo fast. For production with strict privacy requirements or very high volume, models like Llama 4 Maverick Instruct or Mistral 7B v0.1 running on your own infrastructure cut costs significantly.
The models worth knowing
GPT-5: Best general reasoning, highest cost per token
Claude 4 Sonnet: Exceptional at following complex, multi-part instructions
Gemini 2.5 Flash: Fast, multimodal, well-suited for high-throughput bots
DeepSeek R1: Strong reasoning at a fraction of GPT-5's cost
Kimi K2 Instruct: Excellent for coding tasks and agentic workflows
A chatbot is three things working together: a message store, a system prompt, and an API call. Nail those three and everything else is polish.
System prompt design
The system prompt is your chatbot's personality and its operating rules. It runs before every conversation and sets the frame for all responses. This is where most chatbot projects succeed or fail.
A weak system prompt: "You are a helpful assistant."
A strong system prompt:
You are a customer support agent for a SaaS product called Orbit.
You only answer questions about Orbit's features, pricing, and troubleshooting.
If a user asks about something outside Orbit, politely redirect them.
Keep responses under 150 words unless the user explicitly asks for more detail.
Never make up pricing figures. If you do not know the answer, say so clearly
and offer to escalate to a human agent.
The difference is specificity. The model needs constraints, not just a role. Constraints produce consistent, predictable behavior. Vague roles produce vague outputs.
💡 Test your system prompt by asking the bot to do things it should refuse. A prompt that holds under adversarial input will hold in production.
Write your system prompt like a job description for a very literal employee who does exactly what you say and nothing more. Every sentence that adds a constraint is a sentence that reduces unpredictability.
Message history management
Your message store is a list of objects, each with a role (system, user, or assistant) and content. A typical conversation looks like this:
messages = [
{"role": "system", "content": "You are a helpful support agent..."},
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "To reset your password, click..."},
{"role": "user", "content": "I do not see that button anywhere."},
]
You append each new user message and each assistant response to this list. On every API call, you send the full list. The model reads the entire conversation and generates the next reply in context.
Trimming strategy: When the list grows large, trim it before sending to avoid context window errors. Three options:
Sliding window: Drop the oldest user/assistant pairs while always keeping the system prompt.
Summarization: Use a separate LLM call to compress old messages into a short summary, then inject that summary as a pseudo-message.
Fixed turn limit: Allow only the last N conversation turns in the payload.
Sliding window is simplest to implement and works for most use cases. Summarization preserves the most context at the cost of extra API calls and latency.
Temperature and parameters
Temperature controls randomness. At 0.0, responses are deterministic and often repetitive. At 1.0, they are creative and prone to drifting off-topic. For most chatbots, 0.3 to 0.7 is the right range.
Parameter
What it controls
Typical value
temperature
Randomness and creativity
0.3 to 0.7
max_tokens
Hard cap on response length
512 to 2048
top_p
Token sampling breadth
0.9
frequency_penalty
Penalizes repeated phrases
0.1 to 0.3
Building the Backend in Python
The actual code is simpler than most tutorials suggest. Here is a minimal working implementation using the OpenAI client, which is compatible with most LLM providers.
Setting up the API client
pip install openai
from openai import OpenAI
client = OpenAI(api_key="your-api-key-here")
For open-source models served via a local API (like Ollama or LM Studio), you swap the base_url parameter. The rest of the code stays identical, which is one of the real benefits of the OpenAI-compatible API standard most providers have adopted.
This is the core loop. Every production chatbot is a variation of this pattern, with additional logic layered on top.
Streaming responses
Users tolerate waiting for responses much better when they see text arriving in real time. Streaming is built into every major LLM API and is worth implementing from day one:
stream = client.chat.completions.create(
model="gpt-5",
messages=messages,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
Stream the response to your frontend via Server-Sent Events (SSE) or WebSockets. Perceived latency drops from several seconds to nearly instant, and users report significantly higher satisfaction with streamed responses compared to waiting for a full reply.
How to Use LLMs on PicassoIA
PicassoIA hosts 65+ large language models available directly in the browser, with no API setup, no billing configuration, and no infrastructure required. This makes it practical for testing your system prompts and chatbot logic before writing a single line of code.
Step-by-step: test your chatbot system prompt on PicassoIA
Open the model page for the LLM you want to test. Start with GPT-4o or Claude 4 Sonnet for broad capability testing.
Paste your system prompt into the system message field. Use the exact production version, not a simplified draft.
Send test messages that cover typical use cases, edge cases, and adversarial inputs like "ignore all previous instructions."
Iterate on the prompt: adjust specificity, add refusal instructions, tighten the scope. Reload and retest after each change.
Compare models: run the same test set on Llama 4 Maverick Instruct and DeepSeek R1 to see which handles your use case best, without paying for API calls during development.
Lock in your configuration: note the winning model, system prompt text, and temperature setting before you start coding.
💡 Use Kimi K2 Instruct when your chatbot needs to write or review code. Use DeepSeek R1 when it needs to reason through multi-step problems step by step.
Common Mistakes That Break Chatbots
Context overflow with no fallback
The number one failure mode in production chatbots is hitting the context window limit with no handling code. When that happens, the API throws a context_length_exceeded error and your chatbot crashes or returns a generic error page. Always implement a trim or summarize strategy before you ship.
Simple rule: if your conversation history is approaching 80% of the model's context limit based on token count, start dropping the oldest non-system messages from the history list.
Libraries like tiktoken (for OpenAI models) let you count tokens precisely before making each API call. Use them.
Vague system prompts
Vague instructions produce vague behavior. Three patterns to avoid:
Too generic: "Be helpful and friendly." This tells the model nothing about what it should or should not do.
Too long: A 2,000-word system prompt eats context on every call and often contradicts itself in ways that confuse the model.
No refusal instructions: If you do not tell the bot what to refuse, it will attempt to answer everything, including things it absolutely should not.
Ignoring token costs in production
At scale, every unnecessary token costs money. A system prompt that is 800 tokens longer than necessary costs those 800 tokens on every single API call. At 10,000 daily conversations, that is 8 million extra input tokens per day billed at whatever the model's input rate is. Audit your system prompt aggressively before launch and remove any redundant or verbose instructions.
Deploying and Scaling Your Chatbot
Rate limits and costs
Every LLM API has rate limits: requests per minute, tokens per minute, and in some cases daily caps. Plan for these before launch. At low traffic, proprietary APIs are the right call for quality and speed. At high traffic, self-hosting an open-source model like Llama 4 Maverick Instruct often becomes significantly cheaper.
Rough cost comparison for a chatbot with 10,000 daily conversations averaging 500 output tokens each:
GPT-4o Mini deserves serious evaluation for high-volume use cases where the full capability of GPT-4o is not strictly required. Many chatbot tasks, like FAQ answering or simple routing, do not need a frontier model.
Monitoring response quality
Shipping is not the end. You need visibility into what your bot is actually saying to real users. At minimum, log every conversation turn with:
Timestamp and session ID
User message text
Model response text
Response latency in milliseconds
Token count for the full request
Review a random sample of conversations daily in the first week. You will catch prompt failures, hallucinations, and edge cases that did not surface during testing.
💡 Set up review triggers: if a user sends a phrase like "that is wrong" or "you made that up," flag that conversation for immediate manual review.
3 Things to Add After Your Bot Works
Once your chatbot is functional and deployed, these three additions move it from demo to product:
1. Retrieval Augmented Generation (RAG): Connect your bot to a vector database of your own documents. Instead of relying on the model's training data, it retrieves relevant passages and uses them as context before generating a reply. This is how you build a chatbot that accurately answers questions about your specific product, internal policies, or knowledge base without hallucinating details.
2. Guardrails: Add a secondary check that validates both user inputs and bot outputs before anything reaches the user. This catches prompt injection attempts, policy violations, and off-topic responses before they become visible to your audience. A simple rules layer or a second lightweight model call handles most cases.
3. Evaluation suite: Write a test set of 50-100 input/expected output pairs covering your chatbot's core use cases. Run it automatically after every system prompt change. This is the only reliable way to catch regressions before users encounter them. Manual testing at that scale is not realistic after week one.
How to Build a Chatbot with an LLM: The Real Starting Point
The architecture is clear. The code is not complicated. What separates a chatbot that impresses people in a demo from one that works reliably in production is iteration. More specifically: iteration on the system prompt, the trimming strategy, the model choice, and the monitoring setup, all driven by real conversation data.
None of that iteration requires writing code first. You can do the most valuable part of it, finding the right model and a system prompt that actually holds up, in a browser before you open your IDE.
Write your system prompt. Pick a model. See how it actually responds. Test the edge cases. Break it intentionally. Then refine. That iteration loop is where real chatbot quality gets built, and you can run the entire process on PicassoIA before you commit to a single API call or line of code.
The models are available. Your chatbot is not going to build itself.