Building internal AI agents for our teams on Upsun
At Upsun, we believe the best way to understand a technology is to build something real with it. That’s why we created AI-Brain, a multiple AI agents system designed to help our teams to create high-quality technical content and understand our product better. This article is the story of how we built it, why we made specific architectural decisions, and what we learned along the way.
More importantly, this article shows you how to build your own internal AI agents on Upsun. Agents that understand your domain, follow your guidelines, and produce consistent, high-quality outputs. The article will focus on building a content writer agent for a Marketing team but the concepts can be applied to any other role.
The challenge: Beyond generic AI responses
Generic AI assistants are impressive, but they fall short when you need domain-specific expertise. Ask ChatGPT to write a technical article about your company, and you’ll get something generic—accurate in structure perhaps, but lacking the nuance, brand voice, and technical depth your team needs.
The problem isn’t the model’s capability, it’s the context. Large language models don’t inherently know about your company’s:
- Technical documentation and best practices
- Brand voice and messaging guidelines
- Content templates and formatting standards
- Product-specific features and terminology
- Team personas and target audiences
We needed an agent that could write content and produce answers to questions as if it were a member of our team. Someone who had read all our documentation, understood our brand voice, and could reference our existing articles for technical accuracy.
The solution: Context building, not prompt engineering
Our key insight was this: building better agents isn’t about writing better prompts, it’s about building better context.
Traditional prompt engineering tries to cram everything into the user’s message:
# Traditional approach - everything in the prompt
user_message = f"""
Write an article about Redis caching.
Here are our brand guidelines: {50_pages_of_guidelines}
Here are code examples: {hundreds_of_examples}
Here's our style guide: {detailed_style_guide}
Now write the article...
"""
This approach has fundamental problems:
- Token limits: You hit model context windows quickly or encounter context rotting issues
- Cost: Every query pays for the same massive context
- Maintenance: Updating guidelines means updating every prompt
- Relevance: Most of that context isn’t relevant to the specific request
Instead, we built a system that separates static context from dynamic context:
graph LR A[User Request] --> B[Agent] C[Static Knowledge
Guidelines, Templates
Brand Voice] --> B D[Dynamic Retrieval
Relevant Docs
Code Examples] --> B B --> E[Generated Response] style A fill:#D0F302,stroke:#000,color:#000 style B fill:#6046FF,stroke:#000,color:#fff style C fill:#fff,stroke:#000,color:#000 style D fill:#fff,stroke:#000,color:#000 style E fill:#D0F302,stroke:#000,color:#000
- Static Knowledge: Injected once into the agent’s instructions—brand voice, writing standards, personas
- Dynamic Retrieval: Retrieved on-demand using RAG (Retrieval-Augmented Generation)—relevant documentation, code examples, technical details
This hybrid approach gives us the best of both worlds: consistency from static context and freshness from dynamic retrieval.
Architecture overview: The big picture
Before diving into implementation details, let’s understand the system architecture:
graph LR A[GitHub Repos
blog, docs, etc.] --> B[Daily Ingestion
Cron Job] B --> C[ChromaDB
Vector Database] D[Static Knowledge
Rules & Guidelines] --> E[Google ADK Agent] C --> F[RAG Query Engine] F --> E E --> G[Web Interface] E --> H[CLI Interface] style A fill:#fff,stroke:#000,color:#000 style B fill:#D0F302,stroke:#000,color:#000 style C fill:#6046FF,stroke:#000,color:#fff style D fill:#fff,stroke:#000,color:#000 style E fill:#6046FF,stroke:#000,color:#fff style F fill:#D0F302,stroke:#000,color:#000 style G fill:#fff,stroke:#000,color:#000 style H fill:#fff,stroke:#000,color:#000
The system has five main components:
- Document Ingestion: Daily cron job that syncs Git repositories, processes markdown files, and updates embeddings in ChromaDB
- ChromaDB Vector Database: Stores document embeddings for semantic search
- RAG Query Engine: Retrieves relevant content using vector similarity search
- Google ADK Agent: Orchestrates responses using static knowledge and dynamic retrieval
- User Interfaces: Web and CLI interfaces for interaction
Let’s explore each component in detail.
Component 1: ChromaDB Vector Database
ChromaDB is an open-source embedding database designed for building AI applications with semantic search capabilities. Think of it as a specialized database that stores text alongside vector representations (embeddings) that capture semantic meaning.
Why vector databases?
Traditional keyword search fails with semantic queries:
- Query: “How do I cache data for better performance?”
- Keyword match: Looks for exact words like “cache”, “data”, “performance”
- Semantic match: Understands concepts like caching, Redis, performance optimization, CDNs
Vector databases solve this by:
- Converting text to high-dimensional vectors (embeddings) that capture meaning
- Finding similar content using vector similarity (cosine distance)
- Returning results ranked by semantic relevance
ChromaDB Setup on Upsun
We deploy ChromaDB as a separate application on Upsun. Here’s the configuration from our .upsun/config.yaml
:
applications:
chroma:
type: "python:3.12"
source:
root: "chroma"
dependencies:
python3:
uv: "*"
hooks:
build: |
uv init
uv add chromadb
web:
commands:
start: "uv run --no-sync chroma run --host 0.0.0.0 --port $PORT --path /app/.db"
mounts:
".db":
source: "storage"
source_path: "db"
".chroma":
source: "storage"
source_path: "chroma"
Key aspects of this configuration:
- Python 3.12: ChromaDB requires Python 3.11+
- UV package manager: Fast, modern Python dependency management
- Persistent storage: The
mounts
section ensures embeddings survive deployments - Dynamic port: Upsun automatically assigns the
$PORT
variable
Document schema
Each document chunk in ChromaDB has this structure:
{
"id": "a1b2c3d4e5f6g7h8", # MD5 hash (16 chars)
"document": "Actual text content of the chunk...",
"embedding": [0.123, -0.456, ...], # 1536-dimensional vector
"metadata": {
# File metadata
"file_path": "/path/to/documents/repo/file.md",
"file_name": "redis-caching.md",
"directory": "posts",
# Repository metadata
"repository": "upsun-devcenter",
"relative_path": "dev/content/posts/redis-caching.md",
# Frontmatter metadata (from YAML)
"title": "Optimizing Performance with Redis Caching",
"description": "Learn how to implement Redis caching...",
"date": "2024-01-01",
"author": "Platform Team",
"tags": "performance,redis,caching",
# Chunk metadata
"chunk_index": 0,
"total_chunks": 3,
"chunk_token_count": 1200,
# Ingestion metadata
"ingestion_timestamp": "2024-01-15T10:30:00"
}
}
This rich metadata allows for adding criterias when querying the database:
- Filtering: “Only search documentation, not blog posts”
- Attribution: “This content came from article X”
- Recency: “Prioritize recently updated content”
Component 2: Document ingestion system
The ingestion system is responsible for keeping ChromaDB synchronized with our Git repositories that hold most of our technical content (documentation, the devcenter, blogs). It runs daily via a Upsun cron job and processes markdown files with intelligent chunking.
The ingestion flow
graph LR A[Cron Trigger
3 AM UTC] --> B[Clone/Pull
Git Repos] B --> C[Detect
Changed Files] C --> D[Extract
Frontmatter] D --> E[Chunk Text
1500 tokens] E --> F[Generate
Embeddings] F --> G[Store in
ChromaDB] G --> H[Save State] style A fill:#D0F302,stroke:#000,color:#000 style B fill:#fff,stroke:#000,color:#000 style C fill:#fff,stroke:#000,color:#000 style D fill:#fff,stroke:#000,color:#000 style E fill:#fff,stroke:#000,color:#000 style F fill:#6046FF,stroke:#000,color:#fff style G fill:#6046FF,stroke:#000,color:#fff style H fill:#D0F302,stroke:#000,color:#000
SSH authentication for private repositories
One challenge was securely accessing private GitHub repositories. We solved this with SSH key authentication:
class GitRepositoryManager:
def setup_ssh_key_file(self) -> bool:
"""Create temporary SSH key file from environment variable."""
key_content = os.getenv("GITHUB_KEY").strip()
key_content = key_content.replace('\\n', '\n')
# Validate BEGIN/END format
if not (key_content.startswith('-----BEGIN') and
key_content.endswith('-----')):
return False
# Create temporary file with proper permissions (0400)
with tempfile.NamedTemporaryFile(mode='w', delete=False) as temp_file:
temp_file.write(key_content)
self.temp_key_file = temp_file.name
os.chmod(self.temp_key_file, stat.S_IRUSR)
return True
def clone_or_update_repo(self, repo_config: RepositoryConfig):
"""Clone or pull repository with SSH authentication."""
git_env = os.environ.copy()
if self.temp_key_file:
ssh_command = f'ssh -i {self.temp_key_file} -o StrictHostKeyChecking=no'
git_env['GIT_SSH_COMMAND'] = ssh_command
if repo_path.exists():
# Pull updates and track changes
repo = git.Repo(repo_path)
old_commit = repo.head.commit.hexsha
with repo.git.custom_environment(**git_env):
origin = repo.remotes.origin
origin.pull()
new_commit = repo.head.commit.hexsha
changed_files = self._get_changed_files(repo, old_commit, new_commit)
else:
# Clone new repository
repo = git.Repo.clone_from(repo_config.url, repo_path, env=git_env)
The SSH key is stored as an environment variable in Upsun and the Git commands use custom environment to specify SSH key. The script tracks commit hashes to detect changes and make sure we are only ingesting new content.
Intelligent text chunking
Not all text chunks are created equal. We use a token-aware chunking strategy that balances context preservation with embedding model limits:
class DocumentProcessor:
def __init__(self, chunk_size: int = 1500, chunk_overlap: int = 200):
self.chunk_size = chunk_size # tokens
self.chunk_overlap = chunk_overlap # tokens
self.tokenizer = tiktoken.get_encoding("cl100k_base")
def chunk_text(self, text: str, metadata: Dict, file_path: Path) -> List[DocumentChunk]:
"""Split text into overlapping chunks optimized for embeddings."""
# Tokenize entire document
tokens = self.tokenizer.encode(text)
# If shorter than chunk_size, return single chunk
if len(tokens) <= self.chunk_size:
return [self._create_single_chunk(text, tokens, metadata, file_path)]
chunks = []
chunk_index = 0
# Create sliding window chunks with overlap
for start in range(0, len(tokens), self.chunk_size - self.chunk_overlap):
end = min(start + self.chunk_size, len(tokens))
chunk_tokens = tokens[start:end]
chunk_text = self.tokenizer.decode(chunk_tokens)
# Skip very small end chunks (< 100 tokens)
if len(chunk_tokens) < 100:
break
chunk_id = self._generate_chunk_id(file_path, chunk_index, chunk_text)
chunks.append(DocumentChunk(
content=chunk_text,
metadata={
**metadata,
'chunk_index': chunk_index,
'chunk_token_count': len(chunk_tokens),
'chunk_start_token': start,
'chunk_end_token': end
},
chunk_id=chunk_id,
file_path=str(file_path),
chunk_index=chunk_index
))
chunk_index += 1
return chunks
Chunking strategy explained:
Chunk size: 1500 tokens (~6000 characters)
- Large enough to preserve context
- Small enough for embedding model limits
- Roughly 2-3 paragraphs of technical content
Overlap: 200 tokens (~800 characters)
- Ensures concepts split across chunks remain findable
- Provides context continuity
- Helps with concepts mentioned near chunk boundaries
Token encoding:
cl100k_base
- Same tokenizer used by GPT-3.5 and GPT-4
- Ensures accurate token counting
- Prevents embedding model errors
Incremental updates
Efficiency matters. We only process files that have changed:
class ChromaDBManager:
def replace_file_chunks(self, file_path: str, chunks: List[DocumentChunk]) -> bool:
"""Replace all chunks for a file: delete old, add new."""
# Step 1: Query for existing chunks
results = self.collection.query(
query_texts=["dummy"],
n_results=1000,
where={"file_path": file_path},
include=["metadatas"]
)
# Step 2: Delete old chunks
if results["ids"]:
self.collection.delete(ids=results["ids"][0])
# Step 3: Add new chunks with embeddings
self.collection.upsert(
documents=[chunk.content for chunk in chunks],
metadatas=[self._prepare_metadata(chunk.metadata) for chunk in chunks],
ids=[chunk.chunk_id for chunk in chunks]
)
return True
This approach queries ChromaDB for existing chunks from the file and deletes old chunks in a single operation. Then adds new chunks with fresh embeddings and tracks the state in repo_state.json
to skip unchanged files
Component 3: RAG (Retrieval-Augmented Generation)
RAG is the secret sauce that makes our agent domain-aware. Instead of relying solely on the model’s training data, we dynamically retrieve relevant information from our vector database.
How RAG Works
graph LR A[User Query:
'Redis caching'] --> B[Generate
Keywords] B --> C[Create
Embeddings] C --> VRP subgraph VRP[Vector Retrieval Pipeline] D[Vector
Similarity Search] E[Rank by
Relevance] F[Format
as Context] D --> E E --> F end F --> G[Agent Response] style A fill:#D0F302,stroke:#000,color:#000 style B fill:#fff,stroke:#000,color:#000 style C fill:#6046FF,stroke:#000,color:#fff style D fill:#6046FF,stroke:#000,color:#fff style E fill:#fff,stroke:#000,color:#000 style F fill:#D0F302,stroke:#000,color:#000 style G fill:#D0F302,stroke:#000,color:#000
Two-Stage RAG Implementation
Our RAG system uses a two-stage approach:
Stage 1: Keyword Generation Instead of searching with the raw user query, we use OpenAI to generate optimal search keywords:
def _generate_keywords(self, user_request: str) -> List[str]:
"""Use OpenAI to generate optimal search keywords."""
response = self.openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": """Generate 3-5 specific search keywords for finding
relevant documentation. Focus on technical terms and concepts."""
},
{
"role": "user",
"content": user_request
}
]
)
# Parse keywords from response
keywords = self._parse_keywords(response.choices[0].message.content)
return keywords
For example, the query “How do I cache data?” might generate keywords:
redis caching
upsun cache configuration
performance optimization
cache eviction policies
This improves retrieval quality significantly compared to searching with the raw query.
Stage 2: Vector Retrieval For each keyword, we query ChromaDB:
def _retrieve_chunks(self, keyword: str, n_results: int) -> List[Dict]:
"""Query ChromaDB with keyword embedding."""
results = self.collection.query(
query_texts=[keyword],
n_results=n_results,
include=["documents", "metadatas", "distances"]
)
return self._process_results(results)
Under the hood, ChromaDB:
- Creates an embedding for the keyword using OpenAI’s
text-embedding-3-small
- Computes cosine similarity against all document embeddings
- Returns the top N most similar chunks
- Ranks results by similarity score (distance)
Formatting retrieved context
Raw chunks aren’t usable directly. We format them into structured markdown that the agent can understand:
def _format_as_markdown(self, results: List[Dict]) -> str:
"""Format retrieved chunks as structured markdown."""
output = "## Retrieved Upsun Content\n\n"
for result in results:
output += f"### {result['metadata']['title']}\n"
output += f"**Source**: {result['metadata']['repository']}\n"
output += f"**File**: {result['metadata']['file_name']}\n\n"
output += f"{result['document']}\n\n"
output += "---\n\n"
return output
This gives the agent where the information comes from (metadata), the actual content and a clear separation between sources.
RAG in action: An example
Let’s trace a real query through the system:
User Query: “Write an article about Redis caching on Upsun”
Generated Keywords:
[
"redis cache configuration",
"upsun redis service",
"cache performance optimization",
"redis best practices"
]
Retrieved Chunks (top 6 across all keywords):
- Redis service configuration from platformsh-docs
- Caching strategies article from devcenter
- Redis performance tuning from blog post
- Cache eviction policies documentation
- Redis connection pooling example
- Real-world Redis implementation case study
Formatted Context (sent to agent):
## Retrieved Upsun Content
### Configuring Redis as a service
**Source**: platformsh-docs
**File**: add-services/redis.md
To add Redis to your Upsun project, add the following to your
`.upsun/config.yaml` file:
[... content continues ...]
---
### Performance optimization with caching
**Source**: upsun-devcenter
**File**: posts/caching-strategies.md
Caching is one of the most effective ways to improve application
performance...
[... content continues ...]
This context is now combined with static knowledge and the user’s request to generate the final response.
Component 4: Static knowledge base
While RAG provides dynamic and relevant content, static knowledge ensures consistency and brand alignment. Our static knowledge lives in two directories:
/knowledge/
Directory
Contains foundational content injected into every agent interaction. The list of injected files can be customized per agent as they all don’t need the same knowledge.
Here are some examples from our AI-Brain internal project:
upsun_writing_guidelines.md
: Tone, style, voiceupsun_technical_context.md
: Platform capabilities, terminologycontent_templates.md
: Article structures, formatspersonas.md
: Target audience personasicp.md
: Ideal customer profilemessaging_positioning.md
: Value propositions, key messagestalking_points.md
: Common themes and narratives
/rules/
Directory
Contains operational rules and constraints:
content-creator.md
: Role definition and responsibilitiesformat.md
: Formatting requirementsstyleguide.md
: Language conventionsupsun.md
: Additional rules and guidelines about the brand
Loading static knowledge
The agent loads this content once at initialization:
def _load_knowledge_base() -> str:
"""Load all static knowledge files from knowledge/ directory."""
knowledge_path = Path(__file__).parent.parent.parent.parent.parent / "knowledge"
knowledge_content = ""
knowledge_files = [
"upsun_writing_guidelines.md",
"upsun_technical_context.md",
"content_templates.md",
"personas.md",
"icp.md",
"messaging_positioning.md",
"talking_points.md"
]
for file_name in knowledge_files:
file_path = knowledge_path / file_name
if file_path.exists():
with open(file_path, 'r', encoding='utf-8') as f:
knowledge_content += f"## {file_name}\n\n{f.read()}\n\n"
return knowledge_content
This content (~50KB) is loaded once and reused across all queries, providing consistency, efficiency and authority. Every response follows the same guidelines and the agent “knows” company standards inherently.
Component 5: Google ADK Agent
Now we bring everything together with Google’s Agent Development Kit (ADK), a framework for building production-ready AI agents.
Why did we choose Google ADK?
Google ADK stood out for several compelling reasons. First, it’s model-agnostic, meaning we can work with OpenAI, Anthropic, Google, or even local models without changing our core implementation. The framework excels at tool integration, making it trivial to add custom capabilities like our RAG research function.
For complex workflows, ADK supports multi-agent orchestration, allowing multiple specialized agents to collaborate seamlessly. We also appreciated the deployment flexibility. The same agent code can be deployed via CLI, web UI, or API without modification. Finally, ADK handles conversation management automatically, taking care of state, history, and context tracking so we can focus on building great agent experiences rather than infrastructure.
web_search
tool to gemini
models. Fortunately a lot of other tools like tavily
can replace this and work with any model.Agent structure
Here’s the core agent definition:
from google.adk.agents import Agent
from google.adk.models.lite_llm import LiteLlm
# Step 1: Load static knowledge
knowledge_base = _load_knowledge_base()
# Step 2: Initialize ChromaDB connection
chroma_client = create_chromadb_client(
collection_name="upsun_documents"
)
# Step 3: Create query engine for RAG
query_engine = create_query_engine(chroma_client)
# Step 4: Define RAG research tool
def research_tool(user_request: str) -> str:
"""Research tool that uses RAG to retrieve relevant content."""
if not query_engine:
return "## ChromaDB not available\n\nRAG functionality unavailable."
return query_engine.research_topic(user_request, max_results=6)
# Step 5: Compose instruction with static knowledge
instruction = f"""You are the DevCenter Writer, a specialized technical writing agent
for the Upsun marketing team with advanced RAG capabilities.
KNOWLEDGE BASE:
{knowledge_base}
RAG CAPABILITIES:
You have access to a ChromaDB vector database containing Upsun blog posts, articles,
and documentation. You can use the research_tool to:
1. Generate relevant research keywords using OpenAI based on user requests
2. Query the ChromaDB vector database for related Upsun content
3. Retrieve and integrate relevant information into your responses
Always use the research_tool when you need to find specific information about Upsun
features, best practices, or existing content.
Your primary responsibilities:
1. Create comprehensive technical tutorials for the Upsun DevCenter
2. Write clear, step-by-step implementation guides
3. Develop best practices documentation
4. Ensure content follows Upsun's technical writing standards
When writing content:
- Use retrieved documentation to ensure technical accuracy
- Follow the brand voice and guidelines from your knowledge base
- Structure content according to provided templates
- Include practical code examples and real-world scenarios
- Always validate technical details against retrieved documentation
"""
# Step 6: Create the agent
root_agent = Agent(
name="devcenter_writer",
model=LiteLlm(model="anthropic/claude-sonnet-4-5-20250929"),
description="Specialized technical writing agent for creating Upsun DevCenter content",
instruction=instruction,
tools=[research_tool] # RAG tool available to agent
)
And yes, some parts of this article were written with that agent.
LiteLLM
as an abstraction, it is straightforward to use multiple models based on the agent or even inside one specific agent so models can review and validate each other.The power of context building
Notice what we’ve achieved here. The agent has:
Static Context (in
instruction
):- 50KB of brand guidelines, templates, personas
- Loaded once, available for every query
- Ensures consistency across all responses
Dynamic Context (via
research_tool
):- Retrieves relevant docs on-demand
- Only fetches what’s needed for each query
- Provides fresh, accurate technical information
Tool Access (via
tools
parameter):- Can call
research_tool
when needed - Agent decides when to retrieve additional context
- Autonomous research capability
- Can call
This separation of concerns is what makes the system maintainable and scalable.
Agent execution flow
When a user makes a request, here’s what happens:
graph TD A[User Request] --> B[ADK Framework] B --> C[Agent Receives
Request] C --> D{Needs More
Context?} D -->|Yes| E[Call research_tool] E --> F[Generate Keywords] F --> G[Query ChromaDB] G --> H[Format Results] H --> I[Combine Context] D -->|No| I I --> J[Generate Response
with Claude] J --> K[Return to User] style A fill:#D0F302,stroke:#000,color:#000 style E fill:#6046FF,stroke:#000,color:#fff style F fill:#fff,stroke:#000,color:#000 style G fill:#6046FF,stroke:#000,color:#fff style H fill:#fff,stroke:#000,color:#000 style I fill:#D0F302,stroke:#000,color:#000 style J fill:#6046FF,stroke:#000,color:#fff style K fill:#D0F302,stroke:#000,color:#000
The agent autonomously decides when to use the research tool based on the user’s request and its available context.
Deploying on Upsun: Multi-application architecture
Here’s where Upsun’s multi-application support shines. We deploy our system as two independent applications that communicate internally.
Why multi-application?
- Separation of concerns: ChromaDB and the agent have different resource needs
- Independent scaling: Scale database separately from application
- Service isolation: ChromaDB failures don’t affect agent (and vice versa)
- Clear boundaries: Each service has its own configuration and storage
Upsun Configuration
Here’s our complete .upsun/config.yaml
:
applications:
# Main AI Agent Application
app:
source:
root: "/"
type: "python:3.12"
# UV package manager for fast installs
dependencies:
python3:
uv: "*"
# Relationship to ChromaDB application
relationships:
chroma: chroma:http
# Build process using UV
hooks:
build: |
# Sync dependencies from uv.lock
uv sync --frozen
# Persistent storage for Git repositories
mounts:
'documents':
source: storage
source_path: documents
# Daily document ingestion cron job
crons:
ingest:
spec: '0 3 * * *' # 3 AM UTC daily
commands:
start: uv run python ingest_documents.py
stop: killall python
shutdown_timeout: 30
# Web server configuration
web:
commands:
start: "uv run --no-sync adk web src/ai_brain/agents --port=8888"
# Environment variables
variables:
env:
UV_CACHE_DIR: "/tmp/uv-cache"
PYTHONPATH: "."
# ChromaDB Vector Database Application
chroma:
type: "python:3.12"
source:
root: "chroma"
dependencies:
python3:
uv: "*"
hooks:
build: |
uv init
uv add chromadb
web:
commands:
start: "uv run --no-sync chroma run --host 0.0.0.0 --port $PORT --path /app/.db"
# Persistent storage for ChromaDB data
mounts:
".db":
source: "storage"
source_path: "db"
".chroma":
source: "storage"
source_path: "chroma"
variables:
env:
UV_CACHE_DIR: "/tmp/uv-cache"
PYTHONPATH: "."
# Route configuration
routes:
"https://{default}/":
type: upstream
upstream: "app:http"
Key configuration insights
1. Application Relationships
relationships:
chroma: chroma:http
This creates an internal network connection between applications. Upsun automatically provides:
CHROMA_HOST
: Internal hostname for ChromaDBCHROMA_PORT
: Port number for ChromaDB- No external network calls required
- Fast, secure inter-application communication
2. Persistent Storage (Mounts)
mounts:
'documents':
source: storage
source_path: documents
Mounts persist data across deployments:
documents/
: Cloned Git repositories.db/
and.chroma/
: ChromaDB data and indexes
Without mounts, every deployment would re-clone repositories and rebuild embeddings, expensive and slow.
3. Cron Jobs for Automation
crons:
ingest:
spec: '0 3 * * *' # 3 AM UTC daily
commands:
start: uv run python ingest_documents.py
stop: killall python
shutdown_timeout: 30
Automated daily ingestion:
- Runs at 3 AM UTC (low-traffic window)
- Pulls latest changes from Git
- Updates ChromaDB incrementally
- Graceful shutdown with timeout
4. UV Package Manager
dependencies:
python3:
uv: "*"
hooks:
build: |
uv sync --frozen
You can find more information on UV on our article. We use the following arguments:
--frozen
flag: Fails if dependencies don’t match lockfile--no-sync
flag: Prevents write attempts at runtime
5. Environment Variables
Set via upsun variable:create
:
# OpenAI API key for embeddings
upsun variable:create --level environment --name OPENAI_API_KEY --value "sk-..."
# Anthropic API key for embeddings and LLM
upsun variable:create --level environment --name ANTHROPIC_API_KEY --value "..."
# GitHub SSH private key for repository access
upsun variable:create --level environment --name GITHUB_KEY --value "-----BEGIN OPENSSH PRIVATE KEY-----
...
-----END OPENSSH PRIVATE KEY-----"
These are automatically available to both applications and remain secure.
Performance and cost
Real-world performance from our production system:
Ingestion performance
- Documents Processed: 243 files
- Total Chunks Created: ~1,200 chunks
- Processing Time: 8-15 minutes (depends on OpenAI API latency)
- Incremental Updates: Only changed files processed (typically 1-5 files/day)
- Storage Size: ~5-10 MB for embeddings
- Cost: ~$0.10-0.20 per full ingestion (OpenAI embedding costs)
RAG Query Performance
- Keyword Generation: ~500ms (OpenAI API)
- Vector Similarity Search: ~50-100ms per keyword (ChromaDB)
- Total Retrieval Time: ~1-2 seconds for 3 keywords
- Results Returned: 6 most relevant chunks (~3,000-5,000 tokens)
- Cost: ~$0.01 per query (keyword generation + retrieval)
Agent response performance
- Static Knowledge Loading: ~50ms (cached after first load)
- RAG Retrieval: ~1-2 seconds
- LLM Generation: ~5-15 seconds (depends on output length)
- Total Response Time: ~8-20 seconds for complete article
- Cost: ~$0.05-0.15 per article (depends on length)
These are acceptable performance characteristics for an internal tool.
If you need faster responses, consider several optimization strategies: caching frequent queries to avoid redundant processing, pre-generating embeddings for common topics that are queried regularly, using faster models like GPT-3.5-turbo for keyword generation while keeping higher-quality models for final content generation, and implementing streaming responses to provide incremental results as they’re generated rather than waiting for complete output.
Best practices and lessons learned
After building and deploying this system, here are our key takeaways:
Context building beats prompt engineering
The most important lesson we learned is to separate concerns rather than cramming everything into prompts. Static context like brand guidelines and standards should be injected once in agent instructions, while dynamic context from documentation and examples should be retrieved on-demand via RAG. This clear separation scales to unlimited documentation without hitting token limits, and makes the system far more maintainable than traditional prompt engineering approaches.
Chunking strategy matters
Through experimentation, we found that chunk size dramatically affects retrieval quality. Chunks that are too small (500 tokens) fragment concepts and produce poor results, while chunks that are too large (3000 tokens) lose precision and retrieve irrelevant sections. Our sweet spot of 1500 tokens balances context preservation with search precision. Equally critical is the 200-token overlap between chunks, which prevents concepts split across boundaries from being lost during retrieval.
Two-stage RAG improves quality
While direct query-to-embedding search works, adding keyword generation as a first stage significantly improves results. Using a language model to generate 3-5 targeted search keywords from the user’s request improves recall by finding content even with different phrasing, enables better semantic matching since keywords capture intent more precisely, and produces more focused results through technical terminology.
Rich metadata pays dividends
Investing in rich metadata during ingestion enables powerful filtering capabilities later. You can search only official documentation versus blog posts, prioritize recently updated content, filter by content type like tutorials or guides, or query by tags for specific topics. Don’t skimp on metadata during ingestion—it’s one of the highest-value investments you can make in your RAG system.
Incremental processing saves resources
Full re-ingestion of our 243 documents takes 15-20 minutes and costs $0.10-0.20 in API fees. By implementing incremental updates that only process changed files and track state with Git commit hashes, we typically process just 1-5 files per day, saving 95% of processing time and costs. This makes daily updates practical and economical.
Monitor and iterate
Not all embeddings perform equally well. Continuously monitor similarity scores to understand retrieval relevance, gather user feedback on whether the agent finds the right content, and track false negatives where queries return no useful results. Use this feedback loop to refine your chunking strategy, adjust metadata, and improve keyword generation prompts.
Static knowledge ensures consistency
Without static knowledge injected into the agent’s instructions, outputs vary unpredictably—writing style shifts between responses, brand guidelines get missed, target audience considerations are forgotten, and formatting becomes inconsistent. Static knowledge provides the “personality” and standards that make every output consistent and on-brand, regardless of what dynamic content gets retrieved.
The future of internal AI agents
Building AI-Brain taught us that successful AI agents aren’t about having the best model. They are about building the best context.
By separating static knowledge (brand voice, guidelines, standards) from dynamic retrieval (technical documentation, code examples), we created an agent that writes like our team by following brand guidelines consistently, stays technically accurate by retrieving current documentation on demand, scales effortlessly to handle unlimited documentation without hitting token limits, and remains maintainable because knowledge files can be updated independently without touching agent code.
This architecture adapts to any domain. For customer support, combine static response guidelines with dynamic knowledge base retrieval. For code generation, pair static coding patterns with dynamic code search across your repositories. For data analysis, merge static methodologies with dynamic data retrieval. For documentation, unite static writing standards with dynamic content search across your docs.
The tools are remarkably accessible. Google ADK provides a free, open-source agent framework that handles orchestration. ChromaDB offers a free, open-source vector database for semantic search. OpenAI and Claude operate on pay-per-use models with no infrastructure requirements. Upsun delivers a managed platform that eliminates DevOps complexity entirely.
Your next steps:
- Identify a use case: What internal task could benefit from AI assistance?
- Gather static knowledge: What guidelines should the agent always follow?
- Identify dynamic sources: What documentation should be searchable?
- Start small: Build a prototype with 10-20 documents
- Iterate: Add more documents, refine prompts, improve retrieval
- Deploy: Use Upsun’s cloud application platform for production
Resources:
- Google ADK Documentation
- ChromaDB Documentation
- OpenAI Embeddings Guide
- Upsun Multi-App Guide
- UV Package Manager
Questions? Join our Discord.