If you're a software engineer starting to build with AI, you've probably sat in meetings where people casually throw around terms like "LangChain," "embeddings," or "hugging face." If you've nodded along while mentally cataloging what to research later, you're not alone.
The AI space has its own vocabulary and the jargon can make you feel like an impostor. The good news is the concepts aren't that complicated once you understand them.
This article will help you get comfortable with the terms so you can participate in conversations, ask better questions, and understand what people are actually proposing when they suggest using RAG instead of fine-tuning.
Let's decode the vocabulary. It assumes you've been using AI tools like Claude, ChatGPT, or Gemini.
Models are trained AI systems that take inputs and produce outputs based on patterns learned from data. A spam filter is a model; an LLM is a model; an image classifier is a model. The term is broad. When someone says "the model," they're referring to the specific AI system being discussed.
Foundation models are large, general-purpose models trained on massive datasets. They can handle many different types of tasks. Think GPT-5, Claude, Llama.
Specialized models are smaller, specialized models trained for one specific job (spam detection, sentiment analysis, image classification).
Foundation models are more expensive to run but versatile; specialized models are cheaper, faster, and often better at their one thing.
Frontier models is marketing speak for "the newest, most capable foundation models right now." When someone mentions frontier models, they mean the latest GPT, Claude, or Gemini release.
LLMs (Large Language Models) are foundation models that work with text by predicting what comes next in a sequence. Everything else they do like chatting, coding, or analysis emerges from that core prediction capability.
Generative vs. discriminative - Generative models create new content (LLMs generating text, DALL-E creating images). Discriminative models make decisions about existing content (spam filters, fraud detection). If someone says "we need a generative solution," they want something that produces output, not just classifies it.
Multimodal models handle multiple types of input and output - text, images, audio. GPT-5 with vision or Claude with image support can receive an image and respond with text, instead of just text-in/text-out.
Model repositories like Hugging Face are where you find both foundation and specialized models. Think GitHub for ML models. You can download pre-trained models, fine-tuned variants, and specialized models for specific tasks without training your own from scratch.
Prompts are just the input you send to a Model. The instructions, questions, or context. If an API endpoint expects JSON, a prompt is what you put in that JSON. The model reads your prompt and generates a response.
Tokens are common fragments learned during training, not individual letters or always complete words. A token might be a whole word like "the," part of a word like "run" (from "running"), or even punctuation. Models use these small text chunks because it's more efficient. This matters because pricing, speed, and limits are all measured in tokens, not words or characters.
Context window is how much text (measured in tokens) a model can "remember" in a single conversation. If the context window is 128k tokens, that's roughly 96,000 words of conversation history plus your current prompt. Once you exceed it, older messages get dropped. Like a fixed-size buffer.
Temperature controls randomness in responses. Low temperature (0.0-0.3) makes outputs more focused and deterministic - useful for structured tasks like code generation. High temperature (0.7-1.0) makes outputs more creative and varied - useful for brainstorming or creative writing.
Top-p (nucleus sampling) is another randomness control that limits which possible next tokens the model considers. Lower top-p means more predictable outputs; higher means more diversity. You typically adjust either temperature or top-p, not both.
System prompts vs. user prompts - System prompts set the model's behavior and role (like "you're a helpful coding assistant"). User prompts are the actual requests from users. In a chat app, you set the system prompt once in your code; user prompts come from form inputs.
Few-shot vs. zero-shot - Zero-shot means asking the model to do something without examples ("translate this to Spanish"). Few-shot means giving it examples first ("here are 3 examples of translations, now translate this"). Few-shot usually improves accuracy for structured tasks.
Chain-of-thought is prompting the model to show its reasoning step-by-step before giving an answer. Adding "let's think through this step by step" often improves results on complex problems because it forces the model to break down its logic.
Fine-tuning means taking a pre-trained model and continuing its training on your specific data to specialize its behavior. It's like forking a library and modifying its core logic. You're actually changing the model's weights. Fine-tuning is powerful but expensive and requires quality training data.
RAG (Retrieval-Augmented Generation) means searching your own data sources and injecting relevant results into the prompt before sending it to the model. The model itself doesn't change; you're just giving it context from your documents, database, or knowledge base. Think of it like passing query results into a function. The function doesn't change, but the data does.
Prompt engineering is crafting better instructions and examples to get the results you want without changing the model at all. It's the cheapest approach and often surprisingly effective. Most teams start here before considering RAG or fine-tuning.
Embeddings are numerical representations of text (or images, audio, etc.) that capture semantic meaning. Similar concepts have similar embeddings, even if the words are different. You generate embeddings to enable semantic search, clustering, or comparison. It's converting text into coordinates in high-dimensional space.
Vector databases store and search embeddings efficiently. When you do RAG, you typically convert your documents into embeddings, store them in a vector database (Pinecone, Weaviate, Chroma), then search for relevant chunks based on similarity to the user's query. Traditional databases search exact matches; vector databases search by semantic similarity.
Training vs. inference - Training is when a model learns from data (building the model). Inference is when a trained model processes new inputs and generates outputs (using the model). As a developer integrating AI, you're almost always doing inference. Training happens once (or periodically); inference happens thousands or millions of times.
Neural networks are the general category of AI models inspired by how biological brains work. They contain layers of interconnected nodes that process information and learn patterns from data. When people say "deep learning," they mean neural networks with many layers.
Layers are the processing stages stacked in a neural network. Data flows through each layer sequentially, with each layer transforming the representation. Deep learning models have many layers (hence "deep"), and each layer learns increasingly abstract patterns from the data.
Tensors are multi-dimensional arrays of numbers and are the fundamental data structure in neural networks. They're used for everything: your input data gets converted to tensors, the model's learned parameters (weights) are stored as tensors, and intermediate calculations happen with tensors. A 1D tensor is like an array, a 2D tensor is like a matrix, 3D and beyond work the same way.
Parameters (as in "7B parameters" or "70B parameters") are the model's learned weights - 7 billion or 70 billion individual numbers stored across tensors. When you're choosing a model, that number tells you the tradeoff: smaller models (7B, 13B) are faster, cheaper to run, and may be better for focused tasks. Larger models (70B, 175B) are slower and more expensive but generally produce more nuanced, accurate responses across a wider range of tasks.
Transformers are the architecture that powers modern LLMs. The "T" in GPT stands for Transformer. Before transformers, models processed text sequentially (one word at a time). Transformers process all words in parallel, which makes them much faster to train and better at understanding context across long passages.
APIs vs. self-hosted - Using an API (OpenAI, Anthropic, Google) means you send requests over the network and pay per use; the provider handles infrastructure. Self-hosting means running models on your own servers (AWS, Azure, on-premise). APIs are easier to start with; self-hosting gives you control and can be cheaper at scale for smaller models.
Quantization reduces a model's size and memory requirements by using lower-precision numbers for its parameters. A quantized model runs faster and uses less RAM but may be slightly less accurate. It's like compressing images. You trade some quality for significantly smaller file sizes and faster performance.
Batch vs. streaming responses - Batch means you wait for the complete response before getting anything back (traditional API behavior). Streaming sends tokens as they're generated, like Server-Sent Events. Streaming improves perceived performance in chat interfaces because users see output immediately instead of waiting for the entire response.
Rate limits and quotas control how many requests you can make in a given time period. Rate limits are usually per-minute (e.g., 60 requests/min); quotas are often daily or monthly (e.g., 1M tokens/day). These exist to prevent abuse and manage infrastructure costs - similar to API throttling in traditional web services.
Hallucinations occur when models generate plausible-sounding but incorrect or fabricated information. The model isn't "lying" - it's predicting likely text based on patterns, not retrieving facts from a database. Hallucinations are a fundamental limitation of how LLMs work, not a bug to be fixed. You mitigate them with techniques like RAG (grounding responses in real data) or asking models to cite sources.
RLHF (Reinforcement Learning from Human Feedback) is a training technique where humans rate model outputs, and those ratings are used to fine-tune the model toward more helpful, harmless, and honest responses. It's why ChatGPT refuses harmful requests instead of just predicting statistically likely text. Most production LLMs use RLHF or similar techniques.
Guardrails are constraints or filters you add to prevent unwanted model behavior. They may block certain topics, validate outputs against rules, or require specific formats. These can be prompt-based ("never discuss X"), code-based (filtering responses before showing users), or model-based (built into the model through training). Think of guardrails as input validation and output sanitization for AI.
Alignment refers to making models behave according to human intentions and values. A well-aligned model follows instructions accurately, refuses harmful requests, and produces outputs humans actually want. Alignment is an ongoing research problem because "what humans want" is complex and sometimes contradictory.
The AI tooling landscape breaks down into a few categories based on what stage of the process you're working in.
Training frameworks are libraries for building and training models from scratch or fine-tuning existing ones. TensorFlow (Google) and PyTorch (Meta) are the two dominant players. Think of them like different web frameworks (Django vs. Flask, or Express vs. Fastify). PyTorch has become more popular for research and experimentation; TensorFlow is often used in production. Most developers won't train models, but you'll see these names in model documentation and requirements files.
Inference libraries help you run trained models efficiently in production. ONNX Runtime lets you run models optimized for speed regardless of which framework trained them. TensorFlow Lite and PyTorch Mobile are for running models on mobile devices or edge hardware. If you're deploying a model, you're probably using one of these under the hood.
LLM application frameworks make it easier to build apps with language models. LangChain provides abstractions for common patterns like RAG, chaining multiple model calls, or managing conversation history. LlamaIndex focuses specifically on connecting LLMs to your data sources. These are like ORMs for AI. They handle boilerplate so you can focus on your application logic.
Model hubs and repositories like Hugging Face are where you find pre-trained models to download and use. Hugging Face also provides libraries (transformers, diffusers) that make it easy to load and run models from their repository. It's the npm or NuGet of the AI world.
Vector databases like Pinecone, Weaviate, Chroma, and Qdrant store and search embeddings for RAG applications. Traditional databases (Postgres with pgvector, Redis) are also adding vector search capabilities, so you might not need a specialized database.
The AI field moves fast, and new terms emerge constantly. But most conversations about building with AI come back to these core concepts: what kind of model you're using, how you're interacting with it, whether you're adapting it for your needs, and how you're deploying it safely and efficiently.
You don't need to be an ML researcher to build effective AI-powered features. You need to understand enough vocabulary to have productive conversations with your team, evaluate tradeoffs, and know which rabbit holes are worth going down.
The rest is just building software, which you already know how to do.