The OpenAI Python SDK changed in November 2023 when version 1.0 dropped. If your code still uses openai.ChatCompletion.create(), you’re running deprecated patterns that broke two years ago.
The OpenAI Python SDK is the official library for accessing gpt-4o, gpt-4o-mini, embeddings, vision analysis, and assistants from Python. Version 1.x uses client instances, Pydantic response models, and async/await patterns. The SDK supports gpt-4o (flagship model with 128k context) and gpt-4o-mini (fast and cheap with 128k context).
You’ll learn chat completions with streaming, function calling with parallel execution, embeddings for semantic search, vision analysis, assistants with code interpreter, error handling with retries, and cost optimization.
How do I install and configure the OpenAI SDK?
Install using pip3. Requires Python 3.13 or higher.
pip3 install openai
Get your API key from platform.openai.com/api-keys. Store it as an environment variable.
export OPENAI_API_KEY='sk-proj-...'
Initialize the client. The SDK provides synchronous and asynchronous clients.
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
The client handles authentication, retries, and connection pooling. Create one client per application, not per request.
For async applications (FastAPI, asyncio), use AsyncOpenAI():
from openai import AsyncOpenAI
async_client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
async def get_completion(prompt):
response = await async_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "prompt"}]
)
return response.choices[0].message.content
How do chat completions work?
Chat completions power conversational AI and content generation. The API accepts messages with roles (system, user, assistant) and returns the model’s response.
Message structure and conversation context
Messages define conversation history. Each message has a role and content.
System messages set behavior. They define personality, constraints, and output format.
messages = [
{"role": "system", "content": "You are a Python expert who explains concepts with code examples."},
{"role": "user", "content": "What are decorators?"}
]
User messages represent input from the person using your application.
Assistant messages store previous responses. Include them to maintain context across turns.
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is RAG?"},
{"role": "assistant", "content": "RAG combines information retrieval with text generation..."},
{"role": "user", "content": "How do I implement it?"}
]
Controlling model behavior with parameters
temperature (0.0-2.0) controls randomness. Lower values (0.0-0.3) produce deterministic responses. Higher values (0.7-1.5) increase creativity. Use low temperature for code generation and data extraction. Use higher temperature for creative writing.
# Deterministic code
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a fibonacci function"}],
temperature=0.1
)
# Creative writing
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a story opening"}],
temperature=1.2
)
max_tokens limits response length. The model stops after reaching this limit.
top_p (0.0-1.0) implements nucleus sampling. Use top_p=0.1 for focused responses or top_p=0.9 for diverse outputs. Don’t adjust both temperature and top_p.
presence_penalty (-2.0 to 2.0) reduces topic repetition. Positive values encourage exploring new topics.
frequency_penalty (-2.0 to 2.0) reduces token repetition. Positive values discourage verbatim repetition.
Streaming responses for better UX
Streaming sends tokens as they’re generated instead of waiting for completion. This improves perceived latency. Users see text appearing progressively.
Enable with stream=True. The API returns an iterator of chunks.
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain RAG"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Handle streaming errors carefully. Network failures mid-stream leave partial responses.
def stream_completion(prompt):
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True,
timeout=30
)
full_response = ""
for chunk in response:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
full_response += content
return full_response
except Exception as e:
print(f"\nError: {e}")
return None
Model comparison and pricing
Choose models based on task complexity and budget.
| Model | Cost (1M tokens in/out) | Context | Speed | Use Case |
|---|---|---|---|---|
| gpt-4o | $2.50/$10.00 | 128k | Fast | Complex reasoning, analysis |
| gpt-4o-mini | $0.15/$0.60 | 128k | Fastest | Simple tasks, high volume |
gpt-4o excels at complex reasoning. gpt-4o-mini handles simple tasks at 1/16th the cost. Test your use case with both models. Many applications use gpt-4o-mini for 80% of requests and route complex queries to gpt-4o.
What is function calling and how does it work?
Function calling lets the model decide when to call external functions. The model outputs structured JSON with function names and arguments. Your code executes the function and returns results.
This enables API integration, database queries, calculations, and external tool usage.
Function calling workflow
- Define available functions with JSON schemas
- Send user message with function definitions
- Model returns function call (if needed) or text response
- Execute function with provided arguments
- Send function result back to model
- Model generates final response using function output
Use cases include weather APIs, database lookups, calculator functions, web searches, and external data sources.
Defining tools with JSON schemas
Define functions using JSON schemas. Each function needs a name, description, and parameter specification.
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name (e.g., 'San Francisco')"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
]
Write clear descriptions. The model uses them to decide which function matches user intent.
Parallel function calls
The model can call multiple functions in one request. This reduces latency when operations are independent.
import json
def get_weather(location, unit="celsius"):
return {"temperature": 22, "condition": "sunny", "unit": unit}
def calculate(expression):
try:
result = eval(expression)
return {"result": result}
except:
return {"error": "Invalid expression"}
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What's the weather in Tokyo and what's 15 * 24?"}],
tools=tools
)
tool_calls = response.choices[0].message.tool_calls
if tool_calls:
messages = [{"role": "user", "content": "What's the weather in Tokyo and what's 15 * 24?"}]
messages.append(response.choices[0].message)
for tool_call in tool_calls:
function_name = tool_call.function.name
arguments = json.loads(tool_call.function.arguments)
if function_name == "get_weather":
result = get_weather(**arguments)
elif function_name == "calculate":
result = calculate(**arguments)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
final_response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
print(final_response.choices[0].message.content)
Controlling tool usage
The tool_choice parameter controls whether the model must call functions:
"auto": Model decides (default)"required": Model must call at least one function"none": Disable function calling{"type": "function", "function": {"name": "function_name"}}: Force specific function
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Tell me about Tokyo"}],
tools=tools,
tool_choice={"type": "function", "function": {"name": "get_weather"}}
)
How do I use embeddings for semantic search?
Embeddings convert text into dense vectors that capture semantic meaning. Similar texts produce similar vectors.
Choosing an embedding model
OpenAI provides two models:
| Model | Dimensions | Cost (1M tokens) | Use Case |
|---|---|---|---|
| text-embedding-3-large | 3072 | $0.13 | High-quality search |
| text-embedding-3-small | 1536 | $0.02 | Cost-sensitive apps |
text-embedding-3-small provides sufficient quality at 1/6th the cost for most applications.
response = client.embeddings.create(
model="text-embedding-3-small",
input="Retrieval Augmented Generation combines retrieval with generation"
)
embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")
Building semantic search
Embed documents, store vectors, and find nearest neighbors for queries.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
documents = [
"Python is a high-level programming language",
"Machine learning models require training data",
"Vector databases store embeddings for similarity search",
"FastAPI is a modern web framework for Python"
]
# Generate embeddings
doc_embeddings = []
for doc in documents:
response = client.embeddings.create(
model="text-embedding-3-small",
input=doc
)
doc_embeddings.append(response.data[0].embedding)
doc_embeddings = np.array(doc_embeddings)
def search(query, top_k=3):
query_response = client.embeddings.create(
model="text-embedding-3-small",
input=query
)
query_embedding = np.array([query_response.data[0].embedding])
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:top_k]
results = []
for idx in top_indices:
results.append({
"document": documents[idx],
"similarity": similarities[idx]
})
return results
results = search("What is a web framework?")
for i, result in enumerate(results, 1):
print(f"{i}. {result['document']} ({result['similarity']:.3f})")
For production, use vector databases like FAISS, Pinecone, or Weaviate.
How do I analyze images with vision models?
gpt-4o analyzes images and answers questions about visual content. Use it for image captioning, OCR, visual question answering, and content moderation.
Supported image formats
The API accepts images as URLs or base64-encoded data. Supported formats: JPEG, PNG, GIF, WebP. Maximum file size: 20MB.
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {"url": "https://example.com/image.jpg"}
}
]
}
]
)
print(response.choices[0].message.content)
For base64 images:
import base64
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
base64_image = encode_image("diagram.png")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this diagram"},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{base64_image}"}
}
]
}
]
)
Multi-image analysis
Send multiple images in one request.
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Compare these diagrams"},
{"type": "image_url", "image_url": {"url": "https://example.com/arch1.jpg"}},
{"type": "image_url", "image_url": {"url": "https://example.com/arch2.jpg"}}
]
}
]
)
OCR and text extraction
gpt-4o extracts text from images without dedicated OCR libraries. It handles handwriting, complex layouts, and multiple languages.
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Extract all text and format as markdown"},
{"type": "image_url", "image_url": {"url": "https://example.com/document.jpg"}}
]
}
]
)
print(response.choices[0].message.content)
What are assistants and how do I use them?
Assistants are stateful AI agents with persistent conversations. Unlike chat completions (stateless), assistants remember conversation history across API calls.
Assistant capabilities
Assistants support three built-in tools:
- Code Interpreter: Executes Python 3.13, generates charts, analyzes data
- File Search: Semantic search over uploaded documents
- Function Calling: Same as chat completions
Create an assistant once and reuse for multiple conversations.
assistant = client.beta.assistants.create(
name="Python Expert",
instructions="You are an expert Python developer who helps with code review.",
model="gpt-4o",
tools=[{"type": "code_interpreter"}]
)
Using code interpreter
Code interpreter runs Python 3.13 in a sandbox. It can install packages, generate visualizations, and process files.
thread = client.beta.threads.create()
message = client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content="Generate a bar chart showing fibonacci numbers up to F(10)"
)
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id
)
import time
while run.status != "completed":
time.sleep(1)
run = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
messages = client.beta.threads.messages.list(thread_id=thread.id)
print(messages.data[0].content[0].text.value)
File search for document Q&A
Upload documents and let the assistant search them. This implements RAG without managing embeddings yourself.
file = client.files.create(
file=open("documentation.pdf", "rb"),
purpose="assistants"
)
assistant = client.beta.assistants.create(
name="Documentation Assistant",
instructions="Answer questions based on uploaded documentation.",
model="gpt-4o",
tools=[{"type": "file_search"}],
tool_resources={"file_search": {"file_ids": [file.id]}}
)
thread = client.beta.threads.create()
message = client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content="What are the installation requirements?"
)
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id
)
How do I handle errors in production?
Production systems need robust error handling. The SDK raises specific exceptions for different failures.
Common error types
RateLimitError: Exceeded rate limit (requests per minute or tokens per minute). Happens during traffic spikes.
APIError: Server returned 500 error. Temporary server issue. Retry with exponential backoff.
AuthenticationError: Invalid API key or insufficient permissions.
InvalidRequestError: Malformed request (invalid parameters, unsupported model).
APIConnectionError: Network failure or timeout.
from openai import OpenAI, RateLimitError, APIError, AuthenticationError
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
except RateLimitError as e:
print(f"Rate limit exceeded: {e}")
except APIError as e:
print(f"Server error: {e}")
except AuthenticationError as e:
print(f"Authentication failed: {e}")
Implementing retry logic
Use exponential backoff for transient errors.
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type
)
from openai import RateLimitError, APIError
@retry(
retry=retry_if_exception_type((RateLimitError, APIError)),
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
def call_gpt(prompt):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
This retries up to 3 times with exponential backoff (2s, 4s, 8s).
Setting timeouts
Set timeouts to prevent hanging requests.
client = OpenAI(timeout=30.0)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Long task"}],
timeout=60.0
)
How do I optimize costs in production?
Production systems need cost tracking, caching, and rate limiting.
Tracking token usage
Use tiktoken to count tokens and estimate costs.
import tiktoken
def count_tokens(text, model="gpt-4o"):
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
prompt = "Explain retrieval augmented generation"
token_count = count_tokens(prompt)
# Estimate cost (gpt-4o: $2.50 input, $10.00 output per 1M tokens)
input_cost = (token_count / 1_000_000) * 2.50
output_cost = (500 / 1_000_000) * 10.00 # Assume 500 output tokens
total_cost = input_cost + output_cost
print(f"Estimated cost: ${total_cost:.6f}")
Implementing caching
Cache responses for identical prompts using Redis.
import redis
import json
import hashlib
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)
def cached_completion(prompt, model="gpt-4o", ttl=3600):
cache_key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
result = response.choices[0].message.content
redis_client.setex(cache_key, ttl, json.dumps(result))
return result
Rate limiting
Implement client-side throttling for high-volume applications.
Rate limits (2026):
- Free tier: 200 requests/day
- Tier 1: 500 requests/minute, 200k tokens/minute
- Tier 2: 5,000 requests/minute, 2M tokens/minute
import time
from threading import Lock
class RateLimiter:
def __init__(self, requests_per_minute):
self.requests_per_minute = requests_per_minute
self.tokens = requests_per_minute
self.last_update = time.time()
self.lock = Lock()
def acquire(self):
with self.lock:
now = time.time()
elapsed = now - self.last_update
self.tokens = min(
self.requests_per_minute,
self.tokens + elapsed * (self.requests_per_minute / 60)
)
self.last_update = now
if self.tokens >= 1:
self.tokens -= 1
return True
else:
wait_time = (1 - self.tokens) * (60 / self.requests_per_minute)
time.sleep(wait_time)
self.tokens = 0
return True
limiter = RateLimiter(requests_per_minute=500)
def rate_limited_completion(prompt):
limiter.acquire()
return client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
How do I migrate from SDK 0.x to 1.x?
SDK 1.x introduced breaking changes in November 2023.
Breaking changes
| SDK 0.x | SDK 1.x |
|---|---|
openai.ChatCompletion.create() | client.chat.completions.create() |
openai.Embedding.create() | client.embeddings.create() |
| Dict responses | Pydantic models |
Global openai.api_key | Client instance |
| No async support | AsyncOpenAI() |
Migration example
Old (0.x):
import openai
openai.api_key = "sk-..."
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello"}]
)
print(response["choices"][0]["message"]["content"])
New (1.x):
from openai import OpenAI
client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
What’s the difference between gpt-4o and gpt-4o-mini?
gpt-4o is the flagship model with superior reasoning and better performance on complex tasks. It costs 16x more than gpt-4o-mini. Use gpt-4o for research, analysis, and multi-step reasoning. Use gpt-4o-mini for classification, extraction, simple Q&A, and high-volume processing.
How do I reduce API costs?
Use gpt-4o-mini for simple tasks. It costs $0.15/$0.60 per 1M tokens vs $2.50/$10.00 for gpt-4o. Implement caching to avoid redundant calls. Count tokens and optimize prompts. Use lower max_tokens limits. Batch requests when possible.
Can I use the SDK with Azure OpenAI?
Yes. Azure OpenAI provides the same models through a different endpoint. Initialize with AzureOpenAI(api_key=’your-azure-key’, api_version=’2024-02-01′, azure_endpoint=’https://your-resource.openai.azure.com’).
What are the rate limits?
Rate limits depend on your tier. Tier 1 gets 500 requests/minute and 200k tokens/minute. Tier 2 gets 5,000 requests/minute and 2M tokens/minute. Check limits at platform.openai.com/account/limits.
How do I handle long conversations?
Implement conversation summarization. When approaching context limits (128k tokens), summarize old messages and keep recent context. Use sliding windows or hierarchical summarization. Alternatively, use the Assistants API which manages state automatically.
Should I use sync or async client?
Use AsyncOpenAI() for async applications (FastAPI, asyncio services). Use OpenAI() for synchronous scripts. The async client provides better concurrency for multiple API calls. Don’t use async unless your application is already async.
How do I test OpenAI integrations?
Mock the client in tests using dependency injection. Use unittest.mock to create a mock client with predefined responses. For integration tests, use a test API key with low rate limits and monitor costs.
Conclusion
The OpenAI Python SDK 1.x provides type-safe access to gpt-4o, gpt-4o-mini, embeddings, vision, and assistants. Production systems need error handling with retries, cost tracking with token counting, caching for redundant requests, and rate limiting.
Start with gpt-4o-mini for prototyping. Use gpt-4o for complex reasoning. Implement function calling for API integration. Use embeddings for semantic search. Add vision for image analysis. Deploy assistants for stateful conversations.
Test models on your use case. Measure quality and cost. Optimize prompts to reduce tokens. Implement caching and rate limiting. Monitor latency and errors.

