OpenAI Python SDK: Complete Developer Guide (2026)

The OpenAI Python SDK changed in November 2023 when version 1.0 dropped. If your code still uses openai.ChatCompletion.create(), you’re running deprecated patterns that broke two years ago.

The OpenAI Python SDK is the official library for accessing gpt-4o, gpt-4o-mini, embeddings, vision analysis, and assistants from Python. Version 1.x uses client instances, Pydantic response models, and async/await patterns. The SDK supports gpt-4o (flagship model with 128k context) and gpt-4o-mini (fast and cheap with 128k context).

You’ll learn chat completions with streaming, function calling with parallel execution, embeddings for semantic search, vision analysis, assistants with code interpreter, error handling with retries, and cost optimization.

How do I install and configure the OpenAI SDK?

Install using pip3. Requires Python 3.13 or higher.

pip3 install openai

Get your API key from platform.openai.com/api-keys. Store it as an environment variable.

export OPENAI_API_KEY='sk-proj-...'

Initialize the client. The SDK provides synchronous and asynchronous clients.

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

The client handles authentication, retries, and connection pooling. Create one client per application, not per request.

For async applications (FastAPI, asyncio), use AsyncOpenAI():

from openai import AsyncOpenAI

async_client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

async def get_completion(prompt):
    response = await async_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "prompt"}]
    )
    return response.choices[0].message.content

How do chat completions work?

Chat completions power conversational AI and content generation. The API accepts messages with roles (system, user, assistant) and returns the model’s response.

Message structure and conversation context

Messages define conversation history. Each message has a role and content.

System messages set behavior. They define personality, constraints, and output format.

messages = [
    {"role": "system", "content": "You are a Python expert who explains concepts with code examples."},
    {"role": "user", "content": "What are decorators?"}
]

User messages represent input from the person using your application.

Assistant messages store previous responses. Include them to maintain context across turns.

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is RAG?"},
    {"role": "assistant", "content": "RAG combines information retrieval with text generation..."},
    {"role": "user", "content": "How do I implement it?"}
]

Controlling model behavior with parameters

temperature (0.0-2.0) controls randomness. Lower values (0.0-0.3) produce deterministic responses. Higher values (0.7-1.5) increase creativity. Use low temperature for code generation and data extraction. Use higher temperature for creative writing.

# Deterministic code
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a fibonacci function"}],
    temperature=0.1
)

# Creative writing
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a story opening"}],
    temperature=1.2
)

max_tokens limits response length. The model stops after reaching this limit.

top_p (0.0-1.0) implements nucleus sampling. Use top_p=0.1 for focused responses or top_p=0.9 for diverse outputs. Don’t adjust both temperature and top_p.

presence_penalty (-2.0 to 2.0) reduces topic repetition. Positive values encourage exploring new topics.

frequency_penalty (-2.0 to 2.0) reduces token repetition. Positive values discourage verbatim repetition.

Streaming responses for better UX

Streaming sends tokens as they’re generated instead of waiting for completion. This improves perceived latency. Users see text appearing progressively.

Enable with stream=True. The API returns an iterator of chunks.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain RAG"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Handle streaming errors carefully. Network failures mid-stream leave partial responses.

def stream_completion(prompt):
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            stream=True,
            timeout=30
        )
        
        full_response = ""
        for chunk in response:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                print(content, end="", flush=True)
                full_response += content
        
        return full_response
    except Exception as e:
        print(f"\nError: {e}")
        return None

Model comparison and pricing

Choose models based on task complexity and budget.

Model	Cost (1M tokens in/out)	Context	Speed	Use Case
gpt-4o	$2.50/$10.00	128k	Fast	Complex reasoning, analysis
gpt-4o-mini	$0.15/$0.60	128k	Fastest	Simple tasks, high volume

gpt-4o excels at complex reasoning. gpt-4o-mini handles simple tasks at 1/16th the cost. Test your use case with both models. Many applications use gpt-4o-mini for 80% of requests and route complex queries to gpt-4o.

What is function calling and how does it work?

Function calling lets the model decide when to call external functions. The model outputs structured JSON with function names and arguments. Your code executes the function and returns results.

This enables API integration, database queries, calculations, and external tool usage.

Function calling workflow

Define available functions with JSON schemas
Send user message with function definitions
Model returns function call (if needed) or text response
Execute function with provided arguments
Send function result back to model
Model generates final response using function output

Use cases include weather APIs, database lookups, calculator functions, web searches, and external data sources.

Defining tools with JSON schemas

Define functions using JSON schemas. Each function needs a name, description, and parameter specification.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name (e.g., 'San Francisco')"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    }
]

Write clear descriptions. The model uses them to decide which function matches user intent.

Parallel function calls

The model can call multiple functions in one request. This reduces latency when operations are independent.

import json

def get_weather(location, unit="celsius"):
    return {"temperature": 22, "condition": "sunny", "unit": unit}

def calculate(expression):
    try:
        result = eval(expression)
        return {"result": result}
    except:
        return {"error": "Invalid expression"}

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in Tokyo and what's 15 * 24?"}],
    tools=tools
)

tool_calls = response.choices[0].message.tool_calls

if tool_calls:
    messages = [{"role": "user", "content": "What's the weather in Tokyo and what's 15 * 24?"}]
    messages.append(response.choices[0].message)
    
    for tool_call in tool_calls:
        function_name = tool_call.function.name
        arguments = json.loads(tool_call.function.arguments)
        
        if function_name == "get_weather":
            result = get_weather(**arguments)
        elif function_name == "calculate":
            result = calculate(**arguments)
        
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result)
        })
    
    final_response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    
    print(final_response.choices[0].message.content)

Controlling tool usage

The tool_choice parameter controls whether the model must call functions:

"auto": Model decides (default)
"required": Model must call at least one function
"none": Disable function calling
{"type": "function", "function": {"name": "function_name"}}: Force specific function

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Tell me about Tokyo"}],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "get_weather"}}
)

How do I use embeddings for semantic search?

Embeddings convert text into dense vectors that capture semantic meaning. Similar texts produce similar vectors.

Choosing an embedding model

OpenAI provides two models:

Model	Dimensions	Cost (1M tokens)	Use Case
text-embedding-3-large	3072	$0.13	High-quality search
text-embedding-3-small	1536	$0.02	Cost-sensitive apps

text-embedding-3-small provides sufficient quality at 1/6th the cost for most applications.

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Retrieval Augmented Generation combines retrieval with generation"
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")

Building semantic search

Embed documents, store vectors, and find nearest neighbors for queries.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

documents = [
    "Python is a high-level programming language",
    "Machine learning models require training data",
    "Vector databases store embeddings for similarity search",
    "FastAPI is a modern web framework for Python"
]

# Generate embeddings
doc_embeddings = []
for doc in documents:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=doc
    )
    doc_embeddings.append(response.data[0].embedding)

doc_embeddings = np.array(doc_embeddings)

def search(query, top_k=3):
    query_response = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )
    query_embedding = np.array([query_response.data[0].embedding])
    
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    results = []
    for idx in top_indices:
        results.append({
            "document": documents[idx],
            "similarity": similarities[idx]
        })
    
    return results

results = search("What is a web framework?")
for i, result in enumerate(results, 1):
    print(f"{i}. {result['document']} ({result['similarity']:.3f})")

For production, use vector databases like FAISS, Pinecone, or Weaviate.

How do I analyze images with vision models?

gpt-4o analyzes images and answers questions about visual content. Use it for image captioning, OCR, visual question answering, and content moderation.

Supported image formats

The API accepts images as URLs or base64-encoded data. Supported formats: JPEG, PNG, GIF, WebP. Maximum file size: 20MB.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image.jpg"}
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

For base64 images:

import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

base64_image = encode_image("diagram.png")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this diagram"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"}
                }
            ]
        }
    ]
)

Multi-image analysis

Send multiple images in one request.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare these diagrams"},
                {"type": "image_url", "image_url": {"url": "https://example.com/arch1.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/arch2.jpg"}}
            ]
        }
    ]
)

OCR and text extraction

gpt-4o extracts text from images without dedicated OCR libraries. It handles handwriting, complex layouts, and multiple languages.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract all text and format as markdown"},
                {"type": "image_url", "image_url": {"url": "https://example.com/document.jpg"}}
            ]
        }
    ]
)

print(response.choices[0].message.content)

What are assistants and how do I use them?

Assistants are stateful AI agents with persistent conversations. Unlike chat completions (stateless), assistants remember conversation history across API calls.

Assistant capabilities

Assistants support three built-in tools:

Code Interpreter: Executes Python 3.13, generates charts, analyzes data
File Search: Semantic search over uploaded documents
Function Calling: Same as chat completions

Create an assistant once and reuse for multiple conversations.

assistant = client.beta.assistants.create(
    name="Python Expert",
    instructions="You are an expert Python developer who helps with code review.",
    model="gpt-4o",
    tools=[{"type": "code_interpreter"}]
)

Using code interpreter

Code interpreter runs Python 3.13 in a sandbox. It can install packages, generate visualizations, and process files.

thread = client.beta.threads.create()

message = client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="Generate a bar chart showing fibonacci numbers up to F(10)"
)

run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id
)

import time
while run.status != "completed":
    time.sleep(1)
    run = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)

messages = client.beta.threads.messages.list(thread_id=thread.id)
print(messages.data[0].content[0].text.value)

File search for document Q&A

Upload documents and let the assistant search them. This implements RAG without managing embeddings yourself.

file = client.files.create(
    file=open("documentation.pdf", "rb"),
    purpose="assistants"
)

assistant = client.beta.assistants.create(
    name="Documentation Assistant",
    instructions="Answer questions based on uploaded documentation.",
    model="gpt-4o",
    tools=[{"type": "file_search"}],
    tool_resources={"file_search": {"file_ids": [file.id]}}
)

thread = client.beta.threads.create()
message = client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="What are the installation requirements?"
)

run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id
)

How do I handle errors in production?

Production systems need robust error handling. The SDK raises specific exceptions for different failures.

Common error types

RateLimitError: Exceeded rate limit (requests per minute or tokens per minute). Happens during traffic spikes.

APIError: Server returned 500 error. Temporary server issue. Retry with exponential backoff.

AuthenticationError: Invalid API key or insufficient permissions.

InvalidRequestError: Malformed request (invalid parameters, unsupported model).

APIConnectionError: Network failure or timeout.

from openai import OpenAI, RateLimitError, APIError, AuthenticationError

try:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello"}]
    )
except RateLimitError as e:
    print(f"Rate limit exceeded: {e}")
except APIError as e:
    print(f"Server error: {e}")
except AuthenticationError as e:
    print(f"Authentication failed: {e}")

Implementing retry logic

Use exponential backoff for transient errors.

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type
)
from openai import RateLimitError, APIError

@retry(
    retry=retry_if_exception_type((RateLimitError, APIError)),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
def call_gpt(prompt):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

This retries up to 3 times with exponential backoff (2s, 4s, 8s).

Setting timeouts

Set timeouts to prevent hanging requests.

client = OpenAI(timeout=30.0)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Long task"}],
    timeout=60.0
)

How do I optimize costs in production?

Production systems need cost tracking, caching, and rate limiting.

Tracking token usage

Use tiktoken to count tokens and estimate costs.

import tiktoken

def count_tokens(text, model="gpt-4o"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

prompt = "Explain retrieval augmented generation"
token_count = count_tokens(prompt)

# Estimate cost (gpt-4o: $2.50 input, $10.00 output per 1M tokens)
input_cost = (token_count / 1_000_000) * 2.50
output_cost = (500 / 1_000_000) * 10.00  # Assume 500 output tokens
total_cost = input_cost + output_cost

print(f"Estimated cost: ${total_cost:.6f}")

Implementing caching

Cache responses for identical prompts using Redis.

import redis
import json
import hashlib

redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)

def cached_completion(prompt, model="gpt-4o", ttl=3600):
    cache_key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
    
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    
    result = response.choices[0].message.content
    redis_client.setex(cache_key, ttl, json.dumps(result))
    
    return result

Rate limiting

Implement client-side throttling for high-volume applications.

Rate limits (2026):

Free tier: 200 requests/day
Tier 1: 500 requests/minute, 200k tokens/minute
Tier 2: 5,000 requests/minute, 2M tokens/minute

import time
from threading import Lock

class RateLimiter:
    def __init__(self, requests_per_minute):
        self.requests_per_minute = requests_per_minute
        self.tokens = requests_per_minute
        self.last_update = time.time()
        self.lock = Lock()
    
    def acquire(self):
        with self.lock:
            now = time.time()
            elapsed = now - self.last_update
            
            self.tokens = min(
                self.requests_per_minute,
                self.tokens + elapsed * (self.requests_per_minute / 60)
            )
            self.last_update = now
            
            if self.tokens >= 1:
                self.tokens -= 1
                return True
            else:
                wait_time = (1 - self.tokens) * (60 / self.requests_per_minute)
                time.sleep(wait_time)
                self.tokens = 0
                return True

limiter = RateLimiter(requests_per_minute=500)

def rate_limited_completion(prompt):
    limiter.acquire()
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )

How do I migrate from SDK 0.x to 1.x?

SDK 1.x introduced breaking changes in November 2023.

Breaking changes

SDK 0.x	SDK 1.x
`openai.ChatCompletion.create()`	`client.chat.completions.create()`
`openai.Embedding.create()`	`client.embeddings.create()`
Dict responses	Pydantic models
Global `openai.api_key`	Client instance
No async support	`AsyncOpenAI()`

Migration example

Old (0.x):

import openai

openai.api_key = "sk-..."

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello"}]
)

print(response["choices"][0]["message"]["content"])

New (1.x):

from openai import OpenAI

client = OpenAI(api_key="sk-...")

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}]
)

print(response.choices[0].message.content)

What’s the difference between gpt-4o and gpt-4o-mini?

gpt-4o is the flagship model with superior reasoning and better performance on complex tasks. It costs 16x more than gpt-4o-mini. Use gpt-4o for research, analysis, and multi-step reasoning. Use gpt-4o-mini for classification, extraction, simple Q&A, and high-volume processing.

How do I reduce API costs?

Use gpt-4o-mini for simple tasks. It costs $0.15/$0.60 per 1M tokens vs $2.50/$10.00 for gpt-4o. Implement caching to avoid redundant calls. Count tokens and optimize prompts. Use lower max_tokens limits. Batch requests when possible.

Can I use the SDK with Azure OpenAI?

Yes. Azure OpenAI provides the same models through a different endpoint. Initialize with AzureOpenAI(api_key=’your-azure-key’, api_version=’2024-02-01′, azure_endpoint=’https://your-resource.openai.azure.com’).

What are the rate limits?

Rate limits depend on your tier. Tier 1 gets 500 requests/minute and 200k tokens/minute. Tier 2 gets 5,000 requests/minute and 2M tokens/minute. Check limits at platform.openai.com/account/limits.

How do I handle long conversations?

Implement conversation summarization. When approaching context limits (128k tokens), summarize old messages and keep recent context. Use sliding windows or hierarchical summarization. Alternatively, use the Assistants API which manages state automatically.

Should I use sync or async client?

Use AsyncOpenAI() for async applications (FastAPI, asyncio services). Use OpenAI() for synchronous scripts. The async client provides better concurrency for multiple API calls. Don’t use async unless your application is already async.

How do I test OpenAI integrations?

Mock the client in tests using dependency injection. Use unittest.mock to create a mock client with predefined responses. For integration tests, use a test API key with low rate limits and monitor costs.

Conclusion

The OpenAI Python SDK 1.x provides type-safe access to gpt-4o, gpt-4o-mini, embeddings, vision, and assistants. Production systems need error handling with retries, cost tracking with token counting, caching for redundant requests, and rate limiting.

Start with gpt-4o-mini for prototyping. Use gpt-4o for complex reasoning. Implement function calling for API integration. Use embeddings for semantic search. Add vision for image analysis. Deploy assistants for stateful conversations.

Test models on your use case. Measure quality and cost. Optimize prompts to reduce tokens. Implement caching and rate limiting. Monitor latency and errors.