Endpoint name
The name you assign to your endpoint for easy identification in your dashboard. This name is only visible to you and doesn’t affect the endpoint ID used for API calls.Endpoint type
Choose between two endpoint types based on your workload requirements: Queue based endpoints are well-suited for long-running requests, batch processing, or asynchronous tasks. They process requests through a queueing system that guarantees execution and provides built-in retry mechanisms. These endpoints are easy to implement using handler functions, and are ideal for workloads that can be processed asynchronously. Load balancing endpoints are best for high-throughput or low-latency workloads, or non-standard request/response patterns. They route requests directly to worker HTTP servers, bypassing the queue for faster response times. These endpoints support custom REST API paths and are ideal for real-time applications requiring immediate processing. For detailed information about load balancing endpoints, see Load balancing endpoints.GPU configuration
Choose one or more GPU categories (organized by memory) for your endpoint in order of preference. Runpod prioritizes allocating the first category in your list and falls back to subsequent GPUs if your first choice is unavailable. The following GPU categories are available:| GPU type(s) | Memory | Flex cost per second | Active cost per second | Description |
|---|---|---|---|---|
| A4000, A4500, RTX 4000 | 16 GB | $0.00016 | $0.00011 | The most cost-effective for small models. |
| 4090 PRO | 24 GB | $0.00031 | $0.00021 | Extreme throughput for small-to-medium models. |
| L4, A5000, 3090 | 24 GB | $0.00019 | $0.00013 | Great for small-to-medium sized inference workloads. |
| L40, L40S, 6000 Ada PRO | 48 GB | $0.00053 | $0.00037 | Extreme inference throughput on LLMs like Llama 3 7B. |
| A6000, A40 | 48 GB | $0.00034 | $0.00024 | A cost-effective option for running big models. |
| H100 PRO | 80 GB | $0.00116 | $0.00093 | Extreme throughput for big models. |
| A100 | 80 GB | $0.00076 | $0.00060 | High throughput GPU, yet still very cost-effective. |
| H200 PRO | 141 GB | $0.00155 | $0.00124 | Extreme throughput for huge models. |
| B200 | 180 GB | $0.00240 | $0.00190 | Maximum throughput for huge models. |
Worker configuration
Active (min) workers
Sets the minimum number of workers that remain running at all times. Setting this at one or higher eliminates cold start delays for faster response times. Active workers incur charges immediately, but receive up to 30% discount from regular pricing. Default: 0Max workers
The maximum number of concurrent workers your endpoint can scale to. Default: 3GPUs per worker
The number of GPUs assigned to each worker instance. Default: 1Timeout settings
Idle timeout
The amount of time that a worker continues running after completing a request. You’re still charged for this time, even if the worker isn’t actively processing any requests. By default, the idle timeout is set to 5 seconds to help avoid frequent start/stop cycles and reduce the likelihood of cold starts. Setting a longer idle timeout can help minimize cold starts for intermittent traffic, but it may also increase your costs. When configuring idle timeout, start by matching it to your average cold start time to reduce startup delays. For workloads with extended cold starts, consider longer idle timeouts to minimize repeated initialization costs.Execution timeout
The maximum time a job can run before automatic termination. This prevents runaway jobs from consuming excessive resources. You can turn off this setting, but we highly recommend keeping it on. Default: 600 seconds (10 minutes) Maximum: 24 hours (can be extended using job TTL)Job TTL (time-to-live)
The maximum time a job remains in the queue before automatic termination. Default: 86,400,000 milliseconds (24 hours) Minimum: 10,000 milliseconds (10 seconds) See Execution policies for more information.FlashBoot
FlashBoot is Runpod’s solution for reducing the average cold-start times on your endpoint. It works by retaining worker resources for some time after they’re no longer in use, so they can be rebooted quickly. When your endpoint has consistent traffic, your workers have a higher chance of benefiting from FlashBoot for faster spin-ups. However, if your endpoint isn’t receiving frequent requests, FlashBoot has fewer opportunities to optimize performance. There is no additional cost associated with FlashBoot.Model (optional)
You can select from a list of cached models using the Model (optional) field. Selecting a model signals the system to place your workers on host machines that contain the selected model, resulting in faster cold starts and significant cost savings.Advanced settings
When configuring advanced settings, remember that each constraint (data center, storage, CUDA version, GPU type) may limit resource availability. For maximum availability and reliability, select all data centers and CUDA versions, and avoid network volumes unless your workload specifically requires them.Data centers
Control which data centers can deploy and cache your workers. Allowing multiple data centers improves availability, while using a network volume restricts your endpoint to a single data center. Default: All data centersNetwork volumes
Attach persistent storage to your workers. Network volumes have higher latency than local storage, and restrict workers to the data center containing your volume. However, they can be very useful for sharing large models or data between workers on an endpoint.Auto-scaling type
Queue delay
Adds workers based on request wait times. The queue delay scaling strategy adjusts worker numbers based on request wait times. Workers are added if requests spend more than X seconds in the queue, where X is a threshold you define. By default, this threshold is set at 4 seconds.Request count
The request count scaling strategy adjusts worker numbers according to total requests in the queue and in progress. It automatically adds workers as the number of requests increases, ensuring tasks are handled efficiently. Total workers formula:Math.ceil((requestsInQueue + requestsInProgress) / 4)
Expose HTTP/TCP ports
Enables direct communication with your worker via its public IP and port. This can be useful for real-time applications requiring minimal latency, such as WebSocket applications.Enabled GPU types
Here you can specify which GPU types to use within your selected GPU size categories. By default, all GPU types are enabled.CUDA version selection
Specify which CUDA versions can be used with your workload to ensures your code runs on compatible GPU hardware. Runpod will match your workload to GPU instances with the selected CUDA versions.Reducing worker startup times
There are two primary factors that impact worker start times:- Worker initialization time: Worker initialization occurs when a Docker image is downloaded to a new worker. This takes place after you create a new endpoint, adjust worker counts, or deploy a new worker image. Requests that arrive during initialization face delays, as a worker must be fully initialized before it can start processing.
- Cold start: A cold start occurs when a worker is revived from an idle state. Cold starts can get very long if your handler code loads large ML models (several gigabytes to hundreds of gigabytes) into GPU memory.
If your worker’s cold start time exceeds the default 7-minute limit (which can occur when loading large models), the system may mark it as unhealthy. To prevent this, you can extend the cold start timeout by setting the
RUNPOD_INIT_TIMEOUT environment variable. For example, setting RUNPOD_INIT_TIMEOUT=800 allows up to 800 seconds (13.3 minutes) for revival.- Embed models in Docker images: Package your ML models directly within your worker container image instead of downloading them in your handler function. This strategy places models on the worker’s high-speed local storage (SSD/NVMe), dramatically reducing the time needed to load models into GPU memory. This approach is optimal for production environments, though extremely large models (500GB+) may require network volume storage.
- Store large models on network volumes: For flexibility during development, save large models to a network volume using a Pod or one-time handler, then mount this volume to your Serverless workers. While network volumes offer slower model loading compared to embedding models directly, they can speed up your workflow by enabling rapid iteration and seamless switching between different models and configurations.
- Maintain active workers: Set active worker counts above zero to completely eliminate cold starts. These workers remain ready to process requests instantly and cost up to 30% less when idle compared to standard (flex) workers.
- Extend idle timeouts: Configure longer idle periods to preserve worker availability between requests. This strategy prevents premature worker shutdown during temporary traffic lulls, ensuring no cold starts for subsequent requests.
-
Optimize scaling parameters: Fine-tune your auto-scaling configuration for more responsive worker provisioning:
- Lower queue delay thresholds to 2-3 seconds (default 4).
- Decrease request count thresholds to 2-3 (default 4).
- Increase maximum worker limits: Set higher maximum worker capacities to ensure your Docker images are pre-cached across multiple compute nodes and data centers. This proactive approach eliminates image download delays during scaling events, significantly reducing startup times.
Best practices summary
- Understand optimization tradeoffs and make conscious tradeoffs between cost, speed, and model size.
- Start conservative with max workers and scale up as needed.
- Monitor throttling and adjust max workers accordingly.
- Use active workers for latency-sensitive applications.
- Select multiple GPU types to improve availability.
- Choose appropriate timeouts based on your workload characteristics.
- Consider data locality when using network volumes.
- Avoid setting max workers to 1 to prevent bottlenecks.
- Plan for 20% headroom in max workers to handle load spikes.
- Prefer high-end GPUs with lower GPU count for better performance.
- Set execution timeout to prevent runaway processes.
- Match auto-scaling strategy to your workload patterns.
- Embed models in Docker images when possible for faster loading.
- Extend idle timeouts to prevent frequent cold starts.
- Consider disabling FlashBoot for endpoints with few workers or infrequent traffic.