Documentation Index
Fetch the complete documentation index at: https://mintlify.com/NVlabs/alpasim/llms.txt
Use this file to discover all available pages before exploring further.
Performance Tuning
Replica Counts and GPU Distribution
The number of service replicas and their GPU assignments are configured in deployment configs located insrc/wizard/configs/deploy/.
For local workstation: local_oss.yaml
Understanding the Configuration
Each service has two key parameters:- One container per GPU (or one container total if
gpus: null) - Each container runs
replicas_per_containerservice instances - Total replicas =
nr_gpus * replicas_per_container
gpus: [0, 1, 2, 3]→ 4 containers (one per GPU)replicas_per_container: 4→ 4 replicas per container- Total: 4 × 4 = 16 service replicas
Balancing Replicas and Concurrent Rollouts
Total simulation throughput capacity is determined by:n_concurrent_rollouts is the number of rollouts (simulation episodes) each service replica can process simultaneously.
Example from local_oss.yaml scaled up:
Changing Inference Frequency
Changing inference frequency requires coordinating multiple timing parameters.Understanding Timing Parameters
The simulator has multiple synchronized “clocks”:- Driver inference (
control_timestep_us) - How often the model makes decisions - Camera frames (
frame_interval_us) - How often cameras capture images - GPS/Pose updates (
egopose_interval_us) - How often position is updated - Simulation start (
time_start_offset_us) - Initial offset to avoid artifacts
For correct operation, these parameters must be mathematically aligned.
Scenario 1: Simple Frequency Change
To change to 5Hz inference (200ms between decisions):
Full command:
Scenario 2: High-Rate Camera with Lower Inference
To use 30Hz cameras (33.3ms) but 10Hz inference (100ms):- Camera captures at 30Hz:
frame_interval_us=33334(33.3ms) - Inference runs at 10Hz:
control_timestep_us=100002(must be 3 × 33334) - Subsample frames:
driver.inference.Cframes_subsample=3(use every 3rd frame) - Egopose matches inference:
egopose_interval_us=100002 - Time offset aligns:
time_start_offset_us=300006(3 × 100002)
Common Frequencies
| Frequency | control_timestep_us | egopose_interval_us | time_start_offset_us | Notes |
|---|---|---|---|---|
| 2Hz | 500000 (500ms) | 500000 | 500000 or 1500000 | VaVAM default |
| 5Hz | 200000 (200ms) | 200000 | 600000 (3×) | Example config |
| 10Hz | 100000 (100ms) | 100000 | 300000 (3×) | Base default |
| 30Hz | 33334 (33.3ms) | 33334 | 100002 (3×) | High frequency |
Most configs use
time_start_offset_us = 3 × control_timestep_us to avoid artifacts at scene start.Validation
Theassert_zero_decision_delay flag (enabled by default) validates timing synchronization at runtime:
- Camera frames complete exactly at decision time
- Egopose updates complete exactly at decision time
Viewing Results and Metrics
Results Directory Structure
After a run completes, results are inwizard.log_dir (e.g., runs/{RUN_DIR}/):
rollouts/
rollouts/
Simulation logs organized by scene and batch:
rollouts/{scene_id}/{batch_uuid}/rollout.asl- Full simulation logrollouts/{scene_id}/{batch_uuid}/metrics.parquet- Per-rollout metricsrollouts/{scene_id}/{batch_uuid}/{clipgt_id}_{batch_id}_{rollout_id}.mp4- Evaluation video_complete- Marker file indicating successful completion
aggregate/
aggregate/
Aggregated results across all rollouts:
metrics_results.txt- Formatted table of driving scores (mean, std, quantiles)metrics_results.png- Visual summary of driving quality metricsmetrics_unprocessed.parquet- Combined metrics from all rolloutsvideos/- Videos organized by violation types (collision_at_fault, offroad, etc.)
telemetry/
telemetry/
Performance profiling data:
metrics.prom- Prometheus metrics from simulationmetrics_plot.png- Performance visualization (CPU/GPU/RPC metrics)
txt-logs/
txt-logs/
Per-service debug logs for troubleshooting
wizard-config.yaml
wizard-config.yaml
Resolved configuration used for this run (after Hydra inheritance)
Understanding Driving Quality Metrics
The simulation evaluates driving quality across multiple dimensions. Results are inaggregate/metrics_results.txt and visualized in aggregate/metrics_results.png.
Safety Metrics (Binary)
0 = pass, 1 = failcollision_at_fault: Driver caused a collision (front/lateral impact)collision_rear: Rear-end collision (not at fault)offroad: Vehicle drove off the road
Performance Metrics (Continuous)
-
dist_to_gt_trajectory: Maximum distance from ground truth path (meters)- Lower is better; indicates how closely the driver follows expected routes
- Aggregated using MAX over time (worst deviation during the drive)
-
duration_frac_20s: Fraction of 20s drive completed before any failure- 1.0 = completed full 20s without issues
- Less than 1.0 = failed early (collision, off-road, or excessive deviation)
Distance Between Incidents
-
avg_dist_between_incidents: Average km traveled per incident (collision or offroad)- Higher is better; measures safety over distance
-
avg_dist_between_incidents_at_fault: Average km traveled per at-fault incident- Higher is better; excludes rear-end collisions not caused by the driver
Interpreting Results
Theaggregate/metrics_results.txt file shows statistics across all rollouts:
aggregate/videos/violations/ are organized by failure type for easy review.
Performance Metrics
Automatically Generated Metrics Plot
After each simulation run, AlpaSim automatically generates a comprehensive performance visualization. Location:runs/{RUN_DIR}/metrics/metrics_plot.png
This 3×3 grid plot includes:
Row 1: RPC Performance
- RPC Duration histogram - Total time from call start to coroutine resumption
- RPC Blocking histogram - Event loop scheduler delay
- RPC Queue Depth histogram - Service saturation levels
- Rollout Duration histogram - Total time per rollout
- Step Duration histogram - Time per simulation step
- Service Configuration table - Shows replica counts and capacity
- CPU Utilization boxplots - Per-service CPU usage
- GPU Utilization boxplots - GPU compute usage
- GPU Memory boxplots - Memory usage with capacity line
- Async worker idle percentage - How much time runtime spent idle
- Sim seconds per rollout - Wallclock time per simulation
Identifying Bottlenecks
High Queue Depth
High Queue Depth
Service is saturated → Increase
replicas_per_container or n_concurrent_rolloutsHigh RPC Duration
High RPC Duration
Service is slow → Consider optimization or scaling
Low GPU Utilization (<50%)
Low GPU Utilization (<50%)
Underutilized → Can increase load by scaling concurrent rollouts
High GPU Utilization (>90%)
High GPU Utilization (>90%)
May be saturated → Check for throttling, consider adding GPUs
Unbalanced Service Config
Unbalanced Service Config
Total capacity should match across all services to avoid bottlenecks
Performance Indicators
- Low idle percentage (less than 20%) → Runtime is busy, good utilization
- High idle percentage (greater than 80%) → Lots of waiting, check for bottlenecks
- Consistent rollout times → Good stability
- Wide rollout time variance → Investigate outliers in logs
Simulation Configuration
Enabling/Disabling Services
Useruntime.endpoints.<service>.skip to disable services:
Changing the Model
By default, the VaVAM driver and model are used. Model weights are downloaded usingdata/download_vavam_assets.sh and stored in data/vavam-driver/.
Using a Different Model
Mount a custom vavam-driver directory:data/vavam-driver/ (in repository root)
The wizard mounts defines.vavam_driver as /mnt/vavam_driver in the container and the driver loads the model from that path.
Using a Different Driver/Inference Code
To use a custom driver container image:src/driver/ are automatically mounted into containers at runtime.
Troubleshooting
Common Issues
Rollouts directory not appearing
Rollouts directory not appearing
Cause: Simulation failed to start or completeSolution:
- Check console logs for first error message
- Verify all services started successfully
- Check
txt-logs/for service-specific errors - Ensure scenes downloaded correctly to
data/nre-artifacts/all-usdzs/
Out of memory errors
Out of memory errors
Cause: GPU memory exhaustedSolution:
- Reduce
n_concurrent_rolloutsper service - Reduce
replicas_per_container - Use smaller batch sizes
- Check GPU memory usage in
metrics/metrics_plot.png
Timing synchronization errors
Timing synchronization errors
Cause: Misaligned timing parametersSolution:
- Verify
egopose_interval_usequalscontrol_timestep_us - Ensure
time_start_offset_usis a multiple ofcontrol_timestep_us - Check camera
frame_interval_usaligns with control timestep - Review Changing Inference Frequency section
Slow simulation performance
Slow simulation performance
Cause: Service bottleneck or misconfigurationSolution:
- Check
metrics/metrics_plot.pngfor queue depths and utilization - Identify bottleneck service (high queue depth)
- Increase replicas or concurrent rollouts for that service
- Verify all services have balanced total capacity