Performance & Operations

Performance Tuning

Replica Counts and GPU Distribution

The number of service replicas and their GPU assignments are configured in deployment configs located in src/wizard/configs/deploy/. For local workstation: local_oss.yaml

Understanding the Configuration

Each service has two key parameters:

services:
  sensorsim:
    replicas_per_container: 4  # Number of service replicas per container
    gpus: [0, 1, 2, 3]        # GPUs to create containers on

How it works:

One container per GPU (or one container total if gpus: null)
Each container runs replicas_per_container service instances
Total replicas = nr_gpus * replicas_per_container

Example:

gpus: [0, 1, 2, 3] → 4 containers (one per GPU)
replicas_per_container: 4 → 4 replicas per container
Total: 4 × 4 = 16 service replicas

Balancing Replicas and Concurrent Rollouts

Total simulation throughput capacity is determined by:

Total capacity = nr_gpus × replicas_per_container × n_concurrent_rollouts

where n_concurrent_rollouts is the number of rollouts (simulation episodes) each service replica can process simultaneously.

All services must have equal total capacity to avoid bottlenecks.

Example from local_oss.yaml scaled up:

services:
  sensorsim:
    replicas_per_container: 4
    gpus: [0, 1]

  driver:
    replicas_per_container: 8
    gpus: [2, 3]

  controller:
    replicas_per_container: 16
    gpus: null  # CPU-only: 1 container

runtime:
  endpoints:
    sensorsim:
      n_concurrent_rollouts: 4  # 2 GPUs × 4 replicas × 4 concurrent = 32

    driver:
      n_concurrent_rollouts: 2  # 2 GPUs × 8 replicas × 2 concurrent = 32

    controller:
      n_concurrent_rollouts: 2  # 1 CPU × 16 replicas × 2 concurrent = 32

Changing Inference Frequency

Changing inference frequency requires coordinating multiple timing parameters.

Understanding Timing Parameters

The simulator has multiple synchronized “clocks”:

Driver inference (control_timestep_us) - How often the model makes decisions
Camera frames (frame_interval_us) - How often cameras capture images
GPS/Pose updates (egopose_interval_us) - How often position is updated
Simulation start (time_start_offset_us) - Initial offset to avoid artifacts

For correct operation, these parameters must be mathematically aligned.

Scenario 1: Simple Frequency Change

To change to 5Hz inference (200ms between decisions):

Set inference frequency

runtime.default_scenario_parameters.control_timestep_us=200000  # 200ms = 5Hz

Match GPS update rate

egopose_interval_us must equal control_timestep_us:

runtime.default_scenario_parameters.egopose_interval_us=200000

Set time offset

Must be a multiple of control_timestep_us:

runtime.default_scenario_parameters.time_start_offset_us=600000  # 3 × 200ms

Match camera frame rate

For VaVAM default (1 camera):

runtime.default_scenario_parameters.cameras.0.frame_interval_us=200000

For 2-camera configs, also set:

runtime.default_scenario_parameters.cameras.1.frame_interval_us=200000

Full command:

uv run alpasim_wizard +deploy=local_oss \
  wizard.log_dir=runs/{DATETIME} \
  runtime.default_scenario_parameters.control_timestep_us=200000 \
  runtime.default_scenario_parameters.egopose_interval_us=200000 \
  runtime.default_scenario_parameters.time_start_offset_us=600000 \
  runtime.default_scenario_parameters.cameras.0.frame_interval_us=200000

Scenario 2: High-Rate Camera with Lower Inference

To use 30Hz cameras (33.3ms) but 10Hz inference (100ms):

uv run alpasim_wizard +deploy=local_oss \
  wizard.log_dir=runs/{DATETIME} \
  runtime.default_scenario_parameters.control_timestep_us=100002 \
  runtime.default_scenario_parameters.egopose_interval_us=100002 \
  runtime.default_scenario_parameters.time_start_offset_us=300006 \
  runtime.default_scenario_parameters.cameras.0.frame_interval_us=33334 \
  ++driver.inference.Cframes_subsample=3

How it works:

Camera captures at 30Hz: frame_interval_us=33334 (33.3ms)
Inference runs at 10Hz: control_timestep_us=100002 (must be 3 × 33334)
Subsample frames: driver.inference.Cframes_subsample=3 (use every 3rd frame)
Egopose matches inference: egopose_interval_us=100002
Time offset aligns: time_start_offset_us=300006 (3 × 100002)

Common Frequencies

Frequency	`control_timestep_us`	`egopose_interval_us`	`time_start_offset_us`	Notes
2Hz	500000 (500ms)	500000	500000 or 1500000	VaVAM default
5Hz	200000 (200ms)	200000	600000 (3×)	Example config
10Hz	100000 (100ms)	100000	300000 (3×)	Base default
30Hz	33334 (33.3ms)	33334	100002 (3×)	High frequency

Most configs use time_start_offset_us = 3 × control_timestep_us to avoid artifacts at scene start.

Validation

The assert_zero_decision_delay flag (enabled by default) validates timing synchronization at runtime:

# Enabled by default, can explicitly set:
runtime.default_scenario_parameters.assert_zero_decision_delay=true

It checks that:

Camera frames complete exactly at decision time
Egopose updates complete exactly at decision time

If misconfigured, you’ll see errors like:

Camera camera_front_wide_120fov out of sync with planning.
Last started frame finishes at X which is Y microseconds away from decision time Z.

Viewing Results and Metrics

Results Directory Structure

After a run completes, results are in wizard.log_dir (e.g., runs/{RUN_DIR}/):

rollouts/

Simulation logs organized by scene and batch:

rollouts/{scene_id}/{batch_uuid}/rollout.asl - Full simulation log
rollouts/{scene_id}/{batch_uuid}/metrics.parquet - Per-rollout metrics
rollouts/{scene_id}/{batch_uuid}/{clipgt_id}_{batch_id}_{rollout_id}.mp4 - Evaluation video
_complete - Marker file indicating successful completion

aggregate/

Aggregated results across all rollouts:

metrics_results.txt - Formatted table of driving scores (mean, std, quantiles)
metrics_results.png - Visual summary of driving quality metrics
metrics_unprocessed.parquet - Combined metrics from all rollouts
videos/ - Videos organized by violation types (collision_at_fault, offroad, etc.)

telemetry/

Performance profiling data:

metrics.prom - Prometheus metrics from simulation
metrics_plot.png - Performance visualization (CPU/GPU/RPC metrics)

txt-logs/

Per-service debug logs for troubleshooting

wizard-config.yaml

Resolved configuration used for this run (after Hydra inheritance)

Understanding Driving Quality Metrics

The simulation evaluates driving quality across multiple dimensions. Results are in aggregate/metrics_results.txt and visualized in aggregate/metrics_results.png.

Safety Metrics (Binary)

0 = pass, 1 = fail

collision_at_fault: Driver caused a collision (front/lateral impact)
collision_rear: Rear-end collision (not at fault)
offroad: Vehicle drove off the road

Performance Metrics (Continuous)

dist_to_gt_trajectory: Maximum distance from ground truth path (meters)
- Lower is better; indicates how closely the driver follows expected routes
- Aggregated using MAX over time (worst deviation during the drive)
duration_frac_20s: Fraction of 20s drive completed before any failure
- 1.0 = completed full 20s without issues
- Less than 1.0 = failed early (collision, off-road, or excessive deviation)

Distance Between Incidents

avg_dist_between_incidents: Average km traveled per incident (collision or offroad)
- Higher is better; measures safety over distance
avg_dist_between_incidents_at_fault: Average km traveled per at-fault incident
- Higher is better; excludes rear-end collisions not caused by the driver

Interpreting Results

The aggregate/metrics_results.txt file shows statistics across all rollouts:

collision_at_fault: mean=0.05 → 5% of rollouts had at-fault collisions
dist_to_gt_trajectory: mean=2.3 → Average 2.3m deviation from GT path  
duration_frac_20s: mean=0.95 → Average 95% of 20s completed

Videos in aggregate/videos/violations/ are organized by failure type for easy review.

Performance Metrics

Automatically Generated Metrics Plot

After each simulation run, AlpaSim automatically generates a comprehensive performance visualization. Location: runs/{RUN_DIR}/metrics/metrics_plot.png This 3×3 grid plot includes: Row 1: RPC Performance

RPC Duration histogram - Total time from call start to coroutine resumption
RPC Blocking histogram - Event loop scheduler delay
RPC Queue Depth histogram - Service saturation levels

Row 2: Simulation Timing

Rollout Duration histogram - Total time per rollout
Step Duration histogram - Time per simulation step
Service Configuration table - Shows replica counts and capacity

Row 3: Resource Utilization

CPU Utilization boxplots - Per-service CPU usage
GPU Utilization boxplots - GPU compute usage
GPU Memory boxplots - Memory usage with capacity line

Summary header shows:

Async worker idle percentage - How much time runtime spent idle
Sim seconds per rollout - Wallclock time per simulation

Identifying Bottlenecks

High Queue Depth

Service is saturated → Increase replicas_per_container or n_concurrent_rollouts

High RPC Duration

Service is slow → Consider optimization or scaling

Low GPU Utilization (<50%)

Underutilized → Can increase load by scaling concurrent rollouts

High GPU Utilization (>90%)

May be saturated → Check for throttling, consider adding GPUs

Unbalanced Service Config

Total capacity should match across all services to avoid bottlenecks

Performance Indicators

Low idle percentage (less than 20%) → Runtime is busy, good utilization
High idle percentage (greater than 80%) → Lots of waiting, check for bottlenecks
Consistent rollout times → Good stability
Wide rollout time variance → Investigate outliers in logs

Simulation Configuration

Enabling/Disabling Services

Use runtime.endpoints.<service>.skip to disable services:

uv run alpasim_wizard +deploy=local_oss \
  wizard.log_dir=runs/{DATETIME} \
  runtime.endpoints.trafficsim.skip=true

Changing the Model

By default, the VaVAM driver and model are used. Model weights are downloaded using data/download_vavam_assets.sh and stored in data/vavam-driver/.

Using a Different Model

Mount a custom vavam-driver directory:

uv run alpasim_wizard +deploy=local_oss \
  wizard.log_dir=runs/{DATETIME} \
  defines.vavam_driver=/path/to/custom/vavam-driver

Default location: data/vavam-driver/ (in repository root) The wizard mounts defines.vavam_driver as /mnt/vavam_driver in the container and the driver loads the model from that path.

Using a Different Driver/Inference Code

To use a custom driver container image:

uv run alpasim_wizard +deploy=local_oss \
  wizard.log_dir=runs/{DATETIME} \
  services.driver.image=<your-registry>/<your-driver-image>:<tag>

Your custom image must expose a gRPC endpoint compatible with the driver service interface (see protocol buffer definitions). For development of driver code within this repository, changes to src/driver/ are automatically mounted into containers at runtime.

Troubleshooting

Common Issues

Rollouts directory not appearing

Cause: Simulation failed to start or completeSolution:

Check console logs for first error message
Verify all services started successfully
Check txt-logs/ for service-specific errors
Ensure scenes downloaded correctly to data/nre-artifacts/all-usdzs/

Out of memory errors

Cause: GPU memory exhaustedSolution:

Reduce n_concurrent_rollouts per service
Reduce replicas_per_container
Use smaller batch sizes
Check GPU memory usage in metrics/metrics_plot.png

Timing synchronization errors

Cause: Misaligned timing parametersSolution:

Verify egopose_interval_us equals control_timestep_us
Ensure time_start_offset_us is a multiple of control_timestep_us
Check camera frame_interval_us aligns with control timestep
Review Changing Inference Frequency section

Slow simulation performance

Cause: Service bottleneck or misconfigurationSolution:

Check metrics/metrics_plot.png for queue depths and utilization
Identify bottleneck service (high queue depth)
Increase replicas or concurrent rollouts for that service
Verify all services have balanced total capacity

Get Started

Core Concepts

User Guides

Advanced

Performance & Operations

Performance Tuning

Replica Counts and GPU Distribution

Understanding the Configuration

Balancing Replicas and Concurrent Rollouts

Changing Inference Frequency

Understanding Timing Parameters

Scenario 1: Simple Frequency Change

Scenario 2: High-Rate Camera with Lower Inference

Common Frequencies

Validation

Viewing Results and Metrics

Results Directory Structure

Understanding Driving Quality Metrics

Safety Metrics (Binary)

Performance Metrics (Continuous)

Distance Between Incidents

Interpreting Results

Performance Metrics

Automatically Generated Metrics Plot

Identifying Bottlenecks

Performance Indicators

Simulation Configuration

Enabling/Disabling Services

Changing the Model

Using a Different Model

Using a Different Driver/Inference Code

Troubleshooting

Common Issues

Get Started

Core Concepts

User Guides

Advanced

Documentation Index

​Performance Tuning

​Replica Counts and GPU Distribution

​Understanding the Configuration

​Balancing Replicas and Concurrent Rollouts

​Changing Inference Frequency

​Understanding Timing Parameters

​Scenario 1: Simple Frequency Change

​Scenario 2: High-Rate Camera with Lower Inference

​Common Frequencies

​Validation

​Viewing Results and Metrics

​Results Directory Structure

​Understanding Driving Quality Metrics

​Safety Metrics (Binary)

​Performance Metrics (Continuous)

​Distance Between Incidents

​Interpreting Results

​Performance Metrics

​Automatically Generated Metrics Plot

​Identifying Bottlenecks

​Performance Indicators

​Simulation Configuration

​Enabling/Disabling Services

​Changing the Model

​Using a Different Model

​Using a Different Driver/Inference Code

​Troubleshooting

​Common Issues

Performance Tuning

Replica Counts and GPU Distribution

Understanding the Configuration

Balancing Replicas and Concurrent Rollouts

Changing Inference Frequency

Understanding Timing Parameters

Scenario 1: Simple Frequency Change

Scenario 2: High-Rate Camera with Lower Inference

Common Frequencies

Validation

Viewing Results and Metrics

Results Directory Structure

Understanding Driving Quality Metrics

Safety Metrics (Binary)

Performance Metrics (Continuous)

Distance Between Incidents

Interpreting Results

Performance Metrics

Automatically Generated Metrics Plot

Identifying Bottlenecks

Performance Indicators

Simulation Configuration

Enabling/Disabling Services

Changing the Model

Using a Different Model

Using a Different Driver/Inference Code

Troubleshooting

Common Issues