Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/NVlabs/alpasim/llms.txt

Use this file to discover all available pages before exploring further.

Performance Tuning

Replica Counts and GPU Distribution

The number of service replicas and their GPU assignments are configured in deployment configs located in src/wizard/configs/deploy/. For local workstation: local_oss.yaml

Understanding the Configuration

Each service has two key parameters:
services:
  sensorsim:
    replicas_per_container: 4  # Number of service replicas per container
    gpus: [0, 1, 2, 3]        # GPUs to create containers on
How it works:
  • One container per GPU (or one container total if gpus: null)
  • Each container runs replicas_per_container service instances
  • Total replicas = nr_gpus * replicas_per_container
Example:
  • gpus: [0, 1, 2, 3] → 4 containers (one per GPU)
  • replicas_per_container: 4 → 4 replicas per container
  • Total: 4 × 4 = 16 service replicas

Balancing Replicas and Concurrent Rollouts

Total simulation throughput capacity is determined by:
Total capacity = nr_gpus × replicas_per_container × n_concurrent_rollouts
where n_concurrent_rollouts is the number of rollouts (simulation episodes) each service replica can process simultaneously.
All services must have equal total capacity to avoid bottlenecks.
Example from local_oss.yaml scaled up:
services:
  sensorsim:
    replicas_per_container: 4
    gpus: [0, 1]

  driver:
    replicas_per_container: 8
    gpus: [2, 3]

  controller:
    replicas_per_container: 16
    gpus: null  # CPU-only: 1 container

runtime:
  endpoints:
    sensorsim:
      n_concurrent_rollouts: 4  # 2 GPUs × 4 replicas × 4 concurrent = 32

    driver:
      n_concurrent_rollouts: 2  # 2 GPUs × 8 replicas × 2 concurrent = 32

    controller:
      n_concurrent_rollouts: 2  # 1 CPU × 16 replicas × 2 concurrent = 32

Changing Inference Frequency

Changing inference frequency requires coordinating multiple timing parameters.

Understanding Timing Parameters

The simulator has multiple synchronized “clocks”:
  1. Driver inference (control_timestep_us) - How often the model makes decisions
  2. Camera frames (frame_interval_us) - How often cameras capture images
  3. GPS/Pose updates (egopose_interval_us) - How often position is updated
  4. Simulation start (time_start_offset_us) - Initial offset to avoid artifacts
For correct operation, these parameters must be mathematically aligned.

Scenario 1: Simple Frequency Change

To change to 5Hz inference (200ms between decisions):
1

Set inference frequency

runtime.default_scenario_parameters.control_timestep_us=200000  # 200ms = 5Hz
2

Match GPS update rate

egopose_interval_us must equal control_timestep_us:
runtime.default_scenario_parameters.egopose_interval_us=200000
3

Set time offset

Must be a multiple of control_timestep_us:
runtime.default_scenario_parameters.time_start_offset_us=600000  # 3 × 200ms
4

Match camera frame rate

For VaVAM default (1 camera):
runtime.default_scenario_parameters.cameras.0.frame_interval_us=200000
For 2-camera configs, also set:
runtime.default_scenario_parameters.cameras.1.frame_interval_us=200000
Full command:
uv run alpasim_wizard +deploy=local_oss \
  wizard.log_dir=runs/{DATETIME} \
  runtime.default_scenario_parameters.control_timestep_us=200000 \
  runtime.default_scenario_parameters.egopose_interval_us=200000 \
  runtime.default_scenario_parameters.time_start_offset_us=600000 \
  runtime.default_scenario_parameters.cameras.0.frame_interval_us=200000

Scenario 2: High-Rate Camera with Lower Inference

To use 30Hz cameras (33.3ms) but 10Hz inference (100ms):
uv run alpasim_wizard +deploy=local_oss \
  wizard.log_dir=runs/{DATETIME} \
  runtime.default_scenario_parameters.control_timestep_us=100002 \
  runtime.default_scenario_parameters.egopose_interval_us=100002 \
  runtime.default_scenario_parameters.time_start_offset_us=300006 \
  runtime.default_scenario_parameters.cameras.0.frame_interval_us=33334 \
  ++driver.inference.Cframes_subsample=3
How it works:
  1. Camera captures at 30Hz: frame_interval_us=33334 (33.3ms)
  2. Inference runs at 10Hz: control_timestep_us=100002 (must be 3 × 33334)
  3. Subsample frames: driver.inference.Cframes_subsample=3 (use every 3rd frame)
  4. Egopose matches inference: egopose_interval_us=100002
  5. Time offset aligns: time_start_offset_us=300006 (3 × 100002)

Common Frequencies

Frequencycontrol_timestep_usegopose_interval_ustime_start_offset_usNotes
2Hz500000 (500ms)500000500000 or 1500000VaVAM default
5Hz200000 (200ms)200000600000 (3×)Example config
10Hz100000 (100ms)100000300000 (3×)Base default
30Hz33334 (33.3ms)33334100002 (3×)High frequency
Most configs use time_start_offset_us = 3 × control_timestep_us to avoid artifacts at scene start.

Validation

The assert_zero_decision_delay flag (enabled by default) validates timing synchronization at runtime:
# Enabled by default, can explicitly set:
runtime.default_scenario_parameters.assert_zero_decision_delay=true
It checks that:
  • Camera frames complete exactly at decision time
  • Egopose updates complete exactly at decision time
If misconfigured, you’ll see errors like:
Camera camera_front_wide_120fov out of sync with planning.
Last started frame finishes at X which is Y microseconds away from decision time Z.

Viewing Results and Metrics

Results Directory Structure

After a run completes, results are in wizard.log_dir (e.g., runs/{RUN_DIR}/):
Simulation logs organized by scene and batch:
  • rollouts/{scene_id}/{batch_uuid}/rollout.asl - Full simulation log
  • rollouts/{scene_id}/{batch_uuid}/metrics.parquet - Per-rollout metrics
  • rollouts/{scene_id}/{batch_uuid}/{clipgt_id}_{batch_id}_{rollout_id}.mp4 - Evaluation video
  • _complete - Marker file indicating successful completion
Aggregated results across all rollouts:
  • metrics_results.txt - Formatted table of driving scores (mean, std, quantiles)
  • metrics_results.png - Visual summary of driving quality metrics
  • metrics_unprocessed.parquet - Combined metrics from all rollouts
  • videos/ - Videos organized by violation types (collision_at_fault, offroad, etc.)
Performance profiling data:
  • metrics.prom - Prometheus metrics from simulation
  • metrics_plot.png - Performance visualization (CPU/GPU/RPC metrics)
Per-service debug logs for troubleshooting
Resolved configuration used for this run (after Hydra inheritance)

Understanding Driving Quality Metrics

The simulation evaluates driving quality across multiple dimensions. Results are in aggregate/metrics_results.txt and visualized in aggregate/metrics_results.png.

Safety Metrics (Binary)

0 = pass, 1 = fail
  • collision_at_fault: Driver caused a collision (front/lateral impact)
  • collision_rear: Rear-end collision (not at fault)
  • offroad: Vehicle drove off the road

Performance Metrics (Continuous)

  • dist_to_gt_trajectory: Maximum distance from ground truth path (meters)
    • Lower is better; indicates how closely the driver follows expected routes
    • Aggregated using MAX over time (worst deviation during the drive)
  • duration_frac_20s: Fraction of 20s drive completed before any failure
    • 1.0 = completed full 20s without issues
    • Less than 1.0 = failed early (collision, off-road, or excessive deviation)

Distance Between Incidents

  • avg_dist_between_incidents: Average km traveled per incident (collision or offroad)
    • Higher is better; measures safety over distance
  • avg_dist_between_incidents_at_fault: Average km traveled per at-fault incident
    • Higher is better; excludes rear-end collisions not caused by the driver

Interpreting Results

The aggregate/metrics_results.txt file shows statistics across all rollouts:
collision_at_fault: mean=0.05 → 5% of rollouts had at-fault collisions
dist_to_gt_trajectory: mean=2.3 → Average 2.3m deviation from GT path  
duration_frac_20s: mean=0.95 → Average 95% of 20s completed
Videos in aggregate/videos/violations/ are organized by failure type for easy review.

Performance Metrics

Automatically Generated Metrics Plot

After each simulation run, AlpaSim automatically generates a comprehensive performance visualization. Location: runs/{RUN_DIR}/metrics/metrics_plot.png This 3×3 grid plot includes: Row 1: RPC Performance
  • RPC Duration histogram - Total time from call start to coroutine resumption
  • RPC Blocking histogram - Event loop scheduler delay
  • RPC Queue Depth histogram - Service saturation levels
Row 2: Simulation Timing
  • Rollout Duration histogram - Total time per rollout
  • Step Duration histogram - Time per simulation step
  • Service Configuration table - Shows replica counts and capacity
Row 3: Resource Utilization
  • CPU Utilization boxplots - Per-service CPU usage
  • GPU Utilization boxplots - GPU compute usage
  • GPU Memory boxplots - Memory usage with capacity line
Summary header shows:
  • Async worker idle percentage - How much time runtime spent idle
  • Sim seconds per rollout - Wallclock time per simulation

Identifying Bottlenecks

Service is saturated → Increase replicas_per_container or n_concurrent_rollouts
Service is slow → Consider optimization or scaling
Underutilized → Can increase load by scaling concurrent rollouts
May be saturated → Check for throttling, consider adding GPUs
Total capacity should match across all services to avoid bottlenecks

Performance Indicators

  • Low idle percentage (less than 20%) → Runtime is busy, good utilization
  • High idle percentage (greater than 80%) → Lots of waiting, check for bottlenecks
  • Consistent rollout times → Good stability
  • Wide rollout time variance → Investigate outliers in logs

Simulation Configuration

Enabling/Disabling Services

Use runtime.endpoints.<service>.skip to disable services:
uv run alpasim_wizard +deploy=local_oss \
  wizard.log_dir=runs/{DATETIME} \
  runtime.endpoints.trafficsim.skip=true

Changing the Model

By default, the VaVAM driver and model are used. Model weights are downloaded using data/download_vavam_assets.sh and stored in data/vavam-driver/.

Using a Different Model

Mount a custom vavam-driver directory:
uv run alpasim_wizard +deploy=local_oss \
  wizard.log_dir=runs/{DATETIME} \
  defines.vavam_driver=/path/to/custom/vavam-driver
Default location: data/vavam-driver/ (in repository root) The wizard mounts defines.vavam_driver as /mnt/vavam_driver in the container and the driver loads the model from that path.

Using a Different Driver/Inference Code

To use a custom driver container image:
uv run alpasim_wizard +deploy=local_oss \
  wizard.log_dir=runs/{DATETIME} \
  services.driver.image=<your-registry>/<your-driver-image>:<tag>
Your custom image must expose a gRPC endpoint compatible with the driver service interface (see protocol buffer definitions). For development of driver code within this repository, changes to src/driver/ are automatically mounted into containers at runtime.

Troubleshooting

Common Issues

Cause: Simulation failed to start or completeSolution:
  1. Check console logs for first error message
  2. Verify all services started successfully
  3. Check txt-logs/ for service-specific errors
  4. Ensure scenes downloaded correctly to data/nre-artifacts/all-usdzs/
Cause: GPU memory exhaustedSolution:
  1. Reduce n_concurrent_rollouts per service
  2. Reduce replicas_per_container
  3. Use smaller batch sizes
  4. Check GPU memory usage in metrics/metrics_plot.png
Cause: Misaligned timing parametersSolution:
  1. Verify egopose_interval_us equals control_timestep_us
  2. Ensure time_start_offset_us is a multiple of control_timestep_us
  3. Check camera frame_interval_us aligns with control timestep
  4. Review Changing Inference Frequency section
Cause: Service bottleneck or misconfigurationSolution:
  1. Check metrics/metrics_plot.png for queue depths and utilization
  2. Identify bottleneck service (high queue depth)
  3. Increase replicas or concurrent rollouts for that service
  4. Verify all services have balanced total capacity