Skip to main content
Comprehensive guide to managing your GPU instances throughout their lifecycle, from deployment to termination.

Instance Dashboard

Access your instance management dashboard at app.comput3.ai/instances.
Instance management dashboard showing running instances, metrics, and controls

Dashboard Features

Live Status

Real-time status of all your instances with health indicators and uptime tracking.

Resource Metrics

GPU utilization, memory usage, CPU load, and network activity monitoring.

Cost Tracking

Real-time cost accumulation and projected monthly spending based on usage.

Quick Actions

Start, stop, restart, and terminate instances with single-click actions.

Instance Lifecycle Management

1

Launch Phase

Initial Setup (0-60 seconds)
  • Instance provisioning and hardware allocation
  • Operating system and driver installation
  • Environment configuration and startup scripts
  • Network and security group setup
Instance shows “Running” status when ready for connections.
2

Active Phase

Normal Operation
  • Monitor resource utilization and performance
  • Scale resources up or down as needed
  • Manage data and model storage
  • Configure auto-shutdown and scheduling
3

Maintenance Phase

Optimization and Updates
  • Apply system updates and patches
  • Optimize configurations for better performance
  • Clean up temporary files and logs
  • Backup important data and models
4

Termination Phase

Cleanup and Shutdown
  • Save work and export results
  • Backup data to persistent storage
  • Terminate instance to stop billing
  • Review usage reports and costs

Monitoring and Metrics

Real-time Monitoring

Key Metrics to Monitor:
  • GPU Utilization: Percentage of GPU compute being used
  • Memory Usage: GPU memory consumption vs. total available
  • Temperature: GPU temperature for thermal throttling detection
  • Power Draw: Current power consumption vs. maximum TDP
Monitoring Commands:
# Real-time GPU monitoring
nvidia-smi -l 1

# Detailed GPU information
nvidia-ml-py

# GPU memory usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
CPU and Memory:
# CPU usage
htop

# Memory usage
free -h

# Disk usage
df -h

# Network activity
iftop
Automated Monitoring:
# Install monitoring tools
sudo apt update
sudo apt install htop iotop iftop

# System resource summary
cat /proc/cpuinfo | grep "model name" | head -1
cat /proc/meminfo | grep MemTotal
Training Metrics:
  • Loss curves and accuracy over time
  • Training speed (samples/second)
  • Memory allocation patterns
  • Gradient flow and model convergence
Inference Metrics:
  • Requests per second throughput
  • Average response latency
  • Queue depth and processing time
  • Error rates and success metrics

Alerting and Notifications

Set up automated alerts for critical events:
Configure email notifications for:
  • Instance state changes (stopped, terminated)
  • High resource utilization (>90% for 10+ minutes)
  • Cost thresholds exceeded
  • System errors or failures

Scaling and Auto-Management

Vertical Scaling

Resize your instance to different GPU types:
1

Stop Instance

Gracefully shut down your instance to prepare for resizing.
sudo shutdown -h now
2

Change Instance Type

Use the dashboard or API to select a new instance type:
curl
curl -X PATCH "https://api.comput3.ai/v1/instances/i-123" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"instance_type": "h100-80gb"}'
3

Restart Instance

Start the instance with the new configuration.
Data on local storage is preserved during instance type changes.

Data Management

Persistent Storage

Attach additional storage:
# List available volumes
lsblk

# Format new volume
sudo mkfs -t ext4 /dev/xvdf

# Mount volume
sudo mkdir /data
sudo mount /dev/xvdf /data

# Auto-mount on boot
echo '/dev/xvdf /data ext4 defaults,nofail 0 2' | sudo tee -a /etc/fstab
Sync data with S3:
# Install AWS CLI
pip install awscli

# Configure credentials
aws configure

# Sync training data
aws s3 sync s3://your-bucket/data /data/training/

# Upload results
aws s3 sync /data/results/ s3://your-bucket/results/

# Automated sync script
cat > /usr/local/bin/s3-sync.sh << 'EOF'
#!/bin/bash
while true; do
    aws s3 sync /data/checkpoints/ s3://your-bucket/checkpoints/
    sleep 300  # Sync every 5 minutes
done
EOF
Automated backups:
# Create backup script
cat > /usr/local/bin/backup.sh << 'EOF'
#!/bin/bash
DATE=$(date +%Y%m%d_%H%M%S)
tar -czf /tmp/backup_$DATE.tar.gz /data/models/
aws s3 cp /tmp/backup_$DATE.tar.gz s3://your-bucket/backups/
rm /tmp/backup_$DATE.tar.gz
EOF

# Schedule daily backups
echo "0 6 * * * /usr/local/bin/backup.sh" | crontab -

Cost Optimization

Cost Monitoring

Real-time Costs

View current hourly costs and projected monthly spending in the dashboard.

Usage Reports

Download detailed usage reports with breakdowns by instance type and time period.

Budget Alerts

Set up alerts when spending approaches your defined budget limits.

Cost Optimization Tips

Receive personalized recommendations for reducing costs based on usage patterns.

Optimization Strategies

Choose optimal instance types:
  • Monitor GPU utilization over time
  • Downgrade if consistently under 50% utilization
  • Upgrade if hitting memory or compute limits
  • Use spot instances for fault-tolerant workloads
Optimize runtime scheduling:
  • Use off-peak hours for training (typically 2-6 AM local time)
  • Batch multiple experiments together
  • Use preemptible instances for non-critical work
  • Schedule automatic start/stop for predictable workloads
Maximize utilization:
  • Share instances across team members
  • Use containerization for multi-tenant workloads
  • Implement job queuing systems
  • Monitor and optimize GPU memory usage

Troubleshooting

Common Issues

Possible Causes:
  • Insufficient capacity in selected region
  • Invalid SSH key or security group configuration
  • Account billing issues
Solutions:
  • Try different availability zones
  • Verify SSH key format and permissions
  • Check account status and billing information
  • Contact support for capacity issues
Possible Causes:
  • Incorrect SSH key or permissions
  • Security group not allowing SSH (port 22)
  • Instance still initializing
Solutions:
# Check SSH key permissions
chmod 600 ~/.ssh/your-key.pem

# Test connection with verbose output
ssh -v -i ~/.ssh/your-key.pem ubuntu@<instance-ip>

# Verify security group allows SSH
# Port 22 should be open to your IP
Possible Causes:
  • NVIDIA drivers not installed
  • CUDA version mismatch
  • Hardware initialization failed
Solutions:
# Check GPU status
nvidia-smi

# Reinstall NVIDIA drivers
sudo apt update
sudo apt install nvidia-driver-525

# Verify CUDA installation
nvcc --version

# Restart instance if needed
sudo reboot
Possible Causes:
  • Model too large for GPU memory
  • Memory leaks in training code
  • Inefficient data loading
Solutions:
# Monitor GPU memory
import torch
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Clear GPU cache
torch.cuda.empty_cache()

# Use gradient checkpointing
model.gradient_checkpointing_enable()

# Reduce batch size
batch_size = batch_size // 2

Performance Optimization

# Optimize PyTorch settings
import torch

# Enable mixed precision
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.allow_tf32 = True

# Use compiled models (PyTorch 2.0+)
model = torch.compile(model)

# Optimize memory usage
torch.cuda.empty_cache()

API Management

Programmatically manage instances using the Comput3 API:
import requests

class Comput3Manager:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://api.comput3.ai/v1"
        self.headers = {"Authorization": f"Bearer {api_key}"}
    
    def list_instances(self):
        response = requests.get(f"{self.base_url}/instances", headers=self.headers)
        return response.json()
    
    def get_instance(self, instance_id):
        response = requests.get(f"{self.base_url}/instances/{instance_id}", headers=self.headers)
        return response.json()
    
    def start_instance(self, instance_id):
        response = requests.post(f"{self.base_url}/instances/{instance_id}/start", headers=self.headers)
        return response.json()
    
    def stop_instance(self, instance_id):
        response = requests.post(f"{self.base_url}/instances/{instance_id}/stop", headers=self.headers)
        return response.json()
    
    def terminate_instance(self, instance_id):
        response = requests.delete(f"{self.base_url}/instances/{instance_id}", headers=self.headers)
        return response.json()

# Usage example
manager = Comput3Manager("YOUR_API_KEY")
instances = manager.list_instances()
for instance in instances["instances"]:
    print(f"Instance {instance['id']}: {instance['state']}")

Next Steps