Instance Management

Comprehensive guide to managing your GPU instances throughout their lifecycle, from deployment to termination.

Instance Dashboard

Access your instance management dashboard at app.comput3.ai/instances.

Instance management dashboard showing running instances, metrics, and controls

Dashboard Features

Live Status

Real-time status of all your instances with health indicators and uptime tracking.

Resource Metrics

GPU utilization, memory usage, CPU load, and network activity monitoring.

Cost Tracking

Real-time cost accumulation and projected monthly spending based on usage.

Quick Actions

Start, stop, restart, and terminate instances with single-click actions.

Instance Lifecycle Management

Launch Phase

Initial Setup (0-60 seconds)

Instance provisioning and hardware allocation
Operating system and driver installation
Environment configuration and startup scripts
Network and security group setup

Instance shows “Running” status when ready for connections.

Active Phase

Normal Operation

Monitor resource utilization and performance
Scale resources up or down as needed
Manage data and model storage
Configure auto-shutdown and scheduling

Maintenance Phase

Optimization and Updates

Apply system updates and patches
Optimize configurations for better performance
Clean up temporary files and logs
Backup important data and models

Termination Phase

Cleanup and Shutdown

Save work and export results
Backup data to persistent storage
Terminate instance to stop billing
Review usage reports and costs

Monitoring and Metrics

Real-time Monitoring

GPU Metrics

Key Metrics to Monitor:

GPU Utilization: Percentage of GPU compute being used
Memory Usage: GPU memory consumption vs. total available
Temperature: GPU temperature for thermal throttling detection
Power Draw: Current power consumption vs. maximum TDP

Monitoring Commands:

# Real-time GPU monitoring
nvidia-smi -l 1

# Detailed GPU information
nvidia-ml-py

# GPU memory usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

System Metrics

CPU and Memory:

# CPU usage
htop

# Memory usage
free -h

# Disk usage
df -h

# Network activity
iftop

Automated Monitoring:

# Install monitoring tools
sudo apt update
sudo apt install htop iotop iftop

# System resource summary
cat /proc/cpuinfo | grep "model name" | head -1
cat /proc/meminfo | grep MemTotal

Application Metrics

Training Metrics:

Loss curves and accuracy over time
Training speed (samples/second)
Memory allocation patterns
Gradient flow and model convergence

Inference Metrics:

Requests per second throughput
Average response latency
Queue depth and processing time
Error rates and success metrics

Alerting and Notifications

Set up automated alerts for critical events:

Email Alerts
Slack Integration
Custom Webhooks

Configure email notifications for:

Instance state changes (stopped, terminated)
High resource utilization (>90% for 10+ minutes)
Cost thresholds exceeded
System errors or failures

# Install Slack webhook notifier
pip install slack-sdk

# Send notification script
python notify_slack.py "Training completed on instance-123"

import requests

def send_alert(message, webhook_url):
    payload = {
        "text": message,
        "instance_id": "i-1234567890abcdef0",
        "timestamp": datetime.now().isoformat()
    }
    requests.post(webhook_url, json=payload)

Scaling and Auto-Management

Vertical Scaling

Resize your instance to different GPU types:

Stop Instance

Gracefully shut down your instance to prepare for resizing.

sudo shutdown -h now

Change Instance Type

Use the dashboard or API to select a new instance type:

curl

curl -X PATCH "https://api.comput3.ai/v1/instances/i-123" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"instance_type": "h100-80gb"}'

Restart Instance

Start the instance with the new configuration.

Data on local storage is preserved during instance type changes.

Data Management

Persistent Storage

EBS Volumes

Attach additional storage:

# List available volumes
lsblk

# Format new volume
sudo mkfs -t ext4 /dev/xvdf

# Mount volume
sudo mkdir /data
sudo mount /dev/xvdf /data

# Auto-mount on boot
echo '/dev/xvdf /data ext4 defaults,nofail 0 2' | sudo tee -a /etc/fstab

S3 Integration

Sync data with S3:

# Install AWS CLI
pip install awscli

# Configure credentials
aws configure

# Sync training data
aws s3 sync s3://your-bucket/data /data/training/

# Upload results
aws s3 sync /data/results/ s3://your-bucket/results/

# Automated sync script
cat > /usr/local/bin/s3-sync.sh << 'EOF'
#!/bin/bash
while true; do
    aws s3 sync /data/checkpoints/ s3://your-bucket/checkpoints/
    sleep 300  # Sync every 5 minutes
done
EOF

Backup Strategies

Automated backups:

# Create backup script
cat > /usr/local/bin/backup.sh << 'EOF'
#!/bin/bash
DATE=$(date +%Y%m%d_%H%M%S)
tar -czf /tmp/backup_$DATE.tar.gz /data/models/
aws s3 cp /tmp/backup_$DATE.tar.gz s3://your-bucket/backups/
rm /tmp/backup_$DATE.tar.gz
EOF

# Schedule daily backups
echo "0 6 * * * /usr/local/bin/backup.sh" | crontab -

Cost Optimization

Cost Monitoring

Real-time Costs

View current hourly costs and projected monthly spending in the dashboard.

Usage Reports

Download detailed usage reports with breakdowns by instance type and time period.

Budget Alerts

Set up alerts when spending approaches your defined budget limits.

Cost Optimization Tips

Receive personalized recommendations for reducing costs based on usage patterns.

Optimization Strategies

Right-sizing

Choose optimal instance types:

Monitor GPU utilization over time
Downgrade if consistently under 50% utilization
Upgrade if hitting memory or compute limits
Use spot instances for fault-tolerant workloads

Scheduling

Optimize runtime scheduling:

Use off-peak hours for training (typically 2-6 AM local time)
Batch multiple experiments together
Use preemptible instances for non-critical work
Schedule automatic start/stop for predictable workloads

Resource Sharing

Troubleshooting

Common Issues

Instance Won't Start

Possible Causes:

Insufficient capacity in selected region
Invalid SSH key or security group configuration
Account billing issues

Solutions:

Try different availability zones
Verify SSH key format and permissions
Check account status and billing information
Contact support for capacity issues

SSH Connection Failed

Possible Causes:

Incorrect SSH key or permissions
Security group not allowing SSH (port 22)
Instance still initializing

Solutions:

# Check SSH key permissions
chmod 600 ~/.ssh/your-key.pem

# Test connection with verbose output
ssh -v -i ~/.ssh/your-key.pem ubuntu@<instance-ip>

# Verify security group allows SSH
# Port 22 should be open to your IP

GPU Not Detected

Possible Causes:

NVIDIA drivers not installed
CUDA version mismatch
Hardware initialization failed

Solutions:

# Check GPU status
nvidia-smi

# Reinstall NVIDIA drivers
sudo apt update
sudo apt install nvidia-driver-525

# Verify CUDA installation
nvcc --version

# Restart instance if needed
sudo reboot

Out of Memory Errors

Possible Causes:

Model too large for GPU memory
Memory leaks in training code
Inefficient data loading

Solutions:

# Monitor GPU memory
import torch
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Clear GPU cache
torch.cuda.empty_cache()

# Use gradient checkpointing
model.gradient_checkpointing_enable()

# Reduce batch size
batch_size = batch_size // 2

Performance Optimization

GPU Optimization
Data Loading
System Tuning

# Optimize PyTorch settings
import torch

# Enable mixed precision
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.allow_tf32 = True

# Use compiled models (PyTorch 2.0+)
model = torch.compile(model)

# Optimize memory usage
torch.cuda.empty_cache()

# Optimize DataLoader
dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,  # Use multiple CPU cores
    pin_memory=True,  # Faster GPU transfer
    persistent_workers=True  # Reduce worker startup time
)

# Optimize system settings
echo 'vm.swappiness=1' | sudo tee -a /etc/sysctl.conf
echo 'vm.dirty_ratio=15' | sudo tee -a /etc/sysctl.conf
echo 'vm.dirty_background_ratio=5' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Set CPU governor to performance
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

API Management

Programmatically manage instances using the Comput3 API:

import requests

class Comput3Manager:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://api.comput3.ai/v1"
        self.headers = {"Authorization": f"Bearer {api_key}"}
    
    def list_instances(self):
        response = requests.get(f"{self.base_url}/instances", headers=self.headers)
        return response.json()
    
    def get_instance(self, instance_id):
        response = requests.get(f"{self.base_url}/instances/{instance_id}", headers=self.headers)
        return response.json()
    
    def start_instance(self, instance_id):
        response = requests.post(f"{self.base_url}/instances/{instance_id}/start", headers=self.headers)
        return response.json()
    
    def stop_instance(self, instance_id):
        response = requests.post(f"{self.base_url}/instances/{instance_id}/stop", headers=self.headers)
        return response.json()
    
    def terminate_instance(self, instance_id):
        response = requests.delete(f"{self.base_url}/instances/{instance_id}", headers=self.headers)
        return response.json()

# Usage example
manager = Comput3Manager("YOUR_API_KEY")
instances = manager.list_instances()
for instance in instances["instances"]:
    print(f"Instance {instance['id']}: {instance['state']}")

Next Steps

API Documentation

Complete API reference for programmatic instance management.

Cost Calculator

Estimate costs for your specific workload requirements.

Support

Get help with advanced configuration and optimization.

Getting Started

API

Chat

IDE/CLI

Launch GPU

Generate Medias

COM Token

MCP

ELIZAOS

Instance Dashboard

Dashboard Features

Live Status

Resource Metrics

Cost Tracking

Quick Actions

Instance Lifecycle Management

Monitoring and Metrics

Real-time Monitoring

Alerting and Notifications

Scaling and Auto-Management

Vertical Scaling

Data Management

Persistent Storage

Cost Optimization

Cost Monitoring

Real-time Costs

Usage Reports

Budget Alerts

Cost Optimization Tips

Optimization Strategies

Troubleshooting

Common Issues

Performance Optimization

API Management

Next Steps

API Documentation

Cost Calculator

Support

Getting Started

API

Chat

IDE/CLI

Launch GPU

Generate Medias

COM Token

MCP

ELIZAOS

​Instance Dashboard

​Dashboard Features

Live Status

Resource Metrics

Cost Tracking

Quick Actions

​Instance Lifecycle Management

​Monitoring and Metrics

​Real-time Monitoring

​Alerting and Notifications

​Scaling and Auto-Management

​Vertical Scaling

​Data Management

​Persistent Storage

​Cost Optimization

​Cost Monitoring

Real-time Costs

Usage Reports

Budget Alerts

Cost Optimization Tips

​Optimization Strategies

​Troubleshooting

​Common Issues

​Performance Optimization

​API Management

​Next Steps

API Documentation

Cost Calculator

Support

Instance Dashboard

Dashboard Features

Instance Lifecycle Management

Monitoring and Metrics

Real-time Monitoring

Alerting and Notifications

Scaling and Auto-Management

Vertical Scaling

Data Management

Persistent Storage

Cost Optimization

Cost Monitoring

Optimization Strategies

Troubleshooting

Common Issues

Performance Optimization

API Management

Next Steps