Instance Dashboard
Access your instance management dashboard at app.comput3.ai/instances.
Dashboard Features
Live Status
Real-time status of all your instances with health indicators and uptime tracking.
Resource Metrics
GPU utilization, memory usage, CPU load, and network activity monitoring.
Cost Tracking
Real-time cost accumulation and projected monthly spending based on usage.
Quick Actions
Start, stop, restart, and terminate instances with single-click actions.
Instance Lifecycle Management
1
Launch Phase
Initial Setup (0-60 seconds)
- Instance provisioning and hardware allocation
- Operating system and driver installation
- Environment configuration and startup scripts
- Network and security group setup
Instance shows “Running” status when ready for connections.
2
Active Phase
Normal Operation
- Monitor resource utilization and performance
- Scale resources up or down as needed
- Manage data and model storage
- Configure auto-shutdown and scheduling
3
Maintenance Phase
Optimization and Updates
- Apply system updates and patches
- Optimize configurations for better performance
- Clean up temporary files and logs
- Backup important data and models
4
Termination Phase
Cleanup and Shutdown
- Save work and export results
- Backup data to persistent storage
- Terminate instance to stop billing
- Review usage reports and costs
Monitoring and Metrics
Real-time Monitoring
GPU Metrics
GPU Metrics
Key Metrics to Monitor:
- GPU Utilization: Percentage of GPU compute being used
- Memory Usage: GPU memory consumption vs. total available
- Temperature: GPU temperature for thermal throttling detection
- Power Draw: Current power consumption vs. maximum TDP
System Metrics
System Metrics
CPU and Memory:Automated Monitoring:
Application Metrics
Application Metrics
Training Metrics:
- Loss curves and accuracy over time
- Training speed (samples/second)
- Memory allocation patterns
- Gradient flow and model convergence
- Requests per second throughput
- Average response latency
- Queue depth and processing time
- Error rates and success metrics
Alerting and Notifications
Set up automated alerts for critical events:- Email Alerts
- Slack Integration
- Custom Webhooks
Configure email notifications for:
- Instance state changes (stopped, terminated)
- High resource utilization (>90% for 10+ minutes)
- Cost thresholds exceeded
- System errors or failures
Scaling and Auto-Management
Vertical Scaling
Resize your instance to different GPU types:1
Stop Instance
Gracefully shut down your instance to prepare for resizing.
2
Change Instance Type
Use the dashboard or API to select a new instance type:
curl
3
Restart Instance
Start the instance with the new configuration.
Data on local storage is preserved during instance type changes.
Data Management
Persistent Storage
EBS Volumes
EBS Volumes
Attach additional storage:
S3 Integration
S3 Integration
Sync data with S3:
Backup Strategies
Backup Strategies
Automated backups:
Cost Optimization
Cost Monitoring
Real-time Costs
View current hourly costs and projected monthly spending in the dashboard.
Usage Reports
Download detailed usage reports with breakdowns by instance type and time period.
Budget Alerts
Set up alerts when spending approaches your defined budget limits.
Cost Optimization Tips
Receive personalized recommendations for reducing costs based on usage patterns.
Optimization Strategies
Right-sizing
Right-sizing
Choose optimal instance types:
- Monitor GPU utilization over time
- Downgrade if consistently under 50% utilization
- Upgrade if hitting memory or compute limits
- Use spot instances for fault-tolerant workloads
Scheduling
Scheduling
Optimize runtime scheduling:
- Use off-peak hours for training (typically 2-6 AM local time)
- Batch multiple experiments together
- Use preemptible instances for non-critical work
- Schedule automatic start/stop for predictable workloads
Resource Sharing
Resource Sharing
Maximize utilization:
- Share instances across team members
- Use containerization for multi-tenant workloads
- Implement job queuing systems
- Monitor and optimize GPU memory usage
Troubleshooting
Common Issues
Instance Won't Start
Instance Won't Start
Possible Causes:
- Insufficient capacity in selected region
- Invalid SSH key or security group configuration
- Account billing issues
- Try different availability zones
- Verify SSH key format and permissions
- Check account status and billing information
- Contact support for capacity issues
SSH Connection Failed
SSH Connection Failed
Possible Causes:
- Incorrect SSH key or permissions
- Security group not allowing SSH (port 22)
- Instance still initializing
GPU Not Detected
GPU Not Detected
Possible Causes:
- NVIDIA drivers not installed
- CUDA version mismatch
- Hardware initialization failed
Out of Memory Errors
Out of Memory Errors
Possible Causes:
- Model too large for GPU memory
- Memory leaks in training code
- Inefficient data loading
Performance Optimization
- GPU Optimization
- Data Loading
- System Tuning