Troubleshooting¶
Common Issues¶
"RAMJET API key not configured"¶
Problem: RAMJET can't find your API key.
Solution:
# Set environment variable
export RAMJET_API_KEY="your_key_here"
# Or in Python
import ramjetio
ramjetio.init(api_key="your_key_here")
Verify:
"Invalid API key format"¶
Problem: API key format is incorrect.
Solution: API keys should start with ramjet_ followed by the cluster name and secret:
Get a valid key from the dashboard at app.ramjet.io.
"Backend unavailable"¶
Problem: Can't connect to RAMJET dashboard backend.
Possible causes: 1. No internet connection 2. Firewall blocking outbound connections 3. Backend is down (rare)
Solution:
# Check connectivity
curl -v https://api.ramjet.io/health
# Check if port 443 is open outbound
nc -zv api.ramjet.io 443
"Node not appearing in dashboard"¶
Problem: Started training but node doesn't show in dashboard.
Checklist:
1. ✅ RAMJET_API_KEY is set correctly
2. ✅ ramjetio.init() is called
3. ✅ Node has internet access
4. ✅ No firewall blocking outbound port 443
Debug:
import ramjetio
import logging
logging.basicConfig(level=logging.DEBUG)
ramjetio.init()
# Check logs for connection messages
"Cache miss even after first epoch"¶
Problem: Data isn't being cached.
Possible causes:
1. cache_on_miss=False
2. Cache full (check RAMJET_CACHE_SIZE)
3. Different cache keys between epochs
Solution:
# Verify cache is working
dataset = ramjetio.CachedDataset(YourDataset())
# Access same item twice
_ = dataset[0]
_ = dataset[0]
stats = dataset.get_cache_stats()
print(stats)
# Should show: {'cache_hits': 1, 'cache_misses': 1, ...}
"Out of disk space"¶
Problem: Cache filled up the disk.
Solution:
"Connection refused to other nodes"¶
Problem: Nodes can't communicate with each other.
Solution:
# Ensure port 9000 is open between nodes
sudo ufw allow 9000/tcp
# Test connectivity
nc -zv other-node 9000
"Stragglers re-register repeatedly during a long run"¶
Problem: A node that fell behind on heartbeats keeps re-registering with the backend, occasionally rotating its position on the consistent-hash ring and forcing other peers to re-fetch.
Cause: Pre-0.9.0 builds. The straggler's heartbeat would time out, the backend would mark it offline, the node would reconnect and join the ring again at a new position.
Fix:
- Upgrade to 0.9.0 or newer (pip install -U ramjetio). The ring is now frozen during a training run, and re-registers are bounded to ≤ 2 per run even if the network blips.
- If you can't upgrade yet, raise RAMJET_HEARTBEAT_TIMEOUT_S from the default 30 to 60 and reduce concurrent S3 traffic on the affected node.
Verify on 0.9.0+: the cluster's Re-registers panel in the dashboard should stay flat at 0–2 across a 5-epoch run; spikes mean the node is genuinely losing connectivity.
"Slow cache performance"¶
Problem: Cache is slower than expected.
Possible causes: 1. HDD instead of SSD 2. Network bottleneck between nodes 3. Small items (overhead dominates)
Solutions:
# Use SSD
export RAMJET_CACHE_PATH="/nvme/ramjet_cache"
# Check disk speed
dd if=/dev/zero of=/tmp/test bs=1M count=1000 oflag=direct
# Should be >500 MB/s for SSD
"Proto files not generated"¶
Problem: gRPC proto files missing.
Solution:
cd ramjet
pip install grpcio-tools
python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. ramjet/proto/ramjet.proto
"CUDA out of memory with caching"¶
Problem: OOM when using cached data.
Note: RAMJET caches CPU data, not GPU tensors.
Solution:
# Ensure data is on CPU before caching
def transform(data):
return data.cpu() # Move to CPU
dataset = ramjetio.CachedDataset(
YourDataset(),
transform_before_cache=transform,
)
Debug Mode¶
Enable verbose logging:
Or via environment:
Getting Help¶
If you're still stuck:
- Check logs with
RAMJET_LOG_LEVEL=DEBUG - Email: support@ramjet.io
- Discord: discord.gg/ramjetio
- GitHub Issues: github.com/jogrms/nxgeninc/issues
Include in your report:
- Python version: python --version
- RAMJET version: pip show ramjetio
- PyTorch version: python -c "import torch; print(torch.__version__)"
- OS: uname -a
- Full error traceback
- Minimal reproducible example