Troubleshooting¶

Common Issues¶

"RAMJET API key not configured"¶

Problem: RAMJET can't find your API key.

Solution:

# Set environment variable
export RAMJET_API_KEY="your_key_here"

# Or in Python
import ramjetio
ramjetio.init(api_key="your_key_here")

Verify:

echo $RAMJET_API_KEY
# Should print your key

"Invalid API key format"¶

Problem: API key format is incorrect.

Solution: API keys should start with ramjet_ followed by the cluster name and secret:

ramjet_clustername_xxxxxxxxxxxx

Get a valid key from the dashboard at app.ramjet.io.

"Backend unavailable"¶

Problem: Can't connect to RAMJET dashboard backend.

Possible causes: 1. No internet connection 2. Firewall blocking outbound connections 3. Backend is down (rare)

Solution:

# Check connectivity
curl -v https://api.ramjet.io/health

# Check if port 443 is open outbound
nc -zv api.ramjet.io 443

"Node not appearing in dashboard"¶

Problem: Started training but node doesn't show in dashboard.

Checklist: 1. ✅ RAMJET_API_KEY is set correctly 2. ✅ ramjetio.init() is called 3. ✅ Node has internet access 4. ✅ No firewall blocking outbound port 443

Debug:

import ramjetio
import logging

logging.basicConfig(level=logging.DEBUG)
ramjetio.init()
# Check logs for connection messages

"Cache miss even after first epoch"¶

Problem: Data isn't being cached.

Possible causes: 1. cache_on_miss=False 2. Cache full (check RAMJET_CACHE_SIZE) 3. Different cache keys between epochs

Solution:

# Verify cache is working
dataset = ramjetio.CachedDataset(YourDataset())

# Access same item twice
_ = dataset[0]
_ = dataset[0]

stats = dataset.get_cache_stats()
print(stats)
# Should show: {'cache_hits': 1, 'cache_misses': 1, ...}

"Out of disk space"¶

Problem: Cache filled up the disk.

Solution:

# Set cache size limit
export RAMJET_CACHE_SIZE="50GB"

# Or clear cache
ramjet-client clear

"Connection refused to other nodes"¶

Problem: Nodes can't communicate with each other.

Solution:

# Ensure port 9000 is open between nodes
sudo ufw allow 9000/tcp

# Test connectivity
nc -zv other-node 9000

"Stragglers re-register repeatedly during a long run"¶

Problem: A node that fell behind on heartbeats keeps re-registering with the backend, occasionally rotating its position on the consistent-hash ring and forcing other peers to re-fetch.

Cause: Pre-0.9.0 builds. The straggler's heartbeat would time out, the backend would mark it offline, the node would reconnect and join the ring again at a new position.

Fix: - Upgrade to 0.9.0 or newer (pip install -U ramjetio). The ring is now frozen during a training run, and re-registers are bounded to ≤ 2 per run even if the network blips. - If you can't upgrade yet, raise RAMJET_HEARTBEAT_TIMEOUT_S from the default 30 to 60 and reduce concurrent S3 traffic on the affected node.

Verify on 0.9.0+: the cluster's Re-registers panel in the dashboard should stay flat at 0–2 across a 5-epoch run; spikes mean the node is genuinely losing connectivity.

"Slow cache performance"¶

Problem: Cache is slower than expected.

Possible causes: 1. HDD instead of SSD 2. Network bottleneck between nodes 3. Small items (overhead dominates)

Solutions:

# Use SSD
export RAMJET_CACHE_PATH="/nvme/ramjet_cache"

# Check disk speed
dd if=/dev/zero of=/tmp/test bs=1M count=1000 oflag=direct
# Should be >500 MB/s for SSD

"Proto files not generated"¶

Problem: gRPC proto files missing.

Solution:

cd ramjet
pip install grpcio-tools
python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. ramjet/proto/ramjet.proto

"CUDA out of memory with caching"¶

Problem: OOM when using cached data.

Note: RAMJET caches CPU data, not GPU tensors.

Solution:

# Ensure data is on CPU before caching
def transform(data):
    return data.cpu()  # Move to CPU

dataset = ramjetio.CachedDataset(
    YourDataset(),
    transform_before_cache=transform,
)

Debug Mode¶

Enable verbose logging:

import logging
logging.basicConfig(level=logging.DEBUG)

import ramjetio
ramjetio.init()

Or via environment:

export RAMJET_LOG_LEVEL=DEBUG
python train.py

Getting Help¶

If you're still stuck:

Check logs with RAMJET_LOG_LEVEL=DEBUG
Email: support@ramjet.io
Discord: discord.gg/ramjetio
GitHub Issues: github.com/jogrms/nxgeninc/issues

Include in your report: - Python version: python --version - RAMJET version: pip show ramjetio - PyTorch version: python -c "import torch; print(torch.__version__)" - OS: uname -a - Full error traceback - Minimal reproducible example