The story behind this one: my paternity leave was imminent and I wanted to have a benchmark of AutoGluon running on a bunch of tabular Kaggle challenges while I was off. At least this was the initial objective. The goal was to compare AutoGluon with our own AutoML solution.
My plan was to use Ray to schedule tasks across several VMs automatically and benefit from all its features: the dashboard to monitor tasks and the load on each machine, stop tasks taking too much time, retry failed ones, and so on. To be honest, I also just wanted to try Ray for this. I knew it should be possible, and this was a good excuse to dive in. I had previously used ray.tune
on a single machine but had never set up a cluster across multiple VMs. Although everything eventually worked the way I wanted, it was more difficult than I anticipated and required quite a bit of debugging and babysitting at first. I also learned a lot about Ray (and AutoGluon) in the process, so I wanted to share some of those lessons here.
Resource constraints and AutoGluon
One of the key aspects of running this benchmark was ensuring that each AutoGluon run had access to the same amount of computational resources. This is important if we want to make fair comparisons between different AutoML solutions under controlled conditions. In practice, this means fixing the number of CPUs (and GPUs) and the amount of memory available to each run.
As written in AMLB, An AutoML Benchmark, running each task on a specific EC2 instance type ensures consistent resources. I followed a similar principle here. Instead of relying on large shared machines, I used a pool of identical VMs and aimed to run one task per VM to guarantee fixed resources per run.
AutoGluon exposes parameters to control resources like CPUs, GPUs, and memory 1. However, the memory parameter is only a soft limit: in most cases it is respected, but it can occasionally be exceeded. In retrospect, I could have experimented with this, but given the VMs I had, it was simpler to enforce resource limits at the VM level.
To orchestrate this setup efficiently, I turned to Ray that could launch one run per machine with fixed resource constraints.
Setting up the Ray cluster
Setting up the Ray cluster was mostly straightforward, although I ended up doing it manually. I followed the Ray manual setup guide, running ray start
on each VM 2. I did try to use Ray YAML cluster configuration to automate everything, but I ran into issues activating the right conda environment remotely. At the time, I wasn’t very familiar with non-interactive conda environment activation. I suspect I could revisit this now given what I learned when trying Ansible. All the VMs shared an NFS mount, which simplified data access: I didn’t have to copy datasets around, and results could be written to a common location accessible from every machine.
Ray resources are logical
By default, Ray uses logical resources for scheduling and doesn’t enforce CPU affinity or strict OS-level isolation. This meant I couldn’t guarantee that each task would be pinned to a single VM, which was something I wanted for this benchmark. There might be ways to assign tasks to specific nodes (VMs) of the cluster with Ray scheduling strategies but I haven’t played with this. I provide more details on Ray logical versus physical resources at the end of the post.
Isolating AutoGluon from the external Ray cluster
AutoGluon uses Ray under the hood, and this can lead to conflicts when you also manage your own Ray cluster. AutoGluon actually discourages using Ray on top of it. In my case, I was launching each AutoGluon run through Ray, but because Ray doesn’t isolate resources per VM by default, AutoGluon detected the entire cluster’s resources and ignored the per-VM boundaries I wanted to enforce.
To isolate things, I first had to wrap the AutoGluon call in a dedicated Python script that was launched by the Ray task (remote function) as a subprocess using subprocess.run
:
import subprocess
import ray
@ray.remote(num_cpus=8)
def run_autogluon_task(challenge_name):
subprocess.run(
["python", "autogluon_script.py", challenge_name],
)
Then, inside the Python script that invokes AutoGluon (autogluon_script.py
), I had to initialize a local Ray instance with ray.init(address="local")
just before running AutoGluon, so that it uses the VM’s local Ray cluster instead of connecting to the external one. After that, AutoGluon stayed confined to the resources of the VM it was launched on, which was the behavior I wanted.
This approach worked well overall, but I had another issue. Sometimes, Ray would kill a task because it exceeded its memory threshold 3, but the subprocess running AutoGluon would keep running on the VM. Ray then assumed the machine was free and tried to schedule new tasks on it, which led to out-of-memory errors. Increasing Ray’s memory thresholds helped in some cases, but the key fix was enabling RAY_kill_child_processes_on_worker_exit_with_raylet_subreaper=true
, which ensured orphaned subprocesses were properly killed when Ray terminated a task.
Lessons learned
This kind of setup works best when the codebase is stable. When the code changes frequently, debugging errors through Ray can be cumbersome. It’s usually easier to debug tasks outside of Ray first and only distribute them once the code is stable. Overall, the setup worked and allowed me to run a large number of AutoGluon tasks in parallel with fixed resource constraints. The key lessons were:
- Ray makes distribution easy but not foolproof. The logical vs physical resource model means you still need to think carefully about task placement and resource isolation if you want strict control. Other orchestration approaches could have achieved the same goal with different trade-offs in complexity, isolation, and monitoring.
- For a small number of machines, the manual Ray cluster setup was fine, though with better environment management I could probably automate this fully with YAML configs.
- One thing I didn’t fully solve was Ray’s behavior when tasks exceeded memory. Ray killed tasks for going over its memory threshold, even though the same AutoGluon runs succeeded without Ray. One solution would be to disable Ray task killing because of out of memory but this seemed a bit dangerous.
Appendix: More on Ray resources
Ray allows you to specify resources for each task using parameters like num_cpus=2
and memory=500 * 1024 * 1024
telling that you want to use 2 CPUs and 500MiB of RAM for each task. However, this does not prevent your task from using more than 2 physical CPUs and 500 MiB of RAM. It is actually your job to make sure the code run by your task does not use more physical resources than what you want to use. This is because Ray resources are logical and this can be very different from the physical resources of your machine. This is explained in the Ray Resources documentation page: “Physical resources are resources that a machine physically has such as physical CPUs and GPUs and logical resources are virtual resources defined by a system. […] Resource requirements of tasks or actors do NOT impose limits on actual physical resource usage.” The definition of a logical resource depends on the system that defines it, so it differs between Ray and other software. For example, the number of logical CPUs (as defined by the operating system when counting hyperthreads as separate CPUs) is often twice the number of physical CPUs. This is different from Ray’s own definition of logical resources. Also, there is no CPU affinity by default in Ray, so your task will not be pinned to specific cores.
Let’s use an example to make this clear. When you use Ray, you start a Ray cluster (ray.init
), which initializes the total number of resources that the cluster can use. By default this will use the number of CPUs your machine has as the total number of logical CPUs for the cluster, say 8 CPUs. Ray only monitors what happens within this cluster and schedules jobs according to the number of available CPUs and the number of CPUs required by each task (num_cpus
). So let’s say you have 10 tasks to run, each requiring 2 logical CPUs. Ray can start 4 tasks immediately, taking the 8 CPUs of its cluster. This does not mean that each task will only use 2 physical CPUs, nor that the physical CPUs used by the task are not used by another program external to Ray (or from another Ray cluster). If other programs were running on your machine, taking the 8 CPUs, Ray does not monitor this and does not know about these programs. Your Ray cluster does not know about other Ray clusters either. It just monitors what happens in its cluster, the total logical resources you allocated to it, and how many logical resources each task is using. When a task ends, your Ray cluster sees that there are 2 free logical CPUs, meaning that the total number of logical CPUs used by your tasks is 6 out of 8. Again, this doesn’t mean there are actually 2 physical CPUs free on your machine: Ray only monitors the logical resources used by the tasks running within its cluster.
If setting num_cpus
for a Ray task does not prevent your task from using more than num_cpus
physical CPUs, why is it named like that? This is because num_cpus
does have some connection to physical CPUs. First, by default, the total number of CPUs Ray assumes for its cluster is the number of logical CPUs on your machine (including hyperthreading). Then, when you explicitly pass num_cpus
, Ray sets the OMP_NUM_THREADS
environment variable to the same value, which ensures that libraries that read this variable will not use more than num_cpus
physical CPUs. For similar reasons, Ray defines num_gpus
and memory
logical resources.
AutoGluon has
num_cpus
,num_gpus
andmemory_limit
parameters that you can check in the documentation. ↩︎You need to make sure you have the exact same setup on all the VMs. ↩︎
To be completely exact, Ray kills a task when a memory threshold is exceeded at the node (VM) level, not the task level. Ray monitors memory usage on each node to prevent out-of-memory errors, but it does not monitor or enforce memory limits for individual tasks. This behavior is explained here. ↩︎