You can distribute an LLM across TPU, GPU, and CPU by assigning compute-heavy layers to accelerators and offloading static or memory-intensive tasks to CPUs using device mapping.
Here is the code snippet below:

In the above code, we are using the following key points:
-
Manual device_map defines precise device allocation for each model component.
-
load_checkpoint_and_dispatch efficiently loads only needed model chunks.
-
Accelerate ensures optimal hardware-aware deployment across heterogeneous devices.
Hence, cross-device mapping allows scalable and cost-efficient LLM deployment using available hardware tiers.