Model parallelism is possible for inference and you can host extremely large models across multiple accelerators (see DeepSpeed ZeRO), but the inference speed is several times slower compared to just having all of it on one accelerator due to communication bottlenecks and overhead (parallelism refers to memory needed for parameters, not the actual computation in most cases).
Certain models may require storage formats like bfloat16 to run efficiently that may not be supported on older hardware.
Parallelism is also supported (and more necessary actually) for training, but since backprop is expensive it typically requires many multiples of memory requirements needed for inference.
Certain models may require storage formats like bfloat16 to run efficiently that may not be supported on older hardware.
Parallelism is also supported (and more necessary actually) for training, but since backprop is expensive it typically requires many multiples of memory requirements needed for inference.