
8 min read
Optimizing Transformer Inference: From 200ms to 15ms
A practical guide to optimizing transformer model inference for production deployment, covering quantization, distillation, and ONNX Runtime.
TransformersOptimizationONNX
Deep dives into ML experiments, model optimization, deployment strategies, and research papers.