Building ml systems for a trillion trilion floating point operations
All the ml hype is doing trillion trilion floating point operations
ML systems are different
the problems are very simple but as such we have very high expectation
Model FLOP utilisation: the percentage of the theoretical max floating point operations per second our hardware can do. Almost no CPU is anywhere this . we are ofthen hitting 50% in ML.
THe Field has consolidated significantly
from "many artitecture" to "One" artitecture
"Many folks training SOTA models" -> "a few companies training SOTA models"
THere are ways to get more impact
leaving optimisations up to a compiler
getting tools so that you can do the optimisation your self
ML framework programming model history
Declarative (Caffe)
Graph-builder(tensorflow)
define a function
function gets converted into IR
Function eventually executes on GPPU
imperative/eager (pytorch)
call a function
gpu runs a function
function finishes
un optimized eager execution didn't even sacrifice performance, 90% of time spent in matmul there is nothing else we need to optimize.
But then tensor-cores arrived.
thus we get ml compilers
keep the eager programming modell
capture into a graph in some manner
What ml compilers are doing for us
3 things one can be spending time one
compute: time spent on your gpu computing actual floating point operations
memory: time spent transferring tensors within a gpu
overhead: Everything else
All runtime is either compute or shuffling data
a flop is the only real thing a GPU can do.
Last updated