Building ml systems for a trillion trilion floating point operations

All the ml hype is doing trillion trilion floating point operations

ML systems are different

the problems are very simple but as such we have very high expectation
Model FLOP utilisation: the percentage of the theoretical max floating point operations per second our hardware can do. Almost no CPU is anywhere this . we are ofthen hitting 50% in ML.
THe Field has consolidated significantly
- from "many artitecture" to "One" artitecture
- "Many folks training SOTA models" -> "a few companies training SOTA models"
THere are ways to get more impact
- leaving optimisations up to a compiler
- getting tools so that you can do the optimisation your self
ML framework programming model history
- Declarative (Caffe)
- Graph-builder(tensorflow)
  - define a function
  - function gets converted into IR
  - Function eventually executes on GPPU
- imperative/eager (pytorch)
  - call a function
  - gpu runs a function
  - function finishes
  - un optimized eager execution didn't even sacrifice performance, 90% of time spent in matmul there is nothing else we need to optimize.
- But then tensor-cores arrived.
  - thus we get ml compilers
    keep the eager programming modell
    capture into a graph in some manner

What ml compilers are doing for us

3 things one can be spending time one
- compute: time spent on your gpu computing actual floating point operations
- memory: time spent transferring tensors within a gpu
- overhead: Everything else
All runtime is either compute or shuffling data
- a flop is the only real thing a GPU can do.

Last updated 1 year ago