neuron: tranium & inferentia

AWS has built 2 different versions of specialized ML chips or Accelerators, for training and inference

here i will cover the neuron, tranium, and inferentia architectures. along with outlining the differences between tranium and inferentia.

key differences

one of my main motivations to dive deeper into the inferentia and tranium architectures was to better understand the technical differences, leading to one being better for inference and the other better for training.

at a glance they seem awfully similar given they are both powered by some number of neuroncore-v1/2/3 chips. the most obvious place of difference is in their memory. trainium uses HBR while inferentia uses DDR.

to better understand the reasoning behind the difference in memory choice, lets recap latency vs bandwith. latency:the delay before a transfer of data begins following an instruction for its transfer. bandwith the maximum possible amount of data transfer between two points of a network in a specific time

relating these definitions back to inference and training, latency is a priority for inference because we are typically producing a single output where the speed of getting that output is important. while we typically dont need the output of training immediately, but are typically trying to process a large amount of data together, hence the need for bandwith.

so going back, trainium uses HBM, which literally stands for High Bandwith Memory, because training runs prioritize processing tons of examples together. while inferentia uses DDR, a more typical memory, because the cost-benefit is much lesser given latency is the priority in inference scenarios.

cost is also a noticable difference between these instances. with inferentia being considerably cheaper. again this can be related back to the fundamental differences between the focuses for each instance. when it comes to optimizing for training, your focus is on a single cluster and a known size of large data. but when building for inference your optimizations also include scalability for dynamic volumes of usage, handled by spinning up instances at any moment to help with increases in traffic.

-------------------------------------

-------------------------------------

neuroncore-v2 architecture

cuda sm

4 main engines, tensor, vector, scalar, and GPSIMD

each nc-v3 has 28mb of on-chip software-managed SRAM to maximize data locality and optimize data prefetch

tensor engines highly optimized for GEMM, CONV, and transpose. supporting mixed-precision computation.

tensor cores in v3 deliver 79 fp16 TFLOPS of tensor operations

also delivers 316 fp16 TFLOPS with Structured Sparsity, useful when on input tensor in matmul has M:N sparsity pattern, where only M elements out of every N contiguous elements are non-zero

vector engine optimized for vector comp in which ever element of the output is dependent on multple input elements. like layernorm or pooling

scalar enginer optimized for scalar operations, where every element of the output is dependent on one element of the input

-------------------------------------

inferentia (inf1/inf2)

cuda sm

each inf1 instance has x16 inferentia chips, each with x4 neuroncore-v1 cores

128 int8 TOPS and 64 fp16 TFLOPS

8gb device DRAM/DDR4 for storing parameters and intermediate state

each inf2 has x12 inferentia2 chips

190 TFLOPS of fp16

32gb HBM mem, which is x4 inferentia1

inf1.24xlarge is $4.72/hr

inf2.48xlarge is $12.98/hr

<

-------------------------------------

trainium (trn2/trn1)

cuda sm

each trn1 has x16 tranium chips, each with x2 neuroncore-v2 cores

each trn2 ultraserver instance has x64 tranium2 chips aka x4 trn2

trn2 does 20.8 PFLOPS of fp8, so ultraserver does 83.2 PFLOPS

trn2 has 1.5TB of HBM3

trn1.32xlarge is $15.50/hr on demand

trn2.48xlarge is $44.701

-------------------------------------

compiling and deploying BERT on trn1/trn2

-------------------------------------

directory

-------------------------------------