Triton Kernels - Fused Softmax - 2
Worklog: Performance debugging Triton Kernel
Worklog: Performance debugging Triton Kernel
Fused Softmax Triton kernel exploration
RMS Normalization Triton kernel implementation for LLMs
Visualizer for MXFP4 quantization
KV Caching: Training vs Inference in Multi-Head Attention
Load OpenAI's reference implementation
Testing out the huggingface version for the gpt-oss-20b locally on consumer hardware
Regularization to improve duplication penalty loss
How Triton Compiler Works Under the Hood!
Enough MLIR to be dangerous - how Triton uses MLIR passes to progressively lower IR