Learn CUTLASS the hard way!
Walkthrough of optimization techniques for GEMMs from a naive fp32 kernel to CUTLASS bf16 kernel
Walkthrough of optimization techniques for GEMMs from a naive fp32 kernel to CUTLASS bf16 kernel
Worklog: Performance debugging Triton Kernel
Fused Softmax Triton kernel exploration
RMS Normalization Triton kernel implementation for LLMs
Visualizer for MXFP4 quantization
KV Caching: Training vs Inference in Multi-Head Attention
Load OpenAI's reference implementation
Testing out the huggingface version for the gpt-oss-20b locally on consumer hardware
Regularization to improve duplication penalty loss
How Triton Compiler Works Under the Hood!