Much faster yet than stable pytorch 2.3 (46% on A100, as per the tweet), and much much faster yet compared to pytorch 2.2, which was the stable version a couple weeks ago. Also llm.c is much faster yet when the performance comparison is on H100 instead of A100, or on multiple GPU instead of a single one.