How to make llm faster. 6 tok/s from ~57 tok/s, and the att_mix kernel going from 5.