Rendered at 08:39:04 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
Reubend 2 hours ago [-]
This is really cool research, but I'm wondering how much it slows down inference. The readme says that it's "...distinguished by zero overhead (no learned components, no entropy coding)" but does that really mean that this is a "free win"?
mungoman2 2 hours ago [-]
This is cool. It makes storage of the KV cache much smaller, making it possible to keep more of it in fast memory.
Bandwidth-wise it is worse (more bytes accessed) to generate and do random recall on than the vanilla approach, and significantly worse than a quantized approach. That’s because the reference needs to be accessed.
I guess implied is that since the KV cache is smaller, the probability is higher that the parts it that are needed are in fast memory, and that bandwidth requirements of slow links is reduced, and performance goes up.
Would be interesting with a discussion about benefits/drawbacks of the approach. Ideally backed by data.
Bandwidth-wise it is worse (more bytes accessed) to generate and do random recall on than the vanilla approach, and significantly worse than a quantized approach. That’s because the reference needs to be accessed.
I guess implied is that since the KV cache is smaller, the probability is higher that the parts it that are needed are in fast memory, and that bandwidth requirements of slow links is reduced, and performance goes up.
Would be interesting with a discussion about benefits/drawbacks of the approach. Ideally backed by data.