Immediately after switching the page, it will work with CSR.
Please reload your browser to see how it works.

Source:https://github.com/SoraKumo001/next-streaming

⬅️ Multi-Token Attention
kouteiheika 18 hoursReload
This is another potential improvement to the transformer architecture from Facebook (the other one that comes to mind is this one from same authors: https://arxiv.org/abs/2405.18719), but note that it comes with a major problem that might not be obvious at first glance: it's just not usable in practice without a ton of work. It modifies the innards of the attention mechanism, so it is incompatible with Flash Attention (or any other optimized attention library), and you do not want to train anything beyond toy models without Flash Attention (the performance hit is just way too big).

There's pytorch's FlexAttention which could maybe make this practical, but currently it's just way too buggy.


bionhoward 22 hoursReload
How does this compare with Byte Latent Transformer [1]? This happens with convolution post-embedding while BLT happens with attention at embedding time?

1. https://ai.meta.com/research/publications/byte-latent-transf...


bigdict 23 hoursReload
Sure, you can get better model performance by throwing more compute at the problem in different places. Does is it improve perf on an isoflop basis?

bob1029 22 hoursReload
So, we're proposing a multiplicative increase of something that already scales quadratically with the context size?

I think we've already got a bit of a bottleneck in terms of memory bandwidth utilization.


rakejake 16 hoursReload
Interesting. So they convolve the k,v, q vectors? I have been trying the opposite.

I have been working on a classification problem on audio data (with context size somewhere between 1000 and 3000 with potential to expand later). I have been experimenting with adding attention onto a CNN for a classification task I have been working on.

I tried training a vanilla transformer but in the sizes that I am aiming for (5-30M parameters), the training is incredibly unstable and doesn't achieve the performance of an LSTM.

So I went back to CNNs which are fast to train but don't achieve the losses of LSTMs (which are much slower to train,and for higher context sizes you get into the vanishing gradient problem). The CNN-GRU hubrid a worked much better, giving me my best result.

The GRU layer I used had a size of 512. For increasing context sizes, I'd have to make the convolutional layers deeper so as not to increase the GRU size too large. Instead, I decided to swap out the GRU with a MultiHeadAttention layer. The results are great - better than the CNN-GRU (my previous best). Plus, for equivalent sizes the model is faster to train though it hogs a lot of memory.