Immediately after switching the page, it will work with CSR.
Please reload your browser to see how it works.
I think we've already got a bit of a bottleneck in terms of memory bandwidth utilization.
There's pytorch's FlexAttention which could maybe make this practical, but currently it's just way too buggy.
1. https://ai.meta.com/research/publications/byte-latent-transf...
I have been working on a classification problem on audio data (with context size somewhere between 1000 and 3000 with potential to expand later). I have been experimenting with adding attention onto a CNN for a classification task I have been working on.
I tried training a vanilla transformer but in the sizes that I am aiming for (5-30M parameters), the training is incredibly unstable and doesn't achieve the performance of an LSTM.
So I went back to CNNs which are fast to train but don't achieve the losses of LSTMs (which are much slower to train,and for higher context sizes you get into the vanishing gradient problem). The CNN-GRU hubrid a worked much better, giving me my best result.
The GRU layer I used had a size of 512. For increasing context sizes, I'd have to make the convolutional layers deeper so as not to increase the GRU size too large. Instead, I decided to swap out the GRU with a MultiHeadAttention layer. The results are great - better than the CNN-GRU (my previous best). Plus, for equivalent sizes the model is faster to train though it hogs a lot of memory.