Immediately after switching the page, it will work with CSR.
Please reload your browser to see how it works.
As for RAG, I haven't noticed LLMs struggling with poorly structured text (e.g. the YouTube wall of text blob can just be fed directly into LLMs), though I haven't measured this.
In fact my own "webgrep" (convert top 10 search results into text and run grep on them, optionally followed by LLM summary) works on the byte level (gave up chunking words, sentences and paragraphs entirely): I just shove the 1kb before and after the match into the context. This works fine because LLMs just ignore the "mutilated" word parts at the beginning and end.
The only downside of this approach is that if I was the LLM, I would probably be unhappy with my job!
As for semantic chunking (in the context of, maximize the relevance of stuff that goes into the LLM, or indeed as a semantic search for the user), I haven't solved it yet, but I can share one amusing experiment: to find the relevant part of the text (having already returned a mostly-relevant big chunk of text), chop off one sentence at a time and re-run the similarity check! So you "distil" the text down to that which is most relevant (according to the embedding model) to the user query.
This is very slow and stupid, especially in real-time (though kinda fun to watch), but kinda works for the "approximately one sentence answers my question" scenario. A much cheaper approximation here would just be to embed at the sentence level as well as the page/paragraph level.
The idea was that paragraphs are naturally how we segment distinct thoughts in text, and would translate well to segmenting long video clips. It actually worked pretty well! It was able to predict the paragraph breaks in many texts that it wasn’t trained on at all.
The problems at the time were around context length and dialog style formatting.
I wanted to try and approach the problem in a less brute force way by maybe using sentence embedding and calculating the probability of a sentence being a “paragraph ending” sentence - which would likely result in a much smaller model.
Anyway this is really cool! I’m excited to dive in further to what you’ve done!
It uses a similar approach but the focus is on sentence/paragraph segmentation generally and not specifically focused on RAG. It also has some benchmarks. Might be a good source of inspiration for where to take chonky next.