Immediately after switching the page, it will work with CSR.
Please reload your browser to see how it works.

michelpp 5 daysReload

PageRank is one of several interesting centrality metrics that could be applied to a graph to influence RAG on structural data, another one is Triangle Centrality which counts triangles around nodes to figure out their centrality based on the concept that triangles close relationships into a strong bond, where open bonds dilute centrality by drawing weight away from the center:

The paper shows high efficiency compared to other centralities like PageRank, however in some research using the GraphBLAS I and my coauthors found that TC was slower on a variety of sparse graphs than our sparse formulation of PR for graphs up to 1.8 billion edges, but that TC appears to scale better as graphs get larger and is likely more efficient in the trillion edge realm.

https://fossies.org/linux/SuiteSparse/GraphBLAS/Doc/The_Grap...

LASR 5 daysReload

So I've done a ton of work in this area.

Few learnings I've collected:

1. Lexical search with BM25 alone gives you very relevant results if you can do some work during ingestion time with an LLM.

2. Embeddings work well only when the size of the query is roughly on the same order of what you're actually storing in the embedding store.

3. Hypothetical answer generation from a query using an LLM, and then using that hypothetical answer to query for embeddings works really well.

So combining all 3 learnings, we landed on a knowledge decomposition and extraction step very similar to yours. But we stick a metaprompter to essentially auto-generate the domain / entity types.

LLMs are naively bad at identifying the correct level of granularity for the decomposed knowledge. One trick we found is to ask the LLM to output a mermaid.js mindmap to hierarchically break down the input into a tree. At the end of that output, ask the LLM to state which level is the appropriate root for a knowledge node.

Then the node is used to generate questions that could be answered from the knowledge contained in this node. We then index the text of these questions and also embed them.

You can directly match the user's query from these questions using purely BM25 and get good outputs. But a hybrid approach works even better, though not by that much.

Not using LLMs are query time also means we can hierarchically walk down the root into deeper and deeper nodes, using the embedding similiarity as a cost function for the traversal.

inboulder 4 daysReload

PageRank for better centrality seems neat, but it still doesn't address the probably unsolvable flaw with RAG, the reason why RAG basically can't work. All RAG DBs under-perform expectations because RAG fundamentally can't find relationships between words necessary to find the information the user cares about. Weird right, isn't this what the 'attention' mechanism is supposed to be good for? It just isn't good enough.

Example: Say you're searching an article and you want to know what occupation a mentioned person has, let's say the person 'Sharon,' is mentioned to have attended several physical chemistry conferences but her occupation is never explicitly mentioned. There's a very good chance every single rag approach will fail to return correct results, will fail to make this connection between 'occupation' attends conference, type of conference and infers 'chemist'. I could go on, but this sort of error is pervasive along all types of information when trying to retrieve with RAG. In the end, solutions like the above seem to just sort of reinvent other query methods, SQL, pagerank etc, with extra steps... there's little point in vectorization at that point...

AIorNot 5 daysReload

This is very cool, I signed up and uploaded a few docs (PDFs) to the dashboard

Our Use case: We have been looking at farming out this work (analyzing complaince documents (manufacturing paperwork) for our AI Startup however we need to understand the potential scale this can operate under and the cost model for it to be useful to us

We will have about 300K PDF documents per client and expect about a 10% change in that document set, month to month -any GraphRag system has to handle documents at scale - we can use S3 as an igestion mechanism but have to understand the cost and processing time needed for the system to be ready to use duiring:

1. inital loading 2. regular updates -how do we delete data from system for example

cool framework btw..

anotherpaulg 5 daysReload

Super interesting, thanks for sharing. How large a corpus of domain specific text do you need to obtain a useful knowledge graph?

Aider has been doing PageRank on the call graph of code repos since forever. All non trivial code has lots of graph structure to support PageRank. So it works really well to find the most relevant context in the project related to the currently active task.

https://aider.chat/docs/repomap.html#optimizing-the-map