🎉 Hey Gate Square friends! Non-stop perks and endless excitement—our hottest posting reward events are ongoing now! The more you post, the more you win. Don’t miss your exclusive goodies! 🚀
🆘 #Gate 2025 Semi-Year Community Gala# | Square Content Creator TOP 10
Only 1 day left! Your favorite creator is one vote away from TOP 10. Interact on Square to earn Votes—boost them and enter the prize draw. Prizes: iPhone 16 Pro Max, Golden Bull sculpture, Futures Vouchers!
Details 👉 https://www.gate.com/activities/community-vote
1️⃣ #Show My Alpha Points# | Share your Alpha points & gains
Post your
Meta Princeton proposes the ultimate solution for LLM context! Let the model become an autonomous agent and read the context node tree by itself
Original source: Shin Ji Yuan
What is the ultimate solution for LLM long context models?
A solution recently proposed by researchers at Princeton University and Meta AI is to think of LLM as an interactive agent that lets it decide how to read text through iterative prompts.
They designed a system called MemWalker that can process long contexts into a tree of summary nodes.
When a query is received, the model can retrieve this node tree to find relevant information and respond when it has collected enough information. In long text question answering tasks, this method is significantly better than the baseline method using long context windows, recursion, and retrieval.
LeCun also tweeted support for their research.
First you need to build the memory tree:
Slice long text into summary nodes. Rollup nodes are further summarized into higher-level nodes, and finally reach the root.
After accepting the query, LLM navigates through the tree to find relevant information and respond appropriately. LLM accomplishes this process through reasoning – perhaps working to find an answer, choosing to go further down one path, or finding itself misguided and pulling back the same way.
The effectiveness of MemWalker depends on two key parts:
The research team investigates tasks related to long context question answering — given long text x and query q, the goal of the model is to generate a response r.
MEMWALKER FOLLOWS TWO STEPS:
Memory tree construction, where long contexts are split into tree-shaped data structures. This construction does not rely on queries, so if there is sequence data beforehand, it can be calculated in advance.
Navigation, where the model navigates this structure when it receives a query, gathering information to formulate an appropriate response.
MEMWALKER assumes access to the underlying LLM and implements the build and navigation by iterating on LLM prompts.
Navigation
After receiving the query Q, the language model is removed from the root node
The node traversed in LLM
LLM decided in
In the leaf node
(ie
To make navigation decisions, the research team could also ask LLM to first generate a justification in natural language by prompting the action, followed by the action choice itself.
Specifically, at each node, the model generates a response r ∼ LLM(r | s, q), where the response is one of two tuples: 1) r = (reasoning, action, answer) when LLM is in a leaf node or 2) r = (reasoning, action) when LLM is in a non-leaf node.
Navigation Tips Design
The research team enabled LLM navigation with zero-sample prompts. There are two types of tips you need:
The leaf prompt contains paragraph content, queries (and options), and instructions that require LLM to generate an answer or return to the parent node.
Both triage tips and leaf tips specify the output format that LLM needs to follow. Failure to adhere to the format results in invalid actions and LLM needs to be regenerated. If LLM fails to produce resolvable output three times in a row, navigation terminates and returns "No Answer".
Working Memory
When LLM finishes retrieving the tree, it can hold the information in the navigation trail and add it to the context.
To be precise, LLM generates a response r ∼ LLM(r | s, q, m) with additional working memory
The research team truncated the working memory so that it could fit into the context window of the LLM.
THE TABLE ABOVE ALSO SHOWS HOW TO ADD WORKING MEMORY TO THE PROMPT VIA WORKING MEMORY.
Experimental configuration
Datasets and Assessments
The research team used three datasets: QuALITY, SummScreenFD, and GovReport, which came from the SCROLLS benchmark. The research team demonstrated the accuracy of all datasets.
QuALITY
QuALITY is a multiple-choice question and answer dataset.
The dataset contains long-form stories from Project Gutenberg and questions annotated by human annotators. The research team experimented using a subset of 187 examples.
SummScreenFD
SummScreenFD is a dataset of TV and movie scripts originally designed for summarization.
These scripts are presented in the form of dialogues between actors. The research team converted this dataset into a question-and-answer task, in which the raw provided basic truthful summary text was used to generate a "who" question using Stable Beluga 2, which was then checked by a human expert.
The question paired with the original long text became 306 examples of repositioned QA tasks.
GovReport
The GovReport dataset brings together documents from the Congressional Research Service and the U.S. Government Accountability Office, as well as summaries provided by experts.
The research team converted this dataset into a question-and-answer dataset with 101 examples in the same way as SummScreenFD.
All three datasets are characterized by long contexts of different lengths, some shorter examples and some longer sequences.
Therefore, the research team presented results on both the original dataset and on a subset of the longer sequences contained in each task to better assess memory access in more difficult and longer context situations.
The thresholds are QuALITY's 8,000 tokens, SummScreenFD's 6,000 tokens, and GovReport's 12,000 tokens.
Model
The research team used Stable Beluga 2 as a base LLM in most of their experiments because it offers state-of-the-art performance compared to several other LLM variants, which the research team will demonstrate.
Stable Beluga 2 is a 70B LLaMA-2-based instruction tuning model in which fine-tuning does not overlap with the research team's evaluation task.
It has a maximum context length of 4,096 tokens. The research team used the model in a zero-shot manner without further fine-tuning or providing a small number of examples of the research team's task in context.
The research team used top p-sampling for memory tree construction as well as actions and inference to generate navigation.
The research team set the maximum number of nodes for QuALITY, SummScreenFD, and GovReport, maxt Mt = 8, 5, 8, and segment size|c|, respectively = 1000, 1000, 1200。
Benchmark
The research team compared three memory technologies based on the same underlying LLM to Stable Beluga 2:
Full context window
Recursion
Retrieval
The full context window baseline uses all 4,096 tokens to process long input text and generation. Because instances in the dataset often exceed context limits, the research team truncated the length, taking either the right (nearest) or the left (least close) of the text as input, and evaluated both methods.
For the search, the research team used Contriever (Izacard et al., 2022) to select paragraphs from long contexts based on queries. The passages with the highest scores are concatenated into the input context of the LLM until they fill the context.
Finally, the research team implemented a baseline that loops through the digest to the current paragraph of information from the previous paragraph tokens, where each paragraph is 2,500 tokens and the maximum abstract size is 500 tokens.
Results & Analysis
Key Results
Table 2 below shows a comparison between MEMWALKER and other baselines.
This shows the limitation of recursion, where relevant information for the query is lost after a few steps.
MEMWALKER ALSO GOES BEYOND SEARCHING, WHERE PASSAGES COME FROM A COHERENT LONG-FORM STORY RATHER THAN A SEPARATE DOCUMENT.
In these tasks, the full context baseline can perform well in the "raw" task setting, which may contain relatively short sequences, although choosing left or right truncation for best performance seems to depend on the dataset.
However, with the exception of the hold-right variable on QuALITY and the hold-left variable on GovReport, MEMWALKER achieves higher performance in the original setup than the full-context baseline, which may be due to positional bias in the dataset, where relevant paragraphs typically appear at the beginning or end of text.
HOWEVER, ON LONG VERSIONS OF ALL THREE TASKS, MEMWALKER EXCEEDED ALL BASELINES, I.E. IT SHOWED STRONG PERFORMANCE AS MEMORY ACCESS BECAME MORE CRITICAL.
MEMWALKER also surpasses other publicly available models, including LongChat and MPT.
WHEN THE TEXT LENGTH IS SHORTER, MEMWALKER IS INFERIOR TO THE FULL-CONTEXT (LEFT OR RIGHT TRUNCATION) BASELINE, BUT OUTPERFORMS BOTH TRUNCATION TYPES ON LONGER SEQUENCES FOR ALL TASKS.
The benefit of interactive reading is that the appropriate increase in text length becomes apparent, i.e. better performance is shown once the sequence length is significantly greater than 4,096 LLM context length.
Inference is essential for memory tree navigation.
THE EFFECTIVENESS OF MEMWALKER IS HIGHLY DEPENDENT ON THE REASONING CAPABILITIES OF THE UNDERLYING LLM. For each navigation decision, the research team used an LLM prompt that asked the LLM to first generate a justification in natural language to justify the next predicted action, as shown in Table 1 below.
Stable Beluga 2 outperformed Llama 2 Chat of the same LLM size and also showed enhanced reasoning capabilities.
For Stable Beluga 2, requiring reasoning justifications in all tasks improves performance. THIS HIGHLIGHTS THE MAIN FEATURE OF MEMWALKER: IF THE LLM PASSES THE CRITICAL REASONING CAPABILITY THRESHOLD, IT CAN REASON ABOUT LONG INPUTS ACROSS MULTIPLE ROUNDS WITHOUT QUICKLY GENERATING ERRORS BETWEEN ROUNDS.
For weak LLMs that fail to make good navigation decisions, errors can accumulate and overall performance is impaired.
AS LLM'S REASONING CAPABILITIES CONTINUE TO IMPROVE IN THE COMING YEARS, THE RESEARCH TEAM EXPECTS METHODS LIKE MEMWALKER TO BECOME MORE EFFECTIVE.
Working memory is required to navigate the memory tree. WHEN MEMWALKER MAKES DECISIONS TO TRAVERSE THE MEMORY TREE AND READ RELATED PARAGRAPHS, IT MAY LOSE KNOWLEDGE OF THE OVERALL CONTEXT.
Therefore, the model carries information from the node along the navigation path as working memory, where the contents of the working memory are updated when the model chooses the next path.
THE RESEARCH TEAM EVALUATED THE PERFORMANCE OF MEMWALKER WITH OR WITHOUT WORKING MEMORY, AND THE RESULTS ARE SHOWN IN FIGURE 3 BELOW.
MEMWALKER can recover from the wrong path.
WHEN MEMWALKER NAVIGATES THE MEMORY TREE, IT NOT ONLY NEEDS TO FIND ITS PATH TO THE MOST RELEVANT PARAGRAPHS, BUT IT MAY ALSO NEED TO RECOVER FROM ALL RETRIEVAL ERRORS.
The research team presents the recovery statistics in Table 4 below. MEMWALKER performs recovery navigation operations (and therefore changes paths) on approximately 15% - 20% of the samples, but in these examples it is possible to recover and get them correctly in QuALITY, 60% for SummScreenFD, and ∼ 80% for GovReport.
The research team shows the average of the percentages of long context reads for all examples, as shown in Figure 4 below for each of the three tasks. The research team found that, on average, only 63-69% of the text needed to be read to answer questions, including the contents of tree nodes.
Trade-offs for memory tree construction
When the research team builds the memory tree, a fundamental trade-off arises – summarizing larger paragraphs into nodes to reduce the depth of the tree, but potentially losing the accuracy of the content.
Similarly, connecting many lower-level nodes to the nodes above can help flatten the tree, but may make LLM navigation tasks on each node more difficult.
Figure 5 below shows the performance of different configurations of the memory tree on QuALITY. Summarizing larger paragraphs is often more beneficial than summarizing smaller paragraphs and connecting more child nodes to the parent node.
However, performance plateaued as the maximum number of nodes increased, showing the trade-off of how much information can be packed into nodes during memory tree construction.