Meta Princeton proposes the ultimate solution for LLM context! Let the model become an autonomous agent and read the context node tree by itself

Original source: Shin Ji Yuan

Image source: Generated by Unbounded AI

What is the ultimate solution for LLM long context models?

A solution recently proposed by researchers at Princeton University and Meta AI is to think of LLM as an interactive agent that lets it decide how to read text through iterative prompts.

Paper Address:

They designed a system called MemWalker that can process long contexts into a tree of summary nodes.

When a query is received, the model can retrieve this node tree to find relevant information and respond when it has collected enough information. In long text question answering tasks, this method is significantly better than the baseline method using long context windows, recursion, and retrieval.

LeCun also tweeted support for their research.

MemWalker consists of two main parts:

First you need to build the memory tree:

Slice long text into summary nodes. Rollup nodes are further summarized into higher-level nodes, and finally reach the root.

The second part is Navigation:

After accepting the query, LLM navigates through the tree to find relevant information and respond appropriately. LLM accomplishes this process through reasoning – perhaps working to find an answer, choosing to go further down one path, or finding itself misguided and pulling back the same way.

This navigation process can be implemented with zero-sample prompts and is easily adapted to any of the specified large language models.

The research team showed that by interactive reading of the memory tree constructed by this model, MemWalker outperformed other long context baselines and retrieval and loop variants, especially for longer examples.

The effectiveness of MemWalker depends on two key parts:

  1. Working memory size – LLM has better global context capabilities when allowing LLM to obtain more information along the path it retrieves.

2) The reasoning ability of LLM - When LLM reaches the inference threshold, MemWalker is effective. When the inference ability is below the threshold, the error rate during navigation is high.

MEMWALKER: AN INTERACTIVE READER**

The research team investigates tasks related to long context question answering — given long text x and query q, the goal of the model is to generate a response r.

MEMWALKER FOLLOWS TWO STEPS:

  1. Memory tree construction, where long contexts are split into tree-shaped data structures. This construction does not rely on queries, so if there is sequence data beforehand, it can be calculated in advance.

  2. Navigation, where the model navigates this structure when it receives a query, gathering information to formulate an appropriate response.

MEMWALKER assumes access to the underlying LLM and implements the build and navigation by iterating on LLM prompts.

Navigation

After receiving the query Q, the language model is removed from the root node

Start navigating the tree to generate a response.

The node traversed in LLM

, it observes the next level of nodes

Summary of .

LLM decided in

+ Choose one of 1 actions - Select a child node for further inspection, or return to the parent node.

In the leaf node

LLM can decide on one of two actions: submit the leaf node and respond to the query, or if the leaf node contains information

(ie

) is not enough, return to the parent node

To make navigation decisions, the research team could also ask LLM to first generate a justification in natural language by prompting the action, followed by the action choice itself.

Specifically, at each node, the model generates a response r ∼ LLM(r | s, q), where the response is one of two tuples: 1) r = (reasoning, action, answer) when LLM is in a leaf node or 2) r = (reasoning, action) when LLM is in a non-leaf node.

Navigation Tips Design

The research team enabled LLM navigation with zero-sample prompts. There are two types of tips you need:

  1. Triage tips and 2) leaf tips (highlighted in the table below).

The triage prompt contains the query, a summary of the child nodes, and instructions that LLM should follow. Triage tips are used for non-leaf nodes.

The leaf prompt contains paragraph content, queries (and options), and instructions that require LLM to generate an answer or return to the parent node.

Both triage tips and leaf tips specify the output format that LLM needs to follow. Failure to adhere to the format results in invalid actions and LLM needs to be regenerated. If LLM fails to produce resolvable output three times in a row, navigation terminates and returns "No Answer".

Working Memory

When LLM finishes retrieving the tree, it can hold the information in the navigation trail and add it to the context.

To be precise, LLM generates a response r ∼ LLM(r | s, q, m) with additional working memory

Either empty or contains content from previously visited nodes.

The research team truncated the working memory so that it could fit into the context window of the LLM.

THE TABLE ABOVE ALSO SHOWS HOW TO ADD WORKING MEMORY TO THE PROMPT VIA WORKING MEMORY.

Experimental configuration

Datasets and Assessments

The research team used three datasets: QuALITY, SummScreenFD, and GovReport, which came from the SCROLLS benchmark. The research team demonstrated the accuracy of all datasets.

QuALITY

QuALITY is a multiple-choice question and answer dataset.

The dataset contains long-form stories from Project Gutenberg and questions annotated by human annotators. The research team experimented using a subset of 187 examples.

SummScreenFD

SummScreenFD is a dataset of TV and movie scripts originally designed for summarization.

These scripts are presented in the form of dialogues between actors. The research team converted this dataset into a question-and-answer task, in which the raw provided basic truthful summary text was used to generate a "who" question using Stable Beluga 2, which was then checked by a human expert.

The question paired with the original long text became 306 examples of repositioned QA tasks.

GovReport

The GovReport dataset brings together documents from the Congressional Research Service and the U.S. Government Accountability Office, as well as summaries provided by experts.

The research team converted this dataset into a question-and-answer dataset with 101 examples in the same way as SummScreenFD.

All three datasets are characterized by long contexts of different lengths, some shorter examples and some longer sequences.

Therefore, the research team presented results on both the original dataset and on a subset of the longer sequences contained in each task to better assess memory access in more difficult and longer context situations.

The thresholds are QuALITY's 8,000 tokens, SummScreenFD's 6,000 tokens, and GovReport's 12,000 tokens.

Model

The research team used Stable Beluga 2 as a base LLM in most of their experiments because it offers state-of-the-art performance compared to several other LLM variants, which the research team will demonstrate.

Stable Beluga 2 is a 70B LLaMA-2-based instruction tuning model in which fine-tuning does not overlap with the research team's evaluation task.

It has a maximum context length of 4,096 tokens. The research team used the model in a zero-shot manner without further fine-tuning or providing a small number of examples of the research team's task in context.

The research team used top p-sampling for memory tree construction as well as actions and inference to generate navigation.

The research team set the maximum number of nodes for QuALITY, SummScreenFD, and GovReport, maxt Mt = 8, 5, 8, and segment size|c|, respectively = 1000, 1000, 1200。

Benchmark

The research team compared three memory technologies based on the same underlying LLM to Stable Beluga 2:

  1. Full context window

  2. Recursion

  3. Retrieval

The full context window baseline uses all 4,096 tokens to process long input text and generation. Because instances in the dataset often exceed context limits, the research team truncated the length, taking either the right (nearest) or the left (least close) of the text as input, and evaluated both methods.

For the search, the research team used Contriever (Izacard et al., 2022) to select paragraphs from long contexts based on queries. The passages with the highest scores are concatenated into the input context of the LLM until they fill the context.

Finally, the research team implemented a baseline that loops through the digest to the current paragraph of information from the previous paragraph tokens, where each paragraph is 2,500 tokens and the maximum abstract size is 500 tokens.

Results & Analysis

Key Results

Table 2 below shows a comparison between MEMWALKER and other baselines.

MEMWALKER SIGNIFICANTLY EXCEEDED THE RECURSIVE BASELINE IN ALL TASKS.

This shows the limitation of recursion, where relevant information for the query is lost after a few steps.

MEMWALKER ALSO GOES BEYOND SEARCHING, WHERE PASSAGES COME FROM A COHERENT LONG-FORM STORY RATHER THAN A SEPARATE DOCUMENT.

In these tasks, the full context baseline can perform well in the "raw" task setting, which may contain relatively short sequences, although choosing left or right truncation for best performance seems to depend on the dataset.

However, with the exception of the hold-right variable on QuALITY and the hold-left variable on GovReport, MEMWALKER achieves higher performance in the original setup than the full-context baseline, which may be due to positional bias in the dataset, where relevant paragraphs typically appear at the beginning or end of text.

HOWEVER, ON LONG VERSIONS OF ALL THREE TASKS, MEMWALKER EXCEEDED ALL BASELINES, I.E. IT SHOWED STRONG PERFORMANCE AS MEMORY ACCESS BECAME MORE CRITICAL.

MEMWALKER also surpasses other publicly available models, including LongChat and MPT.

MEMWALKER improves performance on long sequences. The research team provided a performance breakdown of the input sequence length for each task in Figure 2 above.

WHEN THE TEXT LENGTH IS SHORTER, MEMWALKER IS INFERIOR TO THE FULL-CONTEXT (LEFT OR RIGHT TRUNCATION) BASELINE, BUT OUTPERFORMS BOTH TRUNCATION TYPES ON LONGER SEQUENCES FOR ALL TASKS.

The benefit of interactive reading is that the appropriate increase in text length becomes apparent, i.e. better performance is shown once the sequence length is significantly greater than 4,096 LLM context length.

Inference is essential for memory tree navigation.

THE EFFECTIVENESS OF MEMWALKER IS HIGHLY DEPENDENT ON THE REASONING CAPABILITIES OF THE UNDERLYING LLM. For each navigation decision, the research team used an LLM prompt that asked the LLM to first generate a justification in natural language to justify the next predicted action, as shown in Table 1 below.

The research team shows in Table 3 below how reasoning affects performance by comparing Llama 2 Chat (13B and 70B parameter variants) with Stable Beluga 2 (70B) and by removing the line "Provide reasoning before making a decision..." from the prompt.

For smaller, less capable models (13B), performance lags significantly behind 70B models due to the inability to follow instructions. In fact, requiring inference justifications for weaker models can degrade performance, perhaps because they cannot generate and leverage those justifications.

Stable Beluga 2 outperformed Llama 2 Chat of the same LLM size and also showed enhanced reasoning capabilities.

For Stable Beluga 2, requiring reasoning justifications in all tasks improves performance. THIS HIGHLIGHTS THE MAIN FEATURE OF MEMWALKER: IF THE LLM PASSES THE CRITICAL REASONING CAPABILITY THRESHOLD, IT CAN REASON ABOUT LONG INPUTS ACROSS MULTIPLE ROUNDS WITHOUT QUICKLY GENERATING ERRORS BETWEEN ROUNDS.

For weak LLMs that fail to make good navigation decisions, errors can accumulate and overall performance is impaired.

AS LLM'S REASONING CAPABILITIES CONTINUE TO IMPROVE IN THE COMING YEARS, THE RESEARCH TEAM EXPECTS METHODS LIKE MEMWALKER TO BECOME MORE EFFECTIVE.

Working memory is required to navigate the memory tree. WHEN MEMWALKER MAKES DECISIONS TO TRAVERSE THE MEMORY TREE AND READ RELATED PARAGRAPHS, IT MAY LOSE KNOWLEDGE OF THE OVERALL CONTEXT.

Therefore, the model carries information from the node along the navigation path as working memory, where the contents of the working memory are updated when the model chooses the next path.

THE RESEARCH TEAM EVALUATED THE PERFORMANCE OF MEMWALKER WITH OR WITHOUT WORKING MEMORY, AND THE RESULTS ARE SHOWN IN FIGURE 3 BELOW.

The research team found that exhaustion of working memory resulted in a significant decrease in performance across all tasks, with a 5-13% drop in accuracy, demonstrating the importance of this component.

MEMWALKER can recover from the wrong path.

WHEN MEMWALKER NAVIGATES THE MEMORY TREE, IT NOT ONLY NEEDS TO FIND ITS PATH TO THE MOST RELEVANT PARAGRAPHS, BUT IT MAY ALSO NEED TO RECOVER FROM ALL RETRIEVAL ERRORS.

The research team presents the recovery statistics in Table 4 below. MEMWALKER performs recovery navigation operations (and therefore changes paths) on approximately 15% - 20% of the samples, but in these examples it is possible to recover and get them correctly in QuALITY, 60% for SummScreenFD, and ∼ 80% for GovReport.

MEMWALKER enables efficient reading. SINCE MEMWALKER DETERMINES WHICH PARTS OF LONG TEXT NEED TO BE READ, THE PAYLOAD THAT NEEDS TO BE READ MAY BE SMALLER THAN THE ENTIRE SEQUENCE.

The research team shows the average of the percentages of long context reads for all examples, as shown in Figure 4 below for each of the three tasks. The research team found that, on average, only 63-69% of the text needed to be read to answer questions, including the contents of tree nodes.

On the path to success, the required reading is further reduced to 59% – 64%.

Trade-offs for memory tree construction

When the research team builds the memory tree, a fundamental trade-off arises – summarizing larger paragraphs into nodes to reduce the depth of the tree, but potentially losing the accuracy of the content.

Similarly, connecting many lower-level nodes to the nodes above can help flatten the tree, but may make LLM navigation tasks on each node more difficult.

Figure 5 below shows the performance of different configurations of the memory tree on QuALITY. Summarizing larger paragraphs is often more beneficial than summarizing smaller paragraphs and connecting more child nodes to the parent node.

However, performance plateaued as the maximum number of nodes increased, showing the trade-off of how much information can be packed into nodes during memory tree construction.

Resources:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)