A16Z: 4 Breakthroughs in Generative AI

2023-06-25 09:58:50

Large language models (LLMs) have become a hot topic in the tech industry, giving us some amazing experiences — from writing a week's worth of code in seconds, to generating more empathetic conversations than we have with humans. Trained on trillions of tokens of data using tens of thousands of GPUs, LLMs demonstrate remarkable natural language understanding and transform fields like copywriting and coding, pushing us into new and exciting generative AI era. Like any emerging technology, generative AI has its fair share of criticism. Although these criticisms partly reflect the limitations of the current capabilities of LLMs, we view these obstacles as opportunities for further innovation rather than as fundamental shortcomings of the technology.

To better understand recent technological breakthroughs in LLMs and prepare founders and operators for the future, we spoke to some of the leading generative AI researchers who are actively building and training some of the largest and most cutting-edge models, These include Dario Amodei, CEO of Anthropic, Aidan Gomez, CEO of Cohere, Noam Shazeer, CEO of Character.AI, and Yoav Shoham of AI21 Labs. These conversations identified 4 key innovation directions for the future: guidance, memory, "hands and feet" and multimodality. In this article, we discuss how these key innovations will evolve over the next 6 to 12 months and how, for founders interested in integrating AI into their own businesses, they can take advantage of these new developments. **

guide

Many founders express concern about using LLMs in their products and workflows due to the potential for hallucinations and reproduction bias from these models. To address these issues, some leading modeling companies are working on improving steering techniques—a method for better controlling model outcomes in the output of LLMs, allowing models to better understand and execute complex user requirements. Noam Shazeer mentioned the similarities between LLMs and children in this regard: "It's a question of how to better bootstrap [models]... The problem we have with LLMs is that we need the right way to tell them how to follow Acting on our demands. Small children are the same - they sometimes make things up and don't have a clear understanding of fantasy and reality." Although after the emergence of model providers and tools like Guardrails and LMQL, in terms of orientation ability [1] Remarkable progress has been made, and researchers are still making progress, which we believe is critical to better productize LLMs for end users.

Improved orientation is especially important in enterprise companies, where the consequences of unpredictable behavior can be costly. Amodei pointed out that the unpredictability of LLMs can make people uncomfortable, and as an API provider, he wants to be able to "say to customers 'no, models don't do this,' or at least rarely do." By improving LLMs output, founders can more confidently ensure that the performance of the model matches the needs of customers. Improved orientation will also pave the way for widespread adoption in other industries that require greater precision and reliability, such as the advertising industry, where the stakes for ad placement are high. Amodei also believes the improved orientation could be applied to "legal use cases, medical use cases, storing financial information and managing financial bets, and scenarios where you need to protect your company's brand. You don't want the technology you're integrating to be unpredictable or hard to predict or characterize. "By being better oriented, LLMs will also be able to accomplish more complex tasks with a small amount of hint engineering, because they will be able to better understand the overall intent."

Advances in the orientation of LLMs also have the potential to open new possibilities in sensitive consumer applications where users expect customized and accurate responses. While users may tolerate less accurate output when engaging in conversational or creative interactions with LLMs, when users use LLMs to assist with everyday tasks, guide important decisions, or augment professionals such as life coaches, therapists, and doctors, They want more accurate output. It has been pointed out that LLMs are expected to replace well-established consumer applications such as search, but before this becomes a real possibility, we may need better guidance to improve model output and build user trust.

Key breakthrough point: users can better customize the output of LLMS. *

memory

Copywriting and ad generation applications driven by LLMs have achieved great success, rapidly gaining popularity among marketers, advertisers, and entrepreneurs. However, the output of most current LLMs is relatively generalized, which makes it difficult to use them for use cases that require personalization and contextual understanding. While hint engineering and fine-tuning can provide a degree of personalization, hint engineering is less scalable, and fine-tuning is often costly as it requires some level of retraining and usually requires close cooperation with most closed-source LLMs. Fine-tuning a model for each individual user is usually not feasible or desirable.

Contextual learning is the holy grail for making this happen, where LLMs take information from your company-generated content, your company-specific jargon, and specific context to create more granular, use-case-specific output. To achieve this goal, LLMs need enhanced memory capabilities. LLM memory has two main components: context windows and retrieval. A context window is text that a model can process and use to guide its output, in addition to the corpus of data it was trained on. Retrieval refers to the retrieval and referencing of relevant information and documents (“contextual data”) from a body of data other than the model’s training data corpus. Currently, most LLMs have limited context windows and cannot natively retrieve additional information, thus generating output that lacks personalization. However, with larger context windows and improved retrieval, LLMs can directly provide more granular, use-case-specific outputs.

In particular, by expanding the context window, the model will be able to handle larger volumes of text and better preserve context, including maintaining coherence in dialogue. This will further significantly improve the ability of the model in tasks that require a deeper understanding of longer inputs, such as summarizing long texts or generating coherent and contextually accurate responses during long conversations. In terms of context windows, we've seen significant improvements - GPT-4 has context windows of 8k and 32k tokens, compared to 4k and 16k tokens for GPT-3.5 and ChatGPT, and Claude recently moved Its context window extends to a staggering 100k tokens [2] 。

Expanding the context window alone does not improve memory sufficiently, as the cost and time of inference scale quasi-linearly or even quadratically with the length of the hint. [3] The retrieval mechanism augments and refines the LLM's original training corpus with contextual data related to the cue. Since LLMs are trained on a body of information and are often difficult to update, retrieval has two main benefits, according to Shoham: "First, it allows you to access sources of information that you did not have at training time. Second, it allows you to Focus the language model on the information you think is relevant to the task." Vector databases such as Pinecone have become the de-facto standard for efficiently retrieving relevant information, and serve as the memory layer for LLMs, making it easier for models to quickly and accurately search and reference massive amounts of information correct data in .

The increased context windows and retrieval will be especially important in enterprise use cases, such as navigating large knowledge bases or complex databases. Companies will be able to better leverage their proprietary data, such as internal knowledge, historical customer support tickets, or financial results, as input to LLMs without fine-tuning. Improving the memory of LLMs will bring improvements and deep customization capabilities in areas such as training, reporting, internal search, data analytics and business intelligence, and customer support.

In the consumer space, improved contextual windows and retrieval will enable powerful personalization capabilities that will revolutionize the user experience. According to Noam Shazeer, “One of the big breakthroughs will be to develop a model that has a very high memory capacity that can be customized for each user while still being cost-effective at scale. You want your therapist to know your Every aspect of life; you want your teachers to know what you already know; you want your life coaches to be able to advise you on what’s going on. They all need context.” Aidan Gomez is also excited about this development. "By giving the model access to data that is uniquely relevant to you, like your email, calendar, or direct messages," he said, "the model will learn about your relationships with different to help you in the best possible way under the circumstances."

*Key breakthrough: LLMs will be able to consider large amounts of relevant information and provide more personal, customized and useful output. *

** "Arms and Legs": Gives the model the ability to use tools**

The real power of LLMs lies in making natural language a medium for action. LLMs have a sophisticated understanding of common and well-documented systems, but they cannot enforce any information extracted from these systems. For example, OpenAI's ChatGPT, Anthropic's Claude, and Character AI's Lily can describe in detail how to book a flight, but they cannot book flights natively by themselves (although technological advances like ChatGPT's plugins are pushing this boundary). "This brain theoretically has all this knowledge, it's just missing the mapping from names to buttons," Amodei said. "It doesn't take a lot of training to connect these cables. You have a disembodied brain that knows how to moves, but it's not yet attached to the arms and legs."

Over time, we have seen companies improve the ability of LLMs to use the tools. Established companies like Bing and Google and startups like Perplexity and You.com launched search APIs. AI21 Labs introduced Jurassic-X, which addresses many of the shortcomings of standalone LLMs by combining models with a set of predetermined tools, including calculators, weather APIs, Wikipedia APIs, and databases. OpenAI launched a beta version of a plugin for ChatGPT that allows ChatGPT to interact with tools such as Expedia, OpenTable, Wolfram, Instacart, Speak, web browsers and code interpreters, a breakthrough that is believed to resemble Apple's "App Store" moment. Recently, OpenAI introduced function calls in GPT-3.5 and GPT-4 [4] , allowing developers to link the capabilities of GPT with any external tools.

The ability to add arms and legs promises to enable a range of use cases across a wide variety of companies and user types by moving from knowledge mining to action orientation. For consumers, LLMs may soon be able to suggest recipes and then order the ingredients you need, or suggest a brunch spot and reserve a table for you. In the enterprise space, founders can make their applications easier to use by plugging in LLMs. As Amodei points out: "For functions that are very difficult to use from a user interface perspective, we may only need to describe them in natural language to achieve complex operations." For example, for applications such as Salesforce, LLM integration should allow users to use Natural language to make updates and have the model automatically make those changes, drastically reducing the time it takes to maintain your CRM. like cohere [5] and Adept [6] Such startups are working on integrating LLMs into such complex tools.

Gomez believes that while it is increasingly likely that LLMs will be able to use applications such as Excel within 2 years, "a lot of refinement still needs to be done. We will have the first generation of models that will be able to use tools, and that will be compelling." But fragile. In the end we'll have the dream system where we can hand any software to the model with some description like 'here's what the tool does, here's how to use it' and it will be able to use it ...once we can provide LLMs with specific and general tools, the automation it brings will be the pinnacle of our field."

*Key breakthrough: LLMs will be able to interact more effectively with the tools we use today. *

multimodal

While chat interfaces are exciting and intuitive to many users, humans can hear and speak language as often as they write or read it, or more. As Amodei points out: “There is a limit to what an AI system can do because not everything is text.” A model with multimodal capabilities can seamlessly process and generate content in multiple audio or visual formats , extending this interaction beyond language. Models like GPT-4, Character.AI, and Meta's ImageBind are already capable of processing and generating images, audio, and other modalities, but their capabilities in this area are relatively basic, although progress is rapid. In Gomez's words, our models are literally blind today, and that needs to change. We built a lot of graphical user interfaces (GUIs) that were supposed to be seen by the user.

As LLMs evolve to better understand and interact with multiple modalities, they will be able to use existing applications that rely on GUIs, such as browsers. They can also provide consumers with a more engaging, coherent and holistic experience, enabling user interactions to go beyond chat interfaces. “A lot of great integration of multimodal models can make things more engaging and more connected to users,” Shazeer noted. He also said, “I think most of the core intelligence right now comes from text, but audio and video can make these Things are more interesting.” From video chatting with AI tutors to iterating and writing TV drama scripts in collaboration with AI, multimodality has the potential to transform entertainment, learning and development, and content generation across a variety of consumer and enterprise use cases.

Multimodality is closely related to tool use. Although LLMs may initially interface with external software via APIs, multimodality will enable LLMs to use tools designed for human consumption but without custom integration, such as traditional enterprise resource planning (ERP) systems, desktop applications, medical devices or manufacture machinery. We have already seen exciting progress in this regard: for example, Google's Med-PaLM-2 model can synthesize mammography and X-ray images. And in the longer term, multimodality (especially integration with computer vision) could extend LLMs to our own physical reality through robotics, autonomous vehicles, and other applications that require real-time interaction with the physical world.

*Key Breakthrough: Multimodal models are able to reason about images, videos, and even physical environments without significant customization. *

Despite some practical limitations of LLMs, researchers have made astonishing improvements to these models in a short amount of time. The fact that we have updated it several times as of this writing is a testament to the rapid development of technology in this field. Gomez agrees: "One time out of 20 the LLM made up the fact that it's obviously still too high. But I'm really, really confident that this is the first time we've built a system like this. People's expectations are pretty high, so the goal has been From 'Computers are dumb, they can only do math' to 'A human could probably do better.' We've bridged the gap enough that the critique focuses on what humans can do."

We're particularly excited about the following four innovations that are at the tipping point of changing the way entrepreneurs build products and run companies. In the long run, the potential is even greater. Amodei predicts: "At some point, we may have a model that can read all the biological data and figure out a cure for cancer." The reality is that the best new applications may still be unknown. At Character.AI, Shazeer lets users develop these use cases: "We're going to see a lot of new apps unlocked. It's hard for me to tell what those apps are. There's going to be millions of apps, and users outnumber the few." Engineers are better at figuring out how to use technology.” We can’t wait to see how these advancements will impact the way we live and work as entrepreneurs and companies, as these new tools and capabilities empower us.

*Thanks to Matt Bornstein, Guido Appenzeller, and Rajko Radovanović for their comments and feedback during the writing process. *

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.