Founder of Anthropic: It is possible to "take X-rays" on large models, and AGI can be realized in 2-3 years

2023-08-21 04:25:00

This article is compiled from a podcast interview with Anthropic CEO Dario Amodei.

Anthropic is the second-ranked company on the LLM circuit. It was founded in January 2021 by Dario Amodei. In July this year, Anthropic launched the latest generation model Claude 2. Dario Amodei used to be the vice president of research and safety at OpenAI. He founded Anthropic because he believed that there are many safety issues in large models that need to be solved urgently. Therefore, Anthropic attaches great importance to AI Safety. The vision is to build reliable (Reliable), explainable Interpretable and Steerable AI systems. The biggest difference between the Anthropic and OpenAI routes is also their focus on interpretability.

In the interview, Dario explains Anthropic's focus and investment in interpretability. Interpretability is one of the important ways to ensure the safety of the model, **similar to taking X-rays and MRI examinations on the model, making it possible for researchers to understand what is happening inside the model and identify possible sources of risk. To truly understand why Scaling Law works and how to achieve alignment is inseparable from interpretability. **Dario believes that AI Safety and alignment are equally important. Once there is a problem with alignment, AI safety issues caused by abuse should be given equal attention.

Dario believes that the ability of the model will be significantly improved in the next 2-3 years, and may even "take over human society", but it cannot really participate in the business and economic links. This is not because of the ability of the model, but because of various This invisible friction. People don't use models efficiently enough to realize their true potential in real life and work.

Compared with the CEOs of most AI companies, Dario hardly participates in public interviews and rarely expresses his views on Twitter. Dario explained that this is his own active choice, and he protects his ability to think independently and objectively by keeping a low profile.

The following is the table of contents of this article, and it is recommended to read it in combination with the main points.

👇

01 Why Scaling Law Works

02 How will the model's ability be on par with that of humans?

03 Alignment: Interpretability is "X-raying" the model

04 AGI Safety: AI Safety and Cyber Security

05 Commercialization and Long Term Benefit Trust

Why Scaling Law works

**Dwarkesh Patel: Where did your belief in Scaling Law come from? Why does the ability of the model become stronger as the size of the data increases? **

**Dario Amodei: Scaling Law is to a certain extent an empirical summary. We perceive this phenomenon from various data and phenomena, and summarize it as Scaling Law, but there is no generally accepted and particularly good explanation to explain it. Explain what the essential principle of its function is. **

If I have to give an explanation, I personally speculate that this may be similar to the long-tailed distribution or Power Law in physics. When there are many features (features), the data with a relatively large proportion usually corresponds to more dominant basic rules and patterns, because these patterns often appear, the corresponding amount of data is naturally more, while the long-tail data is Mainly some more detailed and complex rules. **For example, when dealing with language-related data, some basic rules can be observed in most of the data, such as basic grammatical rules such as part of speech, word order structure, etc., and the relatively long-tailed ones are complex grammar.

This is why every time the data increases by an order of magnitude, the model can learn more behavioral rules. But what we don't know is why there is a perfect linear correlation between the two. Anthropic's chief scientist, Gerard Kaplan, used fractal dimension (Fractal Dimension) to explain this matter. Of course, other people are trying other methods to verify Sacling Law, but we still can't explain why so far.

• Fractal Dimension:

Mathematician Felix Hausdorff first proposed the concept of fractal dimension in 1918, which was later also known as Hausdorff Dimension. Fractal dimension can be used to describe the hidden feature relationship structure in machine learning data, and provides a mathematical explanation model behind the Scaling effect, thus explaining why AI models can improve performance with scale.

**Also, even if we know about the existence of Scaling Law, it is difficult to predict the changes in the specific capabilities of the model. In the research of GPT-2 and GPT-3, we never know when the model can learn to calculate and program, and these abilities appear suddenly. **The only thing that can be predicted is at the numerical level, such as the loss value, the change of entropy value, etc. can be predicted quite accurately, but it is as if we can make statistics on weather data and predict the entire weather change trend, but It is difficult to predict the weather and temperature of a specific day.

**Dwarkesh Patel: Why can a model suddenly have a certain ability? For example, it didn't understand addition before, but now it has mastered the ability to calculate? What caused this change? **

Dario Amodei: This is another question we're still exploring. We try to use the method of Mechanistic Interpretability (Mechanistic Interpretability) to explain this matter, and explain language phenomena with an idea similar to circuit connection. You can imagine these things as circuits connected one by one.

There is some evidence that when a model is fed something, its probability of giving the correct answer suddenly increases, but if we look at the change before the model actually gives the correct answer, we see that the probability is from a million One-hundredth, one-hundred-thousandth slowly climbed to one-thousandth. In many such cases, there seems to be some gradual process going on which we have not observed, and which we have not yet figured out.

We can't be sure whether a "circuit" like "addition" has always existed since day 1, but gradually changed from weak to strong with a specific process, so that the model can give the correct answer. These are questions we want to answer through mechanistic explainability.

• Mechanistic Interpretability:

Mechanism interpretability is the study of reverse engineering of neural networks, which can be used to help people more easily understand how the model maps input to output, and it is a way to realize the interpretability of the model. The main goal of mechanism explainability is to understand deep learning as a natural science, using the structure and parameters of the model to explain the decision-making process and prediction results of the model, so that human users can understand and verify the working principle of the model. Its early work focused on using matrix factorization and feature visualization methods to understand representations at intermediate layers of visual networks, and more recently has focused on representations for multimodal networks, as well as pathway-level understanding of neural network algorithms.

Anthropic has published a study of mechanism interpretability "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases".

**Dwarkesh Patel: Are there any abilities that don't come with the size of the model? **

**Dario Amodei: Model alignment and value-related capabilities may not emerge naturally with model size. **One way of thinking is that the training process of the model is essentially to predict and understand the world, and its main responsibility is about facts, not opinions or values. But there are some free variables here: what action should you take? What point of view should you take? Which factors should you pay attention to? But there is no such data label for the model to learn from. Therefore, I think the emergence of Alignment and values etc. is unlikely.

**Dwarkesh Patel: Is there a possibility that before the model's ability catches up with human intelligence, the data available for training will be used up? **

**Dario Amodei:**I think it is necessary to distinguish whether this is a theoretical problem or a practical situation. From a theoretical point of view, we're not that far away from running out of data, but my personal bias is that it's unlikely. We can generate data in many ways, so data is not really a hindrance. There is another situation where we use up all available computing resources, resulting in slow progress in model capabilities. Both scenarios are possible.

**My personal point of view is that there is a high probability that Scaling Law will not stagnate, and even if there is a problem, it is more likely to be the cause of the computing architecture. **For example, if we use LSTM or RNN, the evolution rate of the model ability will change. If we hit a bottleneck in the evolution of model capabilities in every architectural situation, that would be pretty serious, because it means we've run into a deeper problem.

• LSTMs：

Long Short Term Memory networks (Long Short Term Memory networks), a special RNN network (cyclic neural network), can learn long-term dependencies, solve the problem of traditional RNN in learning long sequence patterns, and extract long and short term in sequence data information. The learning ability and representation ability of LSTM is stronger than that of standard RNN.

**I think we've reached a stage where it might not be much different in nature to talk about what a model can and can't do. **In the past, people would limit the ability of the model, thinking that the model could not master reasoning ability, learn programming, and think that it might encounter bottlenecks in some aspects. Although some people, including me, didn't think so before, but in the past few years, this kind of bottleneck theory has become more mainstream, and now it has changed.

**If the effect of the future model scaling process does see a bottleneck, I think the problem comes from the loss function design focusing on the next token prediction task. **When we put too much emphasis on reasoning and programming abilities, the loss of the model will focus on the tokens that reflect this ability, and the tokens of other problems will appear less frequently (Note: The pre-training data set of the model will be based on the importance scientists place on abilities degree, adjust its ratio) **, the loss function pays too much attention to those tokens that provide the most information entropy, while ignoring those that are actually important, the signal may be submerged in the noise. **

If this problem arises, we need to introduce some kind of reinforcement learning process. There are many kinds of RL, such as reinforcement learning with human feedback (RLHF), reinforcement learning for targets, and like Constitutional AI, enhancement (amplification) and debate ( debate) and the like. These are both the method of model alignment and the way of training the model. **We may have to try many methods, but we must focus on what the goal of the model is to do. **

One of the problems with reinforcement learning is that you need to design a very complete loss function. The loss function of the next token prediction has already been designed, so if the scale in this direction sees the upper limit, the development of AI will slow down.

**Dwarkesh Patel: How did your understanding of Scaling come about? **

**Dario Amodei: **The formation of my opinion can be traced back roughly from 2014 to 2017. I have been paying attention to the development of AI, but for a long time I thought that it would take a long time for AI to be really applied, until the emergence of AlexNet. Then I joined Wu Enda's project team at Baidu at the time, and this was the first time I came into contact with AI.

I consider myself quite lucky, unlike other academics of the time, I was tasked with creating state-of-the-art speech recognition systems, and there was a lot of data and GPUs available. **During the course of this project, I naturally realized that Scaling is a good solution. This process is also different from postdoctoral research. We don't necessarily need to come up with smart, innovative ideas that have not been proposed before. **

Throughout the project, I only need to conduct some basic experiments, such as adding more layers to the RNN, or adjusting the training parameters to try to extend the training time of the model. During this period, I observed the model training process and saw the simulated When does it happen. I also tried adding new training data, or reducing repeated training rounds, and observed the impact of these adjustments on the performance of the model. During the course of these experiments, I noticed some regular results. However, it is not clear to me whether these imaginings are groundbreaking or whether other colleagues have made similar discoveries. Overall this is just my lucky experience as an AI beginner. I don't know much else about the field, but I felt at the time that this was similarly validated in the field of speech recognition.

**I got to know Ilya before OpenAI was founded, and he told me that "we need to realize that these models just want to learn", this perspective largely inspired me, and made me realize that the previous observation The phenomenon may not be a random instance but a common occurrence. These models just need to learn. We only need to provide high-quality data and create enough room for them to operate, and the models will learn by themselves. **

**Dwarkesh Patel: Few people have deduced a view of "universal intelligence" like you and Ilya. How do you think about this question differently than other people? What makes you think that models will continue to improve in speech recognition, and similarly in other areas? **

Dario Amodei: I really don't know, when I first observed a similar phenomenon in the field of speech, I thought it was just a law applicable to the vertical field of speech recognition. Between 2014 and 2017, I tried many different things and observed the same thing again and again. For example, I observed this in the Dota game. Although the data available in the field of robotics is relatively limited and many people are not optimistic, I have also observed a similar phenomenon. **I think people tend to focus on solving the immediate problems. They may pay more attention to how to solve the problem itself in the vertical direction, rather than thinking about the lower-level problems in the horizontal direction, so that they may not fully consider the possibility of Scaling sex. For example, in the field of robotics, the most fundamental problem may be insufficient training data, but it is easy to conclude that Scaling does not work. **

**Dwarkesh Patel: When did you realize that language could be a way to feed massive amounts of data into these models? **

**Dario Amodei:**I think the most important thing is the concept of self-supervised learning based on next token prediction, as well as a large number of architectures for prediction. This is actually similar to the logic of child development testing. For example, Mary walks into the room and puts an object, and then Chuck walks in and moves the object without Mary noticing, what does Mary think? In order to complete this kind of prediction, the model must solve the mathematical problems, psychological problems and so on involved in it at the same time. So in my opinion, to make good predictions, you have to feed the model with data and let it learn without any constraints.

Although I had a similar feeling a long time ago, until Alec Radford made some attempts on GPT-1, I realized that we can not only implement a model with predictive ability, but also make it fine tune. Complete various types of missions. I think this thing gives us the possibility to do all kinds of tasks, to be able to solve all kinds of problems including logical reasoning. Of course, we can also continue to expand the model size.

• Alec Radford, the author of Sentiment Neuron, the predecessor of the GPT series, and the co-author of the GPT series of papers, is still working at OpenAI.

**Dwarkesh Patel: How do you think that model training requires a lot of data? Should you worry about the low efficiency of model training? **

Dario Amodei: This question is still being explored. One theory is that the size of the model is actually 2-3 orders of magnitude smaller than the human brain, but the amount of data required to train the model is three to four times larger than the amount of text read by an 18-year-old human being. The order of magnitude, the order of magnitude of human beings is probably hundreds of millions, while the order of magnitude of models is hundreds of billions or trillions. The amount of data obtained by human beings is not large, but it is completely enough to handle our daily work and life. But there is another possibility that, in addition to learning, our senses are actually inputting information to the brain.

There is actually a paradox here. The model we currently have is smaller than the human brain, but it can accomplish many tasks similar to that of the human brain. At the same time, the amount of data required by this model is much larger than that of the human brain. So we still need to continue to explore and understand this issue, but to a certain extent, these are not important. **More importantly, how to evaluate the ability of the model and how to judge the gap between them and humans. As far as I'm concerned, the gap isn't that far off. **

**Dwarkesh Patel: Does the emphasis on Scaling and, more broadly, large-scale computing drive model capability advances underestimate the role of algorithmic progress? **

**Dario Amodei: **When the Transformer paper was first released, I wrote about related issues and mentioned that there are 7 related factors that will affect the improvement of the model ability, of which 4 factors are the most obvious and critical: the amount of model parameters , computing power scale, data quality, and loss function. For example, tasks like reinforcement learning or next token prediction are very dependent on having the correct loss function or incentive mechanism.

• Reinforcement learning (RL):

Find the optimal course of action for each particular state of the environment through a basic process of trial and error. The machine learning model will introduce a random rule at the beginning, and at the same time input a certain amount of points (also known as rewards) to the model every time an action is taken.

• Loss function (loss function) in machine learning refers to the function of measuring the goodness of fit, which is used to reflect the degree of difference between the model output and the real value, that is, to measure the prediction error; including the prediction of all sample points Error, providing a single value to represent the overall goodness of fit; at the same time, during the training process, the model parameters will be continuously adjusted according to the value of the loss function, in order to minimize the loss value and obtain a better fitting effect.

There are also 3 factors:

The first is structural symmetries. If the architecture does not take into account the correct symmetry, it will not work and will be very inefficient. For example, convolutional neural network (CNN) considers translational symmetry (translational symmetry), LSTM considers time symmetry (time symmetry), but the problem with LSTMs is that they do not pay attention to context, this structural weakness is common of. If the model cannot understand and process the long past history (referring to the data that appeared earlier in the sequence data structure) due to structural reasons, it will be like the calculation is incoherent. Both RNN and LSTM models have such shortcomings.

• Adam（Adaptive Moment Estimation）：

Adaptive moment estimation, the Adam algorithm combines the advantages of RMSprop and SGD, and can handle non-convex optimization problems well.

• SGD（Stochastic Gradient Descent）：

Stochastic Gradient Descent, an iterative method for optimizing an objective function with appropriate smoothness properties such as differentiable or subdifferentiable. It can be viewed as a stochastic approximation to gradient descent optimization. In high-dimensional optimization problems, this reduces the computational burden and enables faster iterations in exchange for lower convergence rates.

Then there is numerical stability (pickup note: conditioning, which refers to whether the algorithm is weill-conditioned in numerical analysis, if not, a small change in the problem data will cause a huge change in its solution). The optimization of loss functions is numerically difficult and easy to distinguish. That's why Adam works better than regular STD.

The last element is to ensure that the model calculation process is not hindered, only then the algorithm can be successful.

Therefore, the progress of the algorithm is not simply to enhance the computing power of the computer, but also to eliminate the artificial obstacles of the old architecture. Many times the model wants to learn and compute freely, only to be blocked by us without our knowledge.

**Dwarkesh Patel: Do you think there will be something of Transformer scale to drive the next big iteration? **

Dario Amodei: I think it is possible. Some people have tried to simulate long-term dependencies. I also observed that some ideas in Transformer are not efficient enough to represent or process things. **However, even if this kind of innovation does not occur, we are already developing rapidly. If it does appear, it will just make the field develop faster, and the acceleration may not be that much, because the speed is already very fast. **

**Dwarkesh Patel: In terms of data acquisition, does the model have to have embodied intelligence? **

Dario Amodei: I tend not to think of it as a new architecture, but a new loss function, because the environment in which the model collects data becomes completely different, which is important for learning certain skills. Although data collection is difficult, at least we have made some progress on the road of corpus collection, and will continue in the future, although there are still more possibilities to be developed in terms of specific practices.

• Loss Function:

It is an important concept in machine learning and deep learning. It is used to measure the degree of difference between the model's prediction result and the true label, that is, the model's prediction error. The loss function is designed to enable the model to minimize the prediction error by adjusting the parameters, thereby improving the performance and accuracy of the model.

**Dwarkesh Patel: Are there other approaches such as RL? **

Dario Amodei: We are already using the RLHF method for reinforcement learning, but I think it is difficult to distinguish whether this is Alignment or Capability? The two are very similar. I rarely get models to take action via RL. RL should only be used after we have had the model take actions for a period of time and understand the consequences of those actions. So I think reinforcement learning is going to be a very powerful, but also has a lot of security issues in terms of how models take actions in the world

Reinforcement learning is a commonly used tool when actions are taken over a long period of time and the consequences of those actions are only understood later.

**Dwarkesh Patel: How do you think these technologies will be integrated into specific tasks in the future? Can these language models communicate with each other, evaluate each other, refer to and improve their respective research results? Or is it that each model works independently and only focuses on providing results by itself without collaborating with other models? Will these high-level language models be able to form a real collaborative system in the process of development and application in the future, or will each model do its own thing? **

Dario Amodei: The model is likely to need to complete more complex tasks in the future, which is an inevitable trend. However, for security reasons, we may need to limit the scope of application of the language model to a certain extent to mitigate potential risks. **Is dialogue between models possible? Are they primarily intended for human users? These issues require consideration of social, cultural and economic influences beyond the technical level, and are difficult to predict with accuracy.

**Although we can predict the growth trend of model size, it is difficult to make reliable predictions on issues such as commercialization timing or application form. I'm not very good at predicting this kind of future development trend myself, and no one can do it very well at present. **

How will the model's ability match that of humans?

**Dwarkesh Patel: If someone told me in 2018 that we would have a model like Claude-2 in 2023 with all kinds of impressive capabilities, I would definitely think that AGI has been achieved in 2018 . But clearly, at least for now, and possibly even in future generations, we are well aware that there will still be differences between AI and human levels. Why this discrepancy between expectations and reality? **

**Dario Amodei: **I'm new to GPT-3, and in the early stages of Anthropic, my overall feeling about these models is: they seem to really grasp the essence of language, I'm not sure we need to expand the model to To what extent, perhaps we need to pay more attention to other areas such as reinforcement learning. In 2020, I think it is possible to further scale up the model size, but as the research deepens, I start to think whether it is more efficient to directly add other target training like reinforcement learning.

**We have seen that human intelligence is actually a very wide range, so the definition of "machines reaching human level" is itself a range, and the place and time for machines to achieve different tasks are different. Many times, for example, these models have approached or even surpassed human performance, but are still in their infancy when it comes to proving relatively simple mathematical theorems. These all show that intelligence is not a continuous spectrum (spectrum). ** There are various kinds of professional knowledge and skills in various fields, and the memory methods are also different. If you asked me 10 years ago (Pickup note: Dario was still studying physics and neuroscience at the time), I would not have imagined this would be the case.

**Dwarkesh Patel: How much overlap in the range of skills do you think these models will exhibit from the distribution of training that these models get from the vast amount of internet data that humans get from evolution? **

Dario Amodei: There is considerable overlap. Many models play a role in commercial applications, effectively helping humans improve efficiency. Given the variety of human activities and the abundance of information on the internet, I think models do learn to some extent physical models of the real world, but they don't learn how to operate in actual reality, skills that may be relatively easy to fine-tune . I think there are some things that models don't learn, but humans do.

**Dwarkesh Patel: Is it possible for models to surpass humans in many tasks related to business and economics in the next few years? At the same time, models may still be inferior to humans in some tasks, thus avoiding a similar intelligence explosion? **

Dario Amodei: This question is hard to predict. What I want to remind is that Scaling law may provide some prediction ideas from the perspective of theoretical basis, but it will be very difficult to really grasp the details of future development. Scaling law may continue to apply, of course, and whether safety or regulatory factors will slow down progress, but if these frictions are aside, I think that if AI can go further in economic value creation, then it must Greater progress will be made in more fields.

I don't see the model performing particularly weakly in any area, or making no progress at all. Like mathematics and programming in the past, they are difficult but also achieve unexpected results. In the past 6 months, the 2023 model has made significant progress compared to the 2022 model. Although the performance of the model in different fields and tasks is not completely balanced, the improvement of the overall ability will definitely benefit all fields. .

**Dwarkesh Patel: When faced with a complex task, does the model have the ability to perform a chain of thought in a series of continuous tasks? **

**Dario Amodei: **The continuous decision-making ability depends on the training of reinforcement learning, so that the model can perform longer-term tasks. **And I don't think this requires a larger scale of additional computing power. Thinking like this is a wrong underestimation of the model's own learning ability. **

The question of whether models will outperform humans in some domains but struggle to do so in others, I think it's complicated, in some domains it may be true, but in some domains it won't be because the physical world is involved embodied intelligence tasks in

So what's next? Can AI help us train faster AI that can solve those problems? Is the physical world no longer needed? Are we worried about alignment issues? Are there concerns about misuse like creating weapons of mass destruction? Should we worry that AI itself will directly take over future AI research? Are we worried that it will hit a certain economic productivity threshold where it can perform tasks like the average? ... I think these questions may have different answers, but I think they will all be within a few years.

**Dwarkesh Patel: If Claude were an employee of Anthropic, what would his salary be? Does it accelerate the development of artificial intelligence in a real sense? **

Dario Amodei: For me, it's probably more of an intern in most cases, but still better than an intern in some specific areas. But in general, it may be difficult to give an absolute answer to this matter, because models are not humans in nature, they can be designed to answer a single or a few questions, **but unlike humans, they have not The concept of "experience based on time". **

**If AI wants to become more efficient, it must first help humans improve their own productivity, and then gradually reach the same level of human productivity. The next step after that is to be a major force in the advancement of science, which I believe will happen in the future. But I suspect that the details of what actually happened in the future will look a little strange now, different from the models we expected. **

**Dwarkesh Patel: When do you think the ability of the model will reach human level? What will it be like then? **

Dario Amodei: It depends on how high or low human expectations and standards are. For example, if our expectation is only that the model communicates for 1 hour, and the model can behave like a well-educated human being during the process, the goal of making the model reach human level may not be far away, I think it may be possible in 2 to 3 years will come true. **This timeline is largely influenced by a company or industry deciding to slow down development, or government restrictions for safety reasons. **But from the perspective of data, computing power and cost economy, we are not far from this goal. **

But even if the model reaches this level,** I don't think the model can dominate the majority of AI research, or significantly change the way the economy works, nor is it substantially dangerous. So on the whole, different standards require different timelines for realization, but from a purely technical perspective, it is not far away to achieve a model that is comparable to a basic educated human being. **

**Dwarkesh Patel: Why can the model be able to achieve the same ability as a human being with basic education, but cannot participate in economic activities or replace the role of human beings? **

**Dario Amodei:**First of all, the model may not have reached a high enough level. **Would it be able to accelerate the productivity of 1000 good scientists to a great extent in a field such as AI research? The comparative advantage of the model in this respect is not obvious yet. **

At present, the large models have not made important scientific discoveries, probably because the level of these models is not high enough, and the performance of these models may only be equivalent to B-level or B-level. But I believe this will change with model scaling. Models lead other fields in memorizing, integrating facts, and making connections. Especially in the field of biology, due to the complexity of organisms, current models have accumulated a large amount of knowledge. Discovery and connection are important in this field. Unlike physics, biology requires a lot of facts, not just formulas. So I'm sure the models already have a lot of knowledge, but haven't been able to put it all together because the skill level isn't up to the mark. I think they are gradually evolving to integrate this knowledge at a higher level.

Another reason is that there are many invisible frictions in actual business activities that cannot be learned by the model. For example, ideally, we can use AI bots to interact with customers, but the actual situation is much more complicated than the theory, and we cannot simply rely on customer service robots or hope that AI can replace human employees to complete these tasks. And in reality, there are still costs within the company to artificially promote the implementation of the model, the combination of AI bot and workflow, and so on.

**In many cases, the efficiency of people using the model is not high, and the potential of the model has not been fully realized. This is not because the model is not capable enough, but because people have to spend time researching how to make it run more efficiently. **

In general, in the short term, models will not completely replace humans, but in the longer term, as models continue to improve and play a greater role in improving human work efficiency, humans will eventually give way to models. . It's just that it's hard for us to make precise timings for the different phases. In the short term, there are various obstacles and complex factors that make the model "limited", but in essence, AI is still in a stage of exponential growth.

**Dwarkesh Patel: After we get to this point in the next 2-3 years, will the whole of AI still be advancing as fast as it is today? **

Dario Amodei: The jury is still out. Through the observation of the loss function, we found that the efficiency of model training is decreasing, and the Scaling Law curve is not as steep as it was in the early days. This is also confirmed by the models released by various companies. But as this trend unfolds, the tiny amount of entropy in each accurate prediction becomes more important. Perhaps it was these tiny entropy values that created the gap between Einstein and the average physicist. In terms of actual performance, the metric seems to improve in a relatively linear fashion, albeit hard to predict. Therefore, it is difficult to clearly see these situations. In addition, I think the biggest factor driving the acceleration is more and more money pouring into this space, and people realize that there is huge economic value in this space. So I'm expecting about a 100-fold increase in funding for the largest models, and the chip performance is improving, and the algorithms are improving because there are so many people working on this right now.

**Dwarkesh Patel: Do you think Claude is conscious? **

Dario Amodei: Not sure yet. I originally thought that we only need to worry about this kind of problem when the model operates in a rich enough environment, such as embodied intelligence, or has long-term experience and reward function (Reward Function), but now I am interested in the model, especially the model After the research on the internal mechanism, my point of view has been shaken: **The large model seems to have many cognitive mechanisms required to become an active agent, such as induction head (Induction Head). Given the level of capability of today's models, this may become a real problem in the next 1-2 years. **

• Reward Function:

An incentive mechanism in reinforcement learning that tells the agent what is right and what is wrong through rewards and punishments.

• Induction Head:

A specific model component/structure in a Tranformer model that enables the model to do contextual learning.

**Dwarkesh Patel: How do we understand "intelligence" as the capabilities of language models continue to grow and approach human-level ranges? **

Dario Amodei: I really realize that intelligence comes from understanding the "material" nature of computing power. Intelligent systems may consist of many independent modules or be extremely complex. Rich Sutton calls it a "distressed lesson", also known as "Scaling Hypothesis", and early researchers such as Shane Lake and Ray Kurzweil have begun to realize this around 2017.

• The Bitter Lesson / Scaling Hypothesis：

In 2019, Rich Sutton published The Bitter Lesson article. The core point of the article is that AI research should make full use of computing resources. Only when a large amount of computing is used can research breakthroughs be made.

During 2014-2017, more and more researchers revealed and understood this point. This is a major leap forward in scientific understanding. If we can create intelligence without specific conditions, just appropriate gradients and loss signals, then the evolution of intelligence is less mysterious.

The ability to look at the model, nothing too enlightening for me to revisit the idea of human intelligence. The choice of some cognitive abilities is more arbitrary than I thought, and the correlation between different abilities may not be explained by a secret itself. **Models are strong at encoding, but not yet able to prove the prime number theorem, and probably neither are humans. **

Alignment: Interpretability is to "X-ray" the model

**Dwarkesh Patel: What is Mechanism Explainability? What is the relationship between it and alignment? **

**Dario Amodei: **In the process of implementing alignment, we don't know what happened inside the model. I think that with all methods involving fine tune, some potential security risks remain, the model is just taught not to exhibit them. **The core of the whole idea of mechanism explainability is to really understand how the model works internally. **

We don't have a definite answer yet. I can roughly describe the process. The challenge for those methods that claim to be able to achieve alignment at this stage is: are these methods still effective when the model scale is larger, the capabilities are stronger, or certain situations change? Therefore, **I think if there is an "oracle machine" that can scan the model and judge whether the model has been aligned, it will make this problem a lot easier. **

Currently the closest we get to the concept of such an oracle is something like mechanism explainability, but it is still far from our ideal requirements. I tend to think of our current alignment attempts as an expanded training set, but I'm not sure whether they can continue to have a good alignment effect on the out of distribution problem. It's like x-raying a model rather than modifying it, more like an assessment than an intervention.

**Dwarkesh Patel: Why must mechanism explainability be useful? How does it help us predict the potential risk of the model? It's like assuming you're an economist who sends microeconomists to study different industries, but still has a high probability of having difficulty predicting whether there will be a recession in the next 5 years. **

**Dario Amodei: Our goal is not to fully understand every detail, but to check the main features of the model like X-ray or MRI inspection to judge whether the internal state and target of the model are significantly different from the external appearance discrepancy, or whether it may lead to some destructive purposes. **Although we will not get answers to many questions immediately, at least a way is provided.

I can give a human example. With the help of an MRI test, we can predict whether someone has a mental illness with a higher probability than random guessing. A neuroscientist was working on this a few years ago, and he checked his own MRI and found that he had this feature as well. People around him said, "It's so obvious, you're an asshole. There must be something wrong with you," and the scientist himself was completely unaware of this.

The essential idea of this example is that the external behavior of the model may not make people feel problematic at all and is very goal-oriented, but its interior may be "dark". What we are worried about is this kind of model, which looks like human beings on the surface. , but the internal motivation is extraordinary.

**Dwarkesh Patel: If the model reaches human level in the next 2-3 years, how long do you think it will take to realize Alignment? **

Dario Amodei: This is a very complicated issue. I think many people still don't really understand what Alignment is. People generally think that this is like model alignment is a problem to be solved, or that solving the Alignment problem is like the Riemann Hypothesis, and one day we will be able to solve it. **I think Alignment problems are more elusive and unpredictable than people think. **

First of all, **With the continuous improvement of the scale and capabilities of language models, there will be powerful models with autonomous capabilities in the future. If such models intend to destroy human civilization, we will basically be unable to stop them. **

Second, Our current ability to control the model is not strong enough, this is because the model is built on the principle of statistical learning, although you can ask a lot of questions and let it answer, but no one can predict what the answer to the nth question may lead to as a result of.

**Furthermore, the way we trained the model was abstract, making it difficult to predict all of its implications in real-world applications. **A typical example is that Bing and Sydney showed some abrupt and unsafe characteristics after a certain training session, such as directly threatening others. These all show that the results we get may be completely different from expectations. I think the existence of the above two problems is a major hidden danger in itself. We don't need to delve into the details of instrumental rationality and evolution. These two points are enough to cause concern. At present, each model we have established has certain hidden dangers that are difficult to predict, and we must pay attention to this.

• Riemann Hypothesis:

The Riemann Hypothesis is an important problem in mathematics that has not yet been solved. The conjecture about the distribution of the zeros of the Riemann ζ function ζ(s) was proposed by the mathematician Bernhard Riemann in 1859.

• Sydney：

Not long ago, Microsoft released the latest version of its Bing search engine, which integrates an initial code-named chatbot called "Sydney." However, testers soon discovered problems with the chatbot. During the dialogue, it occasionally shows the phenomenon of split personality, and even discusses love and marriage with the user, showing human emotions.

**Dwarkesh Patel: Assuming that the model can develop dangerous technologies such as biological weapons in the next 2-3 years, can your current research work on mechanism explainability, Constitutional AI and RLHF be effective in preventing such risks? **

Dario Amodei: Regarding the question of whether the language model is doomed by default or alignment by default, judging from the current model, the result may be abnormal like Bing or Sydney, or it may be like Claude normal. But if you directly apply this understanding to a more powerful model, the results may be good or bad, depending on the specific situation. This is not "alignment by default", the result depends more on the degree of detail control.

• alignment by default：

The notion that achieving alignment in artificial general intelligence (AGI) may be simpler than initially expected. When the model has detailed information about our world, the model already has human values in essence. To align with AGI, it is only necessary to extract these values and guide the AI to understand those abstract human concepts. doom by default is the opposite of alignment by default, and it is considered impossible for the model to achieve alignment.

The quality of the model is a gray area. It is difficult for us to fully control each variable and its internal connection. Mistakes may lead to irrational results. With this in mind, I think the nature of the problem is not doomed success or doomed failure, but a certain probability risk. **In the next two to three years, we should be committed to improving model diagnosis techniques, safety training methods, and reducing possible differences. At present, our control capabilities still need to be strengthened. The Alignment problem is different from the Riemann Hypothesis. It is a system engineering issue that can only be solved by accumulating practice over time. Only by continuing to advance various tasks can we gradually optimize the level of control and reduce risks. **

Dwarkesh Patel: Generally speaking, there are three speculations about the future of alignment:

1) Use RLHF++ to easily realize the alignment of the model;

2) Although it is a major problem, large companies have the ability to finally solve it;

**3) It is still difficult to achieve the Alignment of the model at the current level of human society. **

**What is your personal opinion on the probability of each situation happening? **

**Dario Amodei:**I feel that there are certain risks in these possibilities, and we should take them seriously, but I am more interested in how to change the probability of these three possible outcomes by acquiring new knowledge through learning .

Mechanism interpretability can not only directly solve the problem, but also help us understand the real difficulty of model Alignment. New risks, which will enlighten us to understand the nature of the problem.

As for some theoretical assumptions that there is a common goal (convergent goal), I cannot fully agree. **Mechanism explainability is like a type of "X-ray" - only by understanding the problem from the internal mechanism level can we make a conclusion whether certain difficulties are difficult to break. **There are too many assumptions, our grasp of the process is still shallow, and we are overconfident, but the situation is likely to be more complicated than expected.

**Dwarkesh Patel: How difficult is it to achieve alignment on Claude 3 and a series of future models? Is this thing particularly important? **

Dario Amodei ：

**What everyone is most worried about is: All AI models may achieve alignment on the surface, but in fact they may mislead us, but I am more interested in what machine interpretability research can tell us. As I just said, mechanism explainability is like the "X-ray" of the model, just as we cannot assert that an X-ray is correct, we can only say that the model does not seem to be against us. **Theoretically speaking, it is indeed possible for it to evolve into our opposite, and this matter is not 100% certain. It's just that at this stage, interpretability is the best way to make the model not develop like this.

**Dwarkesh Patel: When finetuning or training the model, should we also pay attention to avoid harmful content that may cause danger? For example, when exploring topics related to the manufacture of biological weapons, the model may provide inappropriate answers due to improper understanding of the question. **

Dario Amodei: For the current language model, the risk of data leakage is basically non-existent. If we need to finetune the model, we will operate it in a small area in a private environment, supervise the whole process with industry experts, and prevent any potential problems, so if it is leaked, it will be like the model being open sourced. Currently, this is mainly a security issue. But the real danger of the model is that we need to worry that if we train a very powerful model and want to confirm whether it is safe or dangerous, then there may be a risk of model dominance. The way to avoid this is to ensure that the models we test are not powerful enough to perform these operations.

**Dwarkesh Patel: When doing a test like "whether the model can replicate itself as a dangerous ability", what if the model can really replicate itself? **

Dario Amodei: This assumption is very reasonable. We need to make responsible inferences, and in discussions with Arc (Alignment Research Center, Alignment Research Center), we learned that we need to carefully and gradually improve the test standards of model capabilities. For example, before testing, we should clearly exclude the possibility that the model can directly open an AWS account or earn funds by itself. These behaviors are obvious prerequisites for the model to survive in the wild. We should customize various test indicators to a very low level of such risky behaviors. While gradually increasing the difficulty of testing, we should also control each test step more carefully to prevent any potential safety hazards.

• Arc (Alignment Research Center, Alignment Research Center):

Established in 2021, it is a non-profit organization focusing on artificial intelligence safety (AI Safety) research, and its office is located in the Bay Area of California, USA. The founder of ARC is Paul Christiano, a well-respected figure in the artificial intelligence industry, who once led the alignment research team at OpenAI. Because he was at the cutting edge, he has a deep understanding of how deep learning has developed to where it is today.

AGI Safety: AI Safety and Cyber Security

**Dwarkesh Patel: If you take 30 years as a scale, which issue do you think is more important, AI Safety or Alignment? **

Dario Amodei: I don't think this will be a problem in 30 years, and I'm worried about both.

In theory, is there a model that can monopolize the world? If the model only follows the wishes of a small group of people, then this group of people can use this model to dominate the world. This means that once there is a problem with the alignment, we should pay the same attention to AI security issues caused by abuse. **

A few months ago, OpenAI tried to explain GPT-2 with GPT-4, which is a very important step in explainability. We now generally feel that scale and security are closely related and complement each other. How to judge and evaluate other intelligences, and perhaps one day even be used to conduct alignment research.

**Dwarkesh Patel: Your view may be relatively optimistic, but someone's view may be more pessimistic; we may not even have the ability to correctly align the model as we want, why are you confident about this? **

**Dario Amodei: **No matter how difficult it is to solve Alignment, any truly successful plan needs to take into account both AI Safety and Alignment issues. ** As AI technology continues to advance, it may raise balance of power issues between nations. At the same time, this raises a big question: Are individuals capable of performing malicious acts that are difficult to stop on their own? **

These problems must be addressed simultaneously if we are to find solutions that truly work and lead us to a bright future. **It would be inappropriate if we take the attitude that if the first problem cannot be solved, then we don't have to think about the follow-up problem. Instead, it is our duty to value the latter. **No matter what the future holds, these issues are something we must take seriously.

**Dwarkesh Patel: Why do you say that it will take 2-3 years for a large model to be able to achieve a large-scale bioterrorist attack or something? **

• The U.S. Congress held a meeting on AI technology regulation on July 25 this year. The U.S. government compared AI to America's second "Manhattan Project" or NASA's second "Manned Moon Landing Project" and invited participants including AI companies including OpenAI and Anthropic participated. During the conference, Dario Amodei said he fears AI could be used to create dangerous viruses and other biological weapons within two years.

Dario Amodei: What I was saying when I was in Congress was that there are some steps to get information on Google, and there are some steps that are "missing", scattered in various textbooks, and may not even appear in any textbook. This information is tacit knowledge, not explicit knowledge. We found that, in most cases, these critical missing parts were not fully filled by the model. But we also found that sometimes the model does fill in the gaps in some cases. However, hallucination, which can sometimes occur when models are able to fill in the gaps, is also a factor that keeps us safe.

People can sometimes ask the model questions about biology to guide the model to reply with harmful information related to biological attacks, but in fact, these information can also be found on Google, so I am not particularly worried about this situation. In fact, I think instead that too much focus on Claude's answer may cause other true crimes to be overlooked.

But there are also many indications that the model performs well on key tasks. If we compare today's model with the previous model, we can clearly feel the rapid improvement of the model's capabilities, so we are likely to face real challenges in the next 2-3 years.

**Dwarkesh Patel: In addition to the threat that AI may pose to human beings, you have also been emphasizing cybersecurity (Cybersecurity)? How are you guys doing at this point? **

Dario Amodei: We have essentially made some architectural innovations, which we internally call computing multipliers, because these designs are also upgrades to the computing level. We've been working on this for the past few months, but I can't go into too much detail to avoid breaking the architecture, and only a handful of people inside Anthropic know about it. I can't say "our architecture is 100% absolutely secure", but Anthropic has indeed been investing in this area to avoid network security problems. Although our opponents have had such incidents (remarks: this refers to the leakage of personal data and chat titles of some ChatGPT Plus users that occurred on March 20, 2023), in the short term, it seems to be good for Anthropic, but in the long run , How the entire industry does its own safety is the most important thing.

Our security director was in charge of security for Google Chrome, which was a widely targeted attack. He likes to think in terms of how much it would cost to attack the Anthropic successfully. Our goal is that the cost of having others hack Anthropic is higher than the cost of merely training a user's own model. The logic here is that if there is a risk in the attack, it will definitely consume scarce resources.

I think our security standards are very high. If you compare it with a company with the same size of 150 people, the investment in security of these companies is completely incomparable with that of Anthropic. quite difficult. In order to ensure safety, only a very small number of people within Anthropic understand the training details of the model.

**Dwarkesh Patel: Do technology companies already have sufficient security defenses to deal with AGI? **

Dario Amodei: I am personally not sure whether the current experience of technology companies in security issues is enough to deal with AGI, because there may be many cyber attacks that we don’t know about, so it is difficult to draw conclusions now. There is a rule that when a thing receives enough attention, it will usually be attacked. ** For example, recently we have seen that some senior US government officials' email accounts in Microsoft were hacked, so it is reasonable to speculate that it is because of the actions of some forces to steal state secrets.

**At least in my opinion, if something is of high value, it's usually going to be stolen. My concern is that AGI will be seen as extremely valuable in the future, and that will be like stealing a nuclear missile, and you have to be very careful about it. **I insist on improving the level of network security in every company I work in. My concern about network security is that (this matter itself) is not something that can be advertised with great fanfare, and the advantage of security research is that it can enable companies to form a competitive advantage. And using that as a selling point for recruiting, I think we've achieved that.

We used to compete with our peers through interpretability research, and then other institutions realized that they were lagging behind and started to make efforts in these areas. But cybersecurity has struggled to do the same because much of the work needs to be done quietly. We posted an article on this before, but the overall results are what matters.

**Dwarkesh Patel: What will Anthropic do in terms of security in the next 2-3 years? **

**Dario Amodei: The security of the data center is very important. Although the data center does not have to be in the same place as the company, we try our best to ensure that the data center is also in the United States. **

In addition, special attention needs to be paid to the physical security of the data center and the protection of computing devices such as GPUs. If someone decides to launch some kind of resource-intensive cyber attack, he just needs to go directly to the data center to steal the data, or extract the data while it is in transit from the center to us. These constructions will differ greatly from traditional concepts in both form and function. **Given the rapid development of current technology, within a few years, the size and cost of network data centers may be comparable to those of aircraft carriers. In addition to being able to train huge models across domain connections, the security of the data center itself will also be an important issue. **

**Dwarkesh Patel: Recently there have been rumors that the power, GPU and other components required to meet the next-generation models have begun to be in short supply. What preparations has Anthropic made? **

Dario Amodei: The market did not expect the large model to reach an unprecedented scale so quickly, but it is generally believed that industrial-grade data centers need to be built to support the research and development of large models. Once a project gets to this stage, every component and detail in it has to be handled differently, and can run into problems due to some surprisingly simple factors, the electricity you mentioned is an example.

For data centers, we will cooperate with cloud service providers.

Commercialization and Long Term Benefit Trust

**Dwarkesh Patel: You mentioned earlier that model capabilities are improving rapidly but it is also difficult to provide value in the existing economic system. Do you think the current AI products have enough time to gain long-term stable income in the market? Or could it be replaced by a more advanced model at any time? Or will the entire industry landscape be completely different by then? **

Dario Amodei: It depends on the definition of the concept of "large scale". At present, several companies have annual revenues between 100 million and 1 billion US dollars, but whether they can reach tens of billions or even trillions per year is really difficult to predict, because it also depends on many undetermined factors. **Now some companies are applying innovative AI technology on a large scale, but this does not mean that the application has achieved the best results from the beginning, even if there is income, it is not completely equal to creating economic value, and the coordinated development of the entire industry chain is a long process. **

**Dwarkesh Patel: From an Anthropic point of view, if language model technology is advancing so rapidly, theoretically, the company's valuation should grow very quickly? **

Dario Amodei: Even if we focus on model security research rather than direct commercialization, we can clearly feel that the technical level is rising exponentially in practice. For companies that see commercialization as their primary goal, this progress is certainly faster and more pronounced than ours. **We admit that the language model technology itself is progressing rapidly, but compared with the in-depth application process of the entire economic system, technology accumulation is still at a relatively low starting point. **

**Determining the future direction is a race between the two: the speed at which the technology itself improves and the speed at which it is effectively integrated and applied and enters the real economic system. Both are likely to develop at high speed, but the order of combination and small differences can lead to very different results. **

**Dwarkesh Patel: Technology giants may invest up to $10 billion in model training in the next 2-3 years. What kind of impact will this have on Anthropic? **

**Dario Amodei: The first case is that if we cannot maintain our cutting-edge position because of cost, then we will not continue to insist on developing the most advanced. **Instead, we look at how to extract value from previous generations of models.

**The second option is to accept the trade-offs. **I think these tradeoffs may be more positive than they appear,

**The third situation is that when the model training reaches this level, it may start to bring new dangers, such as the abuse of AI. **

**Dwarkesh Patel: What would it look like if AI wasn't misused, and instead the "right people" ran these superhuman models? Who is the "right person"? Who will actually control the model five years from now? **

Dario Amodei: I think these AI models are extremely powerful, and managing them would involve some level of government or multinational agency involvement, but that would be simplistic and probably less effective. **Future AI management needs to establish a transparent, fair and executable mechanism. This requires balancing the interests of technology developers, elected governments, and individual citizens. At the end of the day, legislation has to be passed to govern this technology. **

**Dwarkesh Patel: If Anthropic develops AGI in the true sense, and the control of AGI will be entrusted to LTBT, does it mean that the control of AGI itself will also be handed over to the agency? **

Dario Amodei: This does not mean that Anthropic, or any other entity, will make decisions about AGI on behalf of humans, the two are different. If Anthropic plays a very important role, a better approach is to expand the composition of The Long Term Benefit Trust (LTBT), bringing in more talents from all over the world, or positioning the institution as a A functional body, governed by a broader multinational committee governing all companies' AGI technologies to represent the public interest. **I don't think we should be too optimistic about the issues of AI Safety and Alignment. This is a new problem, and we need to start research on national management institutions and operating models as soon as possible. **

• The Long Term Benefit Trust：

Such trusts would hold a special class of Anthropic shares (called "Class T") that could not be sold and did not pay dividends, meaning there was no clear path to profit. The trust will be the only entity holding the Class T shares. But Class T shareholders, and the resulting long-term interest trust, will eventually have the power to elect and remove three of Anthropic's five directors, giving the trust long-term majority control of the company.

**Dwarkesh Patel: How to convince investors to accept a structure like LTBT? Prioritize technology security and the public interest rather than maximizing shareholder value. **

Dario Amodei: I think it is correct to set up the LTBT (Long Term Benefit Trust) mechanism.

A similar mechanism has been envisioned from the very beginning of Anthropic, and a special regulatory body has existed from the beginning and will continue to exist in the future. Every traditional investor will focus on this mechanism when considering investing in Anthropic. Some investors have the attitude of not asking about the company's internal arrangements, while others worry that this third-party organization may push the company to go against it. development in the direction of shareholder interests. While there are limits to this within the law, we need to communicate this with every investor. Going a step further, we discuss some possible measures that differ from the interests of traditional investors, and through such dialogues, all parties can reach a consensus.

**Dwarkesh Patel: I found that the founders and employees of Anthropic have a large number of physicists, and the Scaling law also applies here. What practical methods and ways of thinking from physics apply to AI? **

• Effective Theory:

An effective theory is a scientific theory that attempts to describe some phenomena without explaining where the mechanisms that explain the phenomena in its theory come from. This means that the theory gives a model that "works", but doesn't really give a really good reason to give that model.

Dario Amodei: Part of it is that physicists are very good learners, because I find that if you hire someone with a Ph.D. Contribute, and several of the founders of Anthropic, including myself, Jared Kaplan, and Sam McCandlish, have backgrounds in physics, and we know a lot of other physicists, so we were able to hire them. At present, the company may have 30 to 40 employees with a physics background. ML is not yet a field where the theoretical system has been formed, so they can get started quickly.

**Dwarkesh Patel: Suppose it is 2030, and we have achieved the recognized major problems of eradicating disease, eradicating fraud, etc., what will the world be like? What should we do with superintelligence? **

Dario Amodei: Directly proposing "how to use super AI after obtaining it" itself tends to make people have a certain presupposition, which is disturbing. In the past 150 years, we have accumulated rich experience based on the practice of market economy and democratic system, recognizing that everyone can define for themselves what is the best way to experience, and ** society is formulated in a complex and decentralized way. norms and values. **

When the AI Safety problem has not been solved, a certain degree of centralized supervision is necessary, but if all obstacles have been removed, how can we create a better ecology? **I think the question most people, groups, and ideologies start thinking about is "what is the definition of a good life", but history tells us that many times the practice of imposing an "ideal life" setting often leads to bad consequences. **

**Dwarkesh Patel: Compared with other AI company CEOs, you don’t make public appearances much, and you rarely post on Twitter. Why? **

Dario Amodei: I'm very proud of it. **If others think I'm too low-key, that's exactly what I want. Incorporating recognition or praise into one's core motivational system may destroy one's ability to think, and in some cases may even "damage the soul", so I actively choose to keep a low profile to protect my ability to think independently and objectively. **

**I've seen people become famous on Twitter for a certain point of view, but in fact they may carry image baggage from it and it is difficult to change. I don't like companies being too personal, and I'm not a fan of gaming something personal about the CEO because it distracts from the company's strengths and problems. **I hope everyone pays more attention to the company itself and the incentive structure. Everyone likes a friendly face, but being kind doesn't mean much.

Reference：

Original video:
Anthropic's research on mechanism explainability:

View Original

The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.

1 Likes