The final chapter in artificial intelligence and programming

2023-10-11 08:41:18

Original source: CSDN

Image source: Generated by Unbounded AI

Earlier this year, Matt Welsh announced that programming was coming to an end. He wrote in ACM Communications:

I believe that the traditional idea of "writing programs" is dying, in fact, for all but very specialized applications, as we know it, most software programming will be replaced by trained AI systems. In some cases where only "simple" programs are needed (after all, not everything requires models of hundreds of billions of parameters running on GPU clusters), the programs themselves will be generated directly by AI, rather than hand-coded .

A few weeks later, in a speech, Wales expanded his death observations. It's not just the art of programming that goes to the grave, but computer science as a whole. All computer science is "doomed". (The image below is a screenshot of the speech.) ）

The deliverers of these sad messages do not seem to be overwhelmed with grief. Although Welsh has become a computer science teacher and practitioner (at Harvard, Google, Apple, and elsewhere), he seems eager to move on to the next step. "Anyway, writing code sucks!" He declared.

I'm not so optimistic about the future of post-programming. First of all, I am skeptical. I don't think we've crossed the threshold for machines to learn to solve interesting computational problems on their own. I don't think we're close to that yet, or we're moving in the right direction. Moreover, if it turns out that my point of view is wrong, my impulse is not to acquiesce but to resist. On the one hand, I do not welcome our new AI overlord. Even if they prove to be better programmers than me, I will still continue to use my code editor and compiler, thanks. "Programming sucks?" For me, it has long been a source of joy and inspiration for me. I find it also a valuable tool for understanding the world. I'm never sure if I understand it until I'm able to reduce an idea to code. To benefit from this learning experience, I had to actually write the program instead of just saying some magic words and summoning a genie from Aladdin's AI lamp.

Large Language Model

The idea that programmable machines could write their own programs is deeply rooted in computing history. Charles Babbage hinted at this possibility as early as 1836 when discussing his plan for an analytical machine. When Fortran was introduced in 1957, its official name was "FORTRAN Automatic Coding System". Its stated goal is for computers to "code problems for themselves and produce programs that are as good (but without errors) as human coders."

Fortran doesn't eliminate programming tricks (or mistakes), but it makes the process less tedious. Later languages and other tools brought further improvements. And the dream of fully automated programming has never been dashed. Machines seem to be better suited to programming than most. Computers are methodical, rule-bound, fastidious and literal—all of these traits (right or wrong) are associated with expert programmers.

Ironically, AI systems now ready to take on programming tasks are oddly not like computers. Their personalities are more like Deanna Troi than Commander Data. Logical consistency, causal reasoning, and careful attention to detail are not their strong points. They have incredibly brilliant moments when they seem to be pondering profound thoughts, but they also have the potential for astonishing failure—blatant, brazen errors of reason. They remind me of an old quip: people make mistakes, and it takes a computer to really mess things up.

The latest AI system is called the Big Language Model (LLM). Like most other recent AI inventions, they are built on artificial neural networks, a multilayered structure inspired by the structure of the brain. The nodes of a network are similar to biological neurons, and the connections between nodes act as synapses, which are the connection points where signals are transmitted from one neuron to another. The training network can adjust the strength or weight of the connection. In a language model, training is done by forcing a large amount of text into the network. When the process is complete, the join weights encode detailed statistics on the linguistic features of the training text. In the largest model, the number of weights is 100 billion or more.

In this case, the term model can be misleading. The term does not refer to scale models or miniature models, such as model aircraft. Instead, it refers to predictive models, like mathematical models commonly found in science. Just as atmospheric models predict tomorrow's weather, language models predict the next word in a sentence.

The most famous large-scale language model is ChatGPT, which was released to the public last fall and attracted great attention. Abbreviation GPT Gee Pee Tee: My tongue is constantly tripping over these three rhyming syllables. Other AI products have cute names, such as Bart, Claude, Llama; I wish I could rename GPT in the same spirit. I would call it Geppetto, and it echoes the pattern of consonants. GPT stands for Generative Pre-Trained Transformer; The chat version of the system is equipped with a conversational HMI. ChatGPT was developed by OpenAI, which was founded in 2015 to free AI from the control of a handful of wealthy tech companies. OpenAI has successfully accomplished this mission to the point that it has become a wealthy tech company.

ChatGPT is both admirable and shocking for its wording, ability to speak well, fluency in English, and other languages. The chatbot can mimic famous authors, tell jokes, write love letters, translate poetry, write spam, "help" students with homework, and concoct misinformation for political misinformation. For better or worse, these language abilities represent amazing technological advances. Computers that once struggled to construct an understandable sentence suddenly became masters of words. What GPT says may or may not be true, but it is almost always well-worded.

Shortly after ChatGPT was released, I was surprised to find that its mastery of the language extended to programming languages. The model's training set appears to include not only multiple natural languages, but also a large amount of program source code from public repositories such as GitHub. Based on this resource, GPT is able to write new programs based on commands. I found this surprising because computers are so picky and relentless about their input. Although computers sometimes have small mistakes such as spelling mistakes, human readers struggle to understand a sentence. But if the computer gets input with even one comma or mismatched parenthesis, it will vomit garbled. Language models with potentially statistical or probabilistic properties seem unlikely to maintain the required accuracy beyond a few lines.

I was wrong again in this matter. A key innovation in large language models, the attention mechanism, solves this problem. When I started experimenting with ChatGPT myself, I quickly discovered that it could indeed generate programs without careless grammar errors.

But other problems followed.

Climbing the Word Ladder

When you sit down to chat with a machine, you are immediately confronted with an awkward question: "What should we talk about?" I'm looking for a topic that fairly measures ChatGPT programming capabilities. I want a problem that can be solved by computational means, but this does not require much arithmetic, which is considered one of the weaknesses of large language models. I chose the anagram game invented by Lewis Carroll 150 years ago and analyzed in depth by Donald E. Knuth in the 90s of the 20th century.

In the transcript below, every exchange on my side is marked BR; The rosette is the OpenAI logo that specifies the response of ChatGPT.

When I saw these sentences unfold on the screen — the chatbot typing them word for word, a little erratic, as if stopping to sort out my thoughts — I was immediately blown away by the system's English ability. GPT lists all the basic features of the word ladder in simple, powerful prose: it's a game or puzzle where you go from word to word by changing one letter at a time, each rung of the ladder must be an English word, and the goal is to find the shortest possible sequence from the starting word to the target word. I myself can't explain it better. The most helpful is COLD-> WARM's working example.

It's not just individual sentences that make the impression of language ability. Sentences are organized into paragraphs, which are strung together to form a coherent discourse. That's great!

Also notable is the robot's ability to handle fuzzy and sloppy input. My initial query was formulated as a yes or no question, but ChatGPT correctly interpreted it as a request: "Tell me what you know about the word ladder." My second instruction ignores any typographical cues indicating that LEAD and GOLD should be understood as words, not metal. The chatbot was supposed to provide me with alchemy recipes, but it provided the missing quotes.

However, putting aside all this linguistic and rhetorical complexity, what I really wanted to test was the program's ability to solve the word ladder problem. Both examples from the transcript above can be found on the web, so they are likely to appear in ChatGPT's training data. In other words, large language models may simply remember the solutions instead of building them. So I submitted an even tougher assignment:

At first glance, ChatGPT seems to have triumphed again, solving a puzzle I'm sure it has never encountered before. But take a closer look. MARSH -> MARIS REQUIRES A TWO-LETTER SUBSTITUTION, AS DOES PARIL -> APRIL. The status of MARIS and PARIL as "valid words" may also be questioned. I have complained:

Wow! The robot offers slippery confessions and apologies, but the "correct" ladder is crazier than ever. It looks like we're playing Scrabble with Humpty Dumpty in the Mother Goose nursery rhyme and also declaring "I say APRCHI is a word, he's a word!" Then disperse all the tiles.

This is not a separate, unique failure. All my attempts to solve the word ladder with ChatGPT went off track, although not always in the same way. In one case, I asked for a ladder from REACH to GRASP. AI experts come up with this solution:

REACH -> TEACH -> PEACH -> PEACE -> PACES -> PARES -> PARSE -> PARSE -> PARSE -> PARSE -> PARKS -> PARKS -> PARKS -> PARKS -> PARES -> GRASP.

And this:

SWORD -> SWARD -> REWARD -> REDRAW -> REDREW -> REDREW -> REDREW -> REDREW -> REDRAW -> REPAID -> REPAIR -> PEACE

Now we are babbling like little children who have just learned to count: "One, two, three, four, four, three, four, four, seven, blue, ten!" ”

All the results I've shown so far are documented with ChatGPT version 3.5. I also tried the new and improved version 4.0, which came out in March. The updated robot exudes the same amiable confidence, but I fear it has the same tendencies, falling into casual incoherence:

The ladder starts well with four steps, following all the rules. But then the AI gets distracted. GOING FROM PLAGE TO PAGES REQUIRES A FOUR-LETTER SUBSTITUTION. THEN THERE'S PASES, WHICH ISN'T A WORD (AS FAR AS I KNOW) AND ISN'T NEEDED HERE ANYWAY, SINCE SEQUENCES CAN GO DIRECTLY FROM PAGES TO PARES. More stupidity followed. Still, I appreciate the information note on PLAGE.

Recently I also had a chance to try Llama 2, an LLM published by Meta (Facebook people). Although this model was developed independently of GPT, it seems to have some of the same psychological quirks, such as making rules and then ignoring them. When I asked for a ladder connecting REACH and GRASP, Llama 2 suggested this:

REACH -> TEACH -> DEACH -> LEACH -> SPEECH -> SEAT -> FEET -> GRASP

Prophecy and Code Monkey

Matt Welsh mentions two modes of operation for computing systems built on large language models. So far, we've been working in what I call prophetic mode, where you ask a question and the computer returns an answer. You provide a pair of words, and the system finds a ladder that connects them and performs any calculations needed to get there. You deliver a shoebox full of financial records, and the computer fills out your 1040 form. You compile historical climate data, and the computer predicts the average global temperature in 2050.

The alternative to AI oracles is the AI Code Monkey. In the second mode, the machine does not directly answer your questions or perform your calculations; Instead, it creates a program that can run on a conventional computer. Instead of a word ladder from your bot, you return a program that generates a word ladder, written in the programming language of your choice. Instead of filling out a complete tax return, you get tax preparation software; Climate models replace temperature predictions.

Let's try it out with ChatGPT 3.5:

You can use the wordLadder feature to solve ladder puzzles. Here is an example of usage:

Its output is:

Note that this is a basic demo, and for larger ladders or more complex scenarios, additional optimizations or algorithms may be required, such as breadth-first search.

Again, a cursory glance at the output shows that the performance was successful. ChatGPT seems to be as fluent in Java as English. It knows the syntax of if, while, and for statements, as well as all the fussy punctuation and parenthesis rules. Machine-generated programs seem to combine all these components to accomplish a specified task. Also pay attention to the large number of explanatory comments, which are certainly for our benefit, not for it. Similarly, descriptive variable names (currentWord, newWord, ladder).

ChatGPT also proactively includes instructions for running the program on a specific example (MARCH to APRIL), and it prints out the result that matches the answer given in our previous exchange. Is that output generated by actually running the program? ChatGPT doesn't say it explicitly, but it does claim that if you run the program as instructed, you'll get the displayed results (in all ridiculous glory).

We can test this statement by loading the program into a web browser or other Java execution environment. The conclusion: caught! The program did run, but did not produce the specified result. THE REAL OUTPUT OF THE PROGRAM IS: MARCH -> AARCH -> APRCH -> APRIH -> APRIL. This sequence is not so strange because it follows the rule of changing only one letter at a time, and all "words" have exactly five letters. On the other hand, no intermediate "word" can be found in English dictionaries.

THERE IS A SIMPLE ALGORITHM THAT GENERATES SEQUENCES MARCH -> AARCH -> APRCH -> APRIH -> APRIL. Simply go through the starting word step by step from left to right, changing the letter in each position to match the corresponding letter in the target word. Following this rule, any pair of five-letter words can be arranged in steps in no more than five steps. MARCH -> APRIL takes only four steps because the R in the middle does not need to be changed. I can't imagine an easier way to make a word ladder – assuming, of course, that you're willing to make any mixture of letters count as one word.

The program created by ChatGPT can use this quick and dirty routine, but it does something much more tedious: it builds all possible ladders, the first of which is the starting word, and continues to expand those ladders until it stumbles upon one that contains the target word. This is a wasteful brute force algorithm. Each letter of the starting word can be changed in 25 ways. Thus, a 5-letter word has 125 possible successors. When you reach the five-step ladder, there are 190 million possibilities. (THE EXAMPLES I'VE PROVIDED HERE, SUCH AS MARCH -> APRIL AND REACH-> GRAMP, HAVE AN INVARIANT LETTER, SO THE SOLUTION ONLY TAKES FOUR STEPS.) Trying to calculate the complete five-step solution drained my patience. ）

4th Generation Code Monkey

Let's try the same coding exercise with ChatGPT 4. Given the same hint, the new bot responds as follows:

The program has the same overall structure (a while loop with two nested for loops inside) and employs the same algorithmic strategy (generating all strings that are different from a given word in one place). But the GPT-4 version has a big novelty: the recognition of a word list is essential. With this change, we finally have hope to generate a ladder of real words.

Although GPT-4 recognizes the need for a list, it only provides a placeholder, the 10-word sequence it configures for the REACH-> GRASP example given above. The stubs of this word list are of little use, not even for regenerating the fake "REACH-to-GRASP" ladder. If you try to do this, the program will report that no ladder exists. There is nothing wrong with this result, because a given 10 words does not form an effective path to change only one letter per step.

Even if the words in the list are carefully chosen, the vocabulary of 10 is very insignificant. Generating larger word lists seems like an easy task for language models. After all, LLM is trained on a huge corpus of texts, where almost all English words are likely to appear at least once, while common words appear millions of times. Can't the robot take a representative sample of these words? The answer is clearly no. Although GPT can be said to have "read" all this text, it does not store these words in any easily accessible form. (The same goes for human readers.) Can you make a list of the 10 most common five-letter words in your vocabulary by looking back at a lifetime of reading experiences?

When I asked ChatGPT 4 to generate a list of words, it objected apologetically: "I'm sorry for the confusion, but as an AI developed by OpenAI, I can't directly access the word database or get the ability to get data from external sources..." So I tried some tricks and asked the robot to write a 1000-word story and then sort the words of the story by frequency. The trick worked, but the sample was too small to be of much use. As long as I stick with it, I might be able to coax an acceptable list from GPT, but I'm taking a shortcut. After all, I am not an AI developed by OpenAI, and I have access to external resources. I appropriated a list of 5,757 five-letter English words compiled by Knuth for his word ladder experiment. With this list, programs written in GPT-4 will find the following nine-step ladder diagram:

REACH -> PEACH -> PEACE -> PLACE -> PLANE -> PLANS -> GLANS -> GLASS -> GRASS -> GRASP

This result exactly matches the output of Knuth's own ladder program, which he published 30 years ago in Stanford Graphbase.

At this point, I must admit that with a little outside help, ChatGPT finally fulfilled my request. It writes a program that can construct a valid word ladder. But I still have reservations. Although GPT-4 and Knuth write programs that produce the same output, the programs themselves are not equivalent, or even similar.

Knuth approached this problem in the opposite direction, starting not with a collection of all possible five-letter strings (which number less than 12 million), but with his much smaller list of 5,757 common English words. He then builds a graph (or network) where each word is a node, and the two nodes are connected by edges if and only if the corresponding words differ by one letter. The following illustration shows a fragment of such a diagram.

In the diagram, a word ladder is a series of edges from the start node to the target node. The best ladder is the shortest path, traversing the fewest number of sides. For example, the best path from leash to retch is leash -> leach -> reach -> retch, but there are also longer paths such as leash -> leach -> beach -> peach -> reach -> retch. To find the shortest path, Knuth used an algorithm devised by Edsger W. Dijkstra in the 50s of the 20th century.

Knuth's word ladder program requires upfront investment to convert a simple word list into a chart. On the other hand, it avoids wastingly generating thousands or millions of five-letter strings that cannot be elements of the latter. In solving the REACH-> GRASP problem, the GPT-4 program produced 219,180 such strings; Only 2,792 of them (just over 1%) are real words.

If the various word ladder procedures I describe are submitted by students, then I will give a failing grade to the version without a word list. The GPT-4 program with the list will pass, but for the sake of efficiency and elegance, I will only give the Knuth program the highest marks.

Why do chatbots prefer inferior algorithms? You can simply Google for "word ladder program" to get clues. Almost all of the top-ranking results came from sites like Leetcode, GeeksForGeeks, and RosettaCode. These sites are clearly meant to cater to job seekers and competitors in programming competitions, with solutions that require generating all 125 single-letter variants of each word, just like GPT programs. Because there are so many such sites – there seem to be hundreds – they are more important than other sources, such as Knuth's book (if the text does appear in the training set). Does this mean we should blame Leetcode for the wrong algorithm choice, not GPT? Instead, I would like to point out the inevitable weaknesses of the protocol, the most common of which are the correct answer by default.

Whenever I think that large language models are being written for all of our software, another related concern haunts me. Where do the new algorithms come from? The college language model might be creative in remixing elements of existing projects, but I don't see any way it can invent something completely new and better.

** Enough of the word ladder! **

I'll admit I've gone too far, torturing ChatGPT with too many variants of a particular (and irrelevant) problem. Perhaps college language models perform better on other computational tasks. I've tried several, with mixed results. I just want to discuss one of them, and I find ChatGPT's efforts rather poignant.

With ChatGPT 3.5, I ask for the value of the 100th Fibonacci number. Note that my question was asked in Oracle mode; I'm asking for this number, not a program that calculates it. Still, ChatGPT voluntarily writes a Fibonacci program and then renders the output of that program.

The algorithm implemented by this program is mathematically correct; It comes directly from the definition of the Fibonacci sequence, which is a member of a sequence starting at {0, 1}, with each subsequent element equal to the sum of the first two items. The answer given is also correct: 354224848179261915075 is indeed the 100th Fibonacci number. So what's the problem? It's the middle sentence: "When you run this code, it will output the 100th Fibonacci number." This is not true. If you run the code, you will get the wrong value 354224848179262000000. Recent versions of Java provide the BigInt data type to solve this problem, but BigInt must be explicitly specified, which ChatGPT programs do not do. The reason for this anomaly is that Java uses floating-point arithmetic, even for integer values. According to the IEEE floating-point standard, the maximum integer that can be represented without loss of precision is 253−1; The 100th Fibonacci number is approximately 268. This is what I call poignant thinking: ChatGPT gives the right answer, but the method it claims to use to calculate that answer doesn't provide it. The bot must have found the correct value by some other means, but the exact way was not revealed.

Giving the same task to ChatGPT 4.0 will take us on a journey that is even stranger. In the next interaction, I activated Code Interpreter, a ChatGPT plugin that allows the system to test and run some of the code it writes. Apparently, the robot took advantage of this feature, first coming up with a program that failed for unknown reasons:

Here ChatGPT is written in Python, the main programming language supported by Code Interpreter. The first attempt to write a program was based on the exponentiation of Fibonacci matrices:

This is a well-known and effective method, and the program implements it correctly. However, for mysterious reasons, the code interpreter cannot execute the program. (The code works well in a standard Python environment and returns the correct answer.) ）

At this point, the robot will turn in a completely new direction and take off, suggesting that the required Fibonacci value be calculated by a mathematical identity called Binet's formula. It had written mathematical expressions, but then changed its mind. It correctly foresees the problem of numerical precision: given an exact value of the square root of 5, the formula will produce an exact result, but this is not feasible.

So now ChatGPT has taken a different strategy, using the same iterative algorithm as version 3.5. This time we got the right answer because Python (unlike Java) doesn't have any problems dealing with large integers.

I was impressed by this performance, not only with the correct answer, but also with the courageous persistence of the system. Despite ChatGPT's troubles, it persevered, puzzled by unexpected difficulties but refused to give up. "Well, that matrix method should work. But, anyway, let's try the Binet formula... Oh wait, I forgot... Anyway, there's no need to be so fancy about this. Let's do it in an obvious, slow way. I feel like it's a very human approach to problem solving. It's strange to see this behavior in machines.

Record success and failure scores

My little experiment made me doubt the claim that AI oracles and AI code monkeys are about to crowd out human programmers. I've seen some success, but more of a failure. This bleak record was compiled on relatively simple computational tasks whose solutions are well known and widely published.

Others have made a broader and deeper assessment of LLM code generation. In the bibliography at the end of this article, I list five such studies. I would like to briefly summarize some of the results they reported.

Two years ago, Mark Chen and more than 50 colleagues at OpenAI put a lot of effort into measuring the accuracy of Codex, a fork of ChatGPT 3 dedicated to writing code. (Codex has since become the engine that powers GitHub Copilot, the "programmer's assistant.") ) created a set of 164 tasks that can be done by writing Python programs. These tasks are mostly textbook exercises, programming competitions, and types in the (staggering) literature on how to do well in coding job interviews. Most tasks can be completed with just a few lines of code. Example: Calculate the number of vowels in a given word, determine whether an integer is prime or composite.

Professor Chen's team also gave some thought on the criteria for defining success and failure. Because the LLM process is non-deterministic (word selection is based on probability), the model may generate a flawed program on the first attempt, but will eventually produce the correct response if the attempt is allowed to continue. A parameter called temperature controls the degree of uncertainty. At zero temperature, the model always chooses the most likely word at each step; As the temperature rises, randomness is introduced, allowing the choice of unlikely words. Chen et al. Consider the possibility of this change by adopting three success benchmarks:

pass@1: LLM generates the correct program on the first attempt

pass@10: At least one of the 10 programs generated is correct

pass@100: At least one of the 100 programs generated is correct

Pass@1 tests are performed at zero temperature, so the model always gives the best guess. pass@10 and pass@100 trials are performed at higher temperatures, allowing the system to explore a wider range of potential solutions.

The authors evaluated multiple versions of Codex on all 164 tasks. For the largest and most powerful version of Codex, the pass@1 rate is about 29%, the pass@10 rate is 47%, and the pass@100 reaches 72%. Should we be impressed or shocked when we see these numbers? Is it worth celebrating that Codex is right on the first attempt almost a third of the time (when the temperature is set to zero)? Or if you were willing to sift through 100 proposed plans to find the right one, the success rate climbed to nearly three-quarters? My personal opinion is this: if you look at the current generation of LLM as a pioneering effort in a long-term research program, the results are encouraging. But if you think the technology can immediately replace hand-coded software, there's little hope. We are still far from the necessary level of reliability.

Other studies have yielded broadly similar results. Fredrico Cassano et al. Evaluate the performance of multiple LLMs generating code in a variety of programming languages; They report a wide range of pass@1 rates, but only two exceed 50%. Alessio Buscemi tested ChatGPT 3.5 on 40 coding tasks, requiring programs written in 10 languages and repeating each query 10 times. Out of 4,000 trials, 1,833 produced code that could be compiled and executed. Liu Zhijie et al. Their assessment of ChatGPT is based on questions posted on the Leetcode website. Judge the results by submitting the generated code to an automated Leetcode scoring process. The average acceptance rate for all questions ranged from 31% for programs written in C to 50% for Python programs. Liu et al. Another interesting observation: ChatGPT scored much worse on questions published after September 2021 (the deadline for GPT's training set). They speculate that the robot may be better able to solve earlier problems because it has already seen a solution during training.

A recent paper published by Li Zhong and Zilong Wang goes beyond the basic question of program correctness and considers robustness and reliability. Does the generated program respond correctly to malformed input or external errors, such as when trying to open a file that does not exist? Even though LLM's prompt included an example showing how to properly handle such problems, Zhong and Wang found that the generated code failed to do so 30 to 50 percent of the time.

In addition to these frustrating results, I myself have more doubts. Almost all tests are conducted through short code snippets. An LLM that has difficulty writing a 10-line program may have greater difficulty writing a 100-line or 1,000-line program. Also, a simple pass/fail rating is a very rough measure of code quality. Consider the primality test in Chen's group benchmark suite. This is one of the programs written in Codex:

This code is rated correct – it should be correct because it never misclassifies prime numbers as composite numbers and vice versa. However, when n is large, you may not have the patience or life to wait for a verdict. The algorithm attempts to divide n by each integer between 2 and n−1.

LLM unconventional practicality

It's still early days for large language models. ChatGPT was released less than a year ago; The underlying technology is only about six years old. While I'm pretty sure I'm claiming LLM isn't ready to conquer the coding world, I can't predict with such confidence that they never will. These models will definitely improve and we will use them better. There is already an emerging industry that offers "just-in-time engineering" guidance as a way to get the most out of every query.

Another way to improve LLM performance may be to form a hybrid with another computing system equipped with logic and reasoning tools rather than pure language analysis tools. On the eve of his recent death, Doug Lenat proposed combining LLM with Cyc, a huge database of common sense that he spent four decades working to build. Stephen Wolfram is working on integrating ChatGPT into Wolfram|In Alpha, Wolfram|Alpha is an online collection of curated data and algorithms.

Still, some of the hurdles that hinder LLM course generation seem difficult to overcome.

Language models work their magic in a simple way: in the process of writing a sentence or paragraph, LLM chooses the next word based on the previous word. It's like writing a text message on your phone: you type "I'll see you..." and the software suggests alternative continuations: "tomorrow," "soon," "later." In LLM, each candidate is assigned a probability, which is calculated based on the analysis of all the text in the model training set.

More than a century ago, Russian mathematician A. A. Markov first explored the idea of generating text from this statistical analysis. His process is now known as the n-gram model, where n is the number of words (or characters or other symbols) to consider when choosing the next element of the sequence. I've long been fascinated by the n-gram process, though mostly because of its comedic possibilities. (In an article published 40 years ago, I called it "the art of turning literature into gibberish.") ”）

Of course, ChatGPT and other recent LLMs are more than n-metamodels. Their neural network captures linguistic statistical features far beyond a sequence of n consecutive symbols. Of particular importance is the attention mechanism, which tracks dependencies between selected symbols at arbitrary distances. In natural languages, this means are useful for maintaining subject and verb consistency, or for associating pronouns with the object they refer to. In programming languages, the attention mechanism ensures the integrity of multipart syntax structures, such as if... then... else, and it keeps the parentheses properly paired and nested.

However, even with these improvements, LLM is essentially a tool for constructing new text based on the probability of words appearing in existing text. In my way of thinking, that's not thinking. This is something more superficial, focusing on words rather than ideas. Given this crude mechanism, I was both surprised and puzzled by how much LLM was able to achieve.

For decades, the architects of AI believed that true intelligence, whether natural or artificial, required a mental model of the world. In order to understand what's going on around you (and inside you), you need to have an intuition about how things work, how they fit together, what happens next, cause and effect. Lynant insists that the most important knowledge is the knowledge you acquire long before you start reading. You learn gravity by falling. When you find that a building block tower is easy to tear down but difficult to rebuild, you understand entropy. Before language begins to take root, you'll learn about pain, fear, hunger, and love in infancy. The brain in the box cannot access this experience because it cannot directly access the physical or social universe.

Two hundred and fifty years ago, Swiss watchmaker Pierre Jacquet-Droz built a mechanical automaton that could write with a quill. This clockwork device has hundreds of cams and gears and is dressed up as a little boy sitting on a stool. After activation, the boy dipped the pen in ink and wrote a short message - most notably the Cartesian aphorism "I think, therefore I am". How funny! But even in the 18th century, no one believed that graffiti dolls really thought. LLM skeptics put ChatGPT in the same category.

I'm going to tell you which of these contrasting LLM mentality theories is correct? I'm not. Neither option appealed to me. If Bender and others are right, then we must face the fact that a gadget with no ability to reason or feel, no experience of the physical universe or social interaction, no self-awareness, writes college papers, writes rap songs, gives advice to lovelorns. Knowledge, logic, emotions are worthless; Slippery tongue is all. This is a subversive proposition. If ChatGPT can fool us with this unconscious show, maybe we are also liars, and their voices and anger are meaningless.

On the other hand, if Sutskever is right, then much of the human experience we hold dear—the sense of personality that slowly evolves as we grow and live—can be learned by reading these words on the internet. If that's the case, then I don't actually have to endure the unspeakable pain of junior high school, I don't have to make all the stupid mistakes that cause such heartache and difficulty; There is no need to hurt my self-esteem by colliding with the world. I could have read all of this from the comfort of my armchair; Just words can bring me to a state of maturity with a clear mind without experiencing all the stumbling blocks and pain in the valley that shapes my soul.

I still have two opinions (or maybe more than two!) about the status and impact of large language models on computer science. ）。 AI enthusiasts may be right. These models may take over programming as well as many other types of work and learning. Or they may fail, as with other promising AI innovations. I don't think we have to wait too long to get an answer.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes

Reward
1
Comment
Share

Comment

0/400

No comments

Topic
Gate 2025 Q2 Report Released
15k Popularity
CPI Data Incoming
45k Popularity
Altcoin Season Update
2k Popularity
4Bitcoin Whale Moves
189 Popularity
5Gate Derivatives Volume Hits New High
14k Popularity
6Crypto Legislation Voting Week
4k Popularity
7MicroStrategy Buys More Bitcoin
810 Popularity
8BTC Hits New High
109k Popularity
9My Gate Moments
26k Popularity
10VIP Exclusive Airdrop Carnival
26k Popularity

sitemap