ChatGPT Is a Blurry JPEG of the Web
…
It retains much of the information on the Web, in the same way that a retains much of the information of a higher-resolution image, but, if you’re looking for an exact sequence of bits, you won’t find it; all you will ever get is an approximation. But, because the approximation is presented in the form of grammatical text, which ChatGPT excels at creating, it’s usually acceptable. You’re still looking at a blurry , but the blurriness occurs in a way that doesn’t make the picture as a whole look less sharp.
…
Large language models identify statistical regularities in text. Any analysis of the text of the Web will reveal that phrases like “supply is low” often appear in close proximity to phrases like “prices rise.” A chatbot that incorporates this correlation might, when asked a question about the effect of supply shortages, respond with an answer about prices increasing. If a large language model has compiled a vast number of correlations between economic terms—so many that it can offer plausible responses to a wide variety of questions—should we say that it actually understands economic theory? Models like ChatGPT aren’t eligible for the Hutter Prize for a variety of reasons, one of which is that they don’t reconstruct the original text precisely—i.e., they don’t perform lossless compression. But is it possible that their lossy compression nonetheless indicates real understanding of the sort that A.I. researchers are interested in?
…
Imagine what it would look like if ChatGPT were a lossless algorithm. If that were the case, it would always answer questions by providing a verbatim quote from a relevant Web page. We would probably regard the software as only a slight improvement over a conventional search engine, and be less impressed by it. The fact that ChatGPT rephrases material from the Web instead of quoting it word for word makes it seem like a student expressing ideas in her own words, rather than simply regurgitating what she’s read; it creates the illusion that ChatGPT understands the material. In human students, rote memorization isn’t an indicator of genuine learning, so ChatGPT’s inability to produce exact quotes from Web pages is precisely what makes us think that it has learned something. When we’re dealing with sequences of words, lossy compression looks smarter than lossless compression.
…
Even if it is possible to restrict large language models from engaging in fabrication, should we use them to generate Web content? This would make sense only if our goal is to repackage information that’s already available on the Web. Some companies exist to do just that—we usually call them content mills. Perhaps the blurriness of large language models will be useful to them, as a way of avoiding copyright infringement. Generally speaking, though, I’d say that anything that’s good for content mills is not good for people searching for information. The rise of this type of repackaging is what makes it harder for us to find what we’re looking for online right now; the more that text generated by large language models gets published on the Web, the more the Web becomes a blurrier version of itself.
There is very little information available about OpenAI’s forthcoming successor to ChatGPT, GPT-4. But I’m going to make a prediction: when assembling the vast amount of text used to train GPT-4, the people at OpenAI will have made every effort to exclude material generated by ChatGPT or any other large language model. If this turns out to be the case, it will serve as unintentional conrmation that the analogy between large language models and lossy compression is useful. Repeatedly resaving a creates more compression artifacts, because more information is lost every time. It’s the digital equivalent of repeatedly making photocopies of photocopies in the old days. The image quality only gets worse.
시뮬라크르(simulacre)가 떠오르는 문단이다. 시뮬라크르의 개념은 플라톤까지 거슬러 올라 가게 되는데 플라톤에 의하면 사람이 살고 있는 이 세계는 원형인 ‘이데아’, 복제물인 ‘현실’, 복제의 복제물인 ‘세계’로 이루어져 있다. 여기서 시뮬라크르는 복제물을 다시 복제한 것을 말한다.
Indeed, a useful criterion for gauging a large language model’s quality might be the willingness of a company to use the text that it generates as training material for a new model. If the output of ChatGPT isn’t good enough for GPT-4, we might take that as an indicator that it’s not good enough for us, either. Conversely, if a model starts generating text so good that it can be used to train new models, then that should give us condence in the quality of that text. (I suspect that such an outcome would require a major breakthrough in the techniques used to build these models.) If and when we start seeing models producing output that’s as good as their input, then the analogy of lossy compression will no longer be applicable.
이 문단에서도 논증되듯, LLM의 아웃풋을 다시 LLM을 학습시키는 데에 사용하지 못한다는 것이 바로 아웃풋이 시뮬라크르라는 반증이라고 생각한다. 그럼 이데아는 무엇일까? 실제 사람들이 작성하는 글이 이데아일까? 시뮬라크르는 이데아가 될 수 있을까?
…
There’s nothing magical or mystical about writing, but it involves more than placing an existing document on an unreliable photocopier and pressing the Print button. It’s possible that, in the future, we will build an A.I. that is capable of writing good prose based on nothing but its own experience of the world. The day we achieve that will be momentous indeed—but that day lies far beyond our prediction horizon. In the meantime, it’s reasonable to ask, What use is there in having something that rephrases the Web? If we were losing our access to the Internet forever and had to store a copy on a private server with limited space, a large language model like ChatGPT might be a good solution, assuming that it could be kept from fabricating. But we aren’t losing our access to the Internet. So just how much use is a blurry , when you still have the original?
원본(Web; idea)이 있는데 왜 복제품(ChatGPT; simulacre)을 봐야 하는가?