研究機(jī)構(gòu)Epoch AI在上周四發(fā)布的一項(xiàng)最新研究預(yù)測(cè),大約到這個(gè)十年之交,即2026年到2032年之間,科技公司將耗盡人工智能語(yǔ)言模型所需的公開(kāi)訓(xùn)練數(shù)據(jù)。
該研究報(bào)告的作者塔馬伊·貝西羅格盧將其與耗盡有限自然資源的 “淘金熱”進(jìn)行比較。他表示,一旦人工智能領(lǐng)域耗盡了人類(lèi)創(chuàng)造的文字儲(chǔ)備,保持目前的發(fā)展速度可能會(huì)面臨挑戰(zhàn)。
在短期內(nèi),ChatGPT的開(kāi)發(fā)者OpenAI和谷歌(Google)等科技公司正在競(jìng)相獲取高質(zhì)量的數(shù)據(jù)源,有時(shí)甚至為這些數(shù)據(jù)付費(fèi),用來(lái)訓(xùn)練它們的人工智能大語(yǔ)言模型,例如,通過(guò)簽署協(xié)議來(lái)獲取Reddit論壇和新聞媒體上源源不斷的語(yǔ)句。
從長(zhǎng)遠(yuǎn)來(lái)看,不會(huì)有足夠多的新博客、新聞報(bào)道和社交媒體評(píng)論來(lái)維持當(dāng)前的人工智能發(fā)展軌跡,這將給公司帶來(lái)壓力,迫使它們利用現(xiàn)在被視為隱私的敏感數(shù)據(jù)(如電子郵件或短信),或者依靠聊天機(jī)器人自己輸出的可靠性較低的“合成數(shù)據(jù)”。
貝西羅格盧表示:“這方面存在一個(gè)嚴(yán)重的瓶頸。如果可用的數(shù)據(jù)量受到限制,你就無(wú)法再高效擴(kuò)展你的模型。而擴(kuò)展模型規(guī)模可能是增強(qiáng)模型能力和提高輸出質(zhì)量的最重要方式。”
兩至八年低谷期
兩年前,也就是ChatGPT首次亮相前不久,研究人員在一篇工作論文中首次提出了他們的預(yù)測(cè),認(rèn)為高質(zhì)量文本數(shù)據(jù)在2026年就會(huì)枯竭。在那之后,許多事情發(fā)生了變化,包括新技術(shù)讓人工智能研究人員能夠更好地利用已有數(shù)據(jù),有時(shí)還能對(duì)同一數(shù)據(jù)源進(jìn)行多次“過(guò)度訓(xùn)練”。
但這是有限度的,經(jīng)過(guò)進(jìn)一步研究,Epoch現(xiàn)在預(yù)計(jì)在未來(lái)兩到八年的某個(gè)時(shí)候,公共文本數(shù)據(jù)將會(huì)耗盡。
該團(tuán)隊(duì)的最新研究已通過(guò)同行評(píng)審,并將在今年夏天于奧地利維也納舉行的國(guó)際機(jī)器學(xué)習(xí)大會(huì)(International Conference on Machine Learning)上發(fā)表。Epoch是一家非營(yíng)利性研究機(jī)構(gòu),由總部位于舊金山的Rethink Priorities主辦,并由“有效利他主義”的支持者提供資金。有效利他主義是一種慈善運(yùn)動(dòng),已投入大量資金減少人工智能最嚴(yán)重的風(fēng)險(xiǎn)。
貝西羅格盧表示,人工智能研究人員早在十多年前就意識(shí)到,積極擴(kuò)展兩個(gè)關(guān)鍵要素——算力和海量互聯(lián)網(wǎng)數(shù)據(jù),可以顯著提高人工智能系統(tǒng)的性能。
Epoch的研究顯示,輸入人工智能語(yǔ)言模型的文本數(shù)據(jù)量每年增長(zhǎng)約2.5倍,而計(jì)算能力每年增長(zhǎng)約4倍。Facebook母公司Meta Platforms最近聲稱(chēng),他們即將推出的Llama 3模型(尚未發(fā)布)的最大版本已經(jīng)在多達(dá)15萬(wàn)億個(gè)詞塊上進(jìn)行了訓(xùn)練,每個(gè)詞塊可以代表一個(gè)詞匯的一個(gè)片段。
但是,擔(dān)心數(shù)據(jù)瓶頸到底有多大的價(jià)值還值得商榷。
多倫多大學(xué)(University of Toronto)計(jì)算機(jī)工程系助理教授、非營(yíng)利組織人工智能矢量研究所(Vector Institute for Artificial Intelligence)的研究員尼古拉斯·帕珀諾特表示:“我認(rèn)為,重要的是要記住,我們并不一定需要訓(xùn)練越來(lái)越大的模型。”
“再?gòu)?fù)印復(fù)印件”
帕珀諾特沒(méi)有參與Epoch的研究。他表示,還可以通過(guò)對(duì)特定任務(wù)更加專(zhuān)業(yè)的訓(xùn)練模型,創(chuàng)建更熟練的人工智能系統(tǒng)。但他擔(dān)心的是,生成式人工智能系統(tǒng)使用其自身輸出的結(jié)果進(jìn)行訓(xùn)練,會(huì)導(dǎo)致系統(tǒng)性能下降,即所謂的“模型坍塌”。
帕珀諾特表示,使用人工智能生成的數(shù)據(jù)進(jìn)行訓(xùn)練“就像你復(fù)印一張紙,然后再?gòu)?fù)印復(fù)印件。你會(huì)丟失一些信息。”不僅如此,帕珀諾特的研究還發(fā)現(xiàn),這樣做會(huì)把信息生態(tài)系統(tǒng)中現(xiàn)存的錯(cuò)誤、偏見(jiàn)和不公平進(jìn)一步編碼。
如果真正由人類(lèi)創(chuàng)造的語(yǔ)句依舊是一種重要的人工智能數(shù)據(jù)源,那么最受青睞的信息庫(kù)的管理者,如Reddit和維基百科(Wikipedia)等網(wǎng)站,以及新聞和圖書(shū)出版商等,就必須認(rèn)真思考該如何使用這些數(shù)據(jù)。
維基百科運(yùn)營(yíng)方維基媒體基金會(huì)(Wikimedia Foundation)的首席產(chǎn)品和技術(shù)官塞琳娜·德克爾曼開(kāi)玩笑說(shuō):“也許你不必追求完美。現(xiàn)在,我們正在就人類(lèi)創(chuàng)建的數(shù)據(jù)進(jìn)行與自然資源類(lèi)似的對(duì)話(huà),這是一個(gè)有趣的問(wèn)題。我不應(yīng)該嘲笑它,但我確實(shí)覺(jué)得有點(diǎn)不可思議。”
雖然有些機(jī)構(gòu)往往在數(shù)據(jù)被無(wú)償使用后,才試圖阻止他們的數(shù)據(jù)被用于訓(xùn)練人工智能,,但維基百科對(duì)于人工智能公司如何使用志愿者撰寫(xiě)的詞條,幾乎沒(méi)有任何限制。盡管如此,德克爾曼表示,她希望能鼓勵(lì)人們繼續(xù)貢獻(xiàn)詞條,尤其是在大量廉價(jià)的自動(dòng)生成的“垃圾內(nèi)容”開(kāi)始污染互聯(lián)網(wǎng)的時(shí)候。
她表示,人工智能公司應(yīng)該“關(guān)注人類(lèi)生成的內(nèi)容如何繼續(xù)存在并且可以繼續(xù)訪(fǎng)問(wèn)”。
從人工智能開(kāi)發(fā)者的角度來(lái)看,Epoch的研究稱(chēng),向數(shù)百萬(wàn)人付費(fèi)生成人工智能模型所需的文本,“不太可能是一種經(jīng)濟(jì)的方式”來(lái)提高技術(shù)性能。
隨著OpenAI 開(kāi)始訓(xùn)練其下一代GPT大語(yǔ)言模型,其CEO山姆·阿爾特曼在上個(gè)月的一次聯(lián)合國(guó)活動(dòng)上表示,OpenAI已經(jīng)嘗試過(guò)“生成大量合成數(shù)據(jù)”進(jìn)行訓(xùn)練。
阿爾特曼表示:“我認(rèn)為你需要的是高質(zhì)量的數(shù)據(jù)。有低質(zhì)量的合成數(shù)據(jù)。也有低質(zhì)量的人類(lèi)數(shù)據(jù)。”但對(duì)于過(guò)度依賴(lài)合成數(shù)據(jù)而非其他技術(shù)方法來(lái)改進(jìn)人工智能模型的做法,他也持保留意見(jiàn)。
阿爾特曼表示:“如果訓(xùn)練模型的最佳方法只是生成千萬(wàn)億詞塊的合成數(shù)據(jù)并將其反饋到模型,那就太奇怪了。從某種程度上來(lái)說(shuō),這似乎效率不高。”(財(cái)富中文網(wǎng))
翻譯:劉進(jìn)龍
審校:汪皓
ChatGPT等人工智能系統(tǒng)可能很快就要耗盡讓它們變得更加智能的資源——人們?cè)诰W(wǎng)絡(luò)上編寫(xiě)和分享的數(shù)十萬(wàn)億詞匯。
研究機(jī)構(gòu)Epoch AI在上周四發(fā)布的一項(xiàng)最新研究預(yù)測(cè),大約到這個(gè)十年之交,即2026年到2032年之間,科技公司將耗盡人工智能語(yǔ)言模型所需的公開(kāi)訓(xùn)練數(shù)據(jù)。
該研究報(bào)告的作者塔馬伊·貝西羅格盧將其與耗盡有限自然資源的 “淘金熱”進(jìn)行比較。他表示,一旦人工智能領(lǐng)域耗盡了人類(lèi)創(chuàng)造的文字儲(chǔ)備,保持目前的發(fā)展速度可能會(huì)面臨挑戰(zhàn)。
在短期內(nèi),ChatGPT的開(kāi)發(fā)者OpenAI和谷歌(Google)等科技公司正在競(jìng)相獲取高質(zhì)量的數(shù)據(jù)源,有時(shí)甚至為這些數(shù)據(jù)付費(fèi),用來(lái)訓(xùn)練它們的人工智能大語(yǔ)言模型,例如,通過(guò)簽署協(xié)議來(lái)獲取Reddit論壇和新聞媒體上源源不斷的語(yǔ)句。
從長(zhǎng)遠(yuǎn)來(lái)看,不會(huì)有足夠多的新博客、新聞報(bào)道和社交媒體評(píng)論來(lái)維持當(dāng)前的人工智能發(fā)展軌跡,這將給公司帶來(lái)壓力,迫使它們利用現(xiàn)在被視為隱私的敏感數(shù)據(jù)(如電子郵件或短信),或者依靠聊天機(jī)器人自己輸出的可靠性較低的“合成數(shù)據(jù)”。
貝西羅格盧表示:“這方面存在一個(gè)嚴(yán)重的瓶頸。如果可用的數(shù)據(jù)量受到限制,你就無(wú)法再高效擴(kuò)展你的模型。而擴(kuò)展模型規(guī)模可能是增強(qiáng)模型能力和提高輸出質(zhì)量的最重要方式。”
兩至八年低谷期
兩年前,也就是ChatGPT首次亮相前不久,研究人員在一篇工作論文中首次提出了他們的預(yù)測(cè),認(rèn)為高質(zhì)量文本數(shù)據(jù)在2026年就會(huì)枯竭。在那之后,許多事情發(fā)生了變化,包括新技術(shù)讓人工智能研究人員能夠更好地利用已有數(shù)據(jù),有時(shí)還能對(duì)同一數(shù)據(jù)源進(jìn)行多次“過(guò)度訓(xùn)練”。
但這是有限度的,經(jīng)過(guò)進(jìn)一步研究,Epoch現(xiàn)在預(yù)計(jì)在未來(lái)兩到八年的某個(gè)時(shí)候,公共文本數(shù)據(jù)將會(huì)耗盡。
該團(tuán)隊(duì)的最新研究已通過(guò)同行評(píng)審,并將在今年夏天于奧地利維也納舉行的國(guó)際機(jī)器學(xué)習(xí)大會(huì)(International Conference on Machine Learning)上發(fā)表。Epoch是一家非營(yíng)利性研究機(jī)構(gòu),由總部位于舊金山的Rethink Priorities主辦,并由“有效利他主義”的支持者提供資金。有效利他主義是一種慈善運(yùn)動(dòng),已投入大量資金減少人工智能最嚴(yán)重的風(fēng)險(xiǎn)。
貝西羅格盧表示,人工智能研究人員早在十多年前就意識(shí)到,積極擴(kuò)展兩個(gè)關(guān)鍵要素——算力和海量互聯(lián)網(wǎng)數(shù)據(jù),可以顯著提高人工智能系統(tǒng)的性能。
Epoch的研究顯示,輸入人工智能語(yǔ)言模型的文本數(shù)據(jù)量每年增長(zhǎng)約2.5倍,而計(jì)算能力每年增長(zhǎng)約4倍。Facebook母公司Meta Platforms最近聲稱(chēng),他們即將推出的Llama 3模型(尚未發(fā)布)的最大版本已經(jīng)在多達(dá)15萬(wàn)億個(gè)詞塊上進(jìn)行了訓(xùn)練,每個(gè)詞塊可以代表一個(gè)詞匯的一個(gè)片段。
但是,擔(dān)心數(shù)據(jù)瓶頸到底有多大的價(jià)值還值得商榷。
多倫多大學(xué)(University of Toronto)計(jì)算機(jī)工程系助理教授、非營(yíng)利組織人工智能矢量研究所(Vector Institute for Artificial Intelligence)的研究員尼古拉斯·帕珀諾特表示:“我認(rèn)為,重要的是要記住,我們并不一定需要訓(xùn)練越來(lái)越大的模型。”
“再?gòu)?fù)印復(fù)印件”
帕珀諾特沒(méi)有參與Epoch的研究。他表示,還可以通過(guò)對(duì)特定任務(wù)更加專(zhuān)業(yè)的訓(xùn)練模型,創(chuàng)建更熟練的人工智能系統(tǒng)。但他擔(dān)心的是,生成式人工智能系統(tǒng)使用其自身輸出的結(jié)果進(jìn)行訓(xùn)練,會(huì)導(dǎo)致系統(tǒng)性能下降,即所謂的“模型坍塌”。
帕珀諾特表示,使用人工智能生成的數(shù)據(jù)進(jìn)行訓(xùn)練“就像你復(fù)印一張紙,然后再?gòu)?fù)印復(fù)印件。你會(huì)丟失一些信息。”不僅如此,帕珀諾特的研究還發(fā)現(xiàn),這樣做會(huì)把信息生態(tài)系統(tǒng)中現(xiàn)存的錯(cuò)誤、偏見(jiàn)和不公平進(jìn)一步編碼。
如果真正由人類(lèi)創(chuàng)造的語(yǔ)句依舊是一種重要的人工智能數(shù)據(jù)源,那么最受青睞的信息庫(kù)的管理者,如Reddit和維基百科(Wikipedia)等網(wǎng)站,以及新聞和圖書(shū)出版商等,就必須認(rèn)真思考該如何使用這些數(shù)據(jù)。
維基百科運(yùn)營(yíng)方維基媒體基金會(huì)(Wikimedia Foundation)的首席產(chǎn)品和技術(shù)官塞琳娜·德克爾曼開(kāi)玩笑說(shuō):“也許你不必追求完美。現(xiàn)在,我們正在就人類(lèi)創(chuàng)建的數(shù)據(jù)進(jìn)行與自然資源類(lèi)似的對(duì)話(huà),這是一個(gè)有趣的問(wèn)題。我不應(yīng)該嘲笑它,但我確實(shí)覺(jué)得有點(diǎn)不可思議。”
雖然有些機(jī)構(gòu)往往在數(shù)據(jù)被無(wú)償使用后,才試圖阻止他們的數(shù)據(jù)被用于訓(xùn)練人工智能,,但維基百科對(duì)于人工智能公司如何使用志愿者撰寫(xiě)的詞條,幾乎沒(méi)有任何限制。盡管如此,德克爾曼表示,她希望能鼓勵(lì)人們繼續(xù)貢獻(xiàn)詞條,尤其是在大量廉價(jià)的自動(dòng)生成的“垃圾內(nèi)容”開(kāi)始污染互聯(lián)網(wǎng)的時(shí)候。
她表示,人工智能公司應(yīng)該“關(guān)注人類(lèi)生成的內(nèi)容如何繼續(xù)存在并且可以繼續(xù)訪(fǎng)問(wèn)”。
從人工智能開(kāi)發(fā)者的角度來(lái)看,Epoch的研究稱(chēng),向數(shù)百萬(wàn)人付費(fèi)生成人工智能模型所需的文本,“不太可能是一種經(jīng)濟(jì)的方式”來(lái)提高技術(shù)性能。
隨著OpenAI 開(kāi)始訓(xùn)練其下一代GPT大語(yǔ)言模型,其CEO山姆·阿爾特曼在上個(gè)月的一次聯(lián)合國(guó)活動(dòng)上表示,OpenAI已經(jīng)嘗試過(guò)“生成大量合成數(shù)據(jù)”進(jìn)行訓(xùn)練。
阿爾特曼表示:“我認(rèn)為你需要的是高質(zhì)量的數(shù)據(jù)。有低質(zhì)量的合成數(shù)據(jù)。也有低質(zhì)量的人類(lèi)數(shù)據(jù)。”但對(duì)于過(guò)度依賴(lài)合成數(shù)據(jù)而非其他技術(shù)方法來(lái)改進(jìn)人工智能模型的做法,他也持保留意見(jiàn)。
阿爾特曼表示:“如果訓(xùn)練模型的最佳方法只是生成千萬(wàn)億詞塊的合成數(shù)據(jù)并將其反饋到模型,那就太奇怪了。從某種程度上來(lái)說(shuō),這似乎效率不高。”(財(cái)富中文網(wǎng))
翻譯:劉進(jìn)龍
審校:汪皓
Artificial intelligence systems like ChatGPT could soon run out of what keeps making them smarter—the tens of trillions of words people have written and shared online.
A new study released Thursday by research group Epoch AI projects that tech companies will exhaust the supply of publicly available training data for AI language models by roughly the turn of the decade—sometime between 2026 and 2032.
Comparing it to a “l(fā)iteral gold rush” that depletes finite natural resources, Tamay Besiroglu, an author of the study, said the AI field might face challenges in maintaining its current pace of progress once it drains the reserves of human-generated writing.
In the short term, tech companies like ChatGPT-maker OpenAI and Google are racing to secure and sometimes pay for high-quality data sources to train their AI large language models–for instance, by signing deals to tap into the steady flow of sentences coming out of Reddit forums and news media outlets.
In the longer term, there won’t be enough new blogs, news articles and social media commentary to sustain the current trajectory of AI development, putting pressure on companies to tap into sensitive data now considered private—such as emails or text messages—or relying on less-reliable “synthetic data” spit out by the chatbots themselves.
“There is a serious bottleneck here,” Besiroglu said. “If you start hitting those constraints about how much data you have, then you can’t really scale up your models efficiently anymore. And scaling up models has been probably the most important way of expanding their capabilities and improving the quality of their output.”
A 2- to 8-year cliff
The researchers first made their projections two years ago—shortly before ChatGPT’s debut—in a working paper that forecast a more imminent 2026 cutoff of high-quality text data. Much has changed since then, including new techniques that enabled AI researchers to make better use of the data they already have and sometimes “overtrain” on the same sources multiple times.
But there are limits, and after further research, Epoch now foresees running out of public text data sometime in the next two to eight years.
The team’s latest study is peer-reviewed and due to be presented at this summer’s International Conference on Machine Learning in Vienna, Austria. Epoch is a nonprofit institute hosted by San Francisco-based Rethink Priorities and funded by proponents of effective altruism—a philanthropic movement that has poured money into mitigating AI’s worst-case risks.
Besiroglu said AI researchers realized more than a decade ago that aggressively expanding two key ingredients—computing power and vast stores of internet data—could significantly improve the performance of AI systems.
The amount of text data fed into AI language models has been growing about 2.5 times per year, while computing has grown about 4 times per year, according to the Epoch study. Facebook parent company Meta Platforms recently claimed the largest version of their upcoming Llama 3 model—which has not yet been released—has been trained on up to 15 trillion tokens, each of which can represent a piece of a word.
But how much it’s worth worrying about the data bottleneck is debatable.
“I think it’s important to keep in mind that we don’t necessarily need to train larger and larger models,” said Nicolas Papernot, an assistant professor of computer engineering at the University of Toronto and researcher at the nonprofit Vector Institute for Artificial Intelligence.
‘You photocopy the photocopy’
Papernot, who was not involved in the Epoch study, said building more skilled AI systems can also come from training models that are more specialized for specific tasks. But he has concerns about training generative AI systems on the same outputs they’re producing, leading to degraded performance known as “model collapse.”
Training on AI-generated data is “l(fā)ike what happens when you photocopy a piece of paper and then you photocopy the photocopy. You lose some of the information,” Papernot said. Not only that, but Papernot’s research has also found it can further encode the mistakes, bias and unfairness that’s already baked into the information ecosystem.
If real human-crafted sentences remain a critical AI data source, those who are stewards of the most sought-after troves—websites like Reddit and Wikipedia, as well as news and book publishers—have been forced to think hard about how they’re being used.
“Maybe you don’t lop off the tops of every mountain,” jokes Selena Deckelmann, chief product and technology officer at the Wikimedia Foundation, which runs Wikipedia. “It’s an interesting problem right now that we’re having natural resource conversations about human-created data. I shouldn’t laugh about it, but I do find it kind of amazing.”
While some have sought to close off their data from AI training—often after it’s already been taken without compensation—Wikipedia has placed few restrictions on how AI companies use its volunteer-written entries. Still, Deckelmann said she hopes there continue to be incentives for people to keep contributing, especially as a flood of cheap and automatically generated “garbage content” starts polluting the internet.
AI companies should be “concerned about how human-generated content continues to exist and continues to be accessible,” she said.
From the perspective of AI developers, Epoch’s study says paying millions of humans to generate the text that AI models will need “is unlikely to be an economical way” to drive better technical performance.
As OpenAI begins work on training the next generation of its GPT large language models, CEO Sam Altman told the audience at a United Nations event last month that the company has already experimented with “generating lots of synthetic data” for training.
“I think what you need is high-quality data. There is low-quality synthetic data. There’s low-quality human data,” Altman said. But he also expressed reservations about relying too heavily on synthetic data over other technical methods to improve AI models.
“There’d be something very strange if the best way to train a model was to just generate, like, a quadrillion tokens of synthetic data and feed that back in,” Altman said. “Somehow that seems inefficient.”