
歐洲一項雄心勃勃的新人工智能項目已初具規模,該項目旨在開發支持該地區24種官方語言及更多語言的開源人工智能模型,并力求遵守其繁雜的數字立法。
OpenEuroLLM項目于本月初啟動,預算僅為3740萬歐元(約合3860萬美元):與其他人工智能相關項目[如美國星際之門人工智能基礎設施項目(Stargate AI infrastructure project)首期投入1000億美元]相比,這一預算顯得微不足道。盡管參與該項目的公司,如德國的Aleph Alpha和芬蘭的Silo AI等,也投入了等值的研究人員時間,但項目資金的主要來源仍是歐盟委員會。
歐盟資助的項目通常進展緩慢,而該項目制定了為期三年的路線圖,但該行業目前每月都在經歷重大變革。不過,組織者和參與者向《財富》雜志表示,有望在一年內交付一個中間成果模型,而且為此付出的努力是值得的。
說方言
Aleph Alpha首席研究官亞瑟·賈迪迪(Yasser Jadidi)指出:“大多數享有全球知名度的模型開發工作都側重于英語。這是由于絕大多數可獲取且可訪問的互聯網文本數據都是英文的,這使得其他語言處于不利地位。”
對于瑞典或土耳其(OpenEuroLLM項目還針對已申請加入歐盟的八個國家的語言,因此該項目總共涵蓋32種語言)等地的民眾而言,缺乏能夠理解其語言復雜性的人工智能模型無疑構成了一個嚴峻的挑戰。首要問題在于,這加大了當地企業和公共機構采納該技術并開始提供新服務的難度。
歐洲最大的私人人工智能實驗室Silo AI(該實驗室去年被AMD收購,目前正在參與OpenEuroLLM項目)的首席執行官彼得·薩林(Peter Sarlin)表示:"這首先是一個商業問題。無論是阿爾巴尼亞語、芬蘭語、瑞典語還是其他語言,是否存在能夠在特定的低資源語言中表現出色的模型,從而使該地區的公司能夠最終以此為基礎構建服務?”
賈迪迪表示,這一問題還對本地語境中人工智能模型的準確性和安全性的評估工作產生了影響。事實上,Aleph Alpha在該項目中的主要作用是提供人工智能模型評估基準(而這套基準并非簡單地從英語版本進行機器翻譯得來,因為大多數現有的人工智能模型評估基準都沿用了這一做法。)
OpenEuroLLM項目的資金可能相對較少,但它并非從零開始。
該項目的大多數參與者此前已參與過一個名為高性能語言技術(HPLT)的獨立項目,該項目于兩年前啟動,預算僅為600萬歐元。起初,高性能語言技術項目的目標是交付人工智能模型,但隨后OpenAI的ChatGPT改變了人工智能領域的格局,于是組織者轉向創建一個可用于訓練多語言模型的高質量數據集。目前,高性能語言技術數據集正處于“清理”錯誤信息階段,將成為OpenEuroLLM工作的基礎。
OpenEuroLLM將創建一個基于所有歐洲語言數據集訓練的基礎模型。一旦該基礎模型完成開發,另一個由歐盟資助的名為LLMs4EU的項目將對其進行微調以用于各種應用程序。除了提供資金支持外,歐盟還為所有這些項目提供了算力資源。
遵守規則
對于人工智能公司而言,在歐洲開展業務并非易事。除了逐步生效的《人工智能法案》(AI Act)對模型提供商及其客戶施加的一系列報告責任之外,還要考慮版權法和競爭法,以及《通用數據保護條例》(GDPR,該條例對人工智能公司可使用的個人數據設定了嚴格限制)。
這些法律對歐洲人工智能的發展產生了實質性影響,Meta因《通用數據保護條例》的限制而推遲了Meta AI的推出,蘋果(Apple)也因未指明的反壟斷問題而推遲了Apple Intelligence的部署。(Apple Intelligence將于4月以有限的形式在歐盟地區的iPhone上推出,而Meta已開始向歐洲智能眼鏡佩戴者提供部分Meta AI功能。)
就OpenEuroLLM的組織者而言,這些法律挑戰是可以克服的。與薩林共同領導該項目的捷克查理大學的揚·哈吉奇(Jan Haji?)說:"我們相信,我們能夠遵守所有這些法律規定。”
哈吉奇表示,參與者在開發高性能語言技術數據集時已經解決了版權問題和大部分隱私問題。“《通用數據保護條例》可能構成一定的挑戰,但我們正試圖通過數據假名化來解決這一問題,也就是說,如果遇到人名,會將其進行刪除處理。”他說,同時承認這一過程中必要的自動化可能無法保證達到百分之百的成功率。
哈吉奇表示:“我們的宗旨是確保所有行動都不會與歐洲法規產生任何沖突。”他還補充說,這可能會吸引那些意圖開拓歐盟市場的公司。對于那些在《人工智能法案》框架下需要向歐盟當局提交大量報告的高風險用例而言,開源方法將因其所提供的透明度而變得至關重要。
OpenEuroLLM項目有20個參與者,包括企業、研究機構和芬蘭Lumi等高性能計算集群。這樣的組合可能被視為一種負擔,甚至可能引發優先級上的分歧,但Aleph Alpha的賈迪迪認為,開源項目通常涉及眾多的參與者,但這并不意味著項目會因此受到拖累。
他說:“我們完全有機會確保眾多的貢獻者不是阻礙,反而會帶來機遇。”(財富中文網)
譯者:中慧言-王芳
歐洲一項雄心勃勃的新人工智能項目已初具規模,該項目旨在開發支持該地區24種官方語言及更多語言的開源人工智能模型,并力求遵守其繁雜的數字立法。
OpenEuroLLM項目于本月初啟動,預算僅為3740萬歐元(約合3860萬美元):與其他人工智能相關項目[如美國星際之門人工智能基礎設施項目(Stargate AI infrastructure project)首期投入1000億美元]相比,這一預算顯得微不足道。盡管參與該項目的公司,如德國的Aleph Alpha和芬蘭的Silo AI等,也投入了等值的研究人員時間,但項目資金的主要來源仍是歐盟委員會。
歐盟資助的項目通常進展緩慢,而該項目制定了為期三年的路線圖,但該行業目前每月都在經歷重大變革。不過,組織者和參與者向《財富》雜志表示,有望在一年內交付一個中間成果模型,而且為此付出的努力是值得的。
說方言
Aleph Alpha首席研究官亞瑟·賈迪迪(Yasser Jadidi)指出:“大多數享有全球知名度的模型開發工作都側重于英語。這是由于絕大多數可獲取且可訪問的互聯網文本數據都是英文的,這使得其他語言處于不利地位。”
對于瑞典或土耳其(OpenEuroLLM項目還針對已申請加入歐盟的八個國家的語言,因此該項目總共涵蓋32種語言)等地的民眾而言,缺乏能夠理解其語言復雜性的人工智能模型無疑構成了一個嚴峻的挑戰。首要問題在于,這加大了當地企業和公共機構采納該技術并開始提供新服務的難度。
歐洲最大的私人人工智能實驗室Silo AI(該實驗室去年被AMD收購,目前正在參與OpenEuroLLM項目)的首席執行官彼得·薩林(Peter Sarlin)表示:"這首先是一個商業問題。無論是阿爾巴尼亞語、芬蘭語、瑞典語還是其他語言,是否存在能夠在特定的低資源語言中表現出色的模型,從而使該地區的公司能夠最終以此為基礎構建服務?”
賈迪迪表示,這一問題還對本地語境中人工智能模型的準確性和安全性的評估工作產生了影響。事實上,Aleph Alpha在該項目中的主要作用是提供人工智能模型評估基準(而這套基準并非簡單地從英語版本進行機器翻譯得來,因為大多數現有的人工智能模型評估基準都沿用了這一做法。)
OpenEuroLLM項目的資金可能相對較少,但它并非從零開始。
該項目的大多數參與者此前已參與過一個名為高性能語言技術(HPLT)的獨立項目,該項目于兩年前啟動,預算僅為600萬歐元。起初,高性能語言技術項目的目標是交付人工智能模型,但隨后OpenAI的ChatGPT改變了人工智能領域的格局,于是組織者轉向創建一個可用于訓練多語言模型的高質量數據集。目前,高性能語言技術數據集正處于“清理”錯誤信息階段,將成為OpenEuroLLM工作的基礎。
OpenEuroLLM將創建一個基于所有歐洲語言數據集訓練的基礎模型。一旦該基礎模型完成開發,另一個由歐盟資助的名為LLMs4EU的項目將對其進行微調以用于各種應用程序。除了提供資金支持外,歐盟還為所有這些項目提供了算力資源。
遵守規則
對于人工智能公司而言,在歐洲開展業務并非易事。除了逐步生效的《人工智能法案》(AI Act)對模型提供商及其客戶施加的一系列報告責任之外,還要考慮版權法和競爭法,以及《通用數據保護條例》(GDPR,該條例對人工智能公司可使用的個人數據設定了嚴格限制)。
這些法律對歐洲人工智能的發展產生了實質性影響,Meta因《通用數據保護條例》的限制而推遲了Meta AI的推出,蘋果(Apple)也因未指明的反壟斷問題而推遲了Apple Intelligence的部署。(Apple Intelligence將于4月以有限的形式在歐盟地區的iPhone上推出,而Meta已開始向歐洲智能眼鏡佩戴者提供部分Meta AI功能。)
就OpenEuroLLM的組織者而言,這些法律挑戰是可以克服的。與薩林共同領導該項目的捷克查理大學的揚·哈吉奇(Jan Haji?)說:"我們相信,我們能夠遵守所有這些法律規定。”
哈吉奇表示,參與者在開發高性能語言技術數據集時已經解決了版權問題和大部分隱私問題。“《通用數據保護條例》可能構成一定的挑戰,但我們正試圖通過數據假名化來解決這一問題,也就是說,如果遇到人名,會將其進行刪除處理。”他說,同時承認這一過程中必要的自動化可能無法保證達到百分之百的成功率。
哈吉奇表示:“我們的宗旨是確保所有行動都不會與歐洲法規產生任何沖突。”他還補充說,這可能會吸引那些意圖開拓歐盟市場的公司。對于那些在《人工智能法案》框架下需要向歐盟當局提交大量報告的高風險用例而言,開源方法將因其所提供的透明度而變得至關重要。
OpenEuroLLM項目有20個參與者,包括企業、研究機構和芬蘭Lumi等高性能計算集群。這樣的組合可能被視為一種負擔,甚至可能引發優先級上的分歧,但Aleph Alpha的賈迪迪認為,開源項目通常涉及眾多的參與者,但這并不意味著項目會因此受到拖累。
他說:“我們完全有機會確保眾多的貢獻者不是阻礙,反而會帶來機遇。”(財富中文網)
譯者:中慧言-王芳
An ambitious new AI project has begun to take shape in Europe, with the aim of developing open-source AI models that support the region’s 24 official languages and more—while also complying as much as possible with its thicket of digital legislation.
The OpenEuroLLM project, which commenced work at the start of the month, has a budget of just €37.4 million ($38.6 million): a pittance compared with the sums being invested in other AI-related projects like the $100 billion first tranche of the U.S.’s Stargate AI infrastructure project. Although participating companies such as Germany’s Aleph Alpha and Finland’s Silo AI are also contributing their researchers’ time to an equivalent value, the bulk of the funding comes from the European Commission.
EU-funded projects don’t tend to move fast, and this one has a three-year road map in a sector that’s currently undergoing significant evolution each month. But organizers and participants tell Fortune that it could be possible to deliver an intermediate model within a year—and the effort will be worth it.
Speaking in tongues
“Most model development efforts that have worldwide visibility focus on the English language,” said Yasser Jadidi, chief research officer at Aleph Alpha. “It’s a consequence of most of the internet text data that is available and accessible being in English, and it puts other languages at a disadvantage.”
For people in places like Sweden or Turkey (the OpenEuroLLM project is also targeting the tongues of eight countries that have applied for EU membership, so that the project encompasses a total of 32 languages) the lack of AI models that understand the intricacies of their languages can be a serious problem. For a start, it makes it harder for local companies and public authorities to adopt the technology and start providing new services.
“It’s first and foremost a commercial question,” said Peter Sarlin, the CEO of Silo AI, Europe’s largest private AI lab, which was acquired by AMD last year and is participating in OpenEuroLLM. “Are there models that are performant in that specific low-resource language, be it Albanian or Finnish or Swedish or some other, that allows companies within that region to eventually build services on top?”
The issue also has consequences for evaluating the accuracy and safety of AI models in the local context, Jadidi said. Indeed, Aleph Alpha’s role in the project is chiefly to provide AI-model evaluation benchmarks that aren’t simply machine-translated from English, as most are.
The OpenEuroLLM project may have relatively meager funding, but it isn’t starting from scratch.
Most of its participants have already been involved in a separate scheme called High Performance Language Technologies (HPLT), which started two years ago with a budget of just €6 million. The original proposal was for HPLT to deliver AI models, but then OpenAI’s ChatGPT changed the AI landscape and the organizers pivoted to creating a high-quality dataset that can be used to train multilingual models. The HPLT dataset is currently being “cleaned” of errors, and it will form the basis of OpenEuroLLM’s work.
OpenEuroLLM will create a base model trained on a dataset of all the European languages. Once that’s done, yet another EU-funded project, called LLMs4EU, will fine-tune it for various applications. Apart from cash, the EU is also providing computational resources to all these schemes.
Sticking to the rules
Europe is not the easiest place for AI companies to do business. Quite apart from the AI Act that is gradually coming into force, placing all sorts of reporting responsibilities on model providers and their customers, there’s also copyright and competition law to consider—and the General Data Protection Regulation (GDPR), which places strict limits on the personal data that AI companies can use.
These laws have had real effects on AI’s European progress, with Meta delaying the rollout of Meta AI because of GDPR limits, and Apple also delaying the deployment of Apple Intelligence because of unspecified antitrust issues. (Apple Intelligence will come to EU iPhones in limited form in April, while Meta has started offering some Meta AI features to European wearers of its smart glasses.)
As far as OpenEuroLLM’s organizers are concerned, these laws are manageable. “We believe we can live with all of them,” said Jan Haji? of Charles University in Czechia, who is co-leading the project with Sarlin.
Haji? said the participants had already dealt with the copyright and most privacy issues when developing the HPLT dataset. “The GDPR could be a problem, but that’s something we are trying to get around with pseudonymizing the data, meaning that if we encounter people’s names it gets deleted,” he said, while acknowledging that the necessary automation in this process may not have a 100% success rate.
“Our goal is to do things in such a way that they will not clash with the European regulation in any way,” Haji? said, adding that this could be a draw for companies wanting to target EU markets. For high-risk use cases that will require a lot of reporting to the EU authorities under the AI Act, the open-source approach will be essential for the transparency it allows, he argued.
The OpenEuroLLM project has 20 participants including companies, research institutions, and high-performance computing clusters like Finland’s Lumi. This setup could be seen as a liability with the potential for diverging priorities, but Aleph Alpha’s Jadidi argued that open-source projects often include a wide array of participants without being dragged down.
“We have all the opportunity to ensure that a high amount of contributors is not a hindrance but an opportunity,” he said.