吳恩達(Andrew Ng)是深度學習技術的先驅者之一。所謂深度學習,就是將大型神經網絡應用于人工智能領域。就廣大企業應該如何利用人工智能技術的問題,吳恩達也是最有發言權的專家。吳恩達創辦了一家名為Landing AI的公司并自任首席執行官。這家公司的軟件,可以讓即使不懂編程的人,也能夠輕松構建和維護AI系統。這樣的話,幾乎所有企業都可以使用AI技術了——尤其是計算機視覺應用。目前,一些大型生產商,例如工具制造商史丹利百德(StanleyBlack & Decker)、電子產品制造商富士康(Foxconn),以及汽車零部件制造商電裝公司(Denso)都已經成了Landing AI的客戶。
吳恩達是所謂“數據中心型AI”的倡導者。他認為,隨著開源數據的普及和先進人工智能研究的發表,尖端人工智能技術也變得越來越普及。企業就算請不頂尖院校的計算機博士,也并不難獲得尖端的人工智能軟件代碼,而且這些程序與谷歌(Google)或者美國國家航空航天局(NASA)使用的AI程序可能是一樣的。那么,為什么有些公司能夠成功應用AI技術,有些公司則不能?最大的區別在于,你用什么數據來訓練這個AI算法,這些數據又是如何收集、處理和管理的?吳恩達告訴我,所謂的“數據中心型AI”,就是要對數據進行“智能量化”,用盡量最少的數據來構建一個成功的AI系統。他認為:“向數據中心型AI的轉型”是當今企業需要進行的最重要的轉型,只有這樣才能充分發揮人工智能的優勢。其重要性不亞于上一個10年向深度學習技術的轉型。
吳恩達指出,如果數據準備得當,那么一家企業實際需要的數據,就可能遠遠少于它們的想象。有了正確的數據,哪怕企業只有幾十或者幾百個事例,訓練出的AI系統也將十分好用,絲毫不亞于那些消費互聯網巨頭用幾十億個事例訓練出來的系統。他表示,將AI技術拓展到互聯網巨頭以外的企業的好處之一,就是可以使用更小的數據集進行有效訓練。
那么,什么樣的數據才是正確的數據?吳恩達認為,首先要確保數據的“y系一致性”。也就是說,某個事物是否會收到某個明確的分類標簽,對此必須有十分明確的界限。(比如,某家制藥公司如果想用AI程序尋找藥片上的瑕疵,那么,這家公司就應該將小于一定長度的劃痕明確定義為“無缺陷”,超過這個閾值的劃痕則被標記為“有缺陷”,那么這個系統只需要少的訓練數據就能夠表現得很好。)
吳恩達表示,要想減少數據不一致的情況,企業可以將一個訓練數據集里的同樣圖像分配給不同的人來標記,如果他們的標記結果不一致,設計系統的人就能夠進行更正,或者干脆從訓練數據集里撤掉這個事例。吳恩達還建議,那些編制數據集的人應該對標記方法做好說明,并特別要對一些模楞兩可的事例做好追蹤,因為它們有可能導致標記不一致的情況。任何不清晰或者容易導致混淆的事例都應該從數據集里剔除。最后,企業應該分析人工智能系統的錯誤,看看哪些子集中的事例最容易讓系統出錯。有的時候只要在關鍵子集里添加一些事例,比“大水漫灌”似的添加數據更容易提高系統的表現。他還指出,AI用戶應該把數據編制、數據改進和利用新數據反復訓練AI作為一個持續的循環過程,而不是一個一勞永逸的過程。
咨詢公司埃森哲(Accenture)最近發布的一份關于人工智能應用的報告,也將AI模型的構建與訓練看作一個持續的循環,而不是一個一勞永逸的過程。該研究發現,在它調查的全球1200家公司中,只有12%的公司將它們的AI系統升級到了提高增長和業務轉型速度所需的程度。(還有25%的企業也推進了AI系統的部署,其他公司基本上還處于試點階段。)這12%的公司與其他公司的區別在哪里呢?首先在于它們有“工業化”的AI工具和流程,而且打造了強有力的AI核心團隊。此外還有一些組織上的因素,例如公司高管將AI作為戰略重點、大量投資于AI人才、從一開始就負責任地設計了AI程序,以及充分重視短期和長期AI項目,等等。(財富中文網)
譯者:樸成奎
吳恩達(Andrew Ng)是深度學習技術的先驅者之一。所謂深度學習,就是將大型神經網絡應用于人工智能領域。就廣大企業應該如何利用人工智能技術的問題,吳恩達也是最有發言權的專家。吳恩達創辦了一家名為Landing AI的公司并自任首席執行官。這家公司的軟件,可以讓即使不懂編程的人,也能夠輕松構建和維護AI系統。這樣的話,幾乎所有企業都可以使用AI技術了——尤其是計算機視覺應用。目前,一些大型生產商,例如工具制造商史丹利百德(StanleyBlack & Decker)、電子產品制造商富士康(Foxconn),以及汽車零部件制造商電裝公司(Denso)都已經成了Landing AI的客戶。
吳恩達是所謂“數據中心型AI”的倡導者。他認為,隨著開源數據的普及和先進人工智能研究的發表,尖端人工智能技術也變得越來越普及。企業就算請不頂尖院校的計算機博士,也并不難獲得尖端的人工智能軟件代碼,而且這些程序與谷歌(Google)或者美國國家航空航天局(NASA)使用的AI程序可能是一樣的。那么,為什么有些公司能夠成功應用AI技術,有些公司則不能?最大的區別在于,你用什么數據來訓練這個AI算法,這些數據又是如何收集、處理和管理的?吳恩達告訴我,所謂的“數據中心型AI”,就是要對數據進行“智能量化”,用盡量最少的數據來構建一個成功的AI系統。他認為:“向數據中心型AI的轉型”是當今企業需要進行的最重要的轉型,只有這樣才能充分發揮人工智能的優勢。其重要性不亞于上一個10年向深度學習技術的轉型。
吳恩達指出,如果數據準備得當,那么一家企業實際需要的數據,就可能遠遠少于它們的想象。有了正確的數據,哪怕企業只有幾十或者幾百個事例,訓練出的AI系統也將十分好用,絲毫不亞于那些消費互聯網巨頭用幾十億個事例訓練出來的系統。他表示,將AI技術拓展到互聯網巨頭以外的企業的好處之一,就是可以使用更小的數據集進行有效訓練。
那么,什么樣的數據才是正確的數據?吳恩達認為,首先要確保數據的“y系一致性”。也就是說,某個事物是否會收到某個明確的分類標簽,對此必須有十分明確的界限。(比如,某家制藥公司如果想用AI程序尋找藥片上的瑕疵,那么,這家公司就應該將小于一定長度的劃痕明確定義為“無缺陷”,超過這個閾值的劃痕則被標記為“有缺陷”,那么這個系統只需要少的訓練數據就能夠表現得很好。)
吳恩達表示,要想減少數據不一致的情況,企業可以將一個訓練數據集里的同樣圖像分配給不同的人來標記,如果他們的標記結果不一致,設計系統的人就能夠進行更正,或者干脆從訓練數據集里撤掉這個事例。吳恩達還建議,那些編制數據集的人應該對標記方法做好說明,并特別要對一些模楞兩可的事例做好追蹤,因為它們有可能導致標記不一致的情況。任何不清晰或者容易導致混淆的事例都應該從數據集里剔除。最后,企業應該分析人工智能系統的錯誤,看看哪些子集中的事例最容易讓系統出錯。有的時候只要在關鍵子集里添加一些事例,比“大水漫灌”似的添加數據更容易提高系統的表現。他還指出,AI用戶應該把數據編制、數據改進和利用新數據反復訓練AI作為一個持續的循環過程,而不是一個一勞永逸的過程。
咨詢公司埃森哲(Accenture)最近發布的一份關于人工智能應用的報告,也將AI模型的構建與訓練看作一個持續的循環,而不是一個一勞永逸的過程。該研究發現,在它調查的全球1200家公司中,只有12%的公司將它們的AI系統升級到了提高增長和業務轉型速度所需的程度。(還有25%的企業也推進了AI系統的部署,其他公司基本上還處于試點階段。)這12%的公司與其他公司的區別在哪里呢?首先在于它們有“工業化”的AI工具和流程,而且打造了強有力的AI核心團隊。此外還有一些組織上的因素,例如公司高管將AI作為戰略重點、大量投資于AI人才、從一開始就負責任地設計了AI程序,以及充分重視短期和長期AI項目,等等。(財富中文網)
譯者:樸成奎
Andrew Ng is among the pioneers of deep learning—the use of large neural networks in A.I. He’s also one of the most thoughtful A.I. experts on how real businesses are using the technology. His company, Landing AI, where Ng is founder and CEO, is building software that makes it easy for people, even without coding skills, to build and maintain A.I. systems. This should allow almost any business adopt A.I. —especially computer vision applications. Landing AI’s customers include major manufacturing firms such as toolmaker StanleyBlack & Decker, electronics manufacturer Foxconn, and automotive parts maker Denso.
Ng has become an evangelist for what he calls “data-centric A.I.” The basic premise is that state-of-the-art A.I. algorithms are increasingly ubiquitous thanks to open-source repositories and the publication of cutting edge A.I. research. Companies that would struggle to hire PhDs from top computer science schools can nonetheless access the same software code that Google or NASA might use. The real differentiator between businesses that are successful at A.I. and those that aren’t, Ng argues, is down to data: What data is used to train the algorithm, how it is gathered and processed, and how it is governed? Data-centric A.I., Ng tells me, is the practice of “smartsizing” data so that a successful A.I. system can be built using the least amount of data possible. And he says that “the shift to data-centric A.I.” is the most important shift businesses need to make today to take full advantage of A.I.—calling it as important as the shift to deep learning that has occurred in the past decade.
Ng says that if data is carefully prepared, a company may need far less of it than they think. With the right data, he says companies with just a few dozen examples or few hundred examples can have A.I. systems that work as well as those built by consumer internet giants that have billions of examples. He says one of the keys to extending the benefits of A.I. to companies beyond the online giants is to use techniques that enable A.I. systems to be trained effectively from much smaller datasets.
What’s the right data? Well, Ng has some tips that include making sure that data is what he calls “y consistent.” In essence this means there should be some clear boundary between when something receives a particular classification label and when it doesn’t. (For example, take an A.I. designed to find defects in pills for a pharma company. This system will perform better from less training data if any scratch below a certain length is labelled “not defective,” and any scratch longer than that threshold is labelled “defective" than if there is no consistency in which scratch lengths are labelled defective.)
He says that one way to spot data inconsistencies is to assign the same images in a training set to multiple people to label. If their labels don’t agree, the person designing the system can make a call on the correct label or that example can be discarded from the training set. Ng also urges those curating data sets to clarify labeling instructions by tracking down ambiguous examples. These are tricky cases that are likely to lead to inconsistent labels. Any examples that are unclear or confusing should be eliminated from the data set altogether, he says. Finally, he says people should analyze the errors an A.I. system makes to figure out which subset of examples tend to trip the system up. Adding just a few additional examples in key data subsets leads to faster performance improvements than adding additional examples where the software is already doing well. He also says that A.I. users should see data curation, data improvement, and retraining the A.I. on updated data, as an on-going cycle, not something a user does only once.
The idea of thinking of the building and training of A.I. models as a continuous cycle, not a one-off project, also comes across in a recent report on A.I. adoption from consulting firm Accenture. It found that only 12% of 1,200 companies it looked at globally have advanced their A.I. maturity to the stage where they are seeing superior growth and business transformation. (Another 25% are somewhat advanced in their deployment of A.I., while the rest are still just running pilot projects if anything.) What sets that 12% apart? Well, one factor Accenture identifies is that they have “industrialized” A.I. tools and processes, and that they have created a strong A.I. core team. Other key factors are organizational too: they have top executives who champion A.I. as a strategic priority; they invest heavily in A.I. talent; they design A.I. responsibly from the start; and they prioritize both long- and short-term A.I. projects.