大數據的局限性
“每一場科學革命——從哥白尼的日心說模型到統計學和量子力學的興起,從達爾文的進化和自然選擇學說到基因理論——都是由于一件事,也只是由于一件事導致的,那就是數據的獲取。” 這是達納法伯癌癥研究所生物統計學和計算生物學教授約翰·夸肯布什昨天主題演講中令人大開眼界的開頭。他也是哈佛大學陳曾熙公共衛生學院的教授,擁有諸多學術成果。 毫無疑問,這一數據概念如今正推動著醫療衛生行業幾乎各個方面的轉型。夸肯布什在費城的MedCity Converge大會上指出,每家醫院平均每年會產生大約665TB的數據,其中五分之四都是以圖片、視頻或醫囑的零散形式存在的。 不過嚴重限制人們利用這些信息的因素,不是“大數據”,而是“混亂數據”。 總體來看,在那些可能有海量有用數據可供發掘的地方,我們沒有為那些真正希望使用這些數據的人提供方便之門。那些數據可能很難或很直接地獲取,或是信息量不足,或是格式不對。還有可能數據不完整,或沒有使用兼容的儲存“標準”(我們似乎有數不清的互相不能兼容的標準)。或者在多維度的領域里,數據只記錄了一個維度的信息。(他說:“生物系統是個復雜的自適應系統,擁有許多活動的部件,我們只是剛剛了解了一些皮毛。”) 另外,這些數據并不能真正給出終端用戶想要尋求的答案,這一點似乎是出人意料的普遍誤解。換句話說,現有的數據沒有目的性。 以人口統計數據為例,這是政府和學術機構常規收集的數據。夸肯布什表示:“統計學會使用人口數據,而醫學研究也會依賴人口數據。但醫療護理卻是通過個體數據推動的。所以當我們把(我們的數據研究)用于臨床時,必須考慮如何讓個體數據以有意義的格式儲存而為人所用。” 他說,最終的目標應該是“利用不直觀的數據,建立直觀的圖形化呈現”,從而讓非數據科學家“不必坐在終端機前輸入一系列晦澀的指令,就能對其展開研究”。 夸肯布什表示:“在你考慮讓數據為人所用時,要做的就是建立接口,讓人們能夠接觸并理解數據,用他們自己的想法使用數據。” 如果不這么做,我們所有的大數據就只是大型的二進制數據塊和越來越大的數據服務器。 怎么阻止這種情況發生?夸肯布什坦率地說,將這些未經處理的數據變成可用數據的動機,“不是提高醫療水平或讓人們過得更好。驅動力將是所有科學中最重要的一種:經濟學。如果我們真的打算有所進展,就必須證明,將這種數據和信息整合起來會有利可圖。”(財富中文網) 譯者:嚴匡正 |
“Every revolution in science—from the Copernican heliocentric model to the rise of statistical and quantum mechanics, from Darwin’s theory of evolution and natural selection to the theory of the gene—has been driven by one and only one thing: access to data.” That was the eye-opening opening of a keynote address given yesterday by the brilliant John Quackenbush, a professor of biostatistics and computational biology at Dana-Farber Cancer Institute who has a dual professorship at the Harvard T.H. Chan School of Public Health and ample other academic credits after his name. There is also no question that this digital fuel is driving virtually every transformation in healthcare happening today. Speaking at the MedCity Converge conference in Philadelphia, Quackenbush noted that the average hospital is generating roughly 665 terabytes of data annually, with some four-fifths of it in the unstructured forms of images, video, and doctor’s notes. But the great limiting factor in harnessing all of this information-feedstock is not a “big data problem,” but rather a “messy data problem.” In sum, in places where there is tons of potentially useful data to examine, we don’t make it accessible in ways that people actually want to use it. Either the data isn’t easy or intuitive to access or it simply isn’t informative. Or it’s in the wrong format. Or it’s incomplete—or created with incompatible “standards” (of which we seem to have an unlimited, irreconcilable supply). Or it captures just one dimension of a multidimensional realm. (“Biological systems are really complex, adaptive systems with many moving parts, that we’ve only begun to scratch the surface of understanding,” he says.) Or—and this one seems to be a surprisingly common misstep—the data doesn’t really address the question the end user wants to answer. It’s off-purpose, in other words. Take the case of population-level data, which government and academic institutions routinely collect: “Statistics operate on population data and medical research is driven by population data,” says Quackenbush, “but medical care is driven by individual-level data. So when we’re driving [our data research] to the clinic, we have to think about how we’re going to make that individual-level available in a meaningful format.” Ultimately, the goal, he says, should be to “create intuitive graphical representations of the underlying data” in ways that allow non-data scientists “to explore it without having to sit at a terminal and type in a bunch of obscure commands.” “What you want to think about doing when you make data available to people is to create interfaces that allow them to dive in and make sense of that data, using their own intuition,” Quackenbush says. Without doing that, all of our growing mounds of big data will simply be big blobs on ever-bigger data servers. What’s to stop that from happening? The incentive for turning all this raw feedstock into a usable fuel “is not going to be enhancing healthcare or making people better,” Quackenbush says flatly. “The driver is really going to be the most important ‘–omics’ science of all: which is economics. We have to show that there’s an advantage to bringing this kind of data and information together if we’re really going to make advances.” |