FlyTitle: Business

Integrating data is getting harder, but also more important

整合数据越来越困难,但也越来越重要【专题报道《数据经济》系列之四】

经济学人双语版-新AI组装线 The new AI-ssembly line

GEEKS ARE not known for being poets. But sometimes even they have a way with words, for example when trying to describe the main challenge of dealing with data. It is the search, they say, for “a single version of the truth”.

极客们并不以擅长作诗闻名。但有时候,比如在尝试描述处理数据的主要困难的时候,哪怕是这些人的语言也很有一套。他们说,难点在于寻找“唯一版本的事实”。

This also nicely describes what has been the goal of corporate information technology since it emerged 60 years ago. And the adage encapsulates the main tension for businesses in the data economy: finding digital truth—that is, identifying and combining data that accurately represent reality—is becoming more difficult and more important. More difficult because data and their sources are multiplying. And more important because firms need to get their data house in order to benefit from AI, which they must to stay competitive. AI boosts revenues and profits, according to a recent survey by McKinsey, a consultancy (see chart).

这也很好地描述了企业信息技术(IT)自60年前出现以来的工作目标。而且这个巧妙的表述还概括了数据经济中企业的主要压力:寻找数字真相(即识别和融合准确反映现实的数据)变得愈发困难和重要。愈发困难是因为数据和数据源都在成倍增加。愈发重要则是因为公司需要把自己的数据仓库打理好才能从AI中受益,而这是保持竞争力的必需。咨询公司麦肯锡最近一项调查显示,AI可以增加收入和利润(见图表)。

经济学人双语版-新AI组装线 The new AI-ssembly line

Happily, technology is coming to the rescue. Data-handling software and cloud computing are increasingly enabling what George Gilbert, an investor and veteran observer of the IT industry, calls the “AI-ssembly line”—in reference to what happened a hundred years ago, when electricity replaced steam as the main source of power in factories. Before, machines had to be grouped closely around the power source—a steam engine. Electricity then allowed power to be distributed to where it was needed, which made assembly lines feasible. What is happening now, however, is actually the inverse: the machines of the digital age—a firm’s business applications and software to build these—are virtually regrouping around a new power source: central digital repositories known as “data warehouses” or “data lakes”. In time this may allow companies to build entire digital twins of themselves.

所幸,技术来帮忙了。数据处理软件和云计算正在日益实现“AI组装线”(AI-ssembly line)。IT行业投资者和资深观察家乔治·吉尔伯特(George Gilbert)创造的这个词暗指了100年前发生的事,当时电力取代了蒸汽,成为工厂的主要动力来源。过去,机器必须紧密围绕动力源也就是蒸汽机来布局。然后,电力可以把能源输送到任何需要的地方,让装配线变得可行。但是,现在发生的事情实际上是相反的:数字时代的机器(公司的业务应用程序和构建这些应用的软件)正在围绕着一种新的动力源(被称为“数据仓库”或“数据湖”的中央数字存储库)进行虚拟重组。假以时日,这可能会让公司得以建立完整的数字孪生体。

Finding digital truth is hard because the data come from many sources and in a staggering variety of formats—which makes them hard to integrate. Even simple things such as a customer’s name can be defined and stored in many different ways. Companies can have thousands of software applications, each with its own database. Failed attempts to consolidate or link these digital repositories have cost armies of chief information officers their jobs.

数字真相很难找到,因为数据有许多来源,并且格式五花八门,使得它们难以集成。哪怕是客户姓名这么简单的东西也可以用许多不同的方式定义和存储。公司可能拥有数千个软件应用,每个都有自己的数据库。成批的首席信息官因没能成功地把这些数字存储库整合或连接起来而丢了工作。

Integrating data was already a major problem when IT existed mainly to keep track of a firm’s “transactions”, such as processing an order or managing the supply chain. It has only become more difficult since. In the 1990s firms started using their data to work out how they have been doing, something called “analytics”. A decade ago, they turned to mining their data to make predictions about their business, an approach first dubbed “big data” and now AI. Today a firm’s data are often not just spread across many local databases, but live in different cloud services and stream in from third parties and connected devices.

当IT的存在主要是为了跟踪公司的“交易”(例如处理订单或管理供应链)时,集成数据已经是一个大问题。打那时起,它就只会变得越来越困难。在1990年代,公司开始使用自己的数据自我评估业绩,也就是所谓的“分析”。十年前,它们转向挖掘数据来对业务进行预测,这种方法最初被称为“大数据”,现在被称为AI。如今,公司的数据通常不仅散布在许多本地数据库中,而且还存在于不同的云服务中,并从第三方和联网设备不断流入。

It is the data warehouses and data lakes that are now making it easier to use the digital stuff. They differ in the way they structure information—the first takes a more rigid approach than the second, although the differences are getting smaller—both can now live in the cloud. This makes them not only cheaper to manage, but they can more easily be fed with data from many different sources and used by many different users. One such is made by Snowflake, another startup, which has turned its data warehouse into what it calls a “data platform” that can stretch across different computing clouds. Big cloud providers such as Amazon Web Services and Microsoft Azure offer similar products.

正是数据仓库和数据湖让数字资源的使用变得更加容易。它们在组织信息的方式上有所不同——前者比后者更为严格,但差异越来越小。两者现在都可以存在于云中。这不仅使它们的管理成本更低,而且可以更轻松地输入来源多样、用户众多的数据。创业公司Snowflake就是如此,它把自己的数据仓库变成了可以跨越多个计算云的所谓“数据平台”。诸如亚马逊的AWS和微软的Azure之类的大型云供应商都提供类似的产品。

A second improvement is specialised databases, which take care of certain types of data. Since data often no longer come in the form of static blocks, but rather real-time digital streams, they have to be treated differently, explains Jay Kreps, the chief executive of a startup appropriately named Confluent. It sells cloud services based on Apache Kafka, an open-source program, which analyse these streams and dump them into data lakes. Bosch, a German conglomerate, uses Confluent to gather and mine data from power tools to manage repair services and construction sites.

第二个改进是处理某些类型数据的专用数据库。创业公司Confluent(“合流”,名字取得挺恰当)的首席执行官杰伊·克雷普斯(Jay Kreps)解释说,由于数据通常不再以静态块的形式,而是以实时数字流的形式出现,它们必须被区别对待。该公司销售基于开源程序Apache Kafka的云服务,分析这些数据流并将其转存到各个数据湖中。德国企业集团博世使用Confluent的服务从电动工具中收集和挖掘数据,用以管理维修服务和建筑工地。

Yet it is a third group of software and services that turns all this into Mr Gilbert’s “AI-ssembly line”. Some of these tools prepare data for crunching, others make it easy to design and train an AI algorithm, deploy it in an application to automate decisions and continuously improve it. Enel, a utility, has used such tools to develop a service that helps it identify the power thieves it needs to go after first. Shell, an oil company, has designed algorithms that ensure that its thousands of spare parts are always available around the world. And Kiva, a non-profit lender, has built a data warehouse with Snowflake that allows it to make better decisions about who should receive its loans.

然而,将所有这些变成了吉尔伯特所说的“AI组装线”的是第三组软件和服务。这些工具中,有一些可以整理数据以备分析,有些可以轻松设计和训练AI算法,将其部署到应用中自动执行决策并不断改进。意大利国家电力公司(Enel)已使用此类工具开发了一项服务,帮助它确定需要追捕的头号偷电贼。壳牌石油公司设计了算法来确保其成千上万种备件在世界各地始终有货。非营利贷款公司Kiva与Snowflake合作建立了一个数据仓库,使它可以更好地决定应该放款给谁。

Many other firms were not so lucky, forgetting that technology is always only part of the solution. Motivated by studies that found that AI boosts profits and, in some cases, panicked by the possibility of being disrupted by a startup, some tried to cobble together an AI-assembly line themselves, but failed. They did not have the right type of developers and data scientists—or did not want to pay their exorbitant salaries. This has created an opening for IT vendors to sell more or less pre-packaged versions of AIassembly lines, but each coming at it from a different direction.

许多别的公司就没这么幸运了——它们忘记了技术永远只是解决方案的一部分。AI能提高利润的研究激励了它们,或者有时是因为恐惧被创业公司颠覆,一些公司试图自己拼凑出一条AI装配线,却失败了。它们没有适合自己业务的程序员和数据科学家,或者不想支付高昂的薪水。这为IT供应商提供了销售在某种程度上预制好的AI流水线的机会,但每种流水线的着眼点都不同。

Meanwhile, at the refinery

同时,在炼油厂

Take incumbents first, which are trying to build on their strengths. In the case of the granddaddy, IBM, this is services. It helps firms build what Arvind Krishna, soon its new boss, calls a “data plane”, a collection of programs to develop AI applications. It has also become a data refiner itself: for example, it collects and sells granular weather data that insurers can use to calculate rates, and utilities to predict where power cuts may occur. And it offers a range of AI services, including visual recognition and translation, that other firms can plug into their products.

先说老企业,它们正在努力发挥自己的优势。就拿老祖宗IBM来说,这个优势就是服务。它帮助企业构建即将成为新任老板的阿文德·克里希纳(Arvind Krishna)所说的“数据平面”,即用于开发AI应用的一系列程序。它本身也成了数据炼油厂:例如,它收集和销售精细的天气数据,保险公司可拿来计算费率,公用事业公司可拿来预测可能发生停电的地点。它还提供了视觉识别和翻译等一系列AI服务,其他公司可以直接将它们插入自己的产品中。

Oracle, the world’s leading vendor of relational databases, still the workhorses of corporate IT, aims to extend that position by providing what it calls an “autonomous database”. This type of service combines and automates all sorts of digital repositories, plus bits of AI, so customers do not have to put together all these programs themselves. “It’s many data engines in a single engine,” explains Paul Sonderegger, the firm’s senior data strategist, adding that such integration will be key to increasing a firm’s “data productivity—increasing the dollar output per data input”.

世界领先的关系数据库供应商甲骨文仍然是企业IT的主力军,其目标是通过提供所谓的“自治数据库”来强化这一地位。这种类型的服务将各种数字存储库以及零星的AI组合起来并自动化,这样客户就用不着自己去组合所有这些程序了。“这是包含许多数据引擎的单个引擎。”该公司的高级数据策略师保罗·桑德雷格(Paul Sonderegger)解释道。他补充说,这种集成对于提高公司的“数据生产率,即增加每个数据输入的美元产出”至关重要。

As for younger IT firms, they are increasingly offering to help firms to get their digital ducks lined up, too. Salesforce, which grew up as a web-based service to manage customer relations, has spent billions in the past two years to develop its own AI technology, called Einstein, and acquire two big-data companies, MuleSoft and Tableau. The idea, says Bret Taylor, Salesforce’s president and chief operating officer, is to allow firms to consolidate and link their data so they can have a “single view of their customers”. This makes it easier for firms to anticipate what their customers will do, personalise offers and always recognise them, whether they show up in a retail store or online.

至于年轻的IT公司,它们也在提供越来越多的服务,帮助企业把数字事务安排妥当。Salesforce从提供管理客户关系的网络服务起家,它在过去两年中花费了数十亿美元来开发自己的AI技术“爱因斯坦”,并收购了两家大数据公司MuleSoft和Tableau。Salesforce的总裁兼首席运营官布雷特·泰勒(Bret Taylor)表示,公司的思路是让企业能把数据整合并连接起来,以便对自己的客户“一目了然”。这让企业可以更轻松地预测客户的行为,提供个性化服务,并且无论客户出现在零售店还是网店都能识别出来。

Then there is a host of smaller firms. Databricks has put together an AI platform, complete with tools to cleanse data, build algorithms and deploy them. C3.ai offers something similar, but mainly aims to help big firms through their digital transformation. Qlik is known for analytics and data visualisation, but has recently moved into AI.

然后还有大批较小的公司。Databricks建立了一个AI平台,搭配了用于清洗数据、构建和部署算法的工具。C3.ai提供类似的功能,但主要目标是帮助大公司进行数字化转型。Qlik以分析和数据可视化闻名,但最近已进入AI领域。

But despite such tools, many AI projects still disappoint, says Debra Logan of Gartner, a market-research firm. One big problem is data silos which reflect a firm’s internal boundaries. Different departments within a company, afraid of relinquishing power, are loth to share their data or change what they collect and how (making the point that data structures are often just thinly veiled power structures). This has kept many firms from developing a coherent “data strategy” that would ensure they actually collect and analyse the information they need to achieve their business goals.

市场调研公司高德纳的黛布拉·洛根(Debra Logan)表示,尽管有这些工具,但许多AI项目仍然令人失望。一个大问题是数据孤岛,这反映了企业的内部边界。企业中的各个部门害怕失去权力,不愿意共享数据或改变收集的内容和方式(这也说明,数据结构通常只是权力结构遮上了一层薄薄的面纱)。这使许多公司无法制定连贯的“数据战略”,以确保它们真正能收集和分析实现业务目标所需的信息。

To overcome such digital divisions, some companies have made organisational changes. A growing number have appointed a “chief data officer” who can knock heads together to ensure that the IT department and business units work together, which they must to build anything resembling an AI-assembly line. Yet changes at the top, as well as in technology, are not worth much, if the rest of the company is not ready. “Poor data literacy” is the second biggest barrier to corporate data projects, preceded only by “cultural challenges to accept change”, according to a recent survey by Gartner. Changing this does not mean that all employees have to become data scientists, but that they have a basic grasp of what data can be used for and what not, says Mike Potter, the chief technology officer of Qlik.

为了克服这种数字分隔,一些企业进行了组织调整。越来越多企业任命了“首席数据官”,他们可以把人们聚在一起,确保IT部门和业务部门能够合作,而这对于建立类似于AI装配线的任何东西都是必不可少的。但是,如果公司的其他成员还没有做好准备,那么高层以及技术方面的变革就没有多大价值。根据高德纳最近的一项调查,“数据素养不佳”是公司数据项目的第二大障碍,仅次于“接受变革的文化挑战”。Qlik首席技术官迈克·波特(Mike Potter)表示,改变这一点并不意味着所有员工都必须成为数据科学家,而是要对数据可以用来干什么、不可以干什么有基本的了解。

Data, he argues, are never neutral and must always be questioned: they may be collected for political reasons or in a way that hides things. “We all think that data are so objective,” he says, “but they are actually as interpretable as Shakespeare.” Despite all the tech, there may never be a single version of the truth. ■

他认为,数据永远都不是中立的,必须始终受到质疑:收集它们可能是出于政治原因,又或者收集的方式会隐瞒某些事情。“我们都认为数据是如此客观,”他说,“但实际上它们和莎士比亚一样,可以有很多种解读。” 尽管有了这么多技术,但“唯一版本的事实”可能永远都不会有。■