FlyTitle: Business

Integrating data is getting harder, but also more important


经济学人双语版-新AI组装线 The new AI-ssembly line

GEEKS ARE not known for being poets. But sometimes even they have a way with words, for example when trying to describe the main challenge of dealing with data. It is the search, they say, for “a single version of the truth”.


This also nicely describes what has been the goal of corporate information technology since it emerged 60 years ago. And the adage encapsulates the main tension for businesses in the data economy: finding digital truth—that is, identifying and combining data that accurately represent reality—is becoming more difficult and more important. More difficult because data and their sources are multiplying. And more important because firms need to get their data house in order to benefit from AI, which they must to stay competitive. AI boosts revenues and profits, according to a recent survey by McKinsey, a consultancy (see chart).


经济学人双语版-新AI组装线 The new AI-ssembly line

Happily, technology is coming to the rescue. Data-handling software and cloud computing are increasingly enabling what George Gilbert, an investor and veteran observer of the IT industry, calls the “AI-ssembly line”—in reference to what happened a hundred years ago, when electricity replaced steam as the main source of power in factories. Before, machines had to be grouped closely around the power source—a steam engine. Electricity then allowed power to be distributed to where it was needed, which made assembly lines feasible. What is happening now, however, is actually the inverse: the machines of the digital age—a firm’s business applications and software to build these—are virtually regrouping around a new power source: central digital repositories known as “data warehouses” or “data lakes”. In time this may allow companies to build entire digital twins of themselves.

所幸,技术来帮忙了。数据处理软件和云计算正在日益实现“AI组装线”(AI-ssembly line)。IT行业投资者和资深观察家乔治·吉尔伯特(George Gilbert)创造的这个词暗指了100年前发生的事,当时电力取代了蒸汽,成为工厂的主要动力来源。过去,机器必须紧密围绕动力源也就是蒸汽机来布局。然后,电力可以把能源输送到任何需要的地方,让装配线变得可行。但是,现在发生的事情实际上是相反的:数字时代的机器(公司的业务应用程序和构建这些应用的软件)正在围绕着一种新的动力源(被称为“数据仓库”或“数据湖”的中央数字存储库)进行虚拟重组。假以时日,这可能会让公司得以建立完整的数字孪生体。

Finding digital truth is hard because the data come from many sources and in a staggering variety of formats—which makes them hard to integrate. Even simple things such as a customer’s name can be defined and stored in many different ways. Companies can have thousands of software applications, each with its own database. Failed attempts to consolidate or link these digital repositories have cost armies of chief information officers their jobs.


Integrating data was already a major problem when IT existed mainly to keep track of a firm’s “transactions”, such as processing an order or managing the supply chain. It has only become more difficult since. In the 1990s firms started using their data to work out how they have been doing, something called “analytics”. A decade ago, they turned to mining their data to make predictions about their business, an approach first dubbed “big data” and now AI. Today a firm’s data are often not just spread across many local databases, but live in different cloud services and stream in from third parties and connected devices.


It is the data warehouses and data lakes that are now making it easier to use the digital stuff. They differ in the way they structure information—the first takes a more rigid approach than the second, although the differences are getting smaller—both can now live in the cloud. This makes them not only cheaper to manage, but they can more easily be fed with data from many different sources and used by many different users. One such is made by Snowflake, another startup, which has turned its data warehouse into what it calls a “data platform” that can stretch across different computing clouds. Big cloud providers such as Amazon Web Services and Microsoft Azure offer similar products.


A second improvement is specialised databases, which take care of certain types of data. Since data often no longer come in the form of static blocks, but rather real-time digital streams, they have to be treated differently, explains Jay Kreps, the chief executive of a startup appropriately named Confluent. It sells cloud services based on Apache Kafka, an open-source program, which analyse these streams and dump them into data lakes. Bosch, a German conglomerate, uses Confluent to gather and mine data from power tools to manage repair services and construction sites.

第二个改进是处理某些类型数据的专用数据库。创业公司Confluent(“合流”,名字取得挺恰当)的首席执行官杰伊·克雷普斯(Jay Kreps)解释说,由于数据通常不再以静态块的形式,而是以实时数字流的形式出现,它们必须被区别对待。该公司销售基于开源程序Apache Kafka的云服务,分析这些数据流并将其转存到各个数据湖中。德国企业集团博世使用Confluent的服务从电动工具中收集和挖掘数据,用以管理维修服务和建筑工地。

Yet it is a third group of software and services that turns all this into Mr Gilbert’s “AI-ssembly line”. Some of these tools prepare data for crunching, others make it easy to design and train an AI algorithm, deploy it in an application to automate decisions and continuously improve it. Enel, a utility, has used such tools to develop a service that helps it identify the power thieves it needs to go after first. Shell, an oil company, has designed algorithms that ensure that its thousands of spare parts are always available around the world. And Kiva, a non-profit lender, has built a data warehouse with Snowflake that allows it to make better decisions about who should receive its loans.


Many other firms were not so lucky, forgetting that technology is always only part of the solution. Motivated by studies that found that AI boosts profits and, in some cases, panicked by the possibility of being disrupted by a startup, some tried to cobble together an AI-assembly line themselves, but failed. They did not have the right type of developers and data scientists—or did not want to pay their exorbitant salaries. This has created an opening for IT vendors to sell more or less pre-packaged versions of AIassembly lines, but each coming at it from a different direction.


Meanwhile, at the refinery


Take incumbents first, which are trying to build on their strengths. In the case of the granddaddy, IBM, this is services. It helps firms build what Arvind Krishna, soon its new boss, calls a “data plane”, a collection of programs to develop AI applications. It has also become a data refiner itself: for example, it collects and sells granular weather data that insurers can use to calculate rates, and utilities to predict where power cuts may occur. And it offers a range of AI services, including visual recognition and translation, that other firms can plug into their products.

先说老企业,它们正在努力发挥自己的优势。就拿老祖宗IBM来说,这个优势就是服务。它帮助企业构建即将成为新任老板的阿文德·克里希纳(Arvind Krishna)所说的“数据平面”,即用于开发AI应用的一系列程序。它本身也成了数据炼油厂:例如,它收集和销售精细的天气数据,保险公司可拿来计算费率,公用事业公司可拿来预测可能发生停电的地点。它还提供了视觉识别和翻译等一系列AI服务,其他公司可以直接将它们插入自己的产品中。

Oracle, the world’s leading vendor of relational databases, still the workhorses of corporate IT, aims to extend that position by providing what it calls an “autonomous database”. This type of service combines and automates all sorts of digital repositories, plus bits of AI, so customers do not have to put together all these programs themselves. “It’s many data engines in a single engine,” explains Paul Sonderegger, the firm’s senior data strategist, adding that such integration will be key to increasing a firm’s “data productivity—increasing the dollar output per data input”.

世界领先的关系数据库供应商甲骨文仍然是企业IT的主力军,其目标是通过提供所谓的“自治数据库”来强化这一地位。这种类型的服务将各种数字存储库以及零星的AI组合起来并自动化,这样客户就用不着自己去组合所有这些程序了。“这是包含许多数据引擎的单个引擎。”该公司的高级数据策略师保罗·桑德雷格(Paul Sonderegger)解释道。他补充说,这种集成对于提高公司的“数据生产率,即增加每个数据输入的美元产出”至关重要。

As for younger IT firms, they are increasingly offering to help firms to get their digital ducks lined up, too. Salesforce, which grew up as a web-based service to manage customer relations, has spent billions in the past two years to develop its own AI technology, called Einstein, and acquire two big-data companies, MuleSoft and Tableau. The idea, says Bret Taylor, Salesforce’s president and chief operating officer, is to allow firms to consolidate and link their data so they can have a “single view of their customers”. This makes it easier for firms to anticipate what their customers will do, personalise offers and always recognise them, whether they show up in a retail store or online.

至于年轻的IT公司,它们也在提供越来越多的服务,帮助企业把数字事务安排妥当。Salesforce从提供管理客户关系的网络服务起家,它在过去两年中花费了数十亿美元来开发自己的AI技术“爱因斯坦”,并收购了两家大数据公司MuleSoft和Tableau。Salesforce的总裁兼首席运营官布雷特·泰勒(Bret Taylor)表示,公司的思路是让企业能把数据整合并连接起来,以便对自己的客户“一目了然”。这让企业可以更轻松地预测客户的行为,提供个性化服务,并且无论客户出现在零售店还是网店都能识别出来。

Then there is a host of smaller firms. Databricks has put together an AI platform, complete with tools to cleanse data, build algorithms and deploy them. offers something similar, but mainly aims to help big firms through their digital transformation. Qlik is known for analytics and data visualisation, but has recently moved into AI.


But despite such tools, many AI projects still disappoint, says Debra Logan of Gartner, a market-research firm. One big problem is data silos which reflect a firm’s internal boundaries. Different departments within a company, afraid of relinquishing power, are loth to share their data or change what they collect and how (making the point that data structures are often just thinly veiled power structures). This has kept many firms from developing a coherent “data strategy” that would ensure they actually collect and analyse the information they need to achieve their business goals.

市场调研公司高德纳的黛布拉·洛根(Debra Logan)表示,尽管有这些工具,但许多AI项目仍然令人失望。一个大问题是数据孤岛,这反映了企业的内部边界。企业中的各个部门害怕失去权力,不愿意共享数据或改变收集的内容和方式(这也说明,数据结构通常只是权力结构遮上了一层薄薄的面纱)。这使许多公司无法制定连贯的“数据战略”,以确保它们真正能收集和分析实现业务目标所需的信息。

To overcome such digital divisions, some companies have made organisational changes. A growing number have appointed a “chief data officer” who can knock heads together to ensure that the IT department and business units work together, which they must to build anything resembling an AI-assembly line. Yet changes at the top, as well as in technology, are not worth much, if the rest of the company is not ready. “Poor data literacy” is the second biggest barrier to corporate data projects, preceded only by “cultural challenges to accept change”, according to a recent survey by Gartner. Changing this does not mean that all employees have to become data scientists, but that they have a basic grasp of what data can be used for and what not, says Mike Potter, the chief technology officer of Qlik.

为了克服这种数字分隔,一些企业进行了组织调整。越来越多企业任命了“首席数据官”,他们可以把人们聚在一起,确保IT部门和业务部门能够合作,而这对于建立类似于AI装配线的任何东西都是必不可少的。但是,如果公司的其他成员还没有做好准备,那么高层以及技术方面的变革就没有多大价值。根据高德纳最近的一项调查,“数据素养不佳”是公司数据项目的第二大障碍,仅次于“接受变革的文化挑战”。Qlik首席技术官迈克·波特(Mike Potter)表示,改变这一点并不意味着所有员工都必须成为数据科学家,而是要对数据可以用来干什么、不可以干什么有基本的了解。

Data, he argues, are never neutral and must always be questioned: they may be collected for political reasons or in a way that hides things. “We all think that data are so objective,” he says, “but they are actually as interpretable as Shakespeare.” Despite all the tech, there may never be a single version of the truth. ■

他认为,数据永远都不是中立的,必须始终受到质疑:收集它们可能是出于政治原因,又或者收集的方式会隐瞒某些事情。“我们都认为数据是如此客观,”他说,“但实际上它们和莎士比亚一样,可以有很多种解读。” 尽管有了这么多技术,但“唯一版本的事实”可能永远都不会有。■