It’s been over a decade since a disillusioned Jeff Hammerbacher told Businessweek he was quitting Facebook (Meta Platforms Inc.) because “the best minds of my generation are thinking about how to get people to click on ads. And that sucks.”
Today’s analytical and mathematical geniuses are still struggling to solve the problem of gaining value from data. But ways to monetize data assets have matured from click tracking and user persuasion to optimizing business operations, boosting productivity, and even saving lives through advances in HealthTech and climate change tracking and amelioration.
As data ascends to the throne of the digital age, organizations are racing to become agile and data-driven. But the path to gaining full benefit from data assets is often roadblocked by the limitations of existing data architecture.
“The hard truth is the original promises of master data management, enterprise data warehouses, data marts, data hubs and, yes, even data lakes were broken and left us wanting for more,” said theCUBE industry analyst Dave Vellante as he introduced the “The Data Doesn’t Lie … Or Does It?” event.
The event brought together industry experts Justin Borgman (pictured, left), chief executive officer and co-founder of Starburst Data Inc.; Richard Jarvis (pictured, center), chief technology officer of EMIS Group PLC; and Theresa Tung (pictured, right), cloud-first chief technologist at Accenture PLC, for a candid discussion on three of the most potent potential data lies prevalent in enterprise technology circles. (* Disclosure below.)
Lie #1: The most effective data architecture is centralized
One of the biggest challenges newly digital businesses face is how to securely manage data while making it available to those within the organization who need it. The current consensus is that the most effective data architecture to do this is a centralized data warehouse with a team of data specialists to serve the various lines of business. But is this statement true or false?
“Definitely a lie,” Borgman stated, recalling his experience working with centralized enterprise data warehouse pioneer Teradata Inc.
“One of the things that I found fascinating was that not one of their customers had actually lived up to that vision of centralizing all of their data into one place. They all had data silos. They all had data in different systems,” he said.
Centralized data architecture is an unrealizable goal, Jarvis agreed. While cloud-native startups could in theory centralize all the data and tooling and teams in one place, the reality is that as they grow they’re likely to acquire companies with legacy data architectures.
“It’s just really impossible to get that academic perfection of storing everything in one place,” Jarvis said.
It’s not a question of either/or, according to Tung. She sees a need for centralized governance for specific cases, such as companies required to comply with the Sarbanes-Oxley Act of 2002, which requires centralized retention of certified documentation as a single point of truth for fiscal transparency. However, while Tung is a devotee of centralized data (she says she would love all of her data within a data warehouse), she admits that keeping everything centralized is not a practical reality.
Distributed data architecture is necessary to be able to scale, enable different areas of the organization to make the right data investments for their needs, and be able to collaborate with partners, according to Tung.
“We’re going to see a lot more data sharing and model creation, and so you’re definitely going to be decentralized,” she said.
The conversation continued, with Jarvis describing how EMIS has decentralized data with its healthcare model and Borgman answering how he would address a customer who believed centralization was the most cost-effective way to serve a business.
“The data mesh model basically says, ‘Data’s decentralized and we’re going to turn that into an asset rather than a liability,’” he said, emphasizing that this occurs through “empowering the people that know the data the best to participate in the process of curating and creating data products for consumption.”
Here’s the complete first event panel discussion with Borgman, Jarvis and Tung:
Lie #2: An open-source-based platform cannot provide the performance and control that a proprietary system does
The open versus proprietary debate is a perennial one in tech. And when it comes to data, the majority are in the proprietary camp. And over the past 10 years, the traditional data warehouse model has been the dominant approach to enterprise data management. But a decade ago, when the first data lakes were being built around Apache Hadoop, open-source-based data platforms couldn’t provide enough performance to run fast, interactive SQL queries, according to Borgman. Now it is a different story.
“We have large, giant hyperscale internet companies that don’t have the traditional data warehouse at all; they do all of their analytics in a data lake,” he stated. “So I think we’ve proven that it’s very much possible.”
Companies still have to catch up to the idea that open data formats can provide the same level of performance as traditional data warehouse, because the industry was built around vendor lock-in, according to Borgman.
“How many people love Oracle today but are customers nonetheless?” he asked, emphasizing the word “love.”
The real benefit of open source to companies is that “open buys us the ability to be unsure about the future,” according to Jarvis. “One thing that’s always true about technology is it evolves in a direction slightly different to what people expect.”
Time after time, companies have had to eat losses because of investments that became obsolete when tech innovation took an unanticipated turn. But, by choosing open storage technologies, companies can hedge their bets by applying several different technologies to data processing, according to Jarvis.
“That gives us the ability to remain relevant and innovate on our data storage,” he said.
Here’s the complete second event panel discussion that features the panel debate open source-based data platforms versus proprietary data warehouses:
Lie 3#: Today’s “modern data stack” is modern
The semantics of the word “modern” formed the basis of the event’s third panel discussion, as Borgman kicked off with the statement: “New isn’t modern. It’s the new data stack. It’s the cloud data stack. But that doesn’t necessarily mean it’s modern,” he said. “I think a lot of the components are exactly the same as what we’ve had for 40 years.”
There are differences, Vellante stated, pointing out that the cloud enables scalability and compute is separate from storage.
“The cloud data warehouses out there are really just separating their compute from their storage,” Borgman stated. “A lot of the same sort of structural constraints that exist with the old enterprise data warehouse model on-prem, still exist just a little bit more elastic now because the cloud offers that,” he said, noting that in cloud data warehouses data is still stored in a proprietary format, has to be ingested to be prepared for analysis, and the customer is still locked into a single vendor.
The cloud providers are looking toward more of a cloud continuum, according to Tung, who stated that the cloud providers don’t view centralized cloud as a data lake or data warehouse in a central place, but instead have new query services that expand queries to beyond a single location.
“The next modern generation of the data stack needs to be much more federated,” she stated, citing the rise of edge computing, the need for more on-premises storage due to data sovereignty and data gravity, and the proliferation of multicloud as reasons.
Modernizing the stack has to include the processes and people around the data, as well as modernizing applications, according to Jarvis. “The stack needs to support a scalable business, not just the technology itself,” he said.
EMIS is addressing this by “changing in a way very much aligned to data products and data mesh,” Jarvis stated. He sees this as the solution that both enables more people to consume the service and keeps costs low.
Wanting to keep valuable data sources on-prem while still modernizing and taking advantage of the benefits of cloud is a “killer case for data mesh,” according to Tung, who sees data mesh as providing “the best of both worlds.”
This is the use case that Starburst Enterprise was designed to solve, according to Borgman. The company aims to provide its customers with the flexibility to operate and analyze data across a wide variety of different systems, which in turn “provides the ability to reduce costs, store more in a data lake rather than data warehouse, [and] provides the ability for the fastest time to insight to access the data directly where it lives,” he stated.
The offering also includes the concept of data products so that customers can “create and curate data as a product to be shared and consumed,” Borgman added. “We’re trying to help enable the data mesh model and make that an appropriate complement to the modern data stack that people have today.”
Here’s the complete third event panel discussion that features discussion on why today’s “modern data stack” is not modern:
Here’s the complete event video:
(* Disclosure: TheCUBE is a paid media partner for the “The Data Doesn’t Lie … Or Does It?” event. Neither Starburst Data Inc., the sponsor of theCUBE’s event coverage, nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)
Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.