Databricks Inc. opens its Data + AI Summit today with the announcement that it will release the entirety of its Delta Lake storage framework to open-source under the oversight of the Linux Foundation.

That means there will no longer be any functional differences between the Databricks-branded Delta Lake and the open-sourced version. The company said it will similarly release its recent enhancements to the MLflow machine learning operations platform and Apache Spark analytics framework to open source. Databricks also rolled out several new features for its core Lakehouse data lake.

Delta Lake, which was introduced three years ago and donated to open source in June 2020, improves the efficiency of the hybrid structured and unstructured analytical stores called data lakes to make information more reliable. It does that by managing transactions across batch and streaming data, coordinating multiple simultaneous writes and doing away with the need to build complicated data pipelines.

“Before Delta Lake, technologies like Spark would process large amounts of data; Delta Lake lets you process small deltas with all changes stored in history so you can go back and forward,” said Ali Ghodsi (pictured) Databricks’ co-founder and chief executive of Databricks. “This is important for audit trails and compliance so you can go back and find decisions you made a year ago.”

Surge in contributions

A new 2.0 release of Delta Lake features better query performance and a foundation based on open standards. The release candidate is now available and is expected to go into a general release later this year. Databricks said the update reflects contributions from more than 6,400 developers and noted that total commits have grown 95% with the average number of lines of code per commit surging 900% over the past year.

The company is also announcing version 2.0 of MLflow, a platform for managing machine learning projects. The release includes Pipelines, a new feature to speed and simplify machine learning model deployments. Pipelines give data scientists pre-defined, production-ready templates based on the model type they’re building to allow faster and more reliable model development without requiring intervention by production engineers.

Users can define the elements of the pipeline in a configuration file and MLflow Pipelines manages execution automatically, the company said. Databricks has also added serverless model endpoints to directly support production model hosting, as well as built-in model monitoring dashboards to help teams analyze the real-world model performance.

Downstream benefits

Ghodsi said the decision to donate the latest enhancements to MLflow — which was open-sourced two years ago to the Linux Foundation — is consistent with the company’s roots. “For us, the whole business model is to keep open-sourcing and keep innovating,” he said. Claiming 1 million downloads for MLflow, he said giving the software away has downstream benefits to the company.

“Imagine an enterprise software company with a million downloads,” he said. “Those people are not our customers but they are using our technology. These projects become standards; people teach classes and write books about them.”

Enhancements to Spark, the wildly successful analytics framework that launched Databricks in 2013, include Spark Connect, which allows Spark to run on nearly any device, and Project Lightspeed, a Structured Streaming engine for data streaming on the lakehouse. Spark Connect is a client/server interface for Spark based on Databricks’ DataFrame API that decouples the client and server for better stability while allowing for built-in remote connectivity.

Better streaming for Spark

Project Lightspeed is described as the next generation of the current Spark Structured Streaming engine that is aimed at improving performance, building a support ecosystem for connectors, adding new operators and simplifying deployment and operations.

The new streaming engine will also be more accessible from popular analytics programming languages such as Python, Ghodsi said. “Every year we’ve been excited for real-time streaming to take off and this year it’s taking off, I think, because of machine learning,” he said.

Databricks is also using the event to roll out a series of enhancements to its flagship Lakehouse platform. They include a serverless version that is now available in preview on the Amazon Web Services Inc. cloud, general availability of the company’s Photon query engine, open-source connectors for Go, Node.js and Python, and the ability to federate queries across multiple remote data sources without first extracting and loading the data.

Photo: SiliconANGLE

Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.


Source link

Load More By Michael Smith
Load More In Technology
Comments are closed.

Check Also

Autocar magazine 1 February: on sale now

[ad_1] This week in Autocar, we put Porsche’s new 911 ‘SUV’ through its paces, break the s…