Databases have long been used for transactional and analytics use cases, but they also have practical utility to help enable machine learning capabilities. After all, machine learning is all about deriving insights from data, which is often stored inside a database.
San Francisco-based database vendor Splice Machine is taking an integrated approach to enabling machine learning with its eponymous database. Splice Machine is a distributed SQL relational database management system that includes machine learning capabilities as part of the overall platform.
Splice Machine 3.0 became generally available on March 3, bringing with it updated machine learning capabilities. It also has new Kubernetes cloud native-based model for cloud deployment and enhanced replication features.
In this Q&A, Monte Zweben, co-founder and CEO of Splice Machine, discusses the intersection of machine learning and databases and provides insight into the big changes that have occurred in the data landscape in recent years.
How do you integrate machine learning capabilities with a database?
Monte Zweben: The data platform itself has tables, rows and schema. The machine learning manager that we have native to the database has notebooks for developing models, Python for manipulating the data, algorithms that allow you to model and model workflow management that allows you to track the metadata on models as they go through their experimentation process. And finally we have in-database deployment.
So as an example, imagine a data scientist working in Splice Machine working in the insurance industry. They have an application for claims processing and they are building out models inside Splice Machine to predict claims fraud. There’s a function in Splice Machine called deploy, and what it will do is take a table and a model to generate database code. The deploy function builds a trigger on the database table that tells the table to call a stored procedure that has the model in it for every new record that comes in the table.
So what does this mean in plain English? Let’s say in the claims table, every time new claims would come in, the system would automatically trigger, grab those claims, run the model that predicts claim cause and outputs those predictions in another table. And now all of a sudden, you have real-time, in-the-moment machine learning that is detecting claim fraud on first notice of loss.
What does distributed SQL mean to you?
Zweben: So at its heart, it’s about sharing data across multiple nodes. That provides you the ability to parallelize computation and gain elastic scalability. That is the most important distributed attribute of Splice Machine.
In our new 3.0 release, we just added distributed replication. It’s another element of distribution where you have secondary Splice Machine instances in geo-replicated areas, to handle failover for disaster recovery.
What’s new in Splice Machine 3.0?
Zweben: We moved our cloud stack for Splice Machines from an old Mesos architecture to Kubernetes. Now our container-based architecture is all Kubernetes, and that has given us the opportunity to enable the separation of storage and compute. You literally can pause Splice Machine clusters and turn them back on. This is a great utility for consumption based usage of databases.
Along with our upgrade to Kubernetes, we also upgraded our machine learning manager from an older notebook technology called Zeppelin to a newer notebook technology that has really gained momentum in the marketplace, as much as Kubernetes has in the DevOps world. Jupyter notebooks have taken off in the data science space.
We’ve also enhanced our workflow management tool called mlflow, which is an open source tool that originated with Databricks and we’re part of that community. Mlflow allows data scientists to track their experiments and has that record of metadata available for governance.
What’s your view on open source and the risk of a big cloud vendor cannibalizing open source database technology?
Zweben: We do compose many different open source projects into a seamless and highly performant integration. Our secret sauce is how we put these things together at a very low level, with transactional integrity, to enable a single integrated system. This composition that we put together is open source, so that all of the pieces of our data platform are available in our open source repository, and people can see the source code right now.
I’m intensely worried about cloud cannibalization. I switched to an AGPL license specifically to protect against cannibalization by cloud vendors.
On the other hand, we believe we’re moving up the stack. If you look at our machine learning package, and how it’s so inextricably linked with the database, and the reference applications that we have in different segments, we’re going to be delivering more and more higher-level application functionality.
What are some of the biggest changes you’ve seen in the data landscape over the seven years you’ve been running Splice Machine?
Zweben: With the first generation of big data, it was all about data lakes, and let’s just get all the data the company has into one repository. Unfortunately, that has proven time and time again, at company after company, to just be data swamps.
Data repositories work, they’re scalable, but they don’t have anyone using the data, and this was a mistake for several reasons.
Monte ZwebenCo-founder and CEO, Splice Machine
Instead of thinking about storing the data, companies should think about how to use the data. Start with the application and how you are going to make the application leverage new data sources.
The second reason why this was a mistake was organizationally, because the data scientists who know AI were all centralized in one data science group, away from the application. They are not the subject matter experts for the application.
When you focus on the application and retrofit the application to make it smart and inject AI, you can get a multidisciplinary team. You have app developers, architects, subject-matter experts, data engineers and data scientists, all working together on one purpose. That is a radically more effective and productive organizational structure for modernizing applications with AI.