Snorkel.ai: Unlocking Subject Matter Experts to make Software 2.0 [Alex Ratner]

Snorkel.ai CEO Alex Ratner explains the big idea behind SnorkelFlow and I explain why I think it's a big deal.

Source: https://www.thecloudcast.net/2021/06/automated-data-labeling-for-ai-apps.html
See also: https://softwareengineeringdaily.com/2020/04/09/snorkel-training-dataset-management-with-braden-hancock/

Software 2.0 is Andrej Karpathy's idea that instead of coding business logic by hand, the applications of the future will be trained by data. In other words, machine learning. But ML is limited by the quality of data available, and there is a lot of unstructured, unlabeled data out there that is still being manually labeled today. Scale.AI is a well known startup that has done very well offering a scalable manual labeling workforce, however they are still bottlenecked by the number of subject matter experts available for labeling critically important data, like cancer diagnosis and drug trafficking rings. In order to get labels from subject matter experts, you typically have to put them through a very tedious process of labeling to build up a useful structured dataset upfront before any useful machine learning can be done.

I did some very minor ML work about 5 years ago and found Christopher Re's work on DeepDive at Stanford. It takes a revolutionary approach by making it easy to write the labeling functions themselves. This turns the labeling process into an iterative, REPL like experience where subject matter experts can suggest a function, see its impact right away, and continue refining it, assisted by AI. DeepDive is now commercialized in a startup called Snorkel.AI, so I was very excited to find a clear explanation of Snorkelflow from its CEO, Alex Ratner.

Here it is!

Transcript

[00:01:15] Alex Ratner: [00:01:15] SnorkelFlow is a platform that's meant to take this process of building machine learning models and AI applications. And I get all starting with buildings, the data that they rely on that fuels them and make it, in a nutshell, look more like an iterative software development process. Then you know, this kind of 80, 90% upfront just, hand labeling exercise.

[00:01:34]And so snorkel flow supports that entire iterative loop of, actually laboring data. Can be by hand in the platform, but also most centrally programmatically by letting users, what we call labeling. Basic idea, is that rather than say asking your, legal associate at a bank to, or your doctor friends to sit down and, label a hundred thousand contracts or a hundred thousand electronic health records have them, right.

[00:02:00]Sharistics are bits of their expertise look for this keyword or look for this pattern or look for this, et cetera. I'm like a bridge from old, expert knowledge type input. Modern machine learning models using one to power. The other. So a snorkel flow is an IDE basically, and has a no-code UI component as well, but let's not people either via code or by pushing buttons for even, non-developer subject matter experts say to.

[00:02:24]Programmatically labeled their data by writing these labeling functions and then uses a bunch of modeling techniques. A lot of which was actually, the work that, that the co-founding team. And I did in, in, in our kind of thesis work around how you take a bunch of programmatic data and clean it up and turn it into a final.
[00:02:41]Instead of clean training data for machine learning models, and then actually in snorkel flow, you can, autumn, basically push button train best-in-class open source models. You can then analyze where they're succeeding or failing and, and use that to go back and iterate on your data.

[00:02:54]And there's a Python SDK throughout the whole thing. So many of our customers will mix and match. Will you start.  Create the training data set and then train the model on some other system, et cetera. But what's normal flames of support. Is it basic iterative development process where, you know, rather than just spending months to label a training at once and then being stuck with it and having to throw it out and start all over again, anything in the world changes your upstream input, data changes your downstream objectives.

[00:03:18] Change, making it again more like an iterative process where you push some buttons or write some code. That label the data. You compile a model or train it, but you can think of it like compiling and then you go back and debug by, by iterating on your data, everything centers and snorkel flow around looking at your data and iterating on how it's labeled to improve models.


[00:03:38]Brian Gracely: [00:03:38] I'm curious. So you mentioned you mentioned in there's a there's a Python SDK, which for anybody who, works in data science, data modeling, right? Python is your language to Frank sort of the language you use or are you a couple of them, that's the language that, you how you do your program, but I'm curious, like in today's world, Do data scientists consider themselves programmers or is there still Hey, look, I work on the numbers, I'm good at building models and the numbers, but I don't think of myself as a programmer.

[00:04:08] Like how do you bridge those two worlds together or do you not really have to bridge them together? How much does the data scientists have to go? I have to focus on numbers and models versus I have to focus on programming, something to do stuff. What's their world look like?

[00:04:21]Alex Ratner: [00:04:21] It's a great question. I think I, I haven't been are currently I'm part of four or five different data science institutes or something. And I don't even still know. I mean, the data science is such a broad umbrella term. There's so many different varietals of us and, and types.

[00:04:35] And so I do think there's a very broad spectrum of, the data scientists. An ML engineer and just, loves writing codes are the one that, to your point really just wants to push some buttons and get back to the numbers and the modeling and the outcome. And, we definitely, try to support the range through a layered approach.

[00:04:50]And, we, we have , but on top of that, we have a a no-code UI that allows you to write these wavelength functions without writing code. So for example, if you're trying to train a CA a contract classifier and snorkeled flow, you can, write Lateline functions based on clicking on keywords or pressing buttons with kind of templates for types of patterns or signals you want to look for.

[00:05:11] So, No we try to support basically, if you want to move fast and you're a non developer, or you're just not looking to spend time there, you can just do it in push-button way. But then if you want to go and customize or inject custom logic or really get creative, you can always fall back to the Python SDK.

[00:05:27] And so, I mean, I think a lot of the what we're trying to accomplish in the very beginning, right? Raised me abstraction know level at which you're interfacing with and programming your machine learning model or your AI application. And the first step is the hardest, right?

[00:05:39] If you think of the way that hand labeled training data is, it's like the machine code, or really actually, just so you know, I think of it as like the ones and zeros, literally for binary classification cases. Yeah, a lot of the effort behind the circle project and the company is just, or was just getting from that layer to the layer of, assembly language day.

[00:05:57] But once you get there, you can build all those layers on top and you can go up the stack and down the stack, according to the application of the user type, right. Actually, my co-founder Braden who was, who also did his PhD around, snorkel related stuff, had a paper actually on how you could use natural language inputs.

[00:06:12] You could explain in, in natural light. Just speaking to the computer, why a certain data point should be labeled a certain way and then use off the shelf semantic parsers to parse that down to code, which then would get dumped into snorkel. So basically once you make this leap from labeling data, by ham kind of zeros and ones to labeling your training data with code, then the sky's the limit in terms of building layers of abstraction on top of it.

[00:06:35] And that's actually a lot of what the company does and has been doing over the last two years is. Building a flexible interface through our platform, snorkel flow for different data types and use case types and user types. 

[00:06:45]Brian Gracely: [00:06:45] Yep. Well, and, and I think you, you really answered my question in there.

[00:06:49] The reason I brought it up was on one hand you have this you have this language level SDK in terms of Python, you can get into, Some pretty granular level stuff. And then you have, on the other end, you've got application studio, which you said, like you said is this sort of low code graphical way of, building templates and building applications.

[00:07:08] And I was like, There must be like, I think sometimes there's just perspective of there's one profile of a data scientist. And I think what you really highlighted is it, it's like a lot of things there's a spectrum of, those that specialize in one part of the job, others that don't care about it and want it, certain things to be easy.

[00:07:25] And so that, that was useful because I think sometimes like in my head, I'm thinking, okay, Data scientists is served a certain sort of task the same way you might say okay, they're a Java developer. So they, there's a tool set that they always use. So that was super helpful.

[00:07:39]Alex Ratner: [00:07:39] Yeah. And it depends on what the problem is too. I mean, the other thing also that I think goes under, emphasized in the air space big. Points number one. And I don't think it's that avant gardening where to say it was maybe more back in 2015 is, Hey, AI is about the data, not the models or the algorithms, which I think, fewer people will find a controversial statement today.

[00:07:57]Even if it's phrases in a somewhat reductive way. But the other thing that I still think is under emphasized in practices and necessity of lupus. What we often refer to as subject matter experts into the process. And so I think w and I won't ramble here too long, but just for some perspective, and this is actually the very first  funding that, that the snorkel project ever had was specifically about looping what they call SMEEs and the government subject matter experts.

[00:08:20]Our original partners were some genomicists at Stanford. How do you loop them into the. Of AI in a better way than just saying, Hey go label data for eight months for me, please. And this idea of how do you get subject matter expertise from a human's head into a scalable machine format has been the focus of AI for, decades, but the answer of modern machine learning today for the last, five, 10 years.

[00:08:44] Okay, just sit them down, have them labeled data points one by one, nothing else. They've got all of this rich domain knowledge, a doctor, a lawyer, a cyber analyst, network, technician, and underwriter. Throw that all away, just have them literally just, give zeros and ones labeling data. And that's a nice abstraction.

[00:09:01]And it has been actually a very productive one for the field, because that means the ML engineers can totally abstract the way the messy realities of real-world data and real world subject matter experts. And just focus on optimizing, a fancier model architecture. But I think we've reached a point where it starts to become silly and impractical to have this wall.

[00:09:19] The subject matter expert and the data scientists. So I'll let us loop back and say, but a big focus of circle flow is about making these interfaces in this process, accessible to a non-developer who's, a legal associate or an underwriter or a network technician and have the process too. And that's another motivation behind the kind of, layers, including no-code UI.
2021 Swyx