The Index Podcast
Sept. 6, 2024

AI Governance and Decentralized Training with Alexander Joseph Long, Founder of Pluralis.ai

What will the future of AI governance and decentralized training look like? In this episode of The Index, host Alex Kehaya is joined by Alexander Joseph Long, Founder of Pluralis.ai to explore this pressing question. With a PhD in non-parametric external memory in deep learning and extensive experience at Amazon working on multimodal foundation models, Alexander offers a deep dive into the cutting-edge developments shaping the AI landscape.

They unpack why decentralized training is more than just a technological trend—it’s an essential shift for ensuring control and ownership of AI models. As Ethereum transitions from proof of work to proof of stake, the repurposing of GPUs for AI opens exciting possibilities for decentralized AI systems. The conversation also covers the economic incentives and infrastructure driving this transformation, along with the latest research optimizing communication efficiency for decentralized training algorithms. These developments point to a future where decentralized AI could challenge the monopolistic control held by a few tech giants.

Tune in to discover how this evolving technology could reshape the balance of power in AI.

Learn more about https://pluralis.ai/

Show Links

The Index
X Channel
YouTube


Host - Alex Kehaya

Producer - Shawn Nova

 

 

Chapters

00:06 - Decentralized AI Training - Feasibility Discussion

11:03 - Decentralized AI Training Impact Analysis

19:29 - Innovative Approaches to AI Training

30:18 - The Dangers of Centralized AI

Transcript

WEBVTT

00:00:06.229 --> 00:00:10.154
Welcome to the Index podcast hosted by Alex Cahaya.

00:00:10.154 --> 00:00:36.457
Plug in as we explore new frontiers with entrepreneurs, builders and investors shaping the future of the internet hey everybody and welcome to the Index.

00:00:36.941 --> 00:00:44.207
I'm your host, alex Kahaya, and today I'm excited to welcome Alexander Joseph Long, founder of Pluralis AI.

00:00:44.207 --> 00:00:49.243
Alexander Joseph Long, founder of Pluralis AI.

00:00:49.243 --> 00:00:51.107
Alexander, is at the forefront of revolutionizing AI with decentralized training.

00:00:51.107 --> 00:01:01.347
It's a groundbreaking approach that promises to reshape the development and governance of foundation models and something that is a bit of a contrarian viewpoint.

00:01:01.347 --> 00:01:07.484
Many people don't even think what he's trying to build is possible, so I'm very happy to have you on the show today.

00:01:07.484 --> 00:01:08.367
Thanks for being here.

00:01:09.069 --> 00:01:09.710
Thanks for having me on.

00:01:10.359 --> 00:01:19.388
So before we get into the impossible thing you're conquering, I'd just like to hear more about you and why you're in this space in the first place.

00:01:19.388 --> 00:01:21.433
Mind just telling your story a little bit here.

00:01:21.980 --> 00:01:23.986
Yeah, sure, so I'm an AI guy.

00:01:23.986 --> 00:01:28.385
I did my PhD on non-parametric external memory in deep learning.

00:01:28.385 --> 00:01:30.611
That I finished back in 2021.

00:01:30.611 --> 00:01:39.870
And then I joined Amazon straight out of that, where I was working on sort of multimodal foundation models there, both publishing research and then also applying them to product internally and retail.

00:01:39.870 --> 00:01:43.944
So my background is sort of heavy on the tech side.

00:01:44.787 --> 00:01:49.355
I got really interested in sort of this whole area back in sort of the tail end of last year.

00:01:49.355 --> 00:01:57.153
And you know, the thing is, when you start to really think this stuff through, we start to really think what you need to get decentralized training to work.

00:01:57.153 --> 00:02:03.290
It's very obvious where a lot of crypto starts to fit in, and that was sort of what drew me across to this side.

00:02:03.290 --> 00:02:10.628
So I feel like I'm sort of just one of the first people that's coming from this sort of other world and we're starting to sort of move across into this other field.

00:02:10.628 --> 00:02:15.468
You know, while there's this sort of small number of people doing it at the moment, I think it's going to turn into quite a few soon.

00:02:16.340 --> 00:02:19.549
You obviously have deep expertise in AI and you've thought about this a lot.

00:02:19.549 --> 00:02:24.729
Why does decentralized training of AI matter?

00:02:24.729 --> 00:02:26.252
Why do people care about that?

00:02:26.759 --> 00:02:33.724
My view is very simple, right, If you cannot create the model, you have no way of enforcing any level of ownership or control.

00:02:33.724 --> 00:02:50.508
So if I'm taking a model that's been trained externally by some centralized group maybe it's meta, maybe it's some other people who have done this sort of run where they've accumulated the compute themselves and then you put the model into a protocol you don't have any ability to control the behavior of that model.

00:02:50.508 --> 00:02:56.545
Right, it's the people that made it that have the ability to control the behavior and also have the ability to sort of enforce ownership.

00:02:56.545 --> 00:02:59.868
And so, for me, decentralized training is all about.

00:02:59.868 --> 00:03:13.508
If you want to have this world where you have some kind of a thriving open source, if you want to have this world where these things are actually democratically governed, you have to have some way of creating these things in a decentralized manner.

00:03:13.508 --> 00:03:15.716
It's not sufficient to create them centralized and then put them into a protocol after.

00:03:16.580 --> 00:03:25.552
I totally agree with you and I want to break this down for people a little bit who might not be as familiar with, like, what goes into creating AI technology.

00:03:25.552 --> 00:03:27.984
There are several components here.

00:03:27.984 --> 00:03:30.849
Right, there's data, there's a model.

00:03:30.849 --> 00:03:41.622
The model is like a piece of software that you can train with data and teach it to do things, and then I like to think of it like a brain.

00:03:41.622 --> 00:03:46.272
It becomes a brain like a toddler and and then it's a teenager and it's got more data.

00:03:46.272 --> 00:03:52.692
The more data, the older it gets, the smarter it gets, and that brain goes somewhere and it does what's called inference.

00:03:52.692 --> 00:03:59.092
And inference is you ask ChatGPT a question and it infers the answer based off of the data it's been trained on.

00:03:59.681 --> 00:04:00.844
Am I missing anything here?

00:04:00.844 --> 00:04:02.530
No, that's right In the process.

00:04:02.530 --> 00:04:11.442
That's right In the process.

00:04:11.442 --> 00:04:14.169
So that's like basically the full life cycle and all of those things are being decentralized.

00:04:14.169 --> 00:04:16.958
But the big thing that and I've heard this before, this call from people is the training piece.

00:04:16.958 --> 00:04:20.648
Training a model in a decentralized manner isn't possible.

00:04:20.648 --> 00:04:30.172
Getting decentralized data is possible and doing decentralized inference is possible, but there's been this like notion that the training piece is not possible.

00:04:30.172 --> 00:04:31.461
What about?

00:04:31.461 --> 00:04:32.244
That is hard.

00:04:32.244 --> 00:04:35.492
Why do people say to you no, this is never going to work.

00:04:36.399 --> 00:04:39.507
I mean there's a few sort of common reasons, right.

00:04:39.507 --> 00:04:54.074
The most common one is, if you have this problem with these low bandwidth interconnects, right, this is, if you tell you know, if you tell any serious AI person you're like, hey, we're going to do centralized training, their first reaction will always be, okay, I understand why that would be very nice.

00:04:54.074 --> 00:05:04.889
You're never going to be able to do it because the low bandwidth latency problem, right, and so what they're talking about there is like when you train a chat, gptpt, like you know, you do it in the in these big companies.

00:05:04.889 --> 00:05:08.596
It's on these massive data center racks.

00:05:08.596 --> 00:05:13.285
They have very, very fast interconnects, right, the data can move between the servers very, very fast.

00:05:13.285 --> 00:05:17.463
And because the models are so big, right, they're split across these accelerators.

00:05:18.326 --> 00:05:23.725
And the way distributed training works today is that you know you have to sort of pass this data very, very frequently.

00:05:23.725 --> 00:05:28.353
You need to pass very large amounts of data both in the forward pass and the backward pass.

00:05:28.353 --> 00:05:45.560
And if you were to just take distributed training methods as they work today, and you were to switch out that substrate and you were to say, okay, we're not using a data center with NVLink and InfiniBand, we're going to do it on a sort of consumer grade swarm where everything's connected by the internet.

00:05:45.560 --> 00:05:48.269
It's just, you know, you have this huge blow up right.

00:05:48.269 --> 00:05:48.730
You're starting.

00:05:48.730 --> 00:05:50.963
Everything has to wait for sort of the data to come.

00:05:50.963 --> 00:05:55.862
It's way, way slower, and so that's sort of the typical knee-jerk response.

00:05:55.903 --> 00:06:03.170
Is that that makes this stuff infeasible we should get one of those, like you know, those whiteboard hand animated videos they like draw a diagram.

00:06:03.170 --> 00:06:11.113
We should do that for, like version A is centralized training and you said distributed because it's the centralization pieces.

00:06:11.113 --> 00:06:22.189
It's Amazon or OpenAI training a data set on a model, using a data set to train a model, and they're doing it in this data center or a couple of different data centers.

00:06:22.189 --> 00:06:27.829
But typically, like, these servers need to be very close to each other because the amount of data that has to get transferred between them.

00:06:27.829 --> 00:06:32.050
There's so much data here that you need like a massive, giant computer.

00:06:32.050 --> 00:06:36.226
You said the word swarm and a swarm is like a huge computer.

00:06:36.226 --> 00:06:46.610
It's like leveraging a thousand servers to work together to use this data and the reason that works in.

00:06:46.812 --> 00:07:06.574
I'm just like re-explaining it again so people can like, really like, grasp it and this is kind of how I learned the way it works is there's this thing called bandwidth, which is the literally like I like to think of it as the size of the pipe that the data goes through, and the more bandwidth, the bigger the pipe is, the faster the connection and the more data that can go through that pipe and get trained on all these servers.

00:07:06.574 --> 00:07:18.492
Well, for my home computer, or even if I have like a beefy server at my home, the pipes are pretty small, like I'm sure everybody here has been like why is my Zoom call so shitty right now?

00:07:18.492 --> 00:07:20.466
And it's because of bandwidth.

00:07:20.466 --> 00:07:25.271
There's just not enough bandwidth, and consumer grade bandwidth is typically shared.

00:07:25.271 --> 00:07:33.466
Actually, it's like Cox Internet throttles your nightly Netflix viewing because everyone else is trying to watch Netflix at the same time.

00:07:33.466 --> 00:07:35.072
There just aren't big enough pipes.

00:07:35.072 --> 00:07:47.235
And so the consensus here is that it's kind of hard to imagine how big the clusters of computers are and how much data is flowing through them.

00:07:47.235 --> 00:07:49.144
To train something like GPT.

00:07:49.144 --> 00:07:53.492
I mean you can't understate how massive an operation it is.

00:07:54.000 --> 00:08:01.326
One of my key passions in crypto has always been the idea of like how do you bootstrap the physical infrastructure of a network?

00:08:01.326 --> 00:08:05.175
Because all these blockchains are literally bare metal.

00:08:05.175 --> 00:08:12.279
It's metal in a data center or at a or a computer at a house, but it's melton demure's uh, who's a really well-known investor.

00:08:12.279 --> 00:08:17.261
She kind of coined this phrase like where bits meet atoms, it really is where the bits meet the atoms.

00:08:17.261 --> 00:08:18.586
It's physics.

00:08:18.586 --> 00:08:21.456
That piece of it is just really, really interesting to me.

00:08:21.456 --> 00:08:33.427
And so what you're trying to do is is replicate what's possible in these data centers that are like thousands of square feet, thousands of machines, thousands of you know bare metal servers doing the work.

00:08:34.049 --> 00:08:35.613
It's not trivial, right?

00:08:35.613 --> 00:08:36.841
It's not a trivial difference.

00:08:36.841 --> 00:08:46.029
The difference between that and consumers, like a thousand machines at your home, over a thousand homes, doing this kind of work, is non-trivial.

00:08:46.029 --> 00:08:47.841
So how do you get around these problems?

00:08:47.841 --> 00:08:50.107
How do you solve this bandwidth issue?

00:08:50.107 --> 00:08:54.167
And actually, because the GPUs are there, the compute power is there.

00:08:54.167 --> 00:09:00.267
If you can cluster like a thousand or a hundred thousand people's computers, you can get the compute.

00:09:00.267 --> 00:09:03.263
But it's really the bandwidth, right, like the compute's not actually the issue.

00:09:03.943 --> 00:09:14.890
Yeah, yeah, and maybe just to drive that point home, right, because you're exactly right, the distributed training, which is what I call, like, the centralized training, right, it's so enormous like people have not grasped how enormous it is.

00:09:14.890 --> 00:09:24.590
Right, I read in the llama technical paper right, they're like you know, the ambient temperature changes and our, like, power draw on the grid moves around by tens of megawatts, which is, like you know.

00:09:24.590 --> 00:09:27.748
Every now and then you read a sentence like that and it just puts it in perspective, right of like.

00:09:27.748 --> 00:09:29.484
The scale of this is insanity, right?

00:09:29.865 --> 00:09:39.601
I keep trying to put visuals behind it so people can understand, because most people listening to this show don't see the infrastructure that actually makes this show even happen.

00:09:39.601 --> 00:09:51.129
Like me streaming this and like recording the show, I'm seeing a picture in my mind of like a huge, like what's a 10 megawatt producing solar farm right, how many acres is that?

00:09:51.510 --> 00:09:52.754
yeah, I mean it's larger.

00:09:52.754 --> 00:09:57.822
I mean a good way to frame it is like you know, an average industrial city is probably like a gigawatt, right.

00:09:57.822 --> 00:10:01.956
So we're starting to talk about like just enormous scale, right.

00:10:01.956 --> 00:10:13.971
One of the most exciting things to me that's happening is like we're moving back into this era of heavy industry, right, like we have these massive energy draws to create something that's like really tangibly useful, you know, and there's like huge amounts of physical build-out.

00:10:13.971 --> 00:10:17.909
But I'll just say so, the distributed is enormous right.

00:10:17.909 --> 00:10:21.684
Say, you just take all of Facebook's GPU purchases for this year, right.

00:10:22.046 --> 00:10:29.947
Whatever they announced publicly I think it was 350,000 H100s are going to buy, right, that is insane.

00:10:29.947 --> 00:10:31.264
It's enormous right.

00:10:31.264 --> 00:10:38.070
H100 draws like 700 watts right At peak, so very hand-wavy.

00:10:38.070 --> 00:10:39.546
It's like all those things running at peak.

00:10:39.546 --> 00:10:50.732
It's probably about 200 megawatts right Somewhere around there, which is, like I said, right, your whole city is probably about a gigawatt, but like Bitcoin mining at peak was 20 gigawatts, right.

00:10:50.732 --> 00:11:02.533
So you have this enormous thing, which is gargantuan, but the scale of compute that's been assembled under these protocols has been already two magnitudes bigger right.

00:11:02.533 --> 00:11:05.681
So that was one of the big things that sort of got me interested in this.

00:11:05.681 --> 00:11:16.264
I sort of when I did the very basic math on that and I sort of saw the relative scales of these two things, you know, quite frankly, I couldn't believe it, right, like that's just enormous, right.

00:11:16.485 --> 00:11:22.583
So I mean the other opportunity, if you do the math on it is is ethereum right, because they just moved from proof of work to proof of stake.

00:11:22.583 --> 00:11:26.071
But the proof of work was all GPU based right.

00:11:26.071 --> 00:11:37.160
And so there's all these huge companies that started out mining Ethereum that are now repurposing their GPUs for AI for that same reason.

00:11:37.160 --> 00:11:43.634
But the economic incentives of Ethereum are what created that physical infrastructure.

00:11:44.460 --> 00:11:46.288
Yeah, it's very interesting, right?

00:11:46.288 --> 00:11:51.025
It's interesting that we ended up in this spot where you know, if you had to start from scratch, there's no way you can do it right.

00:11:51.025 --> 00:11:54.586
These things sort of led up to this scenario where it's vaguely feasible.

00:11:55.061 --> 00:12:03.465
I know you have like a blog post about this, but can you just walk us through how this actually works, like your solution to this problem?

00:12:03.465 --> 00:12:05.208
After that I want to get into the impact.

00:12:05.208 --> 00:12:07.072
We kind of led up to that a little bit.

00:12:07.072 --> 00:12:14.393
I think you can kind of think through some logical impacts of if you're able to do it decentralized versus this centralized heavy build-out model.

00:12:14.393 --> 00:12:18.690
It can probably reduce the need to like do such a massive build-out in some ways.

00:12:18.690 --> 00:12:20.788
But can you just talk about the solution set?

00:12:20.788 --> 00:12:21.884
How does this work?

00:12:27.940 --> 00:12:29.687
Maybe it's important to sort of just contextualize it a bit, right?

00:12:29.687 --> 00:12:30.312
The lead up to this.

00:12:30.312 --> 00:12:34.868
So this stuff is all so raw still, right, like you only had large scale runs happening in 2021 with GPT-3, gpt-4.

00:12:34.868 --> 00:12:39.971
A lot of the recent distributed training algorithms really are like a couple of years old, right?

00:12:39.971 --> 00:12:49.332
So the point I'm trying to make is that these things are sort of like there's a recipe that people have stumbled on.

00:12:49.332 --> 00:12:55.013
They're optimizing it for the hardware that they have available and it's all raw.

00:12:55.445 --> 00:13:06.748
This is not a stable, well-established field in how these things work today, and there's this element of the algorithms are tailored for the hardware that was available In a big company.

00:13:06.748 --> 00:13:09.216
You have these GPUs, these server racks.

00:13:09.216 --> 00:13:11.131
You have them for historical reasons.

00:13:11.131 --> 00:13:12.527
Originally, you would train these historical reasons, right.

00:13:12.527 --> 00:13:15.592
Originally, you know you would train these things with something called distributed data, parallel, right?

00:13:15.592 --> 00:13:19.298
You needed these very high speed node-to-node interconnects in that setting.

00:13:19.298 --> 00:13:24.230
Then the models got very big and you have these.

00:13:24.230 --> 00:13:25.476
Okay, I have these very fast node-to-node interconnects.

00:13:25.476 --> 00:13:27.184
I can use them to train in an even more efficient way, right?

00:13:27.365 --> 00:13:34.519
But there's never been this sort of explicit focus on well, what if I just optimize for communication efficiency?

00:13:34.519 --> 00:13:40.977
What if I'm not just trying to push the actual performance profile of the model as far as I can get it with the hardware I have.

00:13:40.977 --> 00:13:46.436
But what if I really motivate it by getting this thing to work in this form setting and I just optimize for that?

00:13:46.436 --> 00:13:47.118
How can I do that?

00:13:47.118 --> 00:13:52.914
That's not something that's happened in the literature and we're just starting to see that occur now.

00:13:52.914 --> 00:14:02.975
It's the beginning of the research there, because people have become aware that this is a very sort of if you can get this to work, it sort of potentially avoids a lot of problems, right?

00:14:02.975 --> 00:14:04.230
So that's the whole setting.

00:14:04.426 --> 00:14:12.051
It's not as if I'm saying that I've figured out some key technical breakthrough that I've figured out in my house by myself.

00:14:12.051 --> 00:14:13.827
You know, just thinking about it.

00:14:13.827 --> 00:14:18.125
Right, the frame of the situation is that there's this obvious research direction.

00:14:18.125 --> 00:14:25.772
The initial results in this research direction strongly point towards feasibility, in my view, rather than infeasibility.

00:14:25.772 --> 00:14:35.077
There's like early serious research here already demonstrating, okay, we can do like a 1 billion parameter run over these low bandwidth interconnects, which we can talk about if you want.

00:14:35.077 --> 00:14:41.557
And so the vector of progress here to me is like why would we not try and do this right?

00:14:41.557 --> 00:14:46.745
Why isn't there more people actually just focusing on this exact problem, and that's sort of what I'm trying to do with Florelis.

00:14:47.567 --> 00:14:50.552
So can you tell me more about like how would it work in theory?

00:14:50.552 --> 00:14:52.796
Right, like it sounds like some of this is still theoretical.

00:14:52.796 --> 00:15:06.498
So just to recap, to make sure I understood Today people are optimizing their models to run on hardware and be like as efficient as possible there in, with the like the rate limiter being how much data can you pump into this model.

00:15:06.498 --> 00:15:07.945
That's the optimization.

00:15:07.945 --> 00:15:11.115
It's like amount of data I can send here.

00:15:11.115 --> 00:15:21.788
The thing they're not doing is optimizing for the communication between those servers and in a swarm if you're in saying you have the, you have the very efficient cluster.

00:15:21.827 --> 00:15:24.153
You don't need to optimize for that right what's fang?

00:15:24.153 --> 00:15:31.148
Uh, like you know, like a big tech, big tech, you know amazon, or if you're one of these industrial research labs oh yeah, you're one of these big yeah, yeah.

00:15:31.187 --> 00:15:33.634
You don't need to optimize for it, because you just have so much money.

00:15:34.196 --> 00:15:34.937
You have the servers.

00:15:35.904 --> 00:15:37.470
Yeah, you can just buy all the servers.

00:15:37.791 --> 00:15:40.107
Yeah, there's no recipe, there's no risk, you just okay.

00:15:40.107 --> 00:15:41.493
We know we can speed up these interconnects.

00:15:41.493 --> 00:15:43.309
We know it makes the algorithm slightly faster.

00:15:43.309 --> 00:15:44.735
We don't need to do.

00:15:58.325 --> 00:16:01.832
Yeah, I mean you talked about it in the beginning, but the real problem here is that the AI is the most powerful technology that humanity will ever see, probably.

00:16:01.832 --> 00:16:02.313
I think that's fair.

00:16:02.313 --> 00:16:03.495
It's one of the most.

00:16:03.495 --> 00:16:18.570
Right now, a handful of companies might be five have and are accumulating all of the compute power necessary to power AI, and it's a huge moat, it's a big barrier to entry.

00:16:18.664 --> 00:16:22.676
The real thing that they own, by the way, the model is just one piece.

00:16:22.676 --> 00:16:24.793
You articulate this in the beginning as well.

00:16:24.793 --> 00:16:29.332
Like the model is just one piece, it's the data combined with the model that makes that brain.

00:16:29.332 --> 00:16:35.610
That is the AI that you can own and that brain lives on infrastructure AI that you can own and that brain lives on infrastructure and it's trained on infrastructure.

00:16:35.610 --> 00:16:42.072
And so you really, right now, today, the only way to get that and that's immense power.

00:16:42.072 --> 00:16:43.616
If you own that, it's immense power.

00:16:44.245 --> 00:16:48.075
People think about it as like IP, where you have like a closed source model.

00:16:48.075 --> 00:16:52.416
You don't need a closed source model, like to own it, you need the hardware and the data.

00:16:52.416 --> 00:16:56.296
If you have those two things and you have the skills to train it, then you own the thing.

00:16:56.296 --> 00:16:59.171
And I think this is for enterprise, especially they.

00:16:59.171 --> 00:17:02.750
Some of them realize it, but I think a lot of people probably don't.

00:17:02.750 --> 00:17:09.391
When they're like giving their data to train an AI for use in their company, they're giving away ownership of that AI.

00:17:09.391 --> 00:17:10.272
They don't own it.

00:17:10.814 --> 00:17:30.176
So the solution set that you're talking about enabling this decentralized training paired with ownership of data means that you can own, or the community can own, the AI instead of this company and anybody, and it lowers the barrier to access to creating this, what is essentially almost like a natural resource, if you think about it.

00:17:30.176 --> 00:17:34.271
I guess I say all that to point out to everybody how serious a problem it is.

00:17:34.271 --> 00:17:36.696
To me it seems existential.

00:17:36.696 --> 00:17:39.989
There's existential risk, not having another option.

00:17:39.989 --> 00:17:43.936
I think the equivalent of nuclear weapons is a pretty good.

00:17:43.936 --> 00:17:52.249
Maybe it's not that apocalyptic, but the amount of power an individual company can have, it's almost like owning a nuke.

00:17:52.469 --> 00:18:11.813
I think you articulated that really well, right, this was one of the main motivators of me trying to do this is I don't think this awareness was there sort of towards the end of last year that these things would obviously trend towards oligopoly, that there was sort of natural dynamics of you know you have these huge capital costs for training that sort of tended these things towards oligopoly.

00:18:11.813 --> 00:18:17.486
I think it's much clearer now that that's sort of what's going to happen and I think it's also very clear why.

00:18:17.486 --> 00:18:26.775
You know, I think the right analogy is is sort of like a commodity, except it's a commodity which injects its sort of cultural bias into everything it touches.

00:18:26.775 --> 00:18:30.009
It's like imagine oil that injected politics in every machine it went into.

00:18:30.009 --> 00:18:43.086
Right, that's sort of what we're talking about and we have this natural dynamic here that pushes these things toward standard oil 2.0, right, there's really going to be one or two giant corporations that are going to be the base model providers.

00:18:43.086 --> 00:18:47.846
They're going to capture most of the value and they're going to decide how these things behave.

00:18:47.925 --> 00:18:50.990
Right, and as of today, I don't think there's an alternative.

00:18:50.990 --> 00:19:00.936
I don't think traditional open source, the way it's set up today, you know people are talking about oh well, it's okay, we're going to have this thriving open source ecosystem that's going to challenge this oligopoly.

00:19:00.936 --> 00:19:05.645
I don't see that how that happens when these runs end up costing hundreds of millions to billions of dollars.

00:19:05.645 --> 00:19:24.951
I think you have this key difference where traditional open source is people donating their time and people are acting as if people are going to donate money, right, which is what you need to get these training runs to work, and, in my view, that that's not going to happen no, you really need token economics to get to that, to get to the scale right.

00:19:24.971 --> 00:19:29.367
You need tokens and technology to that powers decentralized, decentralized training.

00:19:29.367 --> 00:19:36.201
It occurs to me when you're saying this too, that, like on this show, we've had two of the dark horses of AI come on.

00:19:36.201 --> 00:19:39.489
This is one of them the idea that decentralized training is possible.

00:19:39.489 --> 00:19:43.076
The other one was Peter Voss, who came on.

00:19:43.076 --> 00:19:47.113
He's the founder of, I think it's called Agoya AI.

00:19:47.113 --> 00:19:54.327
I'd have to go look at my notes from that one, but he and two other guys were the coauthors of a book about AGI.

00:19:54.327 --> 00:20:03.093
They like coined the term AGI back in like the early 2000s, and what he works on is called cognitive AI versus like LLM based models.

00:20:03.093 --> 00:20:05.768
You have to train them, then put them into a server for inference.

00:20:05.768 --> 00:20:10.025
This one like learns on the fly and he's got a company with like 30 employees.

00:20:10.025 --> 00:20:15.397
They have customers, they use this AI at enterprise and it works because it can train on the fly.

00:20:15.397 --> 00:20:19.392
It requires way less compute and I think we've like landed on the two.

00:20:19.392 --> 00:20:26.877
His is closed source, though I think he would be open or could be convinced to open, sourcing it, which would be huge under the right circumstances.

00:20:26.877 --> 00:20:34.519
I think these two ideas are very important to preventing the thing, the outcome that we're both seeing, already happen around us.

00:20:35.286 --> 00:20:46.836
Anyway, this is more of like just an observation for me, like some things that I'm seeing, like there's these like dark horses that nobody's talking about, and a large part of why both of these things haven't been talked about is because of the consensus.

00:20:46.836 --> 00:20:48.715
It's like the hammer and the nail thing.

00:20:48.715 --> 00:20:52.516
Right, they have a hammer that they can hit a nail with and they're all taught to use that same hammer.

00:20:52.516 --> 00:20:54.551
That's where all the money ends up flowing.

00:20:54.551 --> 00:20:56.613
I actually don't think it's because no one thinks these things.

00:20:56.613 --> 00:21:01.396
I mean, they do think that they might not work or that they won't work, but it's also just as much.

00:21:01.826 --> 00:21:04.736
The analogy I give to people is with GIS software.

00:21:04.736 --> 00:21:25.150
Esri is the main geospatial information system software provider and they did a great job at all the universities getting all the GIS professors licenses to this, and so they train all their students, and so this is just the hammer that, like everybody has to use and I think, for whatever reason, like LLM based training kind of became the hammer as well.

00:21:25.150 --> 00:21:26.555
Same thing with like central.

00:21:26.555 --> 00:21:33.336
The centralized training piece is there because obviously it's just like harder to do, decentralized.

00:21:33.336 --> 00:21:39.368
But before I keep rambling, can you break down actually, though for me a little bit like how this is possible?

00:21:39.368 --> 00:21:41.213
Like maybe it's in the blog article?

00:21:41.213 --> 00:21:42.636
Can you explain to us what's in that?

00:21:43.204 --> 00:21:43.625
Yeah, I mean.

00:21:43.625 --> 00:21:54.666
So I mentioned you have this low bandwidth problem, but you have even bigger problems, right, like if you have heterogeneous node capacity on the memory as well as the compute.

00:21:54.666 --> 00:21:56.574
You don't have that in the centralized case If you have nodes entering and dropping.

00:21:56.574 --> 00:22:04.696
You know you actually do have that happen in the decentralized case, in the centralized case, just because when you do it at the scale like things break and so you actually do that.

00:22:04.717 --> 00:22:18.094
Nets fail, yeah, but you have it at a much larger scale in this sort of swarm setting and you also have this problem, like you don't in a centralized case, you know that the gradient you're getting is right.

00:22:18.114 --> 00:22:19.459
If you're doing this with individuals, what's the gradient Like?

00:22:19.459 --> 00:22:19.779
Gradient is?

00:22:19.779 --> 00:22:23.855
You can think about it as like one fraction of that piece of information from the data you're training.

00:22:23.855 --> 00:22:33.528
On that, you're sort of making the model a little step better and so you're applying these gradients repeatedly and, like you said at the start, you converge on a useful model, right?

00:22:33.528 --> 00:22:35.654
Gradient is like the information that makes the thing slightly better.

00:22:35.654 --> 00:22:38.088
Yeah, so you have a lot of problems, right.

00:22:38.088 --> 00:22:39.292
It's sort of what I'm saying.

00:22:39.773 --> 00:22:45.875
The core of it is this low bandwidth, though, right, and there's a range of sort of solutions here.

00:22:45.875 --> 00:22:59.808
The first sort of set of ways is okay, let's take the distributed approaches we have today, right, and let's focus on the sort of communication primitives and what's actually being moved around and let's just max out the compression, right.

00:22:59.808 --> 00:23:03.727
Let's try and think about ways we can maybe make this slightly more efficient.

00:23:03.727 --> 00:23:13.594
But the sort of general idea is like, let's adapt the distributed training approaches today and just focus them with this explicit objective of making them communication efficient, right.

00:23:13.594 --> 00:23:20.632
My view is you can get a reasonably long way with this kind of thing and, like I said, there's actually a work swarm parallel.

00:23:20.632 --> 00:23:21.434
That's demonstrated.

00:23:21.434 --> 00:23:25.755
You can do this in a setting called pipeline parallel, up to about a billion parameter model.

00:23:25.755 --> 00:23:28.666
Right, and it wasn't the limit, right, it's not.

00:23:28.666 --> 00:23:30.107
You know, I think you probably push a bit farther.

00:23:30.107 --> 00:23:31.469
It's not clear, right?

00:23:31.469 --> 00:23:38.193
But that approach is not going to scale to 100 billion, 1 trillion or 10 trillion parameter models, right?

00:23:38.993 --> 00:23:43.537
There's a whole other direction, which is you have these things called asynchronous updates.

00:23:43.537 --> 00:23:52.865
So if I step back for a moment, maybe just to give a bit more context, like when you train the model, there's this information flow through the network.

00:23:52.865 --> 00:23:54.729
It goes on a forward pass.

00:23:54.729 --> 00:23:55.891
You calculate a loss.

00:23:55.891 --> 00:23:59.336
You then pass the information back, you update all of your little parameters.

00:23:59.336 --> 00:24:15.773
Parameters are sort of the actual thing that makes up the model with those gradients that you've calculated from your loss, right, and you have to do this all in lockstep and you have to make sure that, like, the gradient you calculated from your loss is being applied to each little part of your model in the correct way.

00:24:15.773 --> 00:24:22.266
And then you know you do that process, the information flows back, and then you do it again and you just repeat this over and over again.

00:24:23.167 --> 00:24:27.657
But there's this whole line of work which is actually quite old, which is well.

00:24:27.657 --> 00:24:32.396
What if I just start continuously pushing data through and I don't wait for the gradient to come back?

00:24:32.396 --> 00:24:53.788
I just push through, I activate, and whenever I get my gradient back, I just update my parameter, right, and it introduces a big problem of like, suddenly parts of your model might be out of sync with other parts, right, so it introduces that problem, but it solves another problem which means you don't have to wait, which is one of the reasons why this sort of communication efficiency is such an issue.

00:24:53.788 --> 00:25:02.990
So there's sort of very, very early work there on people applying this async training method to decentralized training.

00:25:03.575 --> 00:25:22.275
I think that has a lot of legs right, like intermediate approach between okay, the low risk, we just adapt the standard distributed training and the sort of third class approaches that I think, which is like you start to look at alternative architectures where, sort of, like you mentioned, I think someone was doing something different.

00:25:22.275 --> 00:25:27.502
There's sort of a few guys now that are like well, we don't think there's anything that special about transformers.

00:25:27.502 --> 00:25:32.763
We think there's, you know, there's something fundamental there, but it can probably be replicated in another type of architecture.

00:25:32.763 --> 00:25:36.766
If I rank those things in sort of range of like risk to pay off.

00:25:36.766 --> 00:25:39.104
I think the new architectures are very speculative right.

00:25:39.104 --> 00:25:52.528
It's going to be a lot of work to get something like that to work and match GPT level performance, but I think it's absolutely worth trying and in the meantime I think there's sort of these two intermediate ways which are much lower risk and can also get quite a long way.

00:25:53.375 --> 00:25:56.704
That's a good segue into my next question, which is like, how do you go to market here?

00:25:56.704 --> 00:26:00.442
You know crypto moves really fast, the market moves really quickly.

00:26:00.442 --> 00:26:01.704
The need is great.

00:26:01.704 --> 00:26:04.898
My hypothesis is the customers feel the pain right.

00:26:04.919 --> 00:26:11.450
Like there's people out there who wish they could train their own models at the scale that like Facebook does, but they can't.

00:26:11.450 --> 00:26:13.259
They don't have access to those resources.

00:26:13.259 --> 00:26:15.203
The open source models exist.

00:26:15.203 --> 00:26:16.717
That's the one plus about Facebook.

00:26:16.717 --> 00:26:19.403
They do make a lot of open source models that are good.

00:26:19.403 --> 00:26:24.909
The infrastructure to train them on does not exist at scale, or really it does exist at scale.

00:26:24.909 --> 00:26:29.540
It's the access and the efficiency to do it, to actually the technology to actually make that feasible.

00:26:30.082 --> 00:26:51.482
So and one thing I found in crypto is like and I think in most technology new startups, if you're doing something really hard, there's this balance between, like hardcore R&D and solving these technical problems and practical business strategy and execution, like shipping something and getting it to market.

00:26:51.482 --> 00:27:00.604
I've seen teams raise like tens of millions of dollars and end up shipping nothing because they sit in a room trying to solve all these problems all the time.

00:27:00.604 --> 00:27:08.842
That are you know they're solvable over time, but like you have to build something that you can then fund indefinitely off of revenue or whatever.

00:27:08.842 --> 00:27:10.921
So how do you think about that?

00:27:10.921 --> 00:27:11.836
Like what's your plan?

00:27:11.836 --> 00:27:22.528
I know it's super early, you're in like this research phase, kind of still it's precede, with this idea but like, how are you planning on like actually taking something initial to market?

00:27:22.528 --> 00:27:23.598
Like what's the MVP?

00:27:23.598 --> 00:27:25.064
I guess it's like another way of putting the question.

00:27:25.734 --> 00:27:25.875
Yeah.

00:27:25.875 --> 00:27:30.622
So, like I said, right, you can sort of train these things decentralized today at small scales, right.

00:27:30.622 --> 00:27:39.182
So you don't have a GPT-4 capable model, but you can train a sort of 1 billion parameter thing today in this way, right, which, by the way, is like not a well-known thing.

00:27:39.182 --> 00:27:41.662
When I tell my AI friends this, they go what?

00:27:41.662 --> 00:27:42.826
Like I can do that?

00:27:42.826 --> 00:27:46.705
And then I show them the paper and they go oh okay, I didn't know that, you know.

00:27:46.705 --> 00:27:53.311
So my view is there's a large amount of sort of narrow value capture that happens at those model scales, right.

00:27:53.311 --> 00:28:05.403
So I think there's a lot of reasons why having sort of a decentralized form where you can train small models that maybe aren't as broadly useful as a big GPT-4, but sort of have this but they're point specific.

00:28:05.796 --> 00:28:11.878
They're point specific, right, it's not like a general purpose tool, but you can scope it to something and it'll be useful.

00:28:12.381 --> 00:28:13.526
That's right, that's exactly right.

00:28:13.526 --> 00:28:13.767
Right.

00:28:13.767 --> 00:28:22.961
And the whole point is, if you will do this in this decentralized way, even though those runs aren't particularly expensive, you don't have this sort of massive upfront capital cost.

00:28:22.961 --> 00:28:29.626
If you have some kind of an active swarm where you're sort of only returning, you're paying for the thing in fractional ownership, right.

00:28:29.626 --> 00:28:42.122
If someone contributes their compute to that model, they have some small fractional ownership and it means that you get that permissionless dynamic that characterizes open source at the model development layer, right.

00:28:42.122 --> 00:28:46.179
I don't need to accumulate a big capital pool to initialize a model run.

00:28:46.179 --> 00:28:52.138
If I'm a model designer, all I need to do is make a clear case that this model is potentially useful to somebody.

00:28:52.138 --> 00:28:56.938
And there's this direct incentive for a compute provider to actually participate in that run.

00:28:56.938 --> 00:28:59.384
Right, because they're going to get reimbursed from use.

00:28:59.954 --> 00:29:00.576
What about the data?

00:29:00.576 --> 00:29:02.423
Where do you get the data for this situation?

00:29:03.174 --> 00:29:16.964
I think for these small models, I mean, you don't run into this problem of really massive data until you start getting up around GPT-4, right, like today, you can use FineWeb, the sort of various other big curated, centralized data sets.

00:29:16.964 --> 00:29:19.502
You get a very long way with those right.

00:29:19.502 --> 00:29:24.923
You know, if you have sort of proprietary data that you want to inject in the model, I think there's a lot of obvious ways you can do that.

00:29:24.923 --> 00:29:28.761
You can set it up like as a rag-like setup where you can put in the fine-tuning.

00:29:28.761 --> 00:29:32.509
I'm not to me that's not the main focus of what I'm trying to do.

00:29:32.509 --> 00:29:45.766
I think there's sort of clear, obvious solutions there, in a way that there's not clear, obvious solutions to getting the decentralized trading to work today, but when you start to get up among the big scales, that's when I think the proprietary data becomes much more critical.

00:29:46.994 --> 00:30:18.076
Yep, but that's the vision for you eventually is to get that big no-transcript Cause.

00:30:18.076 --> 00:30:21.096
I think he's one of the people that talked to me about decentralized training that it would like.

00:30:21.096 --> 00:30:22.779
Did he think it was possible?

00:30:22.779 --> 00:30:24.945
I can't remember what his like first reaction was.

00:30:25.434 --> 00:30:28.965
I don't remember him having a strong negative reaction, but maybe he was just being nice to me.

00:30:31.515 --> 00:30:31.595
Yeah.

00:30:31.595 --> 00:30:42.375
So it sounds like your go-to-market is like this paper uh kind of says, hey, it is possible to train these billion, you know parameter models, which is useful, and so you're, you're going to take that paper and actually build.

00:30:42.375 --> 00:30:45.884
Try to build that as a protocol that's exactly right.

00:30:45.924 --> 00:30:55.585
Right, we want to start by implementing just okay, here's a proof of concept that small to medium size models can be trained in this way, because that's still an extremely contrarian position.

00:30:55.585 --> 00:30:57.359
I want to be clear about that.

00:30:57.359 --> 00:30:59.442
Most people do not agree with what I'm saying there.

00:30:59.442 --> 00:31:09.045
So we want to demonstrate that we think there's sort of clear ways we can get narrow utility in that setting and we're going to build from there into the much, much bigger ones.

00:31:09.855 --> 00:31:11.782
What help can people listening give you?

00:31:11.782 --> 00:31:13.842
How can they contribute, if at all?

00:31:13.842 --> 00:31:14.684
What do you need?

00:31:15.755 --> 00:31:23.066
I mean today we're still just focused on that core technical problem, so we're still a little ways from sort of an active product or marketplace or a token here.

00:31:23.066 --> 00:31:37.506
I think the main thing we're really interested in is if there are research scientists listening to this, particularly people in FANG or in these big labs, and they're maybe not entirely on board with the direction that the model governance is going.

00:31:37.506 --> 00:31:41.510
We really want people active in the protocol that are actually doing model design.

00:31:41.510 --> 00:31:43.978
To me that's actually the main constraint.

00:31:43.978 --> 00:31:52.488
Is that expertise to actually develop these models, to understand some of the training dynamics, which, as of today, is still a lot more of an art than a science.

00:31:52.488 --> 00:31:54.630
Come and get in touch.

00:31:54.630 --> 00:31:59.385
We'd love to sort of get you involved in some of this stuff, while it's still internal only, and go from there.

00:32:00.015 --> 00:32:01.298
I know a few people you should talk to.

00:32:01.298 --> 00:32:04.278
We'll make some intros at the end, at the after the beat off.

00:32:04.278 --> 00:32:05.202
That'd be awesome.

00:32:05.202 --> 00:32:07.577
What have I not asked you that I should have asked?

00:32:08.398 --> 00:32:18.017
I mean, we sort of brushed over it, but I think a big topic is just why this decentralization is critical, why the centralization is so bad.

00:32:18.017 --> 00:32:22.708
Right, because I think there's an emerging understanding of why it's bad, but it's maybe not precise, right?

00:32:22.708 --> 00:32:27.787
And also just the natural fact that this thing does lend itself to centralization, right?

00:32:27.787 --> 00:32:29.742
So I don't know.

00:32:29.742 --> 00:32:38.619
I'll just paint a picture of one of my main concerns, right, which is, let's just say, this stuff ends up becoming very useful as a tutor, right?

00:32:38.619 --> 00:32:57.676
So which to me it seems like it obviously would, if I want to learn something and I have the system I can interact with, which is like personalized and contextual, and it sort of also knows what I you know what else I struggle with, and and it can of also knows what I you know what else I struggle with, and it can really guide me, right, and this ends up being one of the common ways that most people learn most things, right?

00:32:57.676 --> 00:33:02.240
Like I'm talking about, when you're young, you're going through school, you're interacting with these AI tutors very often.

00:33:02.240 --> 00:33:17.186
That to me seems very likely, and if you combine that with this scenario of an oligopoly, right, right, what you end up happening is like you have a very small group of people basically shaping the worldview of entire generations of people.

00:33:17.186 --> 00:33:38.285
Right, and I don't see this risk discussed at all, this particular thing of like you potentially have this world where every book, every internet article, every encyclopedia that you could access, you know, if the LLMs are the base knowledge source, a very small handful of people write all those things.

00:33:38.285 --> 00:33:46.221
It's like they were all written by the same person and maybe they go through different providers and maybe there's one you know software vendor that's calling you know one model provider, another one using the other one.

00:33:46.221 --> 00:34:06.602
But the point is, the base knowledge source sort of becomes totally centralized in the same and because these things it doesn't matter how much you try to debias them or do safety or whatever these things are inherently political, they inherently possess cultural values because they have to give an answer on subjective questions.

00:34:08.126 --> 00:34:09.588
I think that's really, really dangerous.

00:34:09.588 --> 00:34:11.221
Right, it's Orwellian.

00:34:11.221 --> 00:34:18.416
I think it's far more dangerous than the common safety risk of oh, maybe someone uses this thing to make a bioweapon right, like.

00:34:18.416 --> 00:34:19.501
I think that's a danger.

00:34:19.501 --> 00:34:22.244
I think this concentration of power is much, much worse.

00:34:22.244 --> 00:34:27.286
And the point I'm trying to make is that today we're on the glide path to that.

00:34:27.286 --> 00:34:40.887
If things don't change, if we don't have some kind of a real, genuine, decentralized alternative where these things are actually being created and not just taken and put in the protocol later, the only other path is heavy regulation right.

00:34:41.434 --> 00:34:44.842
We've seen in real time how that doesn't work.

00:34:44.842 --> 00:34:52.242
There's a great example from Web 2.0 from the last 10 years, and that's social media tiktok.

00:34:52.242 --> 00:34:57.039
Tiktok is used by every young american.

00:34:57.039 --> 00:34:59.206
I'm 38, I don't use tiktok.

00:34:59.206 --> 00:35:12.985
But like I don't use tiktok because of the privacy reasons, because I'm aware of what's going on there it is controlled by a foreign government who actively, you know, uses it to manipulate the populace, and this is a.

00:35:12.985 --> 00:35:17.123
This is the same thing, even though it's just owned by corporations.

00:35:17.123 --> 00:35:20.300
Maybe they're foreign or not, but it happens, you know.

00:35:20.300 --> 00:35:21.603
You talk about regulation, right?

00:35:21.603 --> 00:35:23.166
You're not going to regulate fast enough.

00:35:23.166 --> 00:35:28.016
Number one the regulations are going to probably be terrible and you're not going to regulate fast enough.

00:35:28.016 --> 00:35:31.445
Before you know it, half the population, or the entire, I mean.

00:35:31.445 --> 00:35:36.483
Look at ChatGPT 100 million users in like what?

00:35:36.483 --> 00:35:42.260
Three months, something like that from when they launched the fastest growing piece of technology ever, and so you're just never going to keep up.

00:35:42.721 --> 00:35:43.503
I think that's exactly right.

00:35:43.503 --> 00:35:51.797
I have major doubts that regulation here can be effective, and I also just want to say like I don't think the individuals in these companies are bad people.

00:35:51.797 --> 00:35:57.757
These people are genuinely very, very smart, very high integrity, are genuinely trying to do the right thing.

00:35:57.757 --> 00:36:06.166
There is just something about corporations which lead to these perverse outcomes, even when every individual inside of them is sort of well-intentioned, right.

00:36:06.166 --> 00:36:22.981
And this is my big concern, that even though I've got friends that are making these models that I know they're trying to do the right thing, I just think there's structure that sort of leads to this emergent behavior which, overall, is very, very negative, and that's my main concern.

00:36:24.405 --> 00:36:27.282
I couldn't agree more and I think it's great the work that you're doing.

00:36:27.282 --> 00:36:35.438
I hope it succeeds and I think it's very important that we have another option right Like a different alternative, and right now we don't.

00:36:35.438 --> 00:36:37.521
So with that we'll wrap.

00:36:37.521 --> 00:36:43.690
Thanks for coming on, and if you're listening to this and you want to intro to Alexander, you know my DMs are open.

00:36:43.690 --> 00:36:45.025
Where can they reach you online?

00:36:45.429 --> 00:36:47.518
Yeah, I'm active on Twitter, Alexander J Long.

00:36:47.518 --> 00:36:49.320
That's probably the best place to reach me, though.

00:36:49.701 --> 00:36:51.123
Cool, all right.

00:36:51.123 --> 00:36:52.306
Well, you guys heard it here first.

00:36:52.306 --> 00:36:53.507
Thanks so much for being here.

00:36:56.739 --> 00:37:01.389
Thanks for having me.

00:37:03.474 --> 00:37:05.708
You just listened to the Index Podcast with your host, alex Cahaya.

00:37:05.708 --> 00:37:12.623
If you enjoyed this episode, please subscribe to the show on Apple, spotify or your favorite streaming platform.

00:37:12.623 --> 00:37:15.123
New episodes are available every Friday.

00:37:15.123 --> 00:37:16.594
Thanks for tuning in.

00:37:16.594 --> 00:37:18.916
Thank you.