Speaker Details
Brewster Kahle
Founder and CEO
Internet Archive & Wayback Machine
Session Transcription
I am truly pleased to be able to introduce our next speaker. One of the great things about being in the technology industry and living in the Bay Area is you come across in your social life people who are just significant innovators and players in the market. And that's the way I feel about our next guest. His name is Brewster Kahle. I'm very happy to say Brewster is a longtime friend, a former neighbor, and one of the most innovative people I met in the industry. So Brewster, welcome to Saas Metrics Palooza. Ray, this is great. Thank you very much. I really appreciate the opportunity. I think this may be a little bit of a different talk than you're going to get through the whole rest of this, but hopefully really informative and some news you can use. So how is research going to change in the age of artificial intelligence? I started and run a large library called the Internet Archive. It may be the largest nonprofit library in the world. And the idea is to make it so that people can, well, be better people, make better decisions, understand what's going on. And what we need is lots of information, but we need lots of tools to make it through it. So I'm going to give you an idea what the Internet Archive is and sort of how it works, a little bit of the breadth of it, but then how we're using AI and even in this current early stage of the chat GPTs of the like towards going and building our systems to be better, and a little bit of how other people are starting to use our collections to go and understand the bigger world based on enormous amounts of data at the Internet Archive. Okay, so what is the Internet Archive? The Internet Archive is a nonprofit library, 501c3 public charity. This is our headquarters building in San Francisco where I'm sitting now. And our motto is universal access to all knowledge. Can we make it so that all the works of humankind can be made available to anybody interested in making use of it? Can we go and make the digital library of Alexandria happen? Can we make that dream come true of a global brain? That was what I jumped into in 1980 to try to build the Internet into because we kind of promised it before with Vannevar Bush, Ted Nelson, Xanadu, and then eventually Tim Berners-Lee's World Wide Web. Could we make this come true? So what do I mean by that? I think as a way of illustrating what the Internet Archive has done is just a skinny example. If you take Wikipedia, you may not have used the Internet Archive or maybe you've just used it for the Wayback Machine, but if you've used the Internet Archive, everybody's used Wikipedia. So if you take something like the Martin Luther King Jr. page, there are all these footnotes, there's citations. And then the question is, can you then turn those citations blue? Can you make them all links? So we've taken, we've gone over and collected all of Wikipedia's pages, tried to find all of, for instance, in this case, the books. Can you find the books in the citations? Turns out that's not easy because there's, well, hundreds of millions of them. And also there's formatting problems. But okay, so let's see if you can find the book. And then can you go, and if there's a page number in the citation, open the book right to the right page. Can you make it so you can go deeper than Wikipedia? The Internet has got lots of information on it. But for all of us that know something deeply, often it's kind of what's available publicly is kind of thin. So can we go and dive deeper? And so this idea was to make it so you could link straight into books. We prioritized, got funding to digitize books, to go and prioritize digitizing those, digitizing them, and then linking them right to the right pages. The win for those doing homework projects, but hopefully everybody else as well. Another thing that we're probably better known for is the Wayback Machine, where we collect web pages. We try to collect one web page, no, a copy of every web page from every website every two months. But actually, it's now grown much more adaptive as we work with about 1,100 organizations to go and collect the Wayback Machine. And the data in the web collections have become the Wayback Machine. And then we try to make it available again to people. And here again, I'll just take an example of a Wikipedia page that linked to a House of Representatives page on the Select Committee on the January 6th attack. And once the parties changed, they took all that information off the website. And so the idea then linking that back into the pages that used to be there is kind of a library-like function, having out-of-print materials, different editions of things, and being able to give context around these things. So we're not just dealing with ever-shifting sands of information, because the average life of a web page is 100 days before it changed or deleted. So how can you depend on that if anything can be sort of shifted out from underneath you? So this is an example. This is one from having to do with Canada's Commission of Inquiry, and then it was disappeared off the web, and so the Wayback Machine to the rescue. So we've now fixed 17 million broken links in Wikipedia. We've inserted about a million links into book links to over 100,000 different books so that you can start to grow deeper. So these are all fairly manual. It's not really AI, but it all is sort of these robots going around, and then they're corrected by people and put in place into these to try to improve Wikipedia to go and make it so that people have access to things that are out of print or in different media and bring it together for people. So that's, I guess, a way of looking at the Internet Archive through this sort of one use case of the Internet Archive, which is reinforcing Wikipedia. Wikipedia loves us, which is kind of great. We love Wikipedia, but it doesn't stop there. So I'm going to give you just some big numbers of sort of what the scale of the collections of the Internet Archive are, and we have 890,000 software titles like Apple II and Commodore 64 and all those great things, so you can go and click and run them, and it runs them not based on going and firing up an old Atari or Commodore 64, but by using these emulators, which the head of software at the Archive, he said, well, why don't we just go and cross-compile the C code into JavaScript and have it run in the browser and then use the Internet Archive as a giant floppy drive? And I just didn't think that would work, but it did take a couple of years of a lot of volunteers to go and try to make all of that come about, and so the software is now living again, at least somewhat, as much as we've been able to pull off. Moving images, we've got over 6 million movies of different forms that people have uploaded to the Archive or digitized from film and video, 14 million audio recordings, a lot of the background image here is the Grateful Dead. They started a tradition of allowing people to copy their concerts and give copies of those as long as nobody made any money, it was key. So they were the first ones to do the Creative Commons non-commercial license, if you will, to go and say, yes, you can go and record our concerts, you put up your microphone at the concert, record it and pass it around as long as nobody makes any money. And it turns out lots of other bands copied this, so there are now 8,000 bands on the Archive that have given permission to have their fans upload their concerts, and we have hundreds of thousands of concerts now that are publicly available that everybody's happy about, which is kind of great. Television news programs, we started recording television in the year 2000, Russian, Chinese, Japanese, Iraqi, Al Jazeera, BBC, CNN, ABC, Fox, 24 hours a day, DVD quality, to go and make that available so that people can go and search and find snippets within the U.S. television news to go and find out what people said. For all of us that grew up, at least in the United States, where we got drilled in to be able to think critically, you need to be able to quote, compare and contrast. If you can't quote, then you can't go and hold on to it to be able to compare it and contrast it to other things and put it into an essay or think critically about it. So television just sort of flows over. And so the idea was to go and record television so people could go and quote it and then be able to compare and contrast and understand what's going on, and it's been used a lot. The biggest use actually hasn't been just going and finding what people said, though that happens a lot in fact-checking and the like, it's by people doing analysis over the whole corpus to be able to get a bigger view of what's going on. Our friend Jesse Ausubel put it, people got really far by having a microscope. What we need is a macroscope to be able to step back and get the bigger picture of what's going on. And so we all live in our bubbles. We don't think we do. We think everybody else lives in a bubble, but we don't, and that's just not true. And so how do you go and get an idea of what, especially time-based media, is going on because you can't just watch it all, but your computer can, and the idea is that they can summarize it and the like. And there have been a lot of news reports based on our data of the differences between U.S. news programs and how they report, just sort of, you know, what do they report on mostly as opposed to, are they saying the same things about the same things? They just report on completely different worlds, so it's a different world if you watch Fox as if you watch CNN or MSNBC. So the idea is we can document that and try to examine and step back to get an idea of what are those people saying, and we're starting to do this internationally, and I'll talk about that in a moment. E-books, we've digitized about 7 million books into e-book form. Also, we've bought a lot of e-books from some publishers, but interestingly, publishers aren't selling e-books anymore. The big guys aren't. They're just licensing them in such a way that they can take them away at any time or change them, which is kind of distressing if you're a librarian or, well, kind of anybody. But the idea of this stream-only world that may have really started with sort of the Netflix, but the idea of Netflix of books, which is sort of what's going on, is a problem. So there are some publishers that still sell e-books, which is great. We'd like to see more of that, but a lot are sort of keeping control and it's a problem. So we're digitizing enormous numbers of books to make those available through, you know, people like doing research like, you know, with Wikipedia. And we're probably best known for our web collection. So we collect a lot of web pages and then make those available as the Wayback Machine. We've also scoured it to try to find academic journal literature, government PDFs, things like that, that we can use for datasets for other purposes, for instance, AI. And the total collection is over 99 petabytes. I just love this because it just has sort of too many zeros. It just is like, just kind of enormous. And I'd say it's some meaningful percentage of the works of humankind. So this is the anti-entropy of humans is putting in order this much information as its cultural legacy. So you can just think of it as culture in a bottle, but it's also an enormous collection of what people have done that they've left a record of in the published record. OK, so that's a bit about the Internet Archive. Now I want to go into how is the Internet Archive using some of these new WSI tools? And I'm going to just give you a couple of a few examples, two that are how we've used them to improve our services in production. So maybe this is something you guys can do as well. Turns out we've been digitizing periodicals on the periodical descriptions that are available commercially are all licensed in such a way that you can't really put them on a website in any enduring way because you'd stop paying until you take it all off. So it doesn't really work. I mean, so this licensing model. So we wanted to go and write our own descriptions. And so we employed a bunch of librarians to start describing these descriptions, these periodicals based on doing mostly research on the net. And it would take about 40 minutes for them to go and describe the intended audience, what's it about and what its publication history is. Then when this chat GPT came out, we set up a workflow that's based on Google spreadsheets that basically you can go and put it into one cell. You put in a prompt that has little places to go and stick things from that particular periodical that we're working on into that prompt, and it will come out with an answer for, say, common ground. So this is a prompt that we've derived over time to go and try to make the description that we could then add. Well, then we put it through a person and they edit it and make it available. So it went from about 40, 45 minutes each down to around six or seven minutes, because if it doesn't really know, it starts making things up. It's a problem with URLs and the like. So AI technology has actually worked pretty well. So here's an example template for this particular body of literature, and it sticks in these particular pieces. And then it says, please go and write a description in this format. And it does. And we can run it on a whole lot of these, given some of the information that we have about it. And it comes out with pretty good descriptions that you have to check and do it. So here is an example of taking sort of this third-party piece of code, chat GPT, going and lashing it into Google Spreadsheets into a workflow that allows people to much more efficiently, 10 times more efficiently, go and do the work that they were doing before by doing this kind of autonomous research agent actions. So that's been a boon for us. It allows us to go through thousands and thousands where it was painful in the hundreds. Another use, which is slightly different, is trying to go and take information that you already have and extract something from it. In our case, we have about 200,000 books that don't have author and title and publisher and date in our metadata. It just, you know, maybe we have the ISBN, but we don't have the other things. And we can't find it in any of the open databases that are out there. So what we did is we took those, we scanned the books because we do that. And then we ran optical character recognition. Then we took the first 300 lines of that book and gave it to chat GPT with a particular prompt and said, please give us the author, title, publisher and date. And it does, it does a great job of it. And so we basically gotten this now to work in production. And I'm just hoping that, you know, we can now use these advanced tools sort of as a plug and play as part of our systems to really speed up the process. And we don't really have an idea of how this compares to humans in this process, because we never actually wanted people to go through 200,000 books and try to figure this out. It seemed like just too rote. But if you tried using computer programs, just normal computer programs in Python or whatever, it's just too fragile that these new technologies can basically deal with fuzzy inputs much better than we were able to before. So that's two examples of how we're using it to improve our collections. Now I'd like to show you a little bit of what we're doing with having other people use our collection. So this is work by Caleb, independent researcher at GDELT. And we have been recording the Russian television, but it's been hard to make sense of it. I'm going to try and do, I'm going to actually see if I can run this as a live demo, hoping that this is coming through. So I've just clicked on a link that goes to a particular program in Russian television. And so these are just screenshots and trying to get an idea of what's going on. And there's Tucker Carlson that sort of keeps appearing. And it turns out he appears a lot on Russian television. And that was even though navigating this huge dataset, just by having little visual images every 30 seconds or a minute was enough to get some reporters like at the Washington Post and others to go and start to get kind of an idea of what's going on. But you want to go deeper than that. So here is one of the programs. And I don't know about you, but I don't speak Russian. So what is she saying? So we then use the AI technologies to go and do speech to text. And so we can go and take her speech and make a transcript of it. It's faulty, but and it's in Russian. And then we used other AI technologies supplied by Google, which were really completely tremendous that did automatic translation that took that Russian and made English, which is kind of great. And so we're able to then play closed captions of as. So it allows you now to go deeper to try to understand what it is it's being discussed. Now, what I found kind of fascinating out of this is actually you want to go the next level is I don't want to watch all this and even transcribed. I don't want to watch all this. I want it summarized. So then we've taken the transcripts from a full day of. And created a summary. So it basically takes the full day's transcripts with a prompt, pass it through chat GBT and finds the the key messages, the repeating segments that they did on Russian television and summarize that into a few words and come up with a title for it. So basically making a newspaper out of television news, but allowing you to go back and see where it came from so that if it's hallucinating, you can then find, you know, what did it really say if the technology is not working quite right or you want to just go deeper into it. So it's a way of trying to get a macroscope. It's the idea of being able to take a step back. I personally, I would love to get a little report as to what it is. I don't know, 5, 10, 15 different television channels are reporting to their particular publics, whether it's all within the United States or to different countries. Just what's going what's going on in that in those countries and the way what I want to pierce the bubble. I want to go and find out what's inside other people's bubbles. And then there are people that are starting to work on the next step beyond this, which is can you talk to the bubble? Can you go and take the, say, Fox News or the Russian television news or, you know, or Belarusian, Chinese, Taiwanese news and make an AI fine tune it in such a way that it will answer questions from that point of view so that you can not only read about what's going on in it, but you can actually probe it to be able to find and have a conversing type as opposed to a search engine type interface into these enormous corpora to be able to try to get an idea of what's going on in other places. So those are sort of three examples of how we've been using AI technologies already. Now I'm going to try to suggest, you know, there's a bunch of places we could go with this. We've digitized everything ever written in Balinese. Turns out they write on palm leaf manuscripts and palm leaves. And so we photographed all of them, but we haven't gotten it into machine readable form yet. So had some people go and key them in Bali into their script in Unicode, went and got the Unicode people to change some of the glyphs so that there are better Balinese representations. And now we have that in transliterated form. But now we want to go and OCR all the rest. So can we use this as training data based on thousands of pages being done by hand to be able to take the hundreds of thousands of pages still to go? So if anybody's looking for a cool AI project, we have that one. We've got 78 RPM records that people are using to try to auto clean up. Also analyzing for different types of things. I would love to see chat GBT so that we can go and take the best we know and put it into a large language model. And so we can start to try to confront some of the bigger problems we have in the world, not just making funny graphics, but go and see what we can do on climate GBT. So that's the Internet Archive. That's some of what we're doing with it. Hope this was at least interesting and hopefully useful. Brewster, every time you do something like this with me or for me, you blow my mind because it makes me think about what the potential is. And even though this particular conference is all about the cloud and metrics, and I think it's good to zoom out and think about what's possible. But let me ask you this question. You're one of the most humble people I've met, but 40 plus years ago, you worked with one of the founding fathers of AI, Marvin Minsky. You helped create one of the first computer companies that were really good looking to execute AI capabilities. What's the biggest change today? How has AI evolved with this manifestation? And what does that mean for us as citizens of the of the earth? Isn't it exciting what's going on? I mean, I went and took a cruise, a robo taxi the other about a month ago. I went, I got to work in a Waymo and it's just like, holy crow, we're finally doing this. We were dreaming of this back in 1980 with Marvin Minsky, the AI lab, but I thought we were data starved. And what we ended up finding out was instead of going and trying to put knowledge together with tweezers is throw enormous amounts of data at it with stupid programs. And we got a long way by going and building the web that way, the search engines that way. And then we ended up making enormous progress now in machine vision and optical character recognition and these generative AI models, which are doing kind of, I don't know, miraculous things. I have no idea how it's doing it. But so it's, but it, you know, it talks to you in a really pretty natural way. It's pretty surreal. You should try downloading the app Pi, which I did for the first time two nights ago. PI, it's a little chat bot thing that you can talk human voice and it's astonishing. So what's happened? Big data is the, that's, that's probably the biggest positive that has propelled a lot of the AI things. One last question. You have one of the world's largest databases. Maybe NSA has one a little bit better. So you've been applying these large language models to, and a massive amount of data. What's your caution for the people out there who may have a lot of data in their organization to have thinking about applying large language models to it? Are there any concerns or cautions on costs or anything else? We're just getting going at, at these new, at these new tools. And some of the, the, the rights issues are really thorny. We basically, the other big thing that's happened in the last 40 years is companies have grown to be these global behemoths that are really trouncing around and stomping around like dinosaurs. It's, it's a problem. And so lawsuits are just sort of flowering everywhere. Cause that seems to be big guys idea of how to keep small guys under control. And it's a problem. So I, I don't know exactly what to say on what to be worried about in this whole front. There are very smart people that think that we're starting to build the tools that will outstrip us at certain things and that will corner us. So I think it will extend the arms of man. And it will make corporations more corporate and militaries more militaristic. And those, I don't think are good trends. Well, unfortunately we're going to have to wrap it up Brewster. Thank you so much to the audience. Highly recommend if you don't use Internet Archive today, Beyond the Wayback Machine, go ahead and subscribe to it, follow it. I think you'll be amazed at how Brewster's vision of digitizing all of human information. And that's really how big it is, how you can use that on a daily basis. Thanks Brewster. Thank you, Ray. And thank you all.