The following are some notes I took at the Coalition for Networked Information Spring Meeting 2015, in Seattle WA, on 2015-04-13. Brewster Kahle needs our help!
“Providing Universal Access to Modern Materials – And living to tell the tale”
The Internet Archive is a non-profit, independent, $12 million-a-year archive OF the internet. However, they want to become the archive ON the internet, not just hosting old web pages. We need to be trying to build libraries together, with distributed collection services and content. Brewster believes it should be free to all.
He started in 1996 capturing snapshots of websites every 2 months. At first they were trying to capture everything; lawyers warned him: “bad things will happen.” Then the Smithsonian worked with them to get the presidential websites of 1996; this gave them cover. Bad things didn’t happen. In 2001, they developed the Wayback Machine. Larry Lessig launched it in a California Library, again providing them with protection. The Library of Congress commissioned them to collect some stuff for them, but was wary. Then the Washington Post wrote it up as a good thing, and it was all good.
They built the Archive-it.org web archiving tool so others could archive content. Some subscribe in order to curate collections on the Internet Archive. Over 350 partners are now web archiving, collecting collaboratively. They have over 1700 curated collections, searchable and browsable.
An example is the Digital Archive of Japan’s 2011 Disasters: not covered anywhere else. It’s a big collection: 6 TB, 100 million archived documents, 18,478 seeds.
Then they got into Lending Television. In 1976 the ATRA (American Television and Radio Archives) Act said we could archive TV. So in 2000, the Internet Archive started, and did not ask permission, but bad things didn’t happen. A Vanderbilt group was lending TV to researchers, and managed to get an exemption. So the Internet Archive built a TV News archive on that, not providing downloads, but providing access. Instead of being upset, the news organizations are thrilled to have access to their old collections. In fact, they want data dumps of their old collections so they can mine the data.
This effort has been supported by the Knight foundation. For the first time they had the opportunity to compare political ads and local TV news in Philadelphia, in 2014. Do you know what they found? For every minute of news about the elections, there were 40 minutes of ads.
So far, in terms of TV, the Internet Archive has collected 3 million hours, maybe 9 petabytes: 100+ channels, 25+ countries, 10 terabytes per channel-year. It doesn’t take as much space as you might think.
In terms of music, they have 6k* bands, 130k concerts. They started with the Grateful Dead, as they were happy to be involved. They supported tape-trading. So the Internet Archive asked for some level of permission from the bands: anyone associated could give permission, and they had no special permission forms. Instead they posted the email responses from their queries. They got 4-5 emails a day, and fans uploaded tons of concerts a day. They now have 9,000 concert recordings – everything the Grateful Dead every did! 43k albums.
Netlabels – internet era labels – offered free hosting of audio content. Now the Internet Archive is working with CDs, LPs, and 78s, which are a bit more problematic. They’re working with the Archive of Contemporary Music in NYC, and working with the digitization process. They’re trying to get this to be more distributed as a project, involving other libraries. The current idea is to have libraries go digital: to have people add audio content if it’s not in there, and then provide them with digital access to the content if they can prove they have the hard copy. They have a model for on-campus access. Can we as a group bring our libraries digital?
They recently got a donation of 40k 78 rpm records. Please don’t get rid of collections – store them. Offsite is fine, it’s cheap. If you don’t want to hold on to them, send them to the Internet Archive.
So far in terms of audio, they have 2 million items in over 5k collections.
Now they’re working with moving images; they started with old films and home movies, and people love them (16 mm and such). They are digitizing 7k films that were just donated with the help of Mellon and a CLIR fellow. They’re operating on a take-down policy; if anyone objects, they just take it offline.
The Internet Archive is offering unlimited storage for people to upload things. Now they’re working with VHS tapes. If the content isn’t currently available new on DVD, they digitized it and put it up. It cost them about $24 per hour, and now they have 1,000,000 of them.
In terms of texts: the Library of Congress has 28 million books. Can you imagine if all that was online? They didn’t stop with just public domain. They have the costs down to 10 cents a page, $30 a book. It’s an open version of the Google books project, with 2 ½ million books. They wanted to digitize everything available in a specific language, and the Balinese were the first to say yes! They publish on palm leaves. So now every text from Bali is online, available free.
The Internet Archive has scanning centers all over the world, and they’re adding 1000 books a day; texts are all available free.
They also have a million books online for the blind and dyslexic.
They use the Openlibrary.org for books, with a web page for every book, and 600k visitors a day. They lend books through this interface; you can check them out for 2 weeks, and during that time no one else can read it. The books are offered via different formats. Brewster wants to do that with their collections, and we could lend things within our own organizations.
Next he wants to focus on personal digital archives – our digital artifacts. This will include Twitter, Facebook, YouTube, Flickr, Gmail, etc.
Universal access to all knowledge is a project involving everyone.
Google books got stopped because it was a monopoly. We need to work together, and do it collaboratively, and make the content free to the people.
David Rosenthal (LOCKSS, Stanford U.) thinks streaming is a solution, as it doesn’t allow download.
Brewster says it’s worked out well as an intermediary solution. Even the book viewing is streaming. It’s not that great for researchers, though; they need to scrape it. So having copies and providing bulk access is tremendous. Shouldn’t our universities be providing the services? Especially for things with rights issues like TV.
David said the size of the data needed is getting very big. Moving the analysis to the data may be the only approach.
Brewster said yes, but let’s not lock this into policy. It’s better to have multiple copies, and allow some bulk access. We have a partial access in Amsterdam and a partial access in Egypt. That’s only 2 copies, and he knows LOCKSS recommends more; they need help.
Mackenzie Smith (UC Davis) asked about researcher access.
Brewster said that he’s found that big data means lots of data points, not lots of content. We need a good middleware layer with services. We’re exploring building an institute for people to build that middleware. They tried just providing access and found it’s hard to work with if you don’t have a programmer. Let’s build some open source software so many people can do research.
Melissa Johnston (University of Washington) asked: why haven’t you gotten into trouble?
Brewster said: maybe we’re just lucky. We try to be respectful, and we don’t make any money at it. People don’t want to feel they’re being taken advantage of. We try to do it right. We don’t profit from it. Just remember, it’s their stuff. We bend but don’t break policy. We could have faced people down in court, but we didn’t; we just took things down. We try to talk more to business people than lawyers, and find how we can make their business work better. Laws tend to trail. We’ll have to get there and do things and the laws will follow.
Brewster likes the old style library, done in a distributed way.
Brewster visited OCLC when it was run on Honeywell Computers. The size was only 17 GB, but it was a lot of maintenance. You don’t need an acre of mainframes anymore. We need to question our old assumptions.
Brewster said: we are severely underfunded for what we’re trying to do. That third phase of building libraries together is requiring us to be more engaged. Don’t give up on us. We’re not there yet.
Every country should have things like this. But we are lame at the moment.
Stephen said DPN will be getting in touch.
Vicky Reich (LOCKSS) said: My definition of a research library is one that continues to build collections. Many of the libraries that used to be research libraries have fallen down by that measure. Much content is subscribed to now; little is freely available. You frankly haven’t given up on this community; I hope we don’t let you down.
Brewster responded: the horse is out of the barn. Librarianship is now all about contract negotiations and personnel issues. He doesn’t think that’s the way to go. Our business models are wrong. If we’re running a library only locally, that’s not so good. But how do we adapt to distributed service provision? He doesn’t know the answer. Donald Waters (Mellon) has been winning this argument for years. It’s difficult to get libraries to pay for things that they can get for free, yet open distribution works better in the internet generation. We still have business models built around local collections, but we need to go global, and we need a support mechanism for it. Michale Lesk (Rutgers) is worried about the 20th century – it might get forgotten due to copyright issues – and institutional responsibilities.
What are you going to do next week? Will it be any different? Your constraints are still the same as last week.
*”k” is shorthand for “thousand”