Faerie Tales: Books Powering Machine Learning

Had to take a deep dive into Alex Reisner's 'Revealed: The Authors Whose Pirated Books Are Powering Generative AI', published in The Atlantic on August 19, 2023. It starts ominously. Walao. If you don't code or take an interest in the how, of course you're not going to get it. But this is something we should all educate ourselves in.

We all know AI surtitles and subtitles on a great number of television dramas cannot make it. Translated chunks of words don't mean what the original languages intend or say. AI translation sucks ass. Human translators are still needed. But hiring them to translate shows, requires a tremendous output of budget. Translators don't come cheap, and television networks force these humans to work for cheap. Along comes AI and ChatGPT. But machine learning for the matters of translation still hovers at infancy.

One of the most troubling issues around generative AI is simple: It’s being made in secret. To produce humanlike answers to questions, systems such as ChatGPT process huge quantities of written material. But few people outside of companies such as Meta and OpenAI know the full extent of the texts these programs have been trained on.

The AI algorithms can only get better as they learn more. It can't differentiate fact or fiction. The generative AI systems can and will plough through massive chunks of cached texts to produce even more coherent words. Of course they're going to use all these books and words out there to learn slang, phrasing and whatever else. Proving it, is another matter altogether, unless there're whistleblowers.

At the heart of it all is Meta, a company so powerful that it can push technology forwards by decades, if it's only unfettered access to all global citizens with an internet connection. Datasets are being used secretly or openly by every developer. The open-source software community has never been so divided. How to do more creative work without restrictive licences? How do you guarantee the integrity of your finished work so that you could make that proper living off of it?

Is it Control vs Piracy vs Intellectual Property vs erm Greater Good?

Upwards of 170,000 books, the majority published in the past 20 years, are in LLaMA’s training data. In addition to work by Silverman, Kadrey, and Golden, nonfiction by Michael Pollan, Rebecca Solnit, and Jon Krakauer is being used, as are thrillers by James Patterson and Stephen King and other fiction by George Saunders, Zadie Smith, and Junot Díaz. These books are part of a dataset called “Books3,” and its use has not been limited to LLaMA. Books3 was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J—a popular open-source model—and likely other generative-AI programs now embedded in websites across the internet. A Meta spokesperson declined to comment on the company’s use of Books3; Bloomberg did not respond to emails requesting comment; and Stella Biderman, EleutherAI’s executive director, did not dispute that the company used Books3 in GPT-J’s training data.

Faerie Tales

Monday, August 28, 2023

Books Powering Machine Learning :: Is It Piracy?

No comments: