
Sophie Bushwick: To train a large artificial intelligence model, you need lots of text and images created by real people. As the AI boom continues, it is becoming increasingly clear that some of this data is coming from copyrighted sources. Now, writers and artists are filing a series of lawsuits to challenge how AI developers use their work.
Lauren Leffer: But it’s not just published authors and visual artists who should care about how generative AI is trained. If you listen to this podcast, you might want to take note too. I’m Lauren Leffer, the technology reporter at Scientific American.
Bushwick: And I’m Sophie Bushwick, technical editor at Scientific American. You listen to Technology, fastthe digital data diving version of Scientific American’s Science, fast podcast.
So, Lauren, people often say that generative AI is being trained all over the Internet, but it seems like there isn’t much clarity on what that means. When this appeared in the office, many of our colleagues had questions.
Laffer: People asked about their individual social media profiles, password protected content, old blogs, you name it. It’s hard to wrap your head around what online data means when, as Emily M. Bender, a computational linguist at the University of Washington, told me, quote, “There’s no place where you can download the Internet.”
Bushwick: So let’s dig into it. How do these AI companies get their data?
Leffer: Well, it’s done through automated programs called web crawlers and web scrapers. This is the same type of technology that has long been used to build search engines. You can think of crawlers as digital spiders that move around strings of silk from URL to URL, cataloging the location of everything they come across.
Bushwick: Happy Halloween to us.
Laffer: Exactly. Spooky spiders on the internet. Then web scrapers go in and download all the directory information.
Bushwick: And these tools are readily available.
Laffer: Right. There are a few different open access crawlers out there. For example, there is one called Common Crawl, which we know OpenAI used to collect training data for at least one iteration of the large language model that powers ChatGPT.
Bushwick: What do you mean? At least one?
Laffer: Yes. So the company, like many of its big tech peers, has become less transparent about training data over time. When OpenAI developed GPT-3, it explained in a paper what it used to train the model and even how it approached filtering that data. However, with the release of GPT-3.5 and GPT-4, OpenAI offered much less information.
Bushwick: How much less do we talk?
Laffer: Much less – almost none. The company’s latest technical report provides literally no details about the training process or the data used. OpenAI even acknowledges this directly in the paper, writing that “given both the competitive landscape and the security implications of large-scale models like GPT-4, this report does not include any additional details about the architecture, hardware training, dataset, construct training method, or the like.”
Bushwick: Wow. Okay, so we don’t really have any information from the company about what powered the latest version of ChatGPT.
Laffer: Right. But that doesn’t mean we’re completely in the dark. Arguably between GPT-3 and GPT-4 the largest data sources remained fairly consistent because it’s really hard to find brand new data sources large enough to build generative AI models. Developers are trying to get more data, not less. GPT-4 probably relied in part on Common Crawl as well.
Bushwick: Okay, so Common Crawl and web crawlers in general – they are a big part of the data collection process. So what are they dredging up? I mean, is there anywhere these little digital spiders can’t go?
Laffer: Good question. There are certainly places that are more difficult to access than others. As a general rule, anything visible in search engines is really easy to vacuum up, but content behind a login page is harder to get to. So information about a public LinkedIn profile may be included in Common Crawl’s database, but a password-protected account probably isn’t. But think about it for a minute.
Open data on the Internet includes things like photos uploaded to Flickr, online marketplaces, voter registration databases, government websites, company websites, probably your employee’s biography, Wikipedia, Reddit, research archives, news agencies. Plus, there’s tons of readily available pirated content and archived compilations, which might include that embarrassing personal blog you thought you deleted years ago.
Bushwick: Yuk. Okay, so that’s a lot of data, but – okay. On the bright side, at least it’s not my old Facebook posts because they’re private, right?
Laffer: I’d love to say yes, but here’s the thing. General web search may not include locked social media accounts or your private posts, but Facebook and Instagram are owned by Meta, which has its own large language model.
Bushwick: Ah, right.
Laffer: Right. And Meta invests big money in further developing its AI.
Bushwick: On the last episode of Technology, fast, we talked about Amazon and Google incorporating user data into their AI models. So does Meta do the same thing?
Laffer: Yes. Officially. The company admitted that it has used Instagram and Facebook posts to train its AI. So far Meta has said this is limited to public posts, but it’s a bit unclear how they define that. And of course it can always change going forward.
Bushwick: I think this is scary, but I think some people might be wondering: So what? It makes sense that writers and artists wouldn’t want their copyrighted works included here, especially when generative AI can spit out content that mimics their style. But why does it matter to anyone else? All of this information is online anyway, so it’s not that private to begin with.
Laffer: TRUE. Everything is already on the Internet, but you might be surprised by some of the material that appears in these databases. Last year a digital artist was working on a visual database called LAION, spelled LAION…
Bushwick: Sure, it’s not confusing.
Laffer: Used in training and popular image generators. The artist came across a medical photo of herself attached to her name. The picture had been taken in hospital as part of her medical record, and at the time she had specifically signed a form indicating that she did not consent to the picture being shared in any context. But somehow it ended up online.
Bushwick: Oh. Isn’t that illegal? That sounds like it would violate HIPPA, the medical privacy rule.
Laffer: Yes to the illegal question, but we don’t know how the medical image got into LAION. These companies and organizations do not keep very good track of the sources of their data. They just compile it and then train air tools with it. A report from Ars Technica found lots of other pictures of people in hospitals in the LAION database as well.
Laffer: And I asked LAION for a comment, but I haven’t heard from them.
Bushwick: So what do we think happened here?
Laffer: Well, I asked Ben Zhao, a computer scientist at the University of Chicago, about this, and he pointed out that the data often gets misplaced. Privacy settings may be too lax. Digital leaks and breaches are common. Information that is not intended for the public Internet ends up on the Internet all the time.
Ben Zhao: There are examples of children being filmed without their permission. There are examples of private home pictures. There are all sorts of things that should not in any way, shape or form be part of a public education set.
Bushwick: But just because data ends up in an AI training set doesn’t mean it becomes available to anyone who wants to see it. I mean, there’s protection here. AI chatbots and image generators don’t just spit out people’s home addresses or credit card numbers if you ask them.
Laffer: TRUE. I mean, it’s hard enough to get AI bots to offer completely accurate information about basic historical events. They hallucinate and they make mistakes a lot. These tools are certainly not the easiest way to track personal information about an individual on the Internet. But…
Bushwick: Oh, why is there always a “but”?
Laffer: There, uh, there have been some cases where AI generators have produced images of real people’s faces and very faithful reproductions of copyrighted work. Plus, while most generative models have guardrails in place to prevent them from sharing identifying information about specific people, researchers have shown that there are usually ways around these blocks with creative prompts or by messing with AI models with open source.
Bushwick: So privacy is still an issue here?
Laffer: Absolutely. It’s just another way your digital information can end up where you don’t want it. And again, because there’s so little transparency, Zhao and others told me that right now it’s basically impossible to hold companies accountable for the data they use or to stop it from happening. We would need some sort of federal privacy regulation for that.
And the US has none.
Bushwick: Yes.
Laffer: Bonus – all that data comes with another big problem.
Bushwick: Oh, of course it does. Let me guess this one. Is it bias?
Laffer: Ding, ding, ding. The Internet may contain a lot of information, but it is skewed information. I spoke with Meredith Broussard, a data journalist who researches AI at New York University, who described the problem.
Meredith Broussard: We all know that there are wonderful things on the Internet and that there is extremely toxic material on the Internet. So when you look at, for example, what are the sites in the Common Crawl, you’ll find a lot of white supremacist sites. You will find a lot of hate speech.
Laffer: And in Broussard’s words, it’s “bias in, bias out.”
Bushwick: Don’t AI developers filter their training data to get rid of the worst bits and put restrictions in place to prevent bots from creating hateful content?
Laffer: Yes. But again, clearly, lots of bias still comes through. It’s obvious when you look at the totality of what AI generates. The models appear to reflect and even amplify many harmful racial, gender, and ethnic stereotypes. For example, AI image generators tend to produce much more sexualized depictions of women than they do men, and at baseline, relying on Internet data means that these AI models will be skewed toward the perspective of people who can access the Internet and post online for the first.
Bushwick: A ha. So we’re talking about wealthier people, western countries, people who don’t face a lot of harassment online. Perhaps this group also excludes the elderly or the very young.
Laffer: Right. The Internet is not actually representative of the real world.
Bushwick: And in turn, neither are these AI models.
Laffer: Exactly. Ultimately, Bender and a couple of other experts I spoke with noted that this bias and, again, lack of transparency make it really hard to tell how our current generative AI model should be used. Like, what’s a good program for a biased black box content machine?
Bushwick: Perhaps it is a question that we will wait to answer until further notice. Science, fast produced by Jeff DelViscio, Tulika Bose, Kelso Harper and Carin Leong. Our show is edited by Elah Feder and Alexa Lim. Our theme music was composed by Dominic Smith.
Laffer: Don’t forget to subscribe Science, fast wherever you get your podcasts. For more in-depth science news and features, visit ScientificAmerican.com. And if you like the show, please give us a rating or review.
Bushwick: For Scientific Americanpp Science, fast, I’m Sophie Bushwick.
Laffer: I’m Lauren Leffer. We’ll talk to you next time.
#Generative #models #soak #data #internet #including