Congress Wants Tech Companies to Pay Up for AI Training Data

stopthatgirl7@kbin.social · 9 months ago

Congress Wants Tech Companies to Pay Up for AI Training Data

Motavader@lemmy.world · edit-2 9 months ago

Yes, and they’ll use legislation to pull up the ladder behind them. It’s a form of Regulatory Capture, and it will absolutely lock out small players.

But there are open source AI training datasets, but the question is whether LLMs can be trained as accurately with them.

Mechanize@feddit.it · 9 months ago

Any foundation model is trained on a subset of common crawl.

All the data in there is, arguably, copyrighted by one individual or another. There is no equivalent open - or closed - source dataset to it.

Each single post, page, blog, site, has a copyright holder. In the last year big companies have started to change their TOS to make that they are able to use, relicense and generally sell your data hosted in their services as their own for the intent of AI training, so potentially some small parts of common crawl will be licensable in bulk - or directly obtained from the source.

This does still leave out the majority of the data directly or indirectly used today, even if you were willing to pay, because it is unfeasable to search and contract every single rights holder.

On the other side of it there have been work to use less but more heavily curated data, which could potentially generate good small, domain specific, models. But still they will not be like the ones we currently have, and the open source community will not be able to have access to the same amount and quality of data.

It’s an interesting problem that I’m personally really interested to see where it leads.

wikibot@lemmy.world · 9 months ago

Here’s the summary for the wikipedia article you mentioned in your comment:

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2008. It completes crawls generally every month.Common Crawl was founded by Gil Elbaz. Advisors to the non-profit include Peter Norvig and Joi Ito. The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available. The Common Crawl dataset includes copyrighted work and is distributed from the US under fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the common crawl dataset to work around copyright law in other legal jurisdictions.As of March 2023, in the most recent version of the Common Crawl dataset, 46% of documents had English as their primary language (followed by German, Russian, Japanese, French, Spanish and Chinese, all below 6%).

^article ^| ^about

Motavader@lemmy.world · 9 months ago

Thanks for the link to Common Crawl; I didn’t know about that project but it looks interesting.

That’s also an interesting point about heavily curated data sets. Would something like that be able to overcome some of the bias in current models? For example, if you were training a facial recognition model, access a curated, open source dataset that has representative samples of all races and genders to try and reduce the racial bias. Anyone training a facial recognition model for any purpose could have a training set that can be peer reviewed for accuracy.

General_Effort@lemmy.world · 9 months ago

These open datasets are used to fine-tune LLMs for specific tasks. But first, LLMS have to learn the basics by being trained on vast amounts of text. At present, there is no chance to do that with open source.

If fair use is cut down, you can forget about it. It would arguably be unconstitutional, though.

That’s not even considering the dystopian wishes to expand copyright even further. Some people demand that the model owner should also own the output. Well, some of these open datasets are made with LLMs like ChatGPT.