Artificial intelligence (AI) is here to stay. Chat Bots like ChatGPT and Gemini are developing high language capability and use machine learning tools like Large Learning Models (LLM’s) to collect large amounts of data to generate new outputs based on learnt data. This will revolutionize the way humans work, create, learn, think and solve problems in ways and scales unimaginable. With technological advancement and improvements in the field of Generative AI, several legal challenges have emerged. Issues such as copyright infringement, the way Generative AI models train to gather and use enormous amounts of copyrighted as well as noncopyrighted materials from the internet, authorship of AI-generated works and if AI-generated work itself is copyrightable are current points of challenge.
How does Generative AI learn?
The machine learning models train and learn from data available on the internet and find hidden patterns or relationships that form algorithms. This ability allows the computer to make predictions or decisions based on learned information. Generative AI uses training datasets from machine learning to generate output that may echo or replicate the works found in the training LLM’s.
Generative AI gather training data from the internet by data scraping1, web crawling2 and data mining3 techniques. Many developers use the digital commons4, scientific publications as well as publicly accessible open-source libraries, repositories, websites or platforms on the internet to train their Generative AI models. In effect, nearly all the dataset used to train Generative AI models comprise of copyrighted content and personally identifiable information. This raises questions on use of copyrighted content by Generative AI to train its models and whether the Doctrine of Fair Use is in fact applicable i.e. if it is an exception to copyright infringement?
Does it amount to fair use?
Whether the use of copyrighted material by AI results in original and creative works and determining who the author is of those works lie at the heart of the matter. Can we then apply the fair use exception to copyright on AI produced work? This continues to remain highly contentious. Several lawsuits have been filed against Generative AI companies since the launch of OpenAI’s ChatGPT in 2022.5 Here, the matter in dispute was the alleged copyright infringement of the proprietor’s copyrighted material.
US Law
Under US Copyright Law the statutory framework for fair use qualifies certain types of use, such as criticism, comment, news reporting, teaching, scholarship, and research.6 Section 107 of the US Copyright Act, 1975 assesses fair use based on four factors: the purpose and character of the use, the nature of the copyrighted work, the amount and significance of the portion used, and the impact on the market or value of the original work. These factors help determine if a particular use is considered fair.
The first factor of fair use considers the purpose and character of the use, including whether it is commercial or non-profit and if it is transformative.7 In many cases, Generative AI is trained for commercial purposes as many platforms now offer commercial subscriptions to their users. Furthermore, even if the work generated by the AI is considered transformative, it does not automatically qualify as fair use as the Courts must also consider the fourth factor, the impact on the market and the value of the original work.8
The second factor of fair use is whether the copyright work is factual or creative. Generative AI usually trains on highly creative materials such as visual art, music or writings that may weigh against fair use. This factor is not exclusive and requires case-to-case interpretation by US Courts.
For the third factor to come into play, the amount and significance of the portion used matters. Usually when an infringer copies the whole work or material central to the work, the Court may weigh against fair use.9 Usually, Generative AI must copy as much as possible from the original work which includes highly creative parts of the work or material central for training to generate quality output. So far, this has not attracted the fair use exception.
The fourth factor of fair use assesses the impact on the market or value of the original work. There are strong arguments that AI training on copyrighted works harms the market and value for those works. Many developers do not compensate copyright owners for the works used to train their Generative AI. Reproduction of copyrighted works by Generative AI may undermine the value of the original work by reducing incentives to buy it. Moreover, the outputs of Generative AI would compete in the same market as the original work, thereby using their own work to eliminate their economic opportunities.10 This would not attract the fair use exception.
Fair use cases involving Generative AI training on copyrighted works will be highly fact dependent and require case-to-case interpretation. While some AI-related uses may qualify as fair use, unauthorized use of copyrighted material to train AI model might not be categorized by US Courts under a broader exception.
Indian Law
Indian Copyright Law states that fair dealing (the term “fair dealing” and “fair use” are interchangeable in India) with literary, dramatic, musical or artistic work other than a computer program is not an infringement of copyright. Under the principle of fair use, it is permissible to reference work of an author into another work without obtaining permission to do so. “Fair dealing” encompasses any use, excluding computer programs, for the purposes of private or personal use, including research; criticism or review of the work or any other work; and reporting on current events and affairs, including public lectures.
Although the issue has not yet been litigated by Indian Courts, it is likely that the use of copyrighted materials by Generative AI would not be considered an exception to copyright, as these materials are often used for commercial exploitation. In general, Generative AI does not acknowledge authors they borrow from, a necessary ingredient to qualify for fair dealing. The Indian Courts have also known to test the element of fairness11 and while there is no straightforward formula for this, the law might ultimately protect the copyright holder’s rights.
European Union Law
Recently, the European Union (EU) Parliament introduced the Artificial Intelligence Act12 (Act) which establishes a comprehensive framework for the development and use of AI within the EU while prioritizing the protection of fundamental rights. The Act is particularly relevant as it regulates the use of Generative AI and addresses various copyright concerns.
Article 53(1) of the Act13 and the Recital 107 of the Act14 underscores the importance of transparency and ensures accountability and facilitates the enforcement of copyrights. It mandates Generative AI developers to make a detailed publicly available summary of contents used for training their models. This summary is expected to provide comprehensive information about the datasets used, including both public and private sources, to enable rightsholders to exercise their rights effectively. The introduction of this summary not only addresses the concern of the stakeholder for the first time but also ensures that they can exercise and enforce their rights if they wish.
Conclusion
Training Generative AI models is a novel concept that poses new challenges to copyright law worldwide. Generative AI developers often use the defense of fair use to justify their use of copyrighted content for training datasets. Whether this doctrine is applicable is not simple. If applied, it might leave the original proprietors and authors unprotected without due compensation. On the other hand, if fair use is not used then innovation and adoption of new technology would be hindered and leave developers embroiled in the legal compliance and due diligence. The tussle is really between human endeavors to innovate creatively into the future and protect works of copyright. The questions we should be asking are how much of copyright holders’ interest is to be protected and what would the future meaning of copyright align with? For now, current principles of copyright law are under a stress test. It is yet to be seen how this will evolve and if new legislation can in fact keep up with the breakneck speed of technological development.
- Data scraping is the process of extracting large amounts of data from publicly available web sources. See https://www.datamation.com/big-data/data-scraping/ ↩︎
- Web crawling is a process that allows the search engine to match, with the use of indexed data, relevant search results to a query. See https://www.elastic.co/what-is/web-crawler ↩︎
- The extraction of natural language works (books or articles, for example) or numeric data (i.e. files or reports) and use of software that read and digest digital information to identify relationships and patterns far more quickly than a human can. ↩︎
- Reclaiming the digital commons: A public data trust for training data, see https://arxiv.org/pdf/2303.09001 ↩︎
- U.S. District Court for the Southern District of New York in New York Times Company v. Microsoft Corporation et al (No. 1:23-cv-11195) is deliberating on the boundaries of fair use and on AI’s use of copyrighted material. OpenAI’s alleged infringement on Times’ copyright works in training of its AI models is currently under trial. ↩︎
- See https://www.copyright.gov/fair-use/ ↩︎
- A work is considered “transformative” if it adds something new to the original, with a further purpose or different character, and does not merely substitute for the original work. ↩︎
- Fox News Network, LLC v. TVEyes, Inc., 883 F.3d 169 (2018) ↩︎
- Harper & Row v. Nation Enterprises, 471 U.S. 539 ↩︎
- See, https://www.csis.org/blogs/perspectives-innovation/informing-innovation-policy-debate-key-conceptscopyright-laws ↩︎
- Civic Chandra & Ors v. Ammini Amma, (1996) 1 KLJ 454 ↩︎
- It came into force on 1 August 2024, with provisions coming into operation gradually over the following 6 to 36 months. ↩︎
- See, https://artificialintelligenceact.eu/article/53/ ↩︎
- See, https://artificialintelligenceact.eu/recital/107/ ↩︎