Generative Artificial Intelligence: Scraping Your Privacy - Copyright

Los Angeles, Calif. (March 27, 2024) - In the context of artificial intelligence, "scraping" refers to the process of extracting data from websites for use in training AI models. This data extraction is typically done through web scraping techniques, where specific information — including personal identifying information like names, addresses, phone numbers, and more — is gathered from websites and then used to create datasets that are utilized to train AI tools. This practice involves copying and gathering data from the web into a central database or spreadsheet for later analysis or retrieval.

The integration of AI with web scraping has become increasingly important in various industries, as it enhances and optimizes activities related to data processing and analysis. By leveraging AI technologies in web scraping, businesses, e-commerce researchers, and decision-makers can extract valuable insights from the vast amount of data available on the internet more efficiently.

The legal implications of web scraping revolve around personal data protection, intellectual property regulations, and adherence to websites' terms of service. Web scraping itself is not inherently illegal, but it must be conducted within certain boundaries to remain legal. Here are key points regarding the legal aspects of web scraping:

Personal Data Protection: Web scraping can become illegal if it involves scraping sensitive information for profit without consent or collecting personal data for malicious purposes. It is crucial to respect individual privacy and avoid scraping personal information without explicit permission.
Intellectual Property Regulations: Respecting intellectual property rights is essential when web scraping. If the terms and conditions of a website prohibit downloading or copying its content, scraping that site could lead to legal issues.
Terms of Service: Websites often have terms of service that outline what activities are allowed on their platforms. Violating these terms, such as by disrupting the normal use of a website or scraping data against the site's policies, can result in legal complications.
GDPR and Other Regulations: Different regions have unique rules regarding web scraping, especially concerning personal data. For example, the General Data Protection Regulation (GDPR) in the European Union sets strict guidelines on data protection and privacy.

When parties do not adhere to safe practices by respecting personal data protection laws, intellectual property rights, and websites' terms of service, lawsuits may be filed. One such prominent suit is that of New York Times Company v. OpenAI and Microsoft.

The New York Times' lawsuit against OpenAI alleged copyright infringement related to the unauthorized use of millions of articles from the Times to train artificial intelligence technologies, including the ChatGPT chatbot. The lawsuit, filed in U.S. District Court in Manhattan, accused OpenAI and Microsoft of using the Times' copyrighted works without permission to develop AI products that compete with the news outlet as a source of reliable information. The Times sought damages for the unlawful copying and use of its valuable works and demanded that the defendants destroy any chatbot models and training data that utilized copyrighted material from the Times.

The outcome of The New York Times' lawsuit against OpenAI is still pending based on the latest available information. OpenAI has filed a motion seeking to dismiss some key elements of the lawsuit brought by The New York Times Company, arguing that ChatGPT is not a substitute for a subscription to The New York Times and that people do not use ChatGPT or any other OpenAI product as a replacement for accessing Times articles.

OpenAI responded to The New York Times' lawsuit by claiming that the case is without merit and that training AI models using publicly available data, including articles like those from the Times, falls under fair use. OpenAI argued that regurgitation, where AI models spit out training data verbatim, is less likely to occur with training data from a single source like The New York Times, and emphasized the responsibility of users to avoid intentionally prompting models to regurgitate content. The company also highlighted its opt-out process for publishers and stated that negotiations with the Times were progressing before the lawsuit was filed. OpenAI's response underscores its stance on fair use of publicly available internet materials for training AI models and its commitment to resolving copyright issues constructively with publishers.

Similar to The New York Times matter, on January 13, 2023, a group of artists, led by the named plaintiff, Sarah Anderson, filed suit in the Northern District of California, against Stability AI Ltd. The plaintiffs, including artists Kelly McKernan and Karla Ortiz, alleged that Stability AI Ltd. and other defendants scraped billions of copyrighted images from online sources without permission to train their AI models, generating new images without attributing to the original artists. The plaintiffs claimed that this practice deprived them of commissions and allowed the defendants to profit from their work.

The case involved disputes over copyright registration, identification of infringed works, and the use of copyrighted material in training AI models. The defendants argued that the plaintiffs must prove actual unauthorized reproduction for their claims to succeed and emphasized the need for specificity in identifying infringed works. The court evaluated the sufficiency of the plaintiffs' claims based on copyright registrations and ownership of valid copyrights, highlighting the complexities surrounding AI technology's interaction with intellectual property laws. .

The court has granted motions to dismiss some claims while allowing others to proceed, with the plaintiffs given the opportunity to amend their complaint to provide more clarity on their theories of infringement and support their allegations with plausible facts. The case is still in progress as the plaintiffs work on amending their complaint to address the court's concerns and move forward with the litigation process.

The impact of the court's ruling on the Andersen v. Stability AI Ltd. case is significant for both content creators and AI technology companies. The court's decision, made by Honorable Judge William Orrick, involved dismissing most claims without prejudice, allowing the plaintiffs to replead their claims. Notably, the judge dismissed copyright infringement claims brought by artists who failed to register their copyrights but allowed the key claim related to Stability AI's alleged use of the artists' work to train Stable Diffusion to proceed.

This ruling sets a precedent for future cases involving AI technology and copyright infringement, emphasizing the importance of copyright registration for artists seeking legal recourse. The decision also highlights the complexities surrounding AI training models and their interaction with existing copyright laws. Depending on how the case progresses after potential amendments by the plaintiffs, it could have lasting implications on how AI companies train their models and how content creators protect their intellectual property in the digital age.

Another high-profile case, Authors Guild et al v. OpenAI, Inc. et al., is a class-action lawsuit filed by the Authors Guild and several prominent authors against OpenAI. The plaintiffs, including authors like John Grisham, Jodi Picoult, David Baldacci, George R.R. Martin, and others, allege that OpenAI infringed on their copyrighted works without permission or compensation. OpenAI is accused of using the authors' works to train large language models (LLMs) for profit, creating sequels and derivatives without authorization. The lawsuit highlights concerns about generative AI impacting creatives' livelihoods and the challenges of AI-generated content competing with original works. This legal action underscores the complex intersection of artificial intelligence, copyright infringement, and the rights of authors in the digital age.

The potential impact of the lawsuit on the writing industry is significant. This legal action represents a pivotal moment in addressing the ethical and legal implications of using authors' works to train artificial intelligence models without proper consent, compensation, or attribution. If successful, this lawsuit could set a precedent for protecting authors' rights and intellectual property in the digital age, ensuring that writers are fairly compensated for the use of their creative works in AI technologies. Additionally, the outcome of this case may influence how AI companies approach the utilization of copyrighted material in their algorithms, potentially leading to more transparent and ethical practices in the industry. Overall, this lawsuit has the potential to reshape the relationship between authors, AI developers, and the writing industry by establishing clearer guidelines for the responsible use of literary works in training AI models.

As technology continues to evolve, businesses that utilize AI-powered web scraping must take into account the four legal considerations listed above, as well as the following additional concerns:

Ethical Implications: The ethical considerations of using AI-powered web scraping, especially concerning personal data, can raise public scrutiny and impact consumer trust and brand image.
Protection Against Attacks: Businesses need to implement measures to protect their web scraping activities from attacks targeting their business logic, ensuring the security and integrity of their data extraction processes.

By addressing these considerations proactively, businesses can mitigate risks associated with AI-powered web scraping, maintain compliance with regulations, and uphold ethical standards in their data extraction practices.

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.

United States: Generative Artificial Intelligence: Scraping Your Privacy

Login to Mondaq.com

Why Register with Mondaq

Your Organisation