Every time you post a photo online, it may have already been added to a dataset somewhere. AI companies need billions of images to train their models, and the public internet has been a convenient source. But that raises a real question: are your photos being used without your knowledge, and what can you actually do about it?
The short answer is: probably yes. The longer answer is far more complicated, legally murky, and worth understanding before you post another selfie or upload a photography portfolio.

How AI Training Data Actually Works
Modern image-generating AI models don't invent visual concepts from nothing. They learn by processing patterns across millions, sometimes billions, of real photographs and illustrations. That data has to come from somewhere.
Where the images come from
The biggest training datasets in the industry, such as LAION-5B, contain over 5 billion image-text pairs scraped from public websites. Common Crawl, an organization that regularly indexes the entire public internet, has been one of the most heavily used data sources. Any image publicly accessible by a web crawler was treated as fair game.
These companies didn't ask permission. They didn't send licensing inquiries. They simply indexed publicly accessible URLs and pulled the images at massive scale.
Which models used scraped data
| AI Model | Known Dataset | Public Disclosure |
|---|
| Stable Diffusion 1.x / 2.x | LAION-5B | Yes, documented |
| Midjourney | Undisclosed proprietary mix | Partial |
| DALL-E (OpenAI) | Proprietary sources including internet data | Limited |
| Google Imagen | Proprietary | Minimal |
| Adobe Firefly | Licensed Adobe Stock only | Yes, fully disclosed |
The irony is that even professional stock photos marked "all rights reserved" ended up inside these datasets. Photographers who had sold exclusive licenses to clients found their work being mimicked by AI generators they never consented to train.
💡 Quick fact: Getty Images sued Stability AI for allegedly scraping and using over 12 million copyrighted images without a license. The case has become a landmark battle for the photography industry and is still active in courts.
The Legal Gray Zone
Copyright law was not designed for AI training. Most existing frameworks protect creators' rights over distribution and reproduction, but training an AI on an image is a newer category that courts are still working through.

What "fair use" actually means
In the United States, AI companies have argued that training on images falls under "fair use," a legal doctrine that allows copyrighted material to be used without permission under specific conditions. Courts evaluate four factors:
- Purpose and character of use: Is the use transformative, or does it simply reproduce the original?
- Nature of the copyrighted work: Creative works receive stronger protection than factual ones.
- Amount used: How much of the original work is consumed in the process?
- Market harm: Does the use replace or devalue the original creator's market?
AI companies argue training is transformative because the model doesn't store a copy of any image. Instead, it learns statistical relationships. Critics counter that AI outputs can directly compete with the original creator's work, which is precisely what factor four is designed to prevent. Both arguments have legal merit, and no final ruling has settled the question.
The EU angle
The European Union's approach is stricter. Under the EU AI Act, general-purpose AI systems must document their training data sources. Under GDPR, individuals have rights over their personal data, which includes photos of their face, with the ability to request access and deletion.
If you are an EU resident, your legal standing is considerably stronger than that of users in most other jurisdictions.
What artists have actually won so far
- Andersen v. Stability AI: A class-action lawsuit filed by visual artists alleging their work was scraped without consent. Still active in U.S. federal court.
- Getty Images v. Stability AI: Strong evidentiary case, as Getty watermarks reportedly appeared in AI-generated outputs.
- Concept Art Association advocacy: Led to Congressional hearings in the United States, though no comprehensive legislation has passed yet.
The legal landscape is slowly shifting. For now, creators are operating largely without comprehensive statutory protections.
What Gets Scraped and What Doesn't
Not all platforms handle your images the same way. The difference between them matters enormously.

Platform-by-platform breakdown
| Platform | AI Training in ToS | Opt-Out Available |
|---|
| Instagram / Meta | Yes, for public accounts | Very limited |
| X (Twitter) | Yes, updated in 2023 | No clear mechanism |
| LinkedIn | Yes | Available in settings |
| Pinterest | Restricted commercial use | Partial |
| Flickr | No, unless Creative Commons | Yes, opt-out exists |
| Adobe Stock | Licensed use only | N/A, strictly controlled |
Meta explicitly updated its Terms of Service to allow public content to be used for training AI products. When you post publicly on Instagram, your photo can legally be used to train Meta's models. Private accounts are excluded. Public accounts are not.
The stock site difference
Stock photography sites like Getty, Shutterstock, and Adobe Stock have licensing agreements that explicitly prohibit unauthorized AI training use. That's why lawsuits from those companies carry legal weight. Social media platforms, by contrast, treat your uploaded content as an implicit license grant to themselves and their AI initiatives.
💡 Tip: Setting your Instagram account to private is one of the fastest single actions you can take to remove your images from future Meta AI training consideration.
Your Photos and Facial Recognition
There is a dimension to this conversation that goes beyond artistic copyright: your face as biometric data.

AI companies building facial recognition systems have also scraped images at scale. Microsoft, IBM, and others were found to have used Flickr photos, some with Creative Commons licenses that explicitly prohibited commercial use, to build facial recognition training datasets. Affected photographers only discovered this when journalists investigated and published their findings.
What facial training data actually captures
When a photo of your face enters a training dataset, it contributes to:
- Facial geometry models: Measurements like the distance between your eyes, the width of your nose bridge, the shape of your jawline
- Expression classifiers: Systems that detect smiling, surprise, anger, or a neutral expression
- Biometric fingerprinting: Unique identifiers tied to your physical appearance that can be used to identify you across multiple photos
The concern isn't abstract. Facial recognition systems trained on scraped data have been deployed by law enforcement agencies, with documented cases of misidentification leading to wrongful arrests in the United States.
How to Protect Your Photos Right Now
You are not powerless. There are concrete steps available today.

Reduce your public footprint
- Set Instagram and other social accounts to private if you're concerned
- Audit your X profile and consider removing older public photos
- Review LinkedIn's AI data settings and opt out where the option exists
- Avoid posting full-resolution originals publicly when a compressed version will do
Use watermarks strategically
Watermarks don't prevent scraping, but they serve two purposes. First, they contaminate training data. AI models trained on watermarked images sometimes produce outputs with visual artifacts, which is already happening in the case of Getty watermarks appearing in Stable Diffusion outputs. Second, they establish provenance, a verifiable record of who created the image.
💡 Pro move: Place a diagonal, semi-transparent watermark with your name or domain across the center of the image, not just a corner. Corner watermarks are easy to crop and remove. A centered overlay is far more difficult to eliminate without degrading the image quality.
Opt out where tools exist
Several services now exist specifically to help you remove your work from AI training datasets:
- Have I Been Trained (haveibeentrained.com): Search for your images in LAION datasets and submit removal requests
- Spawning AI: An opt-out registry that major AI developers have agreed to honor
- Glaze: A tool from the University of Chicago that adds imperceptible pixel-level perturbations to your images, making them adversarial for style-learning AI models
- Nightshade: A more aggressive version of Glaze designed to actively corrupt model training when images are used
Share smart before uploading
One tactic that photographers overlook: if you need to share high-resolution work for a portfolio or a client approval, consider sharing an upscaled version at a reduced native resolution. A tool like Real ESRGAN or Google Upscaler can generate a visually impressive, presentable version without requiring you to expose the full original file. You can also use Topaz Image Upscale for up to 6x upscaling with exceptional quality retention.
Alternatively, use Recraft Crisp Upscale for clean, sharp results that preserve fine detail without introducing artificial sharpening artifacts.

Register your copyright formally
In the United States, copyright registration gives you the legal standing to sue for statutory damages and attorney fees, not just actual damages. For working photographers and digital artists, this is a meaningful tool. A registered copyright is a far stronger legal foundation than simply claiming ownership in a caption.
The Other Side of the Debate
This conversation is worth approaching honestly, because not all counterarguments are without merit.
The "public internet" argument
AI advocates argue that training on publicly available data is no different from a human artist studying the works of their predecessors. Painters learned by looking at paintings. Writers develop their craft by reading widely. Does an AI doing something similar constitute theft?
The analogy has real limits. Human artists don't ingest millions of works in seconds. They don't reproduce patterns at commercial scale. They don't compete with the creators they learned from in the same direct economic sense. The speed and scale of AI training creates an asymmetry that the "learning from examples" framing doesn't fully capture.
Consent versus accessibility
Making something public is not the same as consenting to every possible use of it. Posting a photo on social media to share it with friends is a different act from licensing it for commercial AI training. The current industry default, where anything publicly accessible is treated as fair game for scraping, conflates two genuinely distinct kinds of consent.
What Responsible Practices Look Like
Some AI developers are choosing better paths.

Licensed and synthetic datasets
Adobe Firefly is the clearest example of a commercially successful AI image tool trained exclusively on licensed Adobe Stock images and public domain content. The result is a model that can be used commercially with legal clarity, because the underlying training data was legitimately licensed.
Some companies are also turning to synthetic data, generating training images through simulation or earlier AI models, rather than scraping human-created work.
Opt-in consent models
A small number of platforms have started offering artists opt-in consent for training, with compensation. This model, where creators actively choose to contribute their work and receive something in return, is the approach most closely aligned with how creative licensing has worked throughout history.
It's not the industry standard yet. But it's where the most thoughtful actors are moving.
3 Things to Do This Week

You don't need to wait for legislation or court decisions to take action. Here's a short, practical list:
- Audit your public profiles: Switch any accounts to private if you don't need public visibility. Check every platform's AI data settings.
- Submit opt-out requests: Use Have I Been Trained and Spawning AI to request removal from the major training datasets.
- Watermark your work: Even if it doesn't stop scraping, it creates a public record of authorship and complicates AI training on your images.
💡 For photographers and visual creators: Combine a centered watermark with formal copyright registration for the strongest possible combination of practical and legal protection.
Create on Your Own Terms
The conversation around AI and photo rights can feel overwhelming. But understanding how the system works is the first step to operating within it intelligently.
There's another angle worth considering: instead of worrying about whether AI was trained on your work, you can use AI tools yourself to create entirely new images on your own terms. With a platform like Picasso IA, you can generate original, photorealistic images from text prompts, with no scraping required, no hidden data harvesting, and full creative control over the output.
You can also use background removal to cleanly isolate subjects from your existing photos, apply super-resolution upscaling to enhance image quality before sharing, or use a powerful LLM like GPT-4o or Claude 4 Sonnet to help you draft licensing terms, review a platform's ToS, or write your copyright registration notes.

The tools exist on both sides of this issue. The ones worth using are the ones that are transparent about what they do and give you real creative power in return. Try generating your first image on Picasso IA today and see what's possible when AI works for you, not around you.