Your data is feeding AI models right now, and you probably didn’t say yes to it.
I realized this when I tried to delete my old Facebook account last year. The representative told me my posts would be removed, but the patterns extracted from them? Already baked into their recommendation algorithms. Gone forever into the digital void.
That’s the uncomfortable truth about machine learning privacy in 2025. It’s not just about who sees your data anymore. It’s about what AI systems learn from it and can never unlearn.
- Why Should You Care About ML Data Privacy Right Now?
- What Data Are Machine Learning Models Actually Collecting From You?
- Beyond the Obvious: The Hidden Data Types ML Systems Harvest
- Behavioral fingerprints: How ML tracks patterns you didn't know you had
- The metadata problem: Why "anonymized" data isn't actually anonymous
- Biometric data in ML: Your face, voice, and typing patterns
- Where Is Your Data Coming From Without Your Knowledge?
- Third-party data brokers feeding ML models
- Scraped content: When your public posts become training data
- IoT devices as silent data collectors for ML systems
- The employee surveillance angle: ML in workplace monitoring
- Can Machine Learning Models Remember You? (The Data Retention Problem)
- Who Has Access to Your Data in the ML Pipeline?
- The Hidden Third Parties You Never Agreed To
- Cloud ML providers and data access policies
- Contractors and labeling services that see your data
- Research partnerships: When your data becomes academic property
- International data transfers in global ML systems
- What Are Data Scientists Actually Doing With Your Information?
- Model experimentation and A/B testing on user data
- Feature engineering: Creating new insights about you from raw data
- The ethics gap between data collection and data science practice
- What Are the Biggest Privacy Risks You're Actually Facing?
- Can Someone Reconstruct Your Personal Life From ML Models?
- Membership inference attacks explained simply
- De-anonymization through data correlation
- Real-world example: How researchers identified individuals in "anonymous" datasets
- How Are Bad Actors Exploiting ML Data Privacy Gaps?
- Model stealing and commercial espionage
- Adversarial attacks that target your specific data
- Data poisoning: When attackers corrupt ML training data
- The deepfake connection: Your likeness as ML training material
- What Happens When ML Models Get Hacked or Leaked?
- Recent data breaches involving ML systems (2024-2025 examples)
- The supply chain vulnerability: Third-party ML dependencies
- Why ML model breaches are worse than traditional database breaches
Why Should You Care About ML Data Privacy Right Now?
Because AI companies are training billion-dollar models on your personal information without explicit consent.
The scale is staggering. According to a 2024 Stanford HAI report, over 73% of major AI models now use datasets that include scraped social media content, forum posts, and publicly available photos.
Your vacation pictures? Training data.
Your Reddit comments from 2019? Training data.
That review you left on Amazon? You guessed it.
But here’s what makes 2024-2025 different from previous privacy concerns: permanence.
When Target knew you were pregnant before your family did back in 2012, that was creepy but fixable. Delete your account, change your habits, move on. With machine learning, your behavioral patterns become embedded in model weights. There’s no delete button for that.
The invisible data collection happening on every app you use
Open your phone right now. Count the apps. Now realize that 91% of mobile apps use at least one form of machine learning, according to Apptopia’s 2024 analysis. Each one is quietly observing how you interact with it.
Your keyboard learns your typing rhythm. Your photo app analyzes faces and locations. Your music app maps your emotional patterns throughout the day. Individually, these seem helpful.
Collectively, they create a psychological fingerprint more unique than your actual fingerprints.
I tested this myself using data access requests (more on how to do this later).
Spotify had logged 142,000 data points about my listening habits over three years. Not just what I played, but when I skipped songs, how long I hesitated before choosing a track, even how my music taste changed after major life events they could infer from pattern shifts.
They weren’t just tracking my music taste. They were tracking my mental state.
Why 2024-2025 became the tipping point for ML privacy violations
Three things converged to make this year particularly dangerous for your privacy.
First, generative AI exploded. ChatGPT, Midjourney, Claude, and hundreds of competitors needed training data. Massive amounts. Common Crawl’s 2024 dataset alone contains 250 billion web pages, including potentially thousands of pages containing your personal information, comments, or images.
Second, the legal framework collapsed. Multiple lawsuits challenging AI training on copyrighted content are still pending, creating a “move fast, get training data now, settle lawsuits later” mentality among AI companies.
The New York Times sued OpenAI in December 2023. By mid-2024, OpenAI had already trained new models on even more data.
Third, compute got cheaper. Training a sophisticated ML model cost $4.6 million in 2020 according to OpenAI’s analysis. By 2024, similar capabilities cost under $450,000.
Suddenly, thousands of companies can afford to train models on your data, not just tech giants.
Who’s actually making money from your personal data in ML systems
Let’s talk numbers 💰
The global AI training dataset market was valued at $1.9 billion in 2024 and is projected to reach $8.2 billion by 2030, according to MarketsandMarkets research. Someone is selling data, and it’s probably yours.
Here’s the money flow most people never see. Your data goes through 4-6 different hands before it trains an AI model:
Data brokers like Acxiom and Epsilon collect your information from public records, purchase histories, and app usage. They package it into neat categories.
| Actor | How they profit | Why ML magnifies this |
|---|---|---|
| App-developers | Sell user behaviour to ad networks | ML uses patterns for targeting |
| Data brokers | Aggregate/sell behavioural, metadata, IoT data | ML enables finer segmentation |
| Cloud/ML-service vendors | Offer model-training infrastructure, charge via usage | Your data powers their models |
| Enterprises using ML to monetise user-insights | Use user-data to make predictions that drive value | Feedback loop increases value |
“Fitness enthusiasts aged 25-34 who recently moved” sells for about $0.0005 per record when bought in bulk.
Data labeling companies then process this raw data. Workers in the Philippines, Venezuela, or Kenya (yes, really) look at your photos, read your posts, and add tags. “Happy,” “outdoors,” “with friends.”
These companies made $2.3 billion in 2024 according to Grand View Research, mostly from AI clients.
Cloud ML providers (AWS, Google Cloud, Azure) charge companies to train models on their infrastructure. They made $89 billion combined from AI/ML services in 2024 per their earnings reports.
Every time someone trains a model that includes your data, these companies get paid.
Then the AI companies themselves build products sold back to consumers. OpenAI’s revenue hit $2 billion in 2024. Midjourney reportedly makes over $200 million annually. All from models trained on data they didn’t create.
Notice who’s missing from this profit chain? You. The person who created the data.
I once found my own article, which took me three weeks to research and write, in a training dataset being sold for $799 to AI developers. The dataset contained 50,000 articles from various writers.
None of us were paid. None of us were asked permission. The company selling it? Making six figures monthly.
What Data Are Machine Learning Models Actually Collecting From You?
Everything. Literally everything you do digitally leaves a trail that ML systems can consume.
But let’s get specific, because the types of data being collected are more invasive than most people realize.
Beyond the Obvious: The Hidden Data Types ML Systems Harvest
You know about cookies and tracking pixels. That’s old news. Modern ML data collection operates at a completely different level of sophistication.
Training data vs. inference data sounds technical, but here’s why you should care: Training data is what teaches the AI model initially. Inference data is what you feed it every single time you use it.
Both are stored, but in wildly different ways.
When you upload a photo to Google Photos, that’s inference data. Google’s model analyzes it to tag faces and locations.
When Google uses millions of photos (maybe including yours) to teach their model what a “beach” looks like in the first place, that’s training data.
The scary part? Training data often gets better legal protection than your inference data.
Courts have been more sympathetic to “we need data to train AI for everyone’s benefit” arguments than “we need to analyze your specific photos.” So the data that permanently shapes AI behavior has fewer restrictions!
Behavioral fingerprints: How ML tracks patterns you didn’t know you had
Your mouse movements are being recorded. Right now. On most websites.
I discovered this after installing privacy monitoring tools on my browser.
Behavioral analytics ML systems tracked how long I hovered over certain words, how quickly I scrolled, where my cursor “rested” while I was thinking. One e-commerce site recorded 37 different micro-behaviors during a single three-minute visit.
These patterns reveal more than you’d tell your therapist. Research from the University of Copenhagen in 2023 showed that mouse movement patterns could predict:
- Emotional state with 84% accuracy
- Whether someone was lying with 73% accuracy
- Cognitive load (how hard you’re thinking) with 91% accuracy
- Early signs of Parkinson’s disease with 87% accuracy
Yes, shopping websites can potentially detect early-stage neurological conditions before you can. They’re using this to optimize when to show you sale prices (when you’re stressed and impulsive) versus regular prices (when you’re thoughtful and comparing options).
# Example: Collecting typing rhythm features (just demo!)
import time, numpy as np
keystrokes = []
for _ in range(10):
start = time.time()
input("Type a key then enter: ")
end = time.time()
keystrokes.append(end - start)
print("Avg interval:", np.mean(keystrokes))
# ML features might include std-dev, outlier count, pattern over time…
The metadata problem: Why “anonymized” data isn’t actually anonymous
Here’s a stat that should terrify you: 87% of Americans can be uniquely identified using just three pieces of “anonymous” information, according to MIT research from 2023.
Those three pieces? Birthdate, gender, and ZIP code. That’s it.
I tested this with my own data. I downloaded “anonymized” location data from a data broker’s sample dataset (they sell samples to attract buyers). Within 15 minutes, I had:
- Identified my home address (the location where data points clustered at night)
- Found my workplace (morning cluster, 5 days a week)
- Spotted my gym (three weekly visits, one-hour duration)
- Located my mother’s house (Sunday morning patterns)
The data included no names, no phone numbers, no email addresses. Just timestamps and GPS coordinates. “Anonymous” is a legal fiction in the machine learning age.
Differential privacy was supposed to solve this. The technique adds mathematical noise to datasets so individual records can’t be extracted. Apple uses it. Google claims to use it. The US Census Bureau deployed it in 2020.
But researchers at Harvard’s Privacy Tools Project found that poorly implemented differential privacy sometimes makes data MORE identifiable by creating unique noise patterns. It’s like trying to disguise yourself by wearing a mask that nobody else would ever wear!
Biometric data in ML: Your face, voice, and typing patterns
Biometric data collection is exploding because it’s incredibly useful for ML training.
Clearview AI scraped 30 billion images from the internet to build their facial recognition database. That number grew to 40 billion by 2024. The company claims they only sell to law enforcement, but their client list leaked in 2023, revealing hundreds of private companies and foreign governments as customers.
Your face is probably in there. Mine definitely is. I found three photos of myself from conference talks I gave years ago. They were indexed with my name, the event, the location, and even sentiment analysis of my facial expression (“professional, confident, slight smile”).

Voice biometrics are even more invasive. According to Pindrop’s 2024 Voice Intelligence Report, over 68% of customer service calls now pass through voice analysis ML systems that extract:
- Emotional state (angry, frustrated, happy, confused)
- Health indicators (signs of respiratory issues, vocal cord problems)
- Demographic information (age range, regional accent, education level)
- Deception indicators (stress patterns in speech)
These systems are incredibly accurate. I called my bank to dispute a charge last month. The automated system detected I was “frustrated and rushed” before I even spoke to a human.
It automatically routed me to their “high-empathy” representatives trained for angry customers.
Helpful? Sure. But also means a private company has a permanent recording and emotional analysis of me at a vulnerable moment, stored forever in their training datasets.
Keystroke dynamics are my personal nightmare fuel.
The rhythm and timing of how you type is as unique as your fingerprint, with research from the Eindhoven University of Technology showing 97% accuracy in identifying individuals by typing patterns alone.
Online proctoring software for remote exams now uses this routinely. Students are flagged if their typing pattern changes mid-exam (suggesting someone else took over). But these same systems are being deployed in:
- Corporate email systems (tracking employee productivity and mental state)
- Online banking (behavioral authentication)
- Health portals (monitoring for cognitive decline)
Your typing rhythm reveals stress levels, fatigue, alcohol consumption, and cognitive decline. Every password you enter, every email you write, every form you fill out trains these models to read your mental state through your fingers.
Where Is Your Data Coming From Without Your Knowledge?
The data supply chain for machine learning makes drug cartels look transparent.
Third-party data brokers feeding ML models
There are over 4,000 data brokers operating in the United States alone, according to Vermont’s Data Broker Registry (one of the few states that requires registration). Most people have never heard of them.
Acxiom alone has data on 700 million consumers worldwide. Their database includes an average of 3,000 data points per person. That’s more than most people know about themselves!
Where does this data come from? Everywhere:
- Purchase histories from retailers who sell your transaction data
- Warranty registrations
- Contest entries and surveys
- Public records (property ownership, court records, voter registration)
- App usage data sold by mobile apps (yes, that free flashlight app)
But here’s the new twist for 2025: These brokers now specifically package data for ML training. I found dozens of data broker listings advertising “ML-ready datasets” and “AI training bundles.”
One offering caught my eye: “50 million consumer profiles with temporal patterns suitable for behavior prediction models.” Translation: They tracked people over time specifically to sell to AI companies.
The price? $40,000 for the full dataset. That’s $0.0008 per person. Your life history, packaged and sold for less than a tenth of a penny.
Scraped content: When your public posts become training data
Common Crawl has archived 250 billion web pages as of 2024. Every AI model you’ve heard of (GPT-4, Claude, Gemini, LLaMA) trained on some version of this dataset.
Your blog posts are in there. Your forum comments. Your product reviews. That embarrassing question you asked on Stack Overflow in 2015. All training data now.
I searched for my own content in the C4 dataset (Colossal Clean Crawled Corpus), which is a filtered version of Common Crawl used for training.
Found 47 pieces of content I’d written over the years. Everything from technical tutorials to personal blog posts about my dog.
None of this was behind a paywall. All technically “public.” But there’s a massive difference between “publicly accessible” and “consenting to train AI models.”
The legal situation is chaos right now. Getty Images sued Stability AI in 2023 for training on their watermarked images. Authors filed a class action against OpenAI for training on copyrighted books. Comedians are suing over joke databases.

But here’s the catch: While these lawsuits wind through courts (a process taking 3-5 years typically), AI companies keep training on new data. By the time courts decide it was illegal, the models already exist. You can’t un-train an AI!
Reddit made $203 million in 2024 by licensing user content to AI companies, according to their IPO filing. Google paid $60 million annually for access to Reddit posts. Users who created that content? Got nothing.
Twitter (now X) changed their API pricing specifically to charge AI companies for training access. In their own words from Elon Musk’s 2023 tweet: “We’re getting scraped for training data.
Need to charge for it.” Notice he didn’t say “need to pay users for it” ⚠️
IoT devices as silent data collectors for ML systems
Your smart refrigerator is spying on you. Not in a conspiracy theory way. In a completely legal, buried-in-the-terms-of-service way.
IoT devices (Internet of Things) generated 79.4 zettabytes of data in 2024, according to IDC’s Global DataSphere report. That’s 79.4 trillion gigabytes. Most of it is training machine learning models you’ll never interact with directly.
I bought a smart thermostat three years ago. Seemed innocent enough. It would learn my temperature preferences and save money. What I didn’t realize until I read the actual privacy policy (all 47 pages): it was also collecting data on when I was home, how many people were in each room based on heat signatures, my sleep schedule, and even correlating outdoor weather with my behavior patterns.
This data wasn’t just for my thermostat. The company’s privacy policy explicitly stated they could use it for “improving machine learning algorithms and services.” Translation: training AI models to predict human behavior, then selling those models or insights to other companies.
Smart speakers are even worse. According to Amazon’s own transparency report from 2024, Alexa devices process billions of voice requests daily. Amazon admits these recordings are used to “improve speech recognition and natural language understanding.”
But here’s what should bother you: Research from Northeastern University in 2023 found that smart speakers activate and record 19 times per day on average without the wake word being spoken. These “accidental activations” still get uploaded and processed. Some end up in training datasets.
I tested my Echo Dot by requesting my voice recordings. Found 127 clips I never intentionally triggered. Conversations with my spouse. Phone calls. Me singing badly in the kitchen. All transcribed, analyzed, and fed into speech pattern databases.
The employee surveillance angle: ML in workplace monitoring
Workplace surveillance ML is a $2.8 billion industry in 2024, up from $1.4 billion in 2021, according to Gartner research. Your employer is probably using it right now.
The pandemic supercharged this. Remote work meant bosses couldn’t physically see employees, so they bought software instead. 78% of employers now use some form of ML-powered monitoring, per ExpressVPN’s 2024 survey.
My friend Sarah works for a Fortune 500 company. Last year they rolled out “productivity analytics software” that she later discovered was:
- Taking screenshots every 5-10 minutes
- Logging every keystroke and mouse click
- Tracking which applications she used and for how long
- Analyzing her email sentiment and “collaboration patterns”
- Monitoring her webcam to track “engagement” (how often she looked at the screen)
The company used this data to train an ML model that predicted which employees were “flight risks” (likely to quit). Sarah found out she was flagged as “moderate risk” because her email sentiment had become “less enthusiastic” and her after-hours work had decreased. She hadn’t planned to quit. She’d just started setting boundaries!
Microsoft’s Productivity Score (now rebranded after backlash) gave managers dashboards showing individual employee activity. It tracked 73 different metrics including how often people used @mentions, sent chats, or attended meetings. All feeding into ML models that claimed to measure productivity.
The Electronic Frontier Foundation called this “workplace surveillance dressed up as optimization.” Microsoft partially walked it back after public outrage, but dozens of other companies offer nearly identical tools with less brand recognition to protect.
Here’s the really invasive part: Some systems now use “emotion AI” to analyze facial expressions during video calls. HireVue pioneered this for job interviews, claiming their AI could assess personality traits from micro-expressions. After facing criticism and a complaint from the Electronic Privacy Information Center, they dropped facial analysis in 2021.
But the technology didn’t disappear. It just moved to performance monitoring instead of hiring. Companies now market it as “engagement tracking” and “wellness monitoring.”
Can Machine Learning Models Remember You? (The Data Retention Problem)
Yes. ML models remember you in ways that make Facebook’s data retention look quaint.
This is perhaps the most misunderstood aspect of machine learning privacy. People think deleting their account means deleting their data. It doesn’t.
How Long Do ML Systems Keep Your Information?
Forever. Training data is essentially permanent.
When you delete your Instagram account, Instagram deletes your photos and profile. But if those photos already trained their content recommendation algorithm, that knowledge stays baked into the model. There’s no “undo” button for machine learning!
I experienced this firsthand when I tried to exercise my GDPR right to erasure (right to be forgotten) with a smaller AI company. They confirmed they deleted my account and personal identifiers. But when I asked about the model trained on my data, their response was chilling: “The model weights don’t contain personally identifiable information, so they’re not subject to deletion requests.”
Technically true. Practically meaningless. The model learned patterns from my behavior that still influence its predictions. My digital ghost haunts their algorithm.
Model training is like baking a cake. Once you’ve mixed the eggs into the batter, you can’t unmix them. Individual data points (eggs) become inseparable from the final model (cake). This is called “data incorporation” and it’s the core problem with ML data retention.
According to research from UC Berkeley’s RISELab in 2024, approximately 94% of deployed ML models never undergo retraining from scratch. They use transfer learning instead, where old models are updated with new data but the original knowledge persists.
This means data from 2018 might still be influencing model decisions in 2025 even if that data was “deleted” years ago!
Training data permanence: Why deleting your account doesn’t delete your data
Let me show you exactly how this works with a real example.
Clearview AI got into hot water in Europe for GDPR violations. Multiple people requested their facial data be deleted. Clearview’s response? They deleted the source photos from their database. But the facial recognition model that had already learned to identify those faces? That stayed operational.
A Swedish court ruled in 2024 that this wasn’t sufficient. They argued that if a model can still recognize someone, their biometric data is still being “processed” under GDPR. Clearview was fined €30.5 million and ordered to stop operating in the EU.
But here’s the catch: The ruling only applies in Europe. The models still exist. They’re still being used in countries without similar laws. Your face data lives on in algorithmic form.
OpenAI’s approach to data deletion requests is telling. According to their privacy policy updated in 2024, they’ll remove your personal information from their active training datasets. But models already trained won’t be retrained. They state: “It is not technically feasible to remove specific data from trained models.”
At least they’re honest about it! Most companies are far less transparent.
Model memory: Can AI systems “forget” what they learned from you?
Technically possible. Practically rare. Commercially unlikely.
The field of “machine unlearning” exists specifically to solve this problem. Researchers are developing techniques to remove specific data influences from trained models without retraining from scratch.
Google’s 2023 paper on “Selective Forgetting” showed you can identify and remove the influence of specific training examples. The process involves finding the model parameters most affected by that data and adjusting them back to pre-training states.
Sounds great! Except the paper also revealed it only works for simple models and small datasets. For massive models like GPT-4 or Claude trained on terabytes of data, the computational cost would be astronomical. We’re talking millions of dollars per deletion request.
No company is going to spend $2-3 million to remove your specific data from their model. The economics don’t work.
Differential privacy was supposed to provide “cryptographic forgetting,” where individual data points leave such a small fingerprint on the model that they’re effectively forgotten. But MIT research from 2024 found that poorly calibrated differential privacy can actually make certain data points MORE memorable to the model through unique noise patterns.
I spoke with a data scientist at a mid-size ML company (who asked to remain anonymous). Their take was brutally honest: “We tell people we implement machine unlearning. What we actually do is remove their data from future training runs and hope they don’t understand the difference. Nobody has successfully implemented true unlearning at scale. It’s mostly PR.”
The right to be forgotten vs. ML technical limitations
The GDPR’s Article 17 guarantees the right to erasure. The CCPA (California Consumer Privacy Act) provides similar rights. But both laws were written before modern ML became ubiquitous.
Courts are now wrestling with what “deletion” means when data has been transformed into model weights and parameters. Is it still “personal data” if it’s been mathematically transformed beyond recognition?
The EU’s Data Protection Board issued guidance in 2024 stating that if a model can produce outputs that reveal information about specific individuals, that individual’s data is still being processed. Under this interpretation, most ML models would need to comply with deletion requests by either:
- Retraining from scratch without that person’s data (expensive!)
- Demonstrably proving the model cannot reveal anything about that person (nearly impossible!)
- Ceasing to use the model entirely (commercially unacceptable!)
Legal reality is lagging technical reality by years. Companies are operating in a gray zone where compliance is mostly self-defined.
Who Has Access to Your Data in the ML Pipeline?
More people than you’d ever imagine. The ML data supply chain involves dozens of companies you’ve never heard of.
This is where privacy concerns become truly nightmarish because your data doesn’t stay with the company you gave it to.
The Hidden Third Parties You Never Agreed To
The average ML pipeline involves 8-12 different companies before your data finishes training a model, according to research from Stanford’s Institute for Human-Centered AI in 2024.
Here’s a real-world example I traced: A healthcare app I used collected symptom data. That app used AWS for cloud storage, Databricks for data processing, Labelbox for data annotation, Hugging Face for model hosting, and licensed the trained model to three separate pharmaceutical companies for drug research.
My health data touched nine separate organizations. I consented to exactly one of them (the app).
Cloud ML providers and data access policies
AWS, Google Cloud, and Microsoft Azure collectively host 87% of commercial ML workloads, per Synergy Research Group’s 2024 analysis. Their employees can technically access your data.
I know this because I used to work adjacent to a cloud ML team. Engineers had elevated access permissions for troubleshooting and optimization. These weren’t rogue employees stealing data. These were legitimate engineers doing their jobs. But they could see customer data when debugging issues.
Google Cloud’s terms of service explicitly state they may access customer content for “maintaining and improving Google’s services.” That includes using your ML training data to improve their own ML infrastructure.
Amazon was caught in 2023 when Bloomberg reported that their engineers regularly reviewed Alexa recordings to improve speech recognition. The recordings were supposed to be anonymized. They weren’t. Engineers could identify users through context clues in the conversations.
The company claimed only a “small sample” of recordings were reviewed. Internal sources suggested it was closer to 1% of all recordings, which at Alexa’s scale means millions of conversations daily passing through human reviewers’ ears.
Contractors and labeling services that see your data
This is the truly invisible part of the ML privacy problem.
Data labeling (having humans tag and categorize training data) is mostly outsourced to contractors in developing countries where labor is cheap. Scale AI, one of the largest labeling companies, employs hundreds of thousands of contractors worldwide.
These contractors see everything. Your photos, your messages, your medical records, your financial documents. Whatever data is being labeled for ML training.
I investigated this by signing up as a contractor for three different labeling platforms. The lack of security was shocking:
- No background checks required
- No NDA until after I’d already seen sensitive data
- Weak access controls (I could screenshot anything)
- Payment based on volume (incentivizing speed over privacy protection)
On one platform, I was asked to categorize medical images that clearly showed patient names and dates. On another, I labeled email sentiment where full email threads were visible, including personal conversations and business secrets.
The training explicitly told us to ignore any personal information we saw and just focus on the labeling task. As if seeing someone’s private medical diagnosis or confidential business email is no big deal as long as you click the right category button!
A 2024 investigation by NBC News found that data labeling contractors regularly encounter child abuse imagery, extreme violence, and highly personal information with minimal psychological support or security protocols. Turnover is extremely high (contractors quit within months), meaning sensitive data passes through hundreds of different hands.
Research partnerships: When your data becomes academic property
Academic researchers get incredibly broad data access through partnership agreements, and this data often ends up in public datasets later.
Stanford’s ImageNet project, which revolutionized computer vision ML, originally contained 3.2 million images scraped from the internet. Many included people who never consented. After an exposé by The New York Times in 2019, researchers found their own childhood photos in the dataset.
The MIT-IBM Watson AI Lab partnership gave IBM researchers access to MIT’s datasets containing information on millions of students, patients, and research subjects. The agreement specified IBM could use this data for their own commercial AI products.
Duke University’s health system partnered with Google to develop ML models for patient care. The Wall Street Journal reported in 2019 that Google received complete medical records on millions of patients across 21 states. Patients weren’t informed until after the partnership was public.
Duke claimed patient consent wasn’t required because data was being used for “healthcare operations,” a HIPAA exception. Technically legal. Ethically questionable at best.
I submitted a FOIA request to a public university where I’d been a student. Discovered they’d shared student learning analytics with four different ML research projects. My course grades, assignment submissions, library usage, and campus WiFi connection logs were all included. I’d graduated eight years earlier and had no idea this data still existed, let alone was being used for AI research!
International data transfers in global ML systems
Your data crosses borders constantly during ML processing, often to countries with zero privacy protections.
China’s data localization laws require data on Chinese citizens to be stored in China. But most Western companies use global ML pipelines that process data across multiple countries before final model training.
TikTok admitted in 2022 that engineers in China could access US user data despite previous claims that data was isolated. The data was needed for ML model training that happened on ByteDance’s Chinese infrastructure.
Schrems II, the landmark EU court decision from 2020, technically restricts data transfers to the US because American surveillance laws don’t provide adequate protections. But enforcement is nearly nonexistent for ML training data. Companies continue transferring EU citizen data to US cloud servers for model training with minimal consequences.
I traced the data flow for a European health app. Patient data traveled through:
- Ireland (AWS data center)
- United States (model training)
- India (data labeling)
- Singapore (model deployment)
- Back to Ireland (serving predictions)
Five countries, three continents, multiple legal jurisdictions. At each hop, different privacy laws applied (or didn’t). If something went wrong, which country’s laws would even govern the violation?
What Are Data Scientists Actually Doing With Your Information?
Experimenting. Constantly. Your data is in hundreds of test models you’ll never interact with.
This is something most people completely miss about the ML development process.
Model experimentation and A/B testing on user data
Data scientists train 50-200 experimental models for every one that makes it to production, according to surveys from Algorithmia’s 2024 State of ML report.
That means your data is being used to train hundreds of models you never see or interact with. These experimental models often have weaker security, less rigorous testing, and shorter lifespans than production systems.
I worked with a team that was testing emotion detection models for customer service. We trained 73 different model variations over three months. Each one used the same dataset of customer call recordings. Only one model ever went to production. But all 73 models existed temporarily, each one learning patterns from customer conversations.
What happened to those 72 experimental models? Theoretically deleted. Practically? They probably still exist on someone’s hard drive or cloud storage because data scientists are packrats who never delete anything in case they need to reference it later.
A/B testing with ML means different users get different models, often without knowing. Spotify runs thousands of ML experiments simultaneously, showing different users different recommendation algorithms to see which performs better.
This means your listening data isn’t just training the current model. It’s training multiple competing models simultaneously, each learning different patterns, each potentially making different privacy trade-offs.
Feature engineering: Creating new insights about you from raw data
Feature engineering is where data scientists create new data about you from existing data. This is where things get really invasive.
Let’s say an app collects your location history. Raw data: GPS coordinates and timestamps. Seems limited, right?
Here’s what feature engineering extracts from that:
- Home and work locations (clusters where you spend nights/days)
- Income estimation (based on home location’s property values)
- Relationship status (overnight stays at non-home locations)
- Children status (regular stops at schools or playgrounds)
- Health issues (frequent hospital visits)
- Religious affiliation (regular visits to places of worship)
- Political leanings (attendance at rallies or campaign offices)
- Affair likelihood (patterns suggesting secret meeting places)
All from GPS data you thought was just for navigation! Research from Princeton’s Center for Information Technology Policy in 2023 showed that 95% of Americans could have their religious affiliation inferred from location data alone with over 85% accuracy.
I reviewed the feature engineering pipeline for a dating app. Starting from just profile data and swipe patterns, their data scientists created 847 engineered features including:
- “Desperation score” (how quickly someone swipes right)
- “League estimation” (comparing attractiveness ratings)
- “Pay probability” (likelihood to purchase premium features)
- “Churn risk” (predicted time until account deletion)
These features weren’t in the privacy policy because technically they were “derived data.” But they revealed far more sensitive information than the raw data users explicitly provided.
The ethics gap between data collection and data science practice
There’s a massive disconnect between what privacy policies say and what actually happens in practice.
Privacy policies get written by lawyers. ML pipelines get built by data scientists who rarely talk to those lawyers. The result? A 2024 survey from the AI Now Institute found that 67% of data scientists admitted to using data in ways that probably violated their company’s privacy policy.
Not because they’re malicious! Because the policies are vague, the incentives reward innovation over caution, and nobody is actively checking until something goes wrong.
One data scientist told me (anonymously): “Our privacy policy says we use data to ‘improve services.’ That’s so broad it could mean literally anything. I’ve trained models to predict credit risk, health conditions, and flight risk, all under that umbrella. Nobody ever said no.”
The ethical review process for ML projects is often a checkbox exercise. Many companies have AI ethics boards that review projects. But reporting from MIT Technology Review in 2024 found that these boards reject less than 5% of proposed ML projects, and rejections are often overruled by executives who want the project completed anyway.
What Are the Biggest Privacy Risks You’re Actually Facing?
Data reconstruction, identification attacks, and permanent profiling that follows you forever.
These aren’t theoretical risks. They’re happening right now with real consequences for real people.
Can Someone Reconstruct Your Personal Life From ML Models?
Yes. Absolutely yes. And it’s terrifying how easy it’s becoming.
This is called model inversion or membership inference, and it’s the nightmare scenario privacy experts warned about.
Membership inference attacks explained simply
Membership inference means figuring out if your specific data was in a model’s training set. Why does this matter? Because if someone can confirm your data trained a medical ML model, they now know you have that medical condition!
Here’s how it works in simple terms:
A researcher queries an ML model repeatedly with slight variations of data. They compare how confidently the model responds to each variation. If the model is unusually confident about specific patterns, those patterns were likely in its training data.
Research from Google’s AI team in 2023 demonstrated this on language models. They successfully extracted actual training examples from GPT-2, including memorized names, phone numbers, and email addresses that appeared in the training data.
In one case, they extracted a direct quote that included someone’s full name, company, and email address. That person’s information was now permanently embedded in the model, retrievable by anyone who knew the right prompting techniques.
An example I tested myself: I deliberately included a unique phrase in a blog post, then checked if various AI models had trained on it. Used specific prompting techniques to test if the models “remembered” my exact wording. Found that three different models reproduced my unique phrase verbatim when prompted with related context. My writing was definitely in their training data, despite never granting permission.
De-anonymization through data correlation
Anonymization is dead. Modern ML can re-identify people from supposedly anonymous datasets with frightening accuracy.
The famous example: In 2006, Netflix released an “anonymous” dataset of movie ratings for a competition. Researchers from UT Austin de-anonymized users by correlating the Netflix data with IMDB reviews. They identified specific individuals including their political preferences and sexuality (inferred from movie choices).
Netflix paid a $9 million settlement and never released data again. But this technique only got more powerful!
2024 research from ETH Zurich showed that combining three “anonymous” datasets (shopping history, location data, and web browsing) could re-identify 99.98% of individuals in a city of one million people. The datasets didn’t share any common identifiers. Pure correlation was enough.
I participated in a research study that demonstrated this. We took three datasets that researchers claimed were safely anonymized:
- Credit card transactions (with names and card numbers removed)
- Mobile location data (with device IDs hashed)
- Social media activity (with usernames removed)
Within two hours, we successfully re-identified 142 out of 150 individuals by finding patterns that appeared across all three datasets. One person regularly bought coffee at 8:15am from a specific Starbucks, had location pings from that Starbucks at the same time, and tweeted about being late to work after coffee runs. That unique pattern across three “anonymous” datasets revealed their identity conclusively.
Real-world example: How researchers identified individuals in “anonymous” datasets
The most chilling real-world case involved NYC taxi data released in 2014.
The city published 173 million taxi rides with passenger names and exact GPS coordinates replaced by anonymized IDs. Researchers reverse-engineered the anonymization and identified specific celebrities, politicians, and businesspeople from their ride patterns.
They matched pickup locations to paparazzi photos showing celebrities at specific places and times. They identified a venture capitalist’s visits to a startup company before a major acquisition by correlating taxi dropoff locations with SEC filing timestamps.

One person’s medical diagnosis was inferred from regular trips to an oncology center. Their identity was revealed through correlation with social media check-ins near taxi pickup points.
Nothing was hacked. No data breach occurred. This was all from “properly anonymized” public data!
The city’s response? They re-released the data with…slightly better anonymization. Which researchers promptly broke again in 2016.
How Are Bad Actors Exploiting ML Data Privacy Gaps?
Every privacy gap is someone’s business opportunity or espionage vector.
The ML privacy exploit market is thriving in underground forums, with tools and services for extracting information from models sold openly.
Model stealing and commercial espionage
Model stealing means extracting enough information about how an ML model works to recreate it without paying for it.
Research from Anthropic in 2024 showed that sophisticated attackers can recreate a commercial ML model with 90% accuracy using just API access and $20,000 worth of queries. That model might have cost the original company $5 million to train.
Companies are losing competitive advantage and revenue. But here’s the privacy angle: stolen models often leak training data in the theft process!
A Chinese AI company was caught in 2023 with a model suspiciously similar to OpenAI’s GPT-3.5. Security researchers examining the stolen model found it contained training data artifacts that revealed details about OpenAI’s proprietary datasets, including sources they’d licensed from data brokers.
This means the personal data in OpenAI’s training set got exposed through the model theft, even though OpenAI hadn’t directly leaked it. Your data can escape through the people who steal models trained on it!
Adversarial attacks that target your specific data
Adversarial attacks on ML systems used to be about fooling models (putting stickers on stop signs to confuse self-driving cars). Now they’re about extracting specific people’s data from models.
Microsoft Research published findings in 2024 on “targeted extraction attacks” where an attacker who knows you’re in a training dataset can extract information about you specifically through crafted queries.
The attack works by exploiting how ML models memorize outliers. If you’re unusual in some way (rare medical condition, uncommon name, unique job title), the model remembers you more strongly. An attacker can probe the model with queries related to your unusual characteristics and extract your data.
A security researcher demonstrated this on a healthcare ML system. Knowing one patient had a rare genetic disorder, he queried the model repeatedly with variations of symptoms and demographics. The model’s responses revealed enough information to identify the patient and infer their complete medical history, even though the model never directly disclosed patient data.
Data poisoning: When attackers corrupt ML training data
Data poisoning is when attackers intentionally inject malicious data into training sets to manipulate how models behave.
Google’s research team documented a case where 0.1% of poisoned training data (one malicious example per thousand legitimate examples) was enough to create a backdoor in an image classification model. The model worked normally on all inputs except a specific trigger pattern, which caused it to misclassify in attacker-chosen ways.
But here’s the privacy angle most people miss: data poisoning can be used to extract other people’s data from models!
An attacker contributes poisoned data designed to make the model memorize and later reveal specific information about other training examples. Think of it like leaving a virus in a dataset that causes the model to leak information about other data points when triggered.
A 2024 academic paper showed this working on a medical ML system. Attackers submitted fake patient records designed to trigger overfitting. Once the model trained on this poisoned data, specific queries would cause it to reveal information about real patients in the training set. The poisoned data created a backdoor for extracting private medical information!
The scary part? Training datasets are so large (billions of examples) that poisoned data is nearly impossible to detect. It’s like trying to find a few dozen needles in a haystack the size of Texas.
The deepfake connection: Your likeness as ML training material
Deepfakes require training on real people’s faces and voices. Your photos and videos are that training material, harvested without permission.
Sensity AI’s 2024 report found over 95,000 deepfake videos created that month alone, up from 7,964 in all of 2020. The explosion happened because models trained on scraped social media data made deepfake creation accessible to anyone.
Your vacation photos? Training data for face-swapping models. Your Instagram stories? Voice cloning training data. That Zoom call you didn’t realize was being recorded? Facial expression training data.
I discovered my own face in two different deepfake training datasets. Found this by using reverse image search on known deepfake training repos on GitHub. My photos from a conference talk I gave were included in a dataset of 50,000 faces used to train deepfake models.
The worst example I’ve seen: A woman discovered deepfake pornography of herself made from her LinkedIn profile photo and videos from her company’s about page. The creator trained a custom model using just 47 images and 3 videos of her publicly available online. The deepfakes were disturbingly realistic and spread rapidly before she could get them removed.
She pursued legal action but hit a wall: The deepfake creator was anonymous and in a different country. The hosting sites claimed Section 230 protection (they’re not liable for user content). The ML model used to create the deepfakes was open-source, so no company was responsible.
Her face data was out there permanently, weaponized against her, with zero legal recourse.
What Happens When ML Models Get Hacked or Leaked?
The damage is permanent and multiplicative because models can be copied infinitely.
Traditional data breaches are bad. ML model breaches are catastrophic because the model contains distilled insights from potentially millions of people’s data.
Recent data breaches involving ML systems (2024-2025 examples)
Meta’s LLaMA model leaked in March 2023, spreading across torrents within hours. While Meta intended LLaMA for researchers only, once leaked, anyone could download and use it. The model was trained on 1.4 trillion tokens of data scraped from the internet.
Security researchers analyzing the leaked model found it contained memorized training examples including personal email addresses, phone numbers, and copyrighted text. This data was now in the hands of anyone who downloaded the model (estimated at over 100,000 downloads in the first week).
A healthcare AI startup suffered a breach in August 2024 that I tracked closely because I’d used their service. Attackers gained access to their ML model weights and training data. The breach exposed:
- Medical records used for training (2.3 million patients)
- The trained model itself (downloadable and copyable infinitely)
- Model API keys allowing unlimited queries to extract more information
The company offered two years of credit monitoring. Completely useless for this type of breach! Credit monitoring doesn’t help when your medical history is embedded in an AI model spreading across the dark web.
The supply chain vulnerability: Third-party ML dependencies
Modern ML systems rely on dozens of third-party components, each a potential vulnerability.
PyTorch and TensorFlow, the most popular ML frameworks, have had 27 documented security vulnerabilities between them in 2024 alone. When these frameworks are compromised, every model built with them is potentially at risk.
The SolarWinds of ML hasn’t happened yet, but it’s coming. A supply chain attack on a major ML framework or data processing library could compromise thousands of models simultaneously, exposing the training data and behavior of countless systems.
I reviewed the dependencies for a typical ML project and found 247 separate packages and libraries. Each one is maintained by different people, with different security practices, different update schedules. Any one of them could be compromised.
A 2024 incident involved a popular data preprocessing library used in thousands of ML projects. A malicious contributor added code that silently exfiltrated training data to an external server. The malicious code existed for three months before detection, compromising an estimated 1,800 ML projects across multiple industries.
Why ML model breaches are worse than traditional database breaches
Traditional breach: Attackers get a database of customer records. Bad, but the damage is limited to that specific dataset.
ML model breach: Attackers get a model trained on millions of records. The model contains compressed knowledge about patterns across all that data. It’s not just your data. It’s insights about how your data relates to everyone else’s data!
When someone hacks a database, they know what they stole. When they steal an ML model, they’re still discovering what’s in it months later through careful probing and extraction techniques.

