If you run a website, your analytics setup probably handles more personal data than you think. The line between “anonymous traffic numbers” and “personally identifiable information” gets crossed more often by accident than by design. A poorly built form, a careless UTM parameter, a referrer leak, or a default GA4 setting can turn an analytics report into a privacy violation.
This guide is the practical, no-recipe version. I will explain what counts as PII, where it sneaks into your analytics, and how to audit your stack before a regulator or a customer asks you to. The privacy stance throughout this site is consistent: collect less, store shorter, and give people fewer reasons to distrust you. For the broader strategy, our complete guide to privacy-friendly website analytics sets the foundation that this article extends.
What Is PII?
PII stands for personally identifiable information. The US National Institute of Standards and Technology (NIST SP 800-122) defines it as any information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other information. That last clause matters more than the first one. A row of “anonymous” data plus another row of “anonymous” data can produce an identification when joined.
PII is split into two buckets:
- Direct identifiers single out a person without help. A full name, a Social Security number, a passport number, an email address, a phone number, a home address.
- Indirect identifiers need to be combined with something else to identify a person. ZIP code plus date of birth plus gender, for example, identifies roughly 87% of the US population (Sweeney, 2000). An IP address, a device fingerprint, a session ID, or a logged-in user pseudonym all sit in this bucket.
The reason this matters for analytics: most modern analytics tools collect indirect identifiers as a matter of course. The question is whether the volume and combination of those identifiers crosses into PII territory.
PII vs Personal Data: The GDPR Distinction
“PII” is a US-flavored term. European law uses “personal data,” and the definition is broader. Article 4(1) of the GDPR defines personal data as any information relating to an identified or identifiable natural person, where identifiable means they can be singled out directly or indirectly using identifiers like a name, ID number, location data, online identifier, or one or more factors specific to their physical, physiological, genetic, mental, economic, cultural, or social identity.
Read that list again. “Online identifier” covers cookies, IP addresses, and device fingerprints. “Location data” covers GPS but also coarse IP-derived geolocation. The GDPR’s net is wider than the US notion of PII. If you operate in or sell to the EU, the question is not whether your analytics collects PII in the narrow US sense — it is whether your analytics collects personal data in the GDPR sense. The answer is almost always yes.
This is why the EU’s Court of Justice has steadily ruled against transatlantic analytics setups. Our deep dive on why cookieless analytics is becoming standard in Europe walks through the recent decisions and what they imply for your tooling choices.
Common Types of PII Captured by Web Analytics
Different tools have very different default postures. Here is what each typically grabs out of the box:
| PII Type | Example | GA4 (default) | Plausible | Server logs |
|---|---|---|---|---|
| IP address (full) | 203.0.113.42 | No (truncated, but processed) | No (hashed, never stored) | Yes (raw, unless filtered) |
| User-Agent string | Mozilla/5.0 … | Yes | Parsed only, not stored verbatim | Yes |
| Device fingerprint | Canvas + WebGL hash | Yes (signals) | No | No (logs alone insufficient) |
| Referrer URL | [email protected] | Yes (full URL) | Yes (path only) | Varies (depends on logging) |
| UTM parameters | [email protected] | Yes | Yes | Yes |
| Page URL with query string | /account?user=12345 | Yes | Yes (path only by default) | Yes |
| Logged-in user ID | uid=42 in dataLayer | Yes (if you push it) | No | No (unless app logs it) |
| Form-field values | Email typed into search | Yes (if Enhanced Measurement on) | No | No |
| Geolocation (coarse) | City, region | Yes | Country only | Derived from IP |
The GA4 column is the one most site owners underestimate. Enhanced Measurement is on by default, and it captures site search queries, outbound clicks, file downloads, video engagement, and form interactions. Each of those events can carry PII into your analytics warehouse without anyone touching the configuration.
Direct vs Indirect Identifiers (Why It Matters)
The legal and operational consequences depend on which type you are dealing with.
| Property | Direct Identifier | Indirect Identifier |
|---|---|---|
| Identifies alone? | Yes | No, requires correlation |
| Examples | Email, full name, SSN, phone | IP, fingerprint, user-agent, session ID |
| GDPR treatment | Personal data | Personal data when “reasonably linkable” |
| Removal method | Strip / hash / never collect | Truncate, salt + hash, anonymize |
| Risk in analytics | High — never accept in any field | Medium — depends on retention and joins |
IP addresses are the textbook case. The Court of Justice of the EU ruled in Breyer v Germany (Case C-582/14, 2016) that even dynamic IP addresses are personal data when the website operator has the legal means to identify the user with help from a third party (like an ISP). That was nearly a decade ago, and the ruling has aged well — every European DPA now treats IPs as personal data by default.
Device fingerprints are worse. A fingerprint built from canvas rendering, WebGL hash, font enumeration, screen resolution, and audio context is uniquely stable across sessions and resistant to clearing cookies. The EFF’s Panopticlick research has shown that fingerprint uniqueness ranges from 86% to over 99% depending on the technique. Treating a fingerprint as “anonymous” is wishful thinking.
User-Agent strings are less unique on their own but, combined with IP and accept-language headers, can isolate a single visitor in a small audience. Server logs that store all three together are de facto identification logs.
The 18 HIPAA PII Categories
If you handle US health-related data — including a wellness blog, a fitness app, a telehealth booking page — HIPAA’s Safe Harbor de-identification standard (45 CFR 164.514) lists 18 specific identifiers that must be removed for data to count as de-identified. They are useful as a checklist even outside healthcare:
- Names
- Geographic subdivisions smaller than a state (street, city, county, ZIP — except first three ZIP digits in some cases)
- All elements of dates (except year) related to an individual — birthdate, admission date, discharge date
- Telephone numbers
- Fax numbers
- Email addresses
- Social Security numbers
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate or license numbers
- Vehicle identifiers and serial numbers (including license plates)
- Device identifiers and serial numbers
- Web URLs
- IP addresses
- Biometric identifiers (fingerprints, voiceprints)
- Full-face photographs and comparable images
- Any other unique identifying number, characteristic, or code
Web URLs and IP addresses are explicitly on the list. Default analytics configurations capture both. If you run a site touching any of the categories above, your analytics tool needs more than “we use GA4 and trust Google.”
Where PII Slips Into Analytics by Accident
In my experience auditing client setups, the same handful of leaks show up over and over. None of them require malice. They happen when a developer is in a hurry or a marketer copies a link from a CRM.
- UTM parameters with personal data. A sales rep emails a personalized landing-page link with
[email protected]. Every click sends the email address to GA4 as event parameter and to your CDN logs. - Form-submit URLs. Older HTML forms use GET, which puts whatever the user typed — including names, emails, and phone numbers — into the page URL. GA4 captures the URL on the next pageview.
- Site search queries. If a logged-in user searches for “my invoice 12345” or types their own email into the search box “to see what’s there,” that string becomes a search_term parameter. The same applies to autofill fields where browsers misfire.
- Referrer leaks. A page on your site links out to a partner site. The partner’s analytics gets your full URL as a referer, including any query strings carrying user IDs or session tokens.
- Logged-in user IDs in dataLayer. Developers push
userId: 'user_2345'for personalization. If that ID maps back to your CRM, it is personal data the moment GA4 receives it. - Page paths that encode identity. Routes like
/account/jane-doeor/orders/INV-2024-001end up in the page_location dimension. - Custom dimensions filled at the wrong time. Sending email or username as a custom dimension is one of the top reasons Google has deleted GA properties for terms-of-service violations.
PII in URL Parameters: A Detection Checklist
A 30-minute audit usually surfaces 80% of the leaks. Here is the order I run it in:
- Pull your top 500 page URLs from GA4. Reports → Engagement → Pages and Screens. Export.
- Regex-scan the query strings for
@,email=,name=,phone=,token=,uid=,id=,user=. A simplegrep -E "(email|phone|uid|user|token)=" urls.csvtakes ten seconds. - Check site-search queries. Reports → Events → view_search_results. Sort by search_term and skim the top 200 for anything that looks like an email, an order number, or a personal name.
- Inspect Enhanced Measurement events: form_start, form_submit, file_download, click. Look at the event parameters in DebugView while submitting a form on staging.
- Crawl your own site with a tool like Screaming Frog and filter for URLs containing query strings. Anything carrying personal-looking parameters is a candidate for either a redirect-strip rule or a server-side filter.
- Check inbound paid-campaign URLs. Marketing teams love appending CRM contact IDs to UTM tags. Audit your active campaigns directly inside the ad platforms.
- Inspect your dataLayer. Open DevTools → Console →
window.dataLayer. Read every key. Anything resembling a user ID, email, or hashed-but-recoverable value is a problem.
How to Strip PII from GA4
GA4 ships with a few protections, but most are off by default or require explicit configuration. The minimum baseline:
- IP anonymization. GA4 does not store IP addresses, but it processes them. There is nothing for you to toggle — Google handles truncation server-side. That said, the processing still happens in the US, which is the source of the EU concerns. If you serve EU users, this is the part that does not solve itself.
- Query-string redaction. Admin → Data Streams → your stream → Configure tag settings → Show all → List unwanted referrals (does not help) and the better one: use Modify event rules to strip query parameters. Create a rule that targets
page_locationand removes patterns matchingemail=*,phone=*, etc. - Custom parameter filters. Admin → Data Streams → Configure tag settings → Define internal traffic + Filter unwanted referrals. More importantly, audit every
gtag('event', ...)andgtag('config', ...)call for parameters you should not be sending. - Disable Enhanced Measurement events you do not need. Site search, video engagement, and form interaction are aggressive defaults. Turn off what you do not analyze.
- Google Signals. Off by default in many EU regions, but check Admin → Data Settings → Data Collection. Signals enable cross-device tracking using Google account data — explicitly personal data under any reading.
- Data retention. Default is 2 months for event-level data. You can extend to 14 months. Keep it short.
Stripping PII after collection is harder than not collecting it. Once data has reached Google’s servers, your control is partial. This is why a hard look at Google Analytics alternatives in 2026 is worthwhile for any privacy-conscious operator.
PII-Safe Alternatives to GA4
The shortest path to a PII-clean analytics setup is to use a tool that was designed not to collect PII in the first place. The leading options:
| Tool | IP handling | Cookies | Fingerprinting | Data location | Cookie banner needed? |
|---|---|---|---|---|---|
| Plausible | Hashed daily, never stored | None | None | EU (Germany) | No |
| Fathom | Hashed, never stored | None | None | Canada / EU | No |
| Matomo (cookieless) | Anonymized at collection | Optional | None in cookieless mode | Self-hosted or EU cloud | Depends on config |
| Simple Analytics | Not collected | None | None | EU (Netherlands) | No |
| GA4 (default) | Truncated, processed in US | Yes (_ga, _gid) | Via Signals | US / global | Yes |
Plausible, Fathom, and Simple Analytics all share the same DNA: aggregate metrics, no individual visitor profiles, no cross-site tracking. Matomo’s cookieless mode achieves a similar result with more configuration knobs — see our standalone breakdown of Matomo as a self-hosted analytics tool if data sovereignty is what pushed you off GA4 in the first place.
The trade-off is granularity. You lose individual user journeys and cross-device stitching. For most content sites, agencies, and small SaaS products, this is no loss at all — the aggregate trends are what drive decisions. For ecommerce with logged-in carts, you may need server-side measurement on top. The pattern we cover in how to track website traffic without creeping on your users walks through this trade-off in detail. If you’re shortlisting two of the most popular cookieless options, see our Plausible vs Umami breakdown for a feature-by-feature read.
Storage and Retention Rules
How long you keep PII matters as much as whether you collect it. The principle in every modern privacy law is the same: storage limitation. Keep data only as long as you need it for the original purpose.
Practical baselines I recommend to clients:
- Server logs: 30 days for full logs. After that, aggregate or delete. The standard Apache/Nginx default of “keep forever” is wrong by every law passed since 2016.
- Analytics events: 90 days for event-level data. Aggregate to weekly/monthly rollups for longer-term trend analysis.
- Form submissions: Retain only as long as the business reason exists. A contact form lead followed up and closed should not sit in the database for five years.
- Backups: Document a retention schedule. The GDPR right to erasure applies to backups too, even though regulators give some flexibility on technical implementation.
The GDPR does not state numerical retention limits — the law deliberately leaves that to the controller — but EU DPAs have given strong hints. The French CNIL recommends 13 months as a maximum for analytics cookies. The Italian Garante and Austrian DSB have ruled against extended retention combined with US transfer.
US state laws have begun to specify retention more concretely. The CPRA (California, effective 2023) requires you to disclose retention periods at collection. The Colorado Privacy Act and the Connecticut Data Privacy Act mirror this. Saying “as long as needed” is no longer compliant — you need a documented period per data category.
For the GDPR-specific implications on your analytics stack, see our breakdown of GDPR and website analytics, which goes deeper into legal basis, consent, and DPA selection.
PII Breach Penalties
The fines are no longer hypothetical.
- GDPR. Up to EUR 20 million or 4% of global annual revenue, whichever is higher, for serious breaches. Lower-tier violations cap at EUR 10 million or 2%. Meta has been fined over EUR 1.2 billion (2023) for inadequate transfer protections. Amazon: EUR 746 million. Google: EUR 90 million from the CNIL alone.
- CCPA / CPRA (California). Civil penalties of USD 2,500 per unintentional violation and USD 7,500 per intentional violation or violation involving a minor. Each affected consumer counts as a separate violation. A breach affecting 10,000 California users at the lower tier alone is USD 25 million.
- Other US state laws. Virginia (VCDPA), Colorado (CPA), Connecticut (CTDPA), Utah (UCPA), Texas (TDPSA), Oregon (OCPA), Tennessee (TIPA), Montana (MCDPA), Iowa (ICDPA), Indiana (INCDPA) all have penalties ranging from USD 7,500 to USD 50,000 per violation.
- HIPAA. Tiered: USD 100 to USD 50,000 per violation, capped at USD 1.5 million per identical violation per year. OCR has imposed multi-million-dollar settlements for breaches involving fewer than 100 records.
- Class action exposure. Often the largest cost. The Illinois BIPA cases have produced settlements in the hundreds of millions for biometric data alone.
The reputational tail is worse than the fine. Customers cite a 2023 Cisco consumer-privacy survey result repeatedly: 48% have switched companies or providers over privacy practices. That number is not noise.
PII Audit Checklist for Site Owners
If you do nothing else from this article, work through this list. It takes a focused half-day on most sites.
- Inventory every analytics, advertising, and embedded-script tag firing on your site. Use the Tag Assistant or just View Source on a sample of pages.
- Document the legal basis for each tag (consent, contract, legitimate interest). If you cannot, the tag is at risk.
- Run the 7-step URL parameter audit from earlier in this article.
- Audit your dataLayer pushes for personal data. Remove anything not aggregated.
- Disable Enhanced Measurement events you do not analyze.
- Set GA4 retention to the minimum that supports your reporting (2 months for most sites).
- Strip query parameters using GA4 Modify event rules or, better, server-side Tag Manager.
- Replace GET-based forms with POST. This blocks the most common form-submit leak.
- Set
Referrer-Policy: strict-origin-when-cross-originas an HTTP header to limit what referrer data leaves your site. - Configure server log retention to 30 days. Truncate the last octet of IPs in logs if you keep them longer.
- Document a retention schedule per data category. Add to your privacy policy.
- Test your cookie consent banner. If your banner harms conversions, fix the design — see cookie consent banner hurting conversions for the patterns that work without dark patterns.
If you operate at any scale or in a regulated sector, an annual third-party audit is worth the cost. Many of the GDPR fines listed above came from issues an external auditor would have flagged in a day.
Cross-Device Identity and the PII Trap
One area where sites trip themselves up is cross-device measurement. Stitching the same user’s mobile and desktop sessions together is operationally useful but is, by definition, identification. Even a “deterministic ID” hashed with a salt is personal data under the GDPR.
If you need cross-device measurement, the question is whether you have a legal basis (almost always consent) and whether you can offer a meaningful opt-out. Our piece on cross-device identity resolution explains the trade-offs in detail. The short version: most sites do not actually need it, and the ones that do need it should treat it as a high-risk processing activity with explicit consent.
Frequently Asked Questions
Is an IP address PII?
Under the GDPR, yes — both static and dynamic IP addresses are personal data per the CJEU’s Breyer ruling. Under US law, IP addresses fall under HIPAA’s 18 identifiers and California’s CCPA definition of personal information. The practical answer for any site operating across both jurisdictions: treat it as PII.
Is a hashed email PII?
Yes. Hashing with a known algorithm (SHA-256 of an email) is reversible by anyone who can hash candidate emails and compare. Salting helps, but if the salt is shared across sessions or applications, the hash is still effectively identifying. Article 29 Working Party Opinion 05/2014 explicitly classifies hashed identifiers as pseudonymous data, which is still personal data under GDPR.
Is a browser fingerprint PII?
Yes. Fingerprints uniquely identify devices with 86–99% accuracy depending on the technique. The EDPB Guidelines 2/2023 on the use of cookies and similar technologies treat fingerprinting as equivalent to cookie storage, requiring consent under ePrivacy Directive Article 5(3).
Does GA4 collect PII by default?
GA4 collects information that qualifies as personal data under the GDPR by default — IP addresses (processed), device identifiers, behavioral data tied to a session. Whether it collects PII in the strict US sense depends on what you push to it. Form-submit values and search queries containing emails are the most common accidental sources.
Can I anonymize analytics data after the fact?
Partly. You can delete fields, truncate IPs, and hash identifiers in your warehouse. But “true” anonymization (irreversible, no possibility of re-identification) is harder than it sounds — once data is in BigQuery linked to other tables, joining attacks remain feasible. The GDPR’s Recital 26 sets a high bar. Better strategy: do not collect what you would later need to anonymize.
What about referrer data — is the referer header PII?
The Referer header itself is a URL, not directly PII. But that URL can contain query parameters with personal data, session tokens, or signed URLs that identify a session. Set Referrer-Policy: strict-origin-when-cross-origin at minimum. For pages with sensitive identifiers in the URL, use no-referrer.
Is a cookie ID PII?
Under the GDPR, yes — cookie IDs are explicitly listed as online identifiers in Article 4(1) and Recital 30. Under CCPA, persistent identifiers including cookies are classified as personal information. The “first-party cookie loophole” people sometimes invoke does not exist in either regulation.
Do I need a DPA with my analytics provider?
If your analytics provider processes personal data on your behalf, yes — this is required by GDPR Article 28. Google Analytics, Plausible, Fathom, and Matomo all offer Data Processing Agreements (DPAs). For self-hosted analytics, you are both controller and processor, so no DPA is needed but your responsibilities multiply.
Bottom Line
PII is not a line that gets crossed once and then forever. It is a posture. Default analytics tools err on the side of collecting more, because more data lets them sell more advertising. Privacy-friendly tools err on the side of collecting less, because their business model does not depend on identifying your visitors.
The mechanism is straightforward. Collect less. Strip what you do collect. Store it shorter. Document what you do. Use tools that align with your stance. If you are running default GA4 and have not opened the data-streams configuration in a year, you are almost certainly handling more PII than your privacy policy claims.
Start with the audit checklist. The privacy-friendly analytics guide covers the strategic choice; the GDPR-and-analytics breakdown covers the legal mechanics. Read both, audit once, and your PII exposure drops by an order of magnitude in a single afternoon.