PostgreSQL Anonymizer
PostgreSQL Anonymizer is an extension that masks PII in databases, allowing developers to define rules, supports various methods, and enhances GDPR compliance while maintaining data confidentiality during testing.
Read original articlePostgreSQL Anonymizer is an extension designed to mask or replace personally identifiable information (PII) and sensitive data in PostgreSQL databases. It employs a declarative approach, allowing developers to define masking rules directly within the database schema using PostgreSQL Data Definition Language (DDL). The extension supports five masking methods: Anonymous Dumps, Static Masking, Dynamic Masking, Masking Views, and Masking Data Wrappers, each suitable for different contexts. The primary aim is to ensure data is masked at the database level to minimize exposure risks. Additionally, it offers various masking functions, including randomization and custom functions, along with detection functions to identify columns needing anonymization. The quick start guide outlines steps to launch the extension using Docker, create a database, and set up masking rules for specific user roles. Success stories highlight its effectiveness in reinforcing GDPR compliance and maintaining data confidentiality during testing. Users have praised the extension for its ability to implement complex masking rules without sacrificing functionality.
- PostgreSQL Anonymizer masks PII and sensitive data in PostgreSQL databases.
- It allows developers to define masking rules directly in the database schema.
- The extension supports multiple masking methods tailored for different use cases.
- It includes various masking functions and detection capabilities for identifying sensitive data.
- Users report improved GDPR compliance and data confidentiality during testing processes.
Related
Difference between running Postgres for yourself and for others
The post compares self-managed PostgreSQL with managing it for others, focusing on provisioning, backup/restore, HA, and security. It addresses complexities in provisioning, backup strategies, HA setup, and security measures for external users.
Does PostgreSQL respond to the challenge of analytical queries?
PostgreSQL has advanced in handling analytical queries with foreign data wrappers and partitioning, improving efficiency through optimizer enhancements, while facing challenges in pruning and statistical data. Ongoing community discussions aim for further improvements.
Postgres.new: In-browser Postgres with an AI interface
postgres.new is an in-browser Postgres sandbox that integrates AI assistance for managing databases, supporting features like CSV imports, report generation, and semantic search, with future cost-effective deployments planned.
Pg_mem: A Malware Hidden in the Postgres Processes
Researchers identified PG_MEM malware targeting PostgreSQL databases via brute force attacks, executing arbitrary commands and deploying cryptocurrency miners. Over 800,000 exposed databases highlight the urgent need for enhanced security measures.
How Postgres Is Misused and Abused in the Wild
Karen Jex highlighted at PGConf.dev that PostgreSQL misuse stems from inadequate documentation and understanding. Educating users on best practices is essential to improve user experience and address knowledge gaps.
- Several users are developing or have developed similar tools, emphasizing the need for user-friendly interfaces and configurations for non-technical users.
- There are discussions about the limitations of current anonymization methods, with some experts arguing that they may not meet GDPR standards and could be more accurately described as pseudonymization.
- Concerns are raised about the integration of anonymization tools with existing database frameworks, particularly regarding schema changes and migration issues.
- Users express interest in automatic identification of PII and the need for default masking of new columns to enhance data security.
- Some commenters caution against the risk of inadvertently anonymizing production data, highlighting the importance of careful implementation.
According to its --help output, it is designed to retain the following properties of data:
- cardinalities of values (number of distinct values) for every column and for every tuple of columns;
- conditional cardinalities: number of distinct values of one column under condition on value of another column;
- probability distributions of absolute value of integers; sign of signed integers; exponent and sign for floats;
- probability distributions of length of strings;
- probability of zero values of numbers; empty strings and arrays, NULLs;
- data compression ratio when compressed with LZ77 and entropy family of codecs;
- continuity (magnitude of difference) of time values across table; continuity of floating point values.
- date component of DateTime values;
- UTF-8 validity of string values;
- string values continue to look somewhat natural
[1]: https://clickhouse.com/docs/en/operations/utilities/clickhou...
More specifically the integration of this functionality at a fortunately ex-employer was purposefully kept away from the dev team (no motivation was offered, however I suspect that some sort of segmentation was sought after) and thus did not take into account that tables with PII did in fact still need their schema changed from time to time.
This lead to the anonymizer extension, together with the confidential views to only be installed on production DB instances with dev, test, and staging instances running vanilla postgres. With this, the possibility to catch DB migration issues related to the confidential views was pushed out to the release itself. This lead to numerous failed releases which involved having the ops team intervene, manually remove the views for the duration of the release, then manually re-create them.
So,
If you plan to use this extension, and specifically its views, make sure you have it set up in the exactly same way on all environments. Also make sure that its initialisation and view creation are part of your framework's DB migrations so that they are documented and easy to precisely reproduce on new environments.
It certainly complicates things but it provides an additional security layer of separation between the PII and it's related data. You can provide your end users access to a database without having to worry about them getting access to the "dangerous" data. If they do indeed need access to the data pointed to via the token they can request access to that related database.
This method also provides more performance since you don't need to encrypt the entire database (which is often required when storing PII) and also don't need to add extra security context function calls to every database request.
All that said, I wouldn't rely on this extension as a way to deliver anonymized data to downstream consumers outside of our software team. As others have pointed out, this is really more of a pseudonymization technique. It's great for removing phone numbers, emails, etc. from your data set, but it's not going to eradicate PII. Pretty much all anonymized records can be traced back to their source data through PKs or FKs.
I'm the main developer of this extension. Happy to answer any question you have about this project and anonymization in general!
A good use case that comes to mind is using prod data in a retool app or something for your internal team but you want to mask out certain bits.
I’ve been building Neosync [1] to handle more advanced use cases where you want to anonymize data for lower level environments. This is more useful for stage or dev data. Then prod stays completely unexposed to anyone.
It also has a transactional anonymization api too.
As a computer scientist and academic researcher having worked on this topic for now more than a decade (some of my work if you are interested: [1, 2]), re-identification is often possible from few pieces of information. Masking or replacing a few values or columns will often not provide sufficient guarantees—especially when a lot of information is being released.
What this tool does is called ‘pseudonymization’ and maybe, if not very carefully, ‘de-identification’ in some case. With colleagues, reviewed all the literature and industry practices a few months ago [3], and our conclusion was:
> We find that, although no perfect solution exists, applying modern techniques while auditing their guarantees against attacks is the best approach to safely use and share data today.
This is clearly not what this tool is doing.
[1] https://www.nature.com/articles/s41467-019-10933-3 [2] https://www.nature.com/articles/s41467-024-55296-6 [3] https://www.science.org/doi/10.1126/sciadv.adn7053
I guess you can add some CI steps when modifying the db to ensure a give column is allowed or masked, but still, would be nice if this was defaulted the other way around.
Related
Difference between running Postgres for yourself and for others
The post compares self-managed PostgreSQL with managing it for others, focusing on provisioning, backup/restore, HA, and security. It addresses complexities in provisioning, backup strategies, HA setup, and security measures for external users.
Does PostgreSQL respond to the challenge of analytical queries?
PostgreSQL has advanced in handling analytical queries with foreign data wrappers and partitioning, improving efficiency through optimizer enhancements, while facing challenges in pruning and statistical data. Ongoing community discussions aim for further improvements.
Postgres.new: In-browser Postgres with an AI interface
postgres.new is an in-browser Postgres sandbox that integrates AI assistance for managing databases, supporting features like CSV imports, report generation, and semantic search, with future cost-effective deployments planned.
Pg_mem: A Malware Hidden in the Postgres Processes
Researchers identified PG_MEM malware targeting PostgreSQL databases via brute force attacks, executing arbitrary commands and deploying cryptocurrency miners. Over 800,000 exposed databases highlight the urgent need for enhanced security measures.
How Postgres Is Misused and Abused in the Wild
Karen Jex highlighted at PGConf.dev that PostgreSQL misuse stems from inadequate documentation and understanding. Educating users on best practices is essential to improve user experience and address knowledge gaps.