GenAI combines unique data with knowledge of individual users to create personalized web experiences. How can you ensure that this knowledge is handled securely and in accordance with security compliance standards?
How can you provide assurances to users regarding the removal of their personally identifiable information (PII)?
Let’s explore some tools and patterns you can use to ensure your applications comply with security and privacy standards.
Why RAG is the best architecture for ensuring data privacy
Search augmentation generation is an architecture that augments GenAI responses with private data and is often deployed to address shortcomings of large language models such as hallucinations and short context windows.
But RAG can also help build privacy-sensitive AI systems that forget certain information about individuals upon request.
To comply with security standards, you must ensure that user data is:
Separation
Namespaces isolate users’ data and are a good security primitive.
privacy
When using RAG, data is provided to LLM as context only at the time of generation, but the data does not need to be used to train or fine-tune the AI model.
This means that user data is not stored as knowledge in the model itself, but is simply presented to the GenAI model when requesting generated content.
RAG enables personalization while maintaining strict control over all PII used to generate user-specific responses.
Any proprietary data or PII will be shared with LLM per your request and can be immediately deleted from our systems so the information will not be available for future requests.
On-Demand Deletion
If a user wishes to be forgotten, their data can be deleted from the Vector database index and the RAG system will no longer have any information about them.
Once data is deleted, LLM will no longer be able to answer questions about a particular user or topic, and during the acquisition phase, that information will no longer be provided to LLM when it is generated.
RAG allows for greater flexibility in managing user-specific data than training and fine-tuning, as data about one or more entities can be quickly removed from the production system without impacting system performance for other users.
Handling customer data securely
Understanding different types of data
Designing privacy-conscious software requires understanding the risks associated with each type of customer data you store.
First, classify the types of data that need to be stored in the Vector database: specifically, identify public data, private data, and data that contains PII.
Imagine you are building an e-commerce application that stores a combination of public, sensitive and PII data.
- public: Company name, profile picture, job title.
- private: API key, organization ID, purchase history.
- personal information: Full name, date of birth, account ID.
Next, decide what data should be stored only as vectors and what data should be stored in metadata to support filtering.
We aim to strike a balance between storing as little personal information as possible and providing a rich application experience.
Filtering by metadata is powerful, but in its simplest form it requires private or PII data to be stored in plain text, so you need to be conscious of which fields you expose.
With this understanding, you can take each data type into account and apply the following techniques to handle them safely:
Separating customer data across indexes
Use separate indexes for different purposes: If your application manages natural language descriptions of geographic locations and personally identifiable user data, create two separate indexes, for example, one for locations and one for users.
Name your indexes based on what they will contain. Think of indexes as high-level buckets for the types of data you store.
Separating customer data between namespaces
As we discussed earlier about building multi-tenant systems, namespaces are a convenient and secure primitive for isolating organizations or users within a single index.
Think of a namespace as an entity-specific partition within your index. If your index is users, then each namespace can be mapped to the name of each user. Each namespace stores only the data relevant to that user.
Using namespaces also improves query performance by reducing the total space that must be searched when returning related results.
Querying segments of content using ID prefixes
Pinecone supports ID prefixes, a technique for attaching additional data to a vector’s ID field during an upsert, allowing you to later refer to a “segment” of content, such as all documents on page 1, chunk 23, or all vectors for user A in department Z.
ID prefixes are ideal for associating a set of vectors with a particular user, allowing you to efficiently delete that user’s data upon request.
For example, imagine an application that processes restaurant orders and allows users to find their purchases using natural language.
The ID field can provide hierarchical tags in any structure that suits your application.
This method makes bulk deletions and listings easier to perform.
Using ID prefixes requires some up-front planning when designing your application, but they provide a convenient means of referencing all the vectors and metadata related to a particular entity.
Search expansion generation is also ideal for knowledge elimination
Search Augmentation Generation supplements LLM responses with unique, private, or rapidly changing data to build them on truth and specific context.
But it is also an ideal way to provide end users with assurances regarding their right to be forgotten. Consider an e-commerce scenario where a user can use natural language to interact with a store, retrieve old orders, and buy new products.
In the following RAG workflow, a user’s natural language query is first converted into a query vector and then sent to the vector database to retrieve orders that match the user’s parameters.
The user’s personal context (order history) and personally identifiable information are captured and fed into the generative model at inference time to fulfill the user’s request.
The RAG allows you to control the user data that is presented to the LLM.
What happens when I issue a batch delete using the ID prefix scheme?
Because we removed all user-specific context from the system, subsequent search queries would return no results, effectively removing any knowledge of the user from the LLM.
ID prefixes allow entity-specific data to be isolated, designated, and later listed or deleted, extending RAG into architectures that provide guarantees around data deletion.
The most secure data is data you don’t store
Tokenization to obfuscate user data
In many cases, it is possible to avoid storing any personally identifiable information entirely in the vector database, and instead keep users safe by storing a reference or foreign key to another system, such as the ID of a row in a private database where the complete user record resides.
Complete user records can be kept in a secure, encrypted storage system, either on-premise or hosted by a cloud service provider, reducing the total number of systems that reference user data.
This process is sometimes called tokenization, similar to how a model converts words in a prompt you send into IDs for words in a specific vocabulary. To explore this concept, try this interactive tokenization demo.
Suppose your application can provide a lookup table or a reversible tokenization process, in which case, rather than a plaintext value making the user data visible, you can write the foreign key into the metadata that you associate with the vector during upsert into the vector database.
The foreign key can be anything that makes sense to your application, such as a PostgreSQL row ID, an ID in a relational database that stores user records, a URL, or an S3 bucket name that can be used to look up additional data.
When you upsert a vector, you can attach arbitrary metadata.
Obfuscating user data through hashing
Hashing can be used to hide user data before writing it to the metadata.
Hiding is not encryption: Hiding user data does not offer the same protection as encryption, but it does reduce the chance of accidental exposure of PII.
The application provides the logic to hash the user’s PII and then attach it to the associated vector as metadata.
There are many types of hashing operations, but broadly speaking, they convert the input data into a sequence of characters that is meaningless by itself, but that can potentially be reversed or deciphered by an attacker.
Applications can obfuscate user data in a variety of ways, such as using insecure message hashing or base64 encoding, before writing the value to metadata.
Once the user data is hashed and stored as metadata, the application can use the same hashing logic to perform queries and derive metadata filter values.
The Vector database will return the most relevant results to your query, just as it did before.
The application then de-obfuscates the user data before manipulating it or returning it to the end user.
This approach provides an additional defense in depth: even if an attacker has access to the vector store, they would still need to reverse the application-level hash to obtain the plaintext value.
Encrypting and Decrypting Metadata
Obfuscating and hashing user data is more effective than storing it in plain text, but it is not enough to protect against a skilled and motivated attacker.
Encrypting metadata before every upsert, re-encrypting the query parameters to execute queries, and decrypting the final output of every request may add significant overhead to your system, but it is the best way to ensure that user data is safe and that the vector store has no knowledge of any sensitive data processing queries.
Everything in engineering is a trade-off that must be carefully balanced against the performance loss from constant encryption and decryption, the overhead of securely maintaining and rotating private keys, and the risk of leaking sensitive customer data.
Retention and Deletion of Data in the Vector Database
If you follow the recommended convention of maintaining separate namespaces to implement multi-tenancy, you can easily delete everything stored in that namespace in one operation.
To delete all records from a namespace, use the appropriate deleteAll
Provide the client with parameters, namespace
The parameters are as follows:
Building privacy-conscious AI software is possible with planning
To successfully build privacy-conscious AI software, you need to think about and classify the data you plan to store in advance.
As discussed in ID prefixes and metadata filtering, you must carefully leave handles on segments of your content that can be used to effectively remove entire user or organization-wide information from your system.
By incorporating the Pinecone vector database into your stack and with careful planning, you can build a generative AI system that respects privacy while serving your users’ needs.
YOUTUBE.COM/THENEWSTACK
Technology is evolving quickly, so don’t miss an episode. Subscribe to our YouTube channel to stream all our podcasts, interviews, demos, and more.
subscribe