An End-to-End RAG example using faiss retriver using langchain and openai gpt-3.5 for QA#
This notebook presents a comprehensive end-to-end example utilizing the library’s functionality. Specifically, it showcases how to use a RAG (Retrieval-Augmented Generation) model, powered by GPT-3.5, to retrieve information. This notebook provides insights into leveraging the library for complex use cases.
[ ]:
!pip install "antimatter[langchain]"
!pip install python-dotenv openai
Import openai key from a .env file.#
[3]:
import dotenv
import os
dotenv.load_dotenv(os.path.join(os.getenv("HOME"), '.openai_env'))
[3]:
True
Register a domain and create a read/write context#
[4]:
import os
from antimatter import new_domain, Session
from antimatter.builders import *
from antimatter.datatype.datatypes import Datatype
[5]:
# Either create a new domain or use an existing one
if True:
sess = new_domain("[email protected]")
print ("domain: %s" % (sess.domain_id))
# print(f"sess = Session.from_api_key(domain_id='{sess.domain_id}', api_key='{sess.api_key}')")
else:
sess = Session.from_api_key(domain_id='<domain_id>', api_key='<api_key>')
file_name = "/tmp/testdata.capsule"
domain: dm-8gQzMfrtE5N
Add some facts to this domain#
Create a fact type called is_project_member
with the attributes email
and project
. Add 2 facts to this type: - is_project_member(email="test@test.com", project="project1")
- is_project_member(email="test2@test2.com", project="project2")
[7]:
sess.add_fact_type(
"is_project_member",
description="Team membership",
arguments={"email": "email of the member", "project": "name of the project"},
)
sess.add_fact(
"is_project_member",
"[email protected]",
"project1",
)
sess.add_fact(
"is_project_member",
"[email protected]",
"project2",
)
[7]:
{'id': 'ft-48kn7aqfm31sq6li',
'name': 'is_project_member',
'arguments': ['[email protected]', 'project2']}
[8]:
sess.list_facts('is_project_member')
[8]:
[{'id': 'ft-48kn7aqfm31sq6li',
'name': 'is_project_member',
'arguments': ['[email protected]', 'project2']},
{'id': 'ft-fczggvwqhg9jyqfv',
'name': 'is_project_member',
'arguments': ['[email protected]', 'project1']}]
Open a dataset#
[9]:
# Load dataset
import pandas as pd
data = [
{"id":1,"first_name":"Amanda","last_name":"Jordan","email":"[email protected]","gender":"Female","ip_address":"1.197.201.2","cc":"6759521864920116","country":"Indonesia","birthdate":"3\\/8\\/1971","salary":49756.53,"title":"Internal Auditor","comments":"Hello friends, my name is Alice Johnson and I just turned 29 years old! \\ud83c\\udf89 I am looking forward to connecting with all of you. Feel free to drop me a line at [email protected] or call me at 415-123-4567."},
{"id":2,"first_name":"Albert","last_name":"Freeman","email":"[email protected]","gender":"Male","ip_address":"218.111.175.34","cc":"","country":"Canada","birthdate":"1\\/16\\/1968","salary":150280.17,"title":"Accountant IV","comments":"Customer feedback: I recently visited your store at 5678 Pine Avenue, Dallas, TX 75201. My name is Jane Doe, age 43. I had a wonderful experience and the staff was very friendly. You can reach out to me at [email protected] for any further details."},
{"id":3,"first_name":"Evelyn","last_name":"Morgan","email":"[email protected]","gender":"Female","ip_address":"7.161.136.94","cc":"6767119071901597","country":"Russia","birthdate":"2\\/1\\/1960","salary":144972.51,"title":"Structural Engineer","comments":"Booking Confirmation: Thank you, David Smith (DOB: 01\\/12\\/1978) for booking with us. We have received your payment through the credit card ending with 1234. Your booking ID is #67890. Please save this email for your records. For any queries, contact us at [email protected]."},
]
df = pd.DataFrame(data)
df.head()
[9]:
id | first_name | last_name | gender | ip_address | cc | country | birthdate | salary | title | comments | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Amanda | Jordan | [email protected] | Female | 1.197.201.2 | 6759521864920116 | Indonesia | 3\/8\/1971 | 49756.53 | Internal Auditor | Hello friends, my name is Alice Johnson and I ... |
1 | 2 | Albert | Freeman | [email protected] | Male | 218.111.175.34 | Canada | 1\/16\/1968 | 150280.17 | Accountant IV | Customer feedback: I recently visited your sto... | |
2 | 3 | Evelyn | Morgan | [email protected] | Female | 7.161.136.94 | 6767119071901597 | Russia | 2\/1\/1960 | 144972.51 | Structural Engineer | Booking Confirmation: Thank you, David Smith (... |
List and create write context#
[10]:
sess.list_write_context()
[10]:
[{'name': 'default',
'summary': 'Default write context',
'description': 'No classification of encapsulated data',
'config': {'key_reuse_ttl': 0,
'default_capsule_tags': [],
'required_hooks': []},
'imported': False,
'source_domain_id': None,
'source_domain_name': None},
{'name': 'sensitive',
'summary': 'Default write context (sensitive data)',
'description': 'Classifies data using the fast-pii and data structure classifiers',
'config': {'key_reuse_ttl': 0,
'default_capsule_tags': [],
'required_hooks': [{'hook': 'data-structure-classifier',
'constraint': '>1.0.0',
'mode': 'sync'},
{'hook': 'fast-pii', 'constraint': '>1.0.0', 'mode': 'sync'}]},
'imported': False,
'source_domain_id': None,
'source_domain_name': None}]
[11]:
# Create a new write context
sess.add_write_context(
"write_ctx", WriteContextBuilder().\
set_summary("Sample write context").\
set_description("Sample description").\
add_hook("fast-pii", ">1.0.0", WriteContextHookMode.Sync)
)
[12]:
sess.list_write_context()
[12]:
[{'name': 'default',
'summary': 'Default write context',
'description': 'No classification of encapsulated data',
'config': {'key_reuse_ttl': 0,
'default_capsule_tags': [],
'required_hooks': []},
'imported': False,
'source_domain_id': None,
'source_domain_name': None},
{'name': 'sensitive',
'summary': 'Default write context (sensitive data)',
'description': 'Classifies data using the fast-pii and data structure classifiers',
'config': {'key_reuse_ttl': 0,
'default_capsule_tags': [],
'required_hooks': [{'hook': 'data-structure-classifier',
'constraint': '>1.0.0',
'mode': 'sync'},
{'hook': 'fast-pii', 'constraint': '>1.0.0', 'mode': 'sync'}]},
'imported': False,
'source_domain_id': None,
'source_domain_name': None},
{'name': 'write_ctx',
'summary': 'Sample write context',
'description': 'Sample description',
'config': {'key_reuse_ttl': 0,
'default_capsule_tags': [],
'required_hooks': [{'hook': 'fast-pii',
'constraint': '>1.0.0',
'mode': 'sync'}]},
'imported': False,
'source_domain_id': None,
'source_domain_name': None}]
Encapsulate data using the write context#
[13]:
df_capsule = sess.encapsulate(data=df, write_context="write_ctx", path=file_name)
[14]:
!ls -lrtha /tmp/testdata.capsule
-rw-r--r-- 1 ajay wheel 4.9K Jun 7 14:17 /tmp/testdata.capsule
List & Create read contexts#
[15]:
sess.list_read_context()
[15]:
[{'name': 'default',
'summary': 'Default read context',
'description': 'The default read context',
'disable_read_logging': False,
'key_cache_ttl': 0,
'read_parameters': [],
'imported': False,
'source_domain_id': None,
'source_domain_name': None}]
[16]:
sess.add_read_context("read_ctx",
ReadContextBuilder().\
set_summary("Sample read context").\
set_description("Sample description").\
add_required_hook("fast-pii", ">1.0.0").\
add_read_parameter("key", True, "description")
)
[17]:
sess.list_read_context()
[17]:
[{'name': 'default',
'summary': 'Default read context',
'description': 'The default read context',
'disable_read_logging': False,
'key_cache_ttl': 0,
'read_parameters': [],
'imported': False,
'source_domain_id': None,
'source_domain_name': None},
{'name': 'read_ctx',
'summary': 'Sample read context',
'description': 'Sample description',
'disable_read_logging': False,
'key_cache_ttl': 0,
'read_parameters': [{'key': 'key',
'required': True,
'description': 'description'}],
'imported': False,
'source_domain_id': None,
'source_domain_name': None}]
Open and read data based on read context#
[18]:
capsule = sess.load_capsule(path=file_name, read_context="read_ctx")
Retrieve the data as a langchain retriever#
[19]:
retriever = capsule.data_as(dt=Datatype.LangchainRetriever)
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Retrieve some data from the retriever#
[20]:
retriever._get_relevant_documents(query="Amanda Jordan", run_manager=None)
[20]:
[Document(page_content='{"id": "2", "first_name": "Albert", "last_name": "Freeman", "email": "[email protected]", "gender": "Male", "ip_address": "218.111.175.34", "cc": "", "country": "Canada", "birthdate": "1\\\\/16\\\\/1968", "salary": "150280.17", "title": "Accountant IV", "comments": "Customer feedback: I recently visited your store at 5678 Pine Avenue, Dallas, TX 75201. My name is Jane Doe, age 43. I had a wonderful experience and the staff was very friendly. You can reach out to me at [email protected] for any further details."}'),
Document(page_content='{"id": "1", "first_name": "Amanda", "last_name": "Jordan", "email": "[email protected]", "gender": "Female", "ip_address": "1.197.201.2", "cc": "6759521864920116", "country": "Indonesia", "birthdate": "3\\\\/8\\\\/1971", "salary": "49756.53", "title": "Internal Auditor", "comments": "Hello friends, my name is Alice Johnson and I just turned 29 years old! \\\\ud83c\\\\udf89 I am looking forward to connecting with all of you. Feel free to drop me a line at [email protected] or call me at 415-123-4567."}'),
Document(page_content='{"id": "3", "first_name": "Evelyn", "last_name": "Morgan", "email": "[email protected]", "gender": "Female", "ip_address": "7.161.136.94", "cc": "6767119071901597", "country": "Russia", "birthdate": "2\\\\/1\\\\/1960", "salary": "144972.51", "title": "Structural Engineer", "comments": "Booking Confirmation: Thank you, David Smith (DOB: 01\\\\/12\\\\/1978) for booking with us. We have received your payment through the credit card ending with 1234. Your booking ID is #67890. Please save this email for your records. For any queries, contact us at [email protected]."}')]
Create a gpt-3.5 qa and test with langchain retriever#
[21]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
chatbot = ConversationalRetrievalChain.from_llm(ChatOpenAI(model='gpt-3.5-turbo'), retriever=retriever)
resp = chatbot({'question': "Give me details about Amanda Jordan", 'chat_history': ""})
print(resp["answer"])
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The class `ChatOpenAI` was deprecated in LangChain 0.0.10 and will be removed in 0.3.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`.
warn_deprecated(
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The method `Chain.__call__` was deprecated in langchain 0.1.0 and will be removed in 0.3.0. Use invoke instead.
warn_deprecated(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Amanda Jordan is a female from Indonesia. Her email is [email protected], and her IP address is 1.197.201.2. She works as an Internal Auditor and holds the title of "Internal Auditor." Amanda was born on March 8, 1971, and her salary is $49,756.53. Additionally, her credit card number is 6759 5218 6492 0116. If you need more specific details, please let me know.
Create a new read context with rules to redact data#
[22]:
email_redaction_rule = sess.add_read_context_rules("read_ctx", rule_builder=ReadContextRuleBuilder().add_match_expression(
source=Source.Tags,
key="tag.antimatter.io/pii/email_address",
operator=Operator.Exists
).set_action(Action.Redact).set_priority(20))
[23]:
sess.add_read_context_rules("read_ctx", rule_builder=ReadContextRuleBuilder().add_match_expression(
source=Source.Tags,
key="tag.antimatter.io/pii/credit_card",
operator=Operator.Exists
).set_action(Action.Redact).set_priority(30))
[23]:
'rl-ipo2ukxd955hz6eq'
[24]:
sess.describe_read_context("read_ctx")
[24]:
{'name': 'read_ctx',
'summary': 'Sample read context',
'description': 'Sample description',
'disable_read_logging': False,
'key_cache_ttl': 0,
'required_hooks': [{'hook': 'fast-pii',
'constraint': '>1.0.0',
'write_context': None}],
'read_parameters': [{'key': 'key',
'required': True,
'description': 'description'}],
'rules': [{'id': 'rl-rk1q9h43zbwgftdp',
'match_expressions': [{'source': 'tags',
'key': 'tag.antimatter.io/pii/email_address',
'operator': 'Exists',
'values': None,
'value': None}],
'action': 'Redact',
'token_scope': None,
'token_format': None,
'facts': [],
'priority': 20,
'imported': False,
'source_domain_id': None,
'source_domain_name': None},
{'id': 'rl-ipo2ukxd955hz6eq',
'match_expressions': [{'source': 'tags',
'key': 'tag.antimatter.io/pii/credit_card',
'operator': 'Exists',
'values': None,
'value': None}],
'action': 'Redact',
'token_scope': None,
'token_format': None,
'facts': [],
'priority': 30,
'imported': False,
'source_domain_id': None,
'source_domain_name': None}],
'imported': False,
'source_domain_id': None,
'source_domain_name': None,
'policy_assembly': None}
Materialize the data with the new rules for redaction#
[25]:
capsule = sess.load_capsule(path=file_name, read_context="read_ctx")
[26]:
df = capsule.data_as(dt=Datatype.PandasDataframe)
[27]:
df
[27]:
id | first_name | last_name | gender | ip_address | cc | country | birthdate | salary | title | comments | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Amanda | Jordan | {redacted} | Female | 1.197.201.2 | {redacted} | Indonesia | 3\/8\/1971 | 49756.53 | Internal Auditor | Hello friends, my name is Alice Johnson and I ... |
1 | 2 | Albert | Freeman | {redacted} | Male | 218.111.175.34 | Canada | 1\/16\/1968 | 150280.17 | Accountant IV | Customer feedback: I recently visited your sto... | |
2 | 3 | Evelyn | Morgan | {redacted} | Female | 7.161.136.94 | {redacted} | Russia | 2\/1\/1960 | 144972.51 | Structural Engineer | Booking Confirmation: Thank you, David Smith (... |
Use RAG qa with new redacted context and it’s materialized retriever#
[28]:
retriever = capsule.data_as(dt=Datatype.LangchainRetriever)
retriever._get_relevant_documents(query="Amanda Jordan", run_manager=None)
chatbot = ConversationalRetrievalChain.from_llm(ChatOpenAI(model='gpt-3.5-turbo'), retriever=retriever)
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
[29]:
resp = chatbot({'question': "Give me details about Amanda Jordan", 'chat_history': ""})
print(resp["answer"])
Amanda Jordan is a female from Indonesia. She was born on March 8, 1971. Her job title is Internal Auditor, and her salary is $49,756.53. Unfortunately, her email address and credit card information are redacted for privacy reasons. If you need further details, you can contact her at the provided phone number: 415-123-4567.
Remove email redaction from the rule#
[30]:
sess.delete_read_context_rule('read_ctx', email_redaction_rule)
[31]:
sess.describe_read_context("read_ctx")
[31]:
{'name': 'read_ctx',
'summary': 'Sample read context',
'description': 'Sample description',
'disable_read_logging': False,
'key_cache_ttl': 0,
'required_hooks': [{'hook': 'fast-pii',
'constraint': '>1.0.0',
'write_context': None}],
'read_parameters': [{'key': 'key',
'required': True,
'description': 'description'}],
'rules': [{'id': 'rl-ipo2ukxd955hz6eq',
'match_expressions': [{'source': 'tags',
'key': 'tag.antimatter.io/pii/credit_card',
'operator': 'Exists',
'values': None,
'value': None}],
'action': 'Redact',
'token_scope': None,
'token_format': None,
'facts': [],
'priority': 30,
'imported': False,
'source_domain_id': None,
'source_domain_name': None}],
'imported': False,
'source_domain_id': None,
'source_domain_name': None,
'policy_assembly': None}
Read the data with the new redaction rule#
[32]:
capsule = sess.load_capsule(path=file_name, read_context="read_ctx")
[33]:
retriever = capsule.data_as(dt=Datatype.LangchainRetriever)
retriever._get_relevant_documents(query="Amanda Jordan", run_manager=None)
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
[33]:
[Document(page_content='{"id": "2", "first_name": "Albert", "last_name": "Freeman", "email": "[email protected]", "gender": "Male", "ip_address": "218.111.175.34", "cc": "", "country": "Canada", "birthdate": "1\\\\/16\\\\/1968", "salary": "150280.17", "title": "Accountant IV", "comments": "Customer feedback: I recently visited your store at 5678 Pine Avenue, Dallas, TX 75201. My name is Jane Doe, age 43. I had a wonderful experience and the staff was very friendly. You can reach out to me at [email protected] for any further details."}'),
Document(page_content='{"id": "1", "first_name": "Amanda", "last_name": "Jordan", "email": "[email protected]", "gender": "Female", "ip_address": "1.197.201.2", "cc": "{redacted}", "country": "Indonesia", "birthdate": "3\\\\/8\\\\/1971", "salary": "49756.53", "title": "Internal Auditor", "comments": "Hello friends, my name is Alice Johnson and I just turned 29 years old! \\\\ud83c\\\\udf89 I am looking forward to connecting with all of you. Feel free to drop me a line at [email protected] or call me at 415-123-4567."}'),
Document(page_content='{"id": "3", "first_name": "Evelyn", "last_name": "Morgan", "email": "[email protected]", "gender": "Female", "ip_address": "7.161.136.94", "cc": "{redacted}", "country": "Russia", "birthdate": "2\\\\/1\\\\/1960", "salary": "144972.51", "title": "Structural Engineer", "comments": "Booking Confirmation: Thank you, David Smith (DOB: 01\\\\/12\\\\/1978) for booking with us. We have received your payment through the credit card ending with {redacted}. Your booking ID is #67890. Please save this email for your records. For any queries, contact us at [email protected]."}')]
[34]:
chatbot = ConversationalRetrievalChain.from_llm(ChatOpenAI(model='gpt-3.5-turbo'), retriever=retriever)
resp = chatbot({'question': "Give me details about Amanda Jordan", 'chat_history': ""})
print(resp["answer"])
Amanda Jordan is a female born on March 8, 1971, in Indonesia. She works as an Internal Auditor with a salary of $49,756.53. Her email address is [email protected]. If you need more specific information, feel free to ask.