An End-to-End RAG example using faiss retriver using langchain and openai gpt-3.5 for QA#
This notebook presents a comprehensive end-to-end example utilizing the library’s functionality. Specifically, it showcases how to use a RAG (Retrieval-Augmented Generation) model, powered by GPT-3.5, to retrieve information. This notebook provides insights into leveraging the library for complex use cases.
[ ]:
!pip install "antimatter[langchain]"
!pip install python-dotenv openai
Import openai key from a .env file.#
[1]:
import dotenv
import os
dotenv.load_dotenv(os.path.join(os.getenv("HOME"), '.openai_env'))
Register a domain and create a read/write context#
[2]:
import os
from antimatter import new_domain, Session
from antimatter.builders import *
from antimatter.datatype.datatypes import Datatype
[3]:
# Either create a new domain or use an existing one
if True:
sess = new_domain("[email protected]")
print ("domain: %s" % (sess.domain_id))
# print(f"sess = Session(domain='{sess.domain_id}', api_key='{sess.api_key}')")
else:
sess = Session(domain='<domain_id>', api_key='<api_key>')
file_name = "/tmp/testdata.capsule"
domain: dm-SCzRQLF64gW
Add some facts to this domain#
Create a fact type called is_project_member
with the attributes email
and project
. Add 2 facts to this type: - is_project_member(email="test@test.com", project="project1")
- is_project_member(email="test2@test2.com", project="project2")
[4]:
sess.add_fact_type(
"is_project_member",
description="Team membership",
arguments={"email": "email of the member", "project": "name of the project"},
)
sess.add_fact(
"is_project_member",
"[email protected]",
"project1",
)
sess.add_fact(
"is_project_member",
"[email protected]",
"project2",
)
[4]:
{'id': 'ft-xhgmwksnxqk4nn1h',
'name': 'is_project_member',
'arguments': ['[email protected]', 'project2']}
[5]:
sess.list_facts('is_project_member')
[5]:
[{'id': 'ft-fmo66g9aj2vcan4d',
'name': 'is_project_member',
'arguments': ['[email protected]', 'project1']},
{'id': 'ft-xhgmwksnxqk4nn1h',
'name': 'is_project_member',
'arguments': ['[email protected]', 'project2']}]
Open a dataset#
[6]:
# Load dataset
import pandas as pd
data = [
{"id":1,"first_name":"Amanda","last_name":"Jordan","email":"[email protected]","gender":"Female","ip_address":"1.197.201.2","cc":"6759521864920116","country":"Indonesia","birthdate":"3\\/8\\/1971","salary":49756.53,"title":"Internal Auditor","comments":"Hello friends, my name is Alice Johnson and I just turned 29 years old! \\ud83c\\udf89 I am looking forward to connecting with all of you. Feel free to drop me a line at [email protected] or call me at 415-123-4567."},
{"id":2,"first_name":"Albert","last_name":"Freeman","email":"[email protected]","gender":"Male","ip_address":"218.111.175.34","cc":"","country":"Canada","birthdate":"1\\/16\\/1968","salary":150280.17,"title":"Accountant IV","comments":"Customer feedback: I recently visited your store at 5678 Pine Avenue, Dallas, TX 75201. My name is Jane Doe, age 43. I had a wonderful experience and the staff was very friendly. You can reach out to me at [email protected] for any further details."},
{"id":3,"first_name":"Evelyn","last_name":"Morgan","email":"[email protected]","gender":"Female","ip_address":"7.161.136.94","cc":"6767119071901597","country":"Russia","birthdate":"2\\/1\\/1960","salary":144972.51,"title":"Structural Engineer","comments":"Booking Confirmation: Thank you, David Smith (DOB: 01\\/12\\/1978) for booking with us. We have received your payment through the credit card ending with 1234. Your booking ID is #67890. Please save this email for your records. For any queries, contact us at [email protected]."},
]
df = pd.DataFrame(data)
df.head()
[6]:
id | first_name | last_name | gender | ip_address | cc | country | birthdate | salary | title | comments | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Amanda | Jordan | [email protected] | Female | 1.197.201.2 | 6759521864920116 | Indonesia | 3\/8\/1971 | 49756.53 | Internal Auditor | Hello friends, my name is Alice Johnson and I ... |
1 | 2 | Albert | Freeman | [email protected] | Male | 218.111.175.34 | Canada | 1\/16\/1968 | 150280.17 | Accountant IV | Customer feedback: I recently visited your sto... | |
2 | 3 | Evelyn | Morgan | [email protected] | Female | 7.161.136.94 | 6767119071901597 | Russia | 2\/1\/1960 | 144972.51 | Structural Engineer | Booking Confirmation: Thank you, David Smith (... |
List and create write context#
[7]:
sess.list_write_context()
[7]:
[{'name': 'default',
'summary': 'Default write context',
'description': 'No classification of encapsulated data',
'config': {'key_reuse_ttl': 0, 'required_hooks': []},
'imported': False,
'source_domain_id': None,
'source_domain_name': None},
{'name': 'sensitive',
'summary': 'Default write context (sensitive data)',
'description': 'Classifies data using the fast-pii and data structure classifiers',
'config': {'key_reuse_ttl': 0,
'required_hooks': [{'hook': 'data-structure-classifier',
'constraint': '>1.0.0',
'mode': 'sync'},
{'hook': 'fast-pii', 'constraint': '>1.0.0', 'mode': 'sync'}]},
'imported': False,
'source_domain_id': None,
'source_domain_name': None}]
[8]:
# Create a new write context
sess.add_write_context(
"write_ctx", WriteContextBuilder().\
set_summary("Sample write context").\
set_description("Sample description").\
add_hook("fast-pii", ">1.0.0", WriteContextHookMode.Sync)
)
[9]:
sess.list_write_context()
[9]:
[{'name': 'default',
'summary': 'Default write context',
'description': 'No classification of encapsulated data',
'config': {'key_reuse_ttl': 0, 'required_hooks': []},
'imported': False,
'source_domain_id': None,
'source_domain_name': None},
{'name': 'sensitive',
'summary': 'Default write context (sensitive data)',
'description': 'Classifies data using the fast-pii and data structure classifiers',
'config': {'key_reuse_ttl': 0,
'required_hooks': [{'hook': 'data-structure-classifier',
'constraint': '>1.0.0',
'mode': 'sync'},
{'hook': 'fast-pii', 'constraint': '>1.0.0', 'mode': 'sync'}]},
'imported': False,
'source_domain_id': None,
'source_domain_name': None},
{'name': 'write_ctx',
'summary': 'Sample write context',
'description': 'Sample description',
'config': {'key_reuse_ttl': 0,
'required_hooks': [{'hook': 'fast-pii',
'constraint': '>1.0.0',
'mode': 'sync'}]},
'imported': False,
'source_domain_id': None,
'source_domain_name': None}]
Encapsulate data using the write context#
[10]:
df_capsule = sess.encapsulate(data=df, write_context="write_ctx", path=file_name)
[11]:
!ls -lrtha /tmp/testdata.capsule
-rw-r--r-- 1 ajay wheel 4.9K Apr 30 14:54 /tmp/testdata.capsule
List & Create read contexts#
[12]:
sess.list_read_context()
[12]:
[{'name': 'default',
'summary': 'Default read context',
'description': 'The default read context',
'disable_read_logging': False,
'key_cache_ttl': 0,
'read_parameters': [],
'imported': False,
'source_domain_id': None,
'source_domain_name': None}]
[13]:
sess.add_read_context("read_ctx",
ReadContextBuilder().\
set_summary("Sample read context").\
set_description("Sample description").\
add_required_hook("fast-pii", ">1.0.0").\
add_read_parameter("key", True, "description")
)
[14]:
sess.list_read_context()
[14]:
[{'name': 'default',
'summary': 'Default read context',
'description': 'The default read context',
'disable_read_logging': False,
'key_cache_ttl': 0,
'read_parameters': [],
'imported': False,
'source_domain_id': None,
'source_domain_name': None},
{'name': 'read_ctx',
'summary': 'Sample read context',
'description': 'Sample description',
'disable_read_logging': False,
'key_cache_ttl': 0,
'read_parameters': [{'key': 'key',
'required': True,
'description': 'description'}],
'imported': False,
'source_domain_id': None,
'source_domain_name': None}]
Open and read data based on read context#
[15]:
capsule = sess.load_capsule(path=file_name, read_context="read_ctx")
Retrieve the data as a langchain retriever#
[16]:
retriever = capsule.data_as(dt=Datatype.LangchainRetriever)
/Users/ajay/repos/antimatter/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Retrieve some data from the retriever#
[17]:
retriever._get_relevant_documents(query="Amanda Jordan", run_manager=None)
/Users/ajay/repos/antimatter/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The method `BaseRetriever.get_relevant_documents` was deprecated in langchain-core 0.1.46 and will be removed in 0.3.0. Use invoke instead.
warn_deprecated(
[17]:
[Document(page_content="{'id': '2', 'first_name': 'Albert', 'last_name': 'Freeman', 'email': '[email protected]', 'gender': 'Male', 'ip_address': '218.111.175.34', 'cc': '', 'country': 'Canada', 'birthdate': '1\\\\/16\\\\/1968', 'salary': '150280.17', 'title': 'Accountant IV', 'comments': 'Customer feedback: I recently visited your store at 5678 Pine Avenue, Dallas, TX 75201. My name is Jane Doe, age 43. I had a wonderful experience and the staff was very friendly. You can reach out to me at [email protected] for any further details.'}"),
Document(page_content="{'id': '1', 'first_name': 'Amanda', 'last_name': 'Jordan', 'email': '[email protected]', 'gender': 'Female', 'ip_address': '1.197.201.2', 'cc': '6759521864920116', 'country': 'Indonesia', 'birthdate': '3\\\\/8\\\\/1971', 'salary': '49756.53', 'title': 'Internal Auditor', 'comments': 'Hello friends, my name is Alice Johnson and I just turned 29 years old! \\\\ud83c\\\\udf89 I am looking forward to connecting with all of you. Feel free to drop me a line at [email protected] or call me at 415-123-4567.'}"),
Document(page_content="{'id': '3', 'first_name': 'Evelyn', 'last_name': 'Morgan', 'email': '[email protected]', 'gender': 'Female', 'ip_address': '7.161.136.94', 'cc': '6767119071901597', 'country': 'Russia', 'birthdate': '2\\\\/1\\\\/1960', 'salary': '144972.51', 'title': 'Structural Engineer', 'comments': 'Booking Confirmation: Thank you, David Smith (DOB: 01\\\\/12\\\\/1978) for booking with us. We have received your payment through the credit card ending with 1234. Your booking ID is #67890. Please save this email for your records. For any queries, contact us at [email protected].'}")]
Create a gpt-3.5 qa and test with langchain retriever#
[18]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
chatbot = ConversationalRetrievalChain.from_llm(ChatOpenAI(model='gpt-3.5-turbo'), retriever=retriever)
resp = chatbot({'question': "Give me details about Amanda Jordan", 'chat_history': ""})
print(resp["answer"])
/Users/ajay/repos/antimatter/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The class `ChatOpenAI` was deprecated in LangChain 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`.
warn_deprecated(
/Users/ajay/repos/antimatter/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The method `Chain.__call__` was deprecated in langchain 0.1.0 and will be removed in 0.2.0. Use invoke instead.
warn_deprecated(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Amanda Jordan is a female from Indonesia. Her email is [email protected], and her IP address is 1.197.201.2. She holds the title of Internal Auditor, with a salary of 49756.53. Her credit card number is 6759521864920116, and her birthdate is March 8, 1971. If you need further details, feel free to ask.
Create a new read context with rules to redact data#
[19]:
email_redaction_rule = sess.add_read_context_rules("read_ctx", rule_builder=ReadContextRuleBuilder().add_match_expression(
source=Source.Tags,
key="tag.antimatter.io/pii/email_address",
operator=Operator.Exists
).set_action(Action.Redact).set_priority(20))
[20]:
sess.add_read_context_rules("read_ctx", rule_builder=ReadContextRuleBuilder().add_match_expression(
source=Source.Tags,
key="tag.antimatter.io/pii/credit_card",
operator=Operator.Exists
).set_action(Action.Redact).set_priority(30))
[20]:
'rl-6kqgudmqqjlj8pe4'
[21]:
sess.describe_read_context("read_ctx")
[21]:
{'name': 'read_ctx',
'summary': 'Sample read context',
'description': 'Sample description',
'disable_read_logging': False,
'key_cache_ttl': 0,
'required_hooks': [{'hook': 'fast-pii',
'constraint': '>1.0.0',
'write_context': None}],
'read_parameters': [{'key': 'key',
'required': True,
'description': 'description'}],
'rules': [{'id': 'rl-lzcgcih08qq6dmxs',
'match_expressions': [{'source': 'tags',
'key': 'tag.antimatter.io/pii/email_address',
'operator': 'Exists',
'values': None,
'value': None}],
'action': 'Redact',
'token_scope': None,
'token_format': None,
'facts': [],
'priority': 20,
'imported': False,
'source_domain_id': None,
'source_domain_name': None},
{'id': 'rl-6kqgudmqqjlj8pe4',
'match_expressions': [{'source': 'tags',
'key': 'tag.antimatter.io/pii/credit_card',
'operator': 'Exists',
'values': None,
'value': None}],
'action': 'Redact',
'token_scope': None,
'token_format': None,
'facts': [],
'priority': 30,
'imported': False,
'source_domain_id': None,
'source_domain_name': None}],
'imported': False,
'source_domain_id': None,
'source_domain_name': None}
Materialize the data with the new rules for redaction#
[22]:
capsule = sess.load_capsule(path=file_name, read_context="read_ctx")
[23]:
df = capsule.data_as(dt=Datatype.PandasDataframe)
[24]:
df
[24]:
id | first_name | last_name | gender | ip_address | cc | country | birthdate | salary | title | comments | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Amanda | Jordan | {redacted} | Female | 1.197.201.2 | {redacted} | Indonesia | 3\/8\/1971 | 49756.53 | Internal Auditor | Hello friends, my name is Alice Johnson and I ... |
1 | 2 | Albert | Freeman | {redacted} | Male | 218.111.175.34 | Canada | 1\/16\/1968 | 150280.17 | Accountant IV | Customer feedback: I recently visited your sto... | |
2 | 3 | Evelyn | Morgan | {redacted} | Female | 7.161.136.94 | {redacted} | Russia | 2\/1\/1960 | 144972.51 | Structural Engineer | Booking Confirmation: Thank you, David Smith (... |
Use RAG qa with new redacted context and it’s materialized retriever#
[25]:
retriever = capsule.data_as(dt=Datatype.LangchainRetriever)
retriever._get_relevant_documents(query="Amanda Jordan", run_manager=None)
chatbot = ConversationalRetrievalChain.from_llm(ChatOpenAI(model='gpt-3.5-turbo'), retriever=retriever)
[26]:
resp = chatbot({'question': "Give me details about Amanda Jordan", 'chat_history': ""})
print(resp["answer"])
Amanda Jordan is a female from Indonesia. She was born on March 8, 1971. Amanda works as an Internal Auditor and her salary is $49,756.53. Unfortunately, I don't have access to her full email address or credit card details.
Remove email redaction from the rule#
[27]:
sess.delete_read_context_rule('read_ctx', email_redaction_rule)
[28]:
sess.describe_read_context("read_ctx")
[28]:
{'name': 'read_ctx',
'summary': 'Sample read context',
'description': 'Sample description',
'disable_read_logging': False,
'key_cache_ttl': 0,
'required_hooks': [{'hook': 'fast-pii',
'constraint': '>1.0.0',
'write_context': None}],
'read_parameters': [{'key': 'key',
'required': True,
'description': 'description'}],
'rules': [{'id': 'rl-6kqgudmqqjlj8pe4',
'match_expressions': [{'source': 'tags',
'key': 'tag.antimatter.io/pii/credit_card',
'operator': 'Exists',
'values': None,
'value': None}],
'action': 'Redact',
'token_scope': None,
'token_format': None,
'facts': [],
'priority': 30,
'imported': False,
'source_domain_id': None,
'source_domain_name': None}],
'imported': False,
'source_domain_id': None,
'source_domain_name': None}
Read the data with the new redaction rule#
[29]:
capsule = sess.load_capsule(path=file_name, read_context="read_ctx")
[30]:
retriever = capsule.data_as(dt=Datatype.LangchainRetriever)
retriever._get_relevant_documents(query="Amanda Jordan", run_manager=None)
[30]:
[Document(page_content="{'id': '2', 'first_name': 'Albert', 'last_name': 'Freeman', 'email': '[email protected]', 'gender': 'Male', 'ip_address': '218.111.175.34', 'cc': '', 'country': 'Canada', 'birthdate': '1\\\\/16\\\\/1968', 'salary': '150280.17', 'title': 'Accountant IV', 'comments': 'Customer feedback: I recently visited your store at 5678 Pine Avenue, Dallas, TX 75201. My name is Jane Doe, age 43. I had a wonderful experience and the staff was very friendly. You can reach out to me at [email protected] for any further details.'}"),
Document(page_content="{'id': '1', 'first_name': 'Amanda', 'last_name': 'Jordan', 'email': '[email protected]', 'gender': 'Female', 'ip_address': '1.197.201.2', 'cc': '{redacted}', 'country': 'Indonesia', 'birthdate': '3\\\\/8\\\\/1971', 'salary': '49756.53', 'title': 'Internal Auditor', 'comments': 'Hello friends, my name is Alice Johnson and I just turned 29 years old! \\\\ud83c\\\\udf89 I am looking forward to connecting with all of you. Feel free to drop me a line at [email protected] or call me at 415-123-4567.'}"),
Document(page_content="{'id': '3', 'first_name': 'Evelyn', 'last_name': 'Morgan', 'email': '[email protected]', 'gender': 'Female', 'ip_address': '7.161.136.94', 'cc': '{redacted}', 'country': 'Russia', 'birthdate': '2\\\\/1\\\\/1960', 'salary': '144972.51', 'title': 'Structural Engineer', 'comments': 'Booking Confirmation: Thank you, David Smith (DOB: 01\\\\/12\\\\/1978) for booking with us. We have received your payment through the credit card ending with {redacted}. Your booking ID is #67890. Please save this email for your records. For any queries, contact us at [email protected].'}")]
[31]:
chatbot = ConversationalRetrievalChain.from_llm(ChatOpenAI(model='gpt-3.5-turbo'), retriever=retriever)
resp = chatbot({'question': "Give me details about Amanda Jordan", 'chat_history': ""})
print(resp["answer"])
Amanda Jordan is a female from Indonesia. Her email is [email protected], her IP address is 1.197.201.2, and her birthdate is March 8, 1971. She works as an Internal Auditor with a salary of $49,756.53. If you need more specific information, please let me know.