An End-to-End RAG example using faiss retriver using langchain and openai gpt-3.5 for QA#

This notebook presents a comprehensive end-to-end example utilizing the library’s functionality. Specifically, it showcases how to use a RAG (Retrieval-Augmented Generation) model, powered by GPT-3.5, to retrieve information. This notebook provides insights into leveraging the library for complex use cases.

[ ]:
!pip install "antimatter[langchain]"
!pip install python-dotenv openai

Import openai key from a .env file.#

[3]:
import dotenv
import os
dotenv.load_dotenv(os.path.join(os.getenv("HOME"), '.openai_env'))

[3]:
True

Register a domain and create a read/write context#

[4]:
import os
from antimatter import new_domain, Session
from antimatter.builders import *
from antimatter.datatype.datatypes import Datatype
[5]:
# Either create a new domain or use an existing one
if True:
    sess = new_domain("[email protected]")
    print ("domain: %s" % (sess.domain_id))
    # print(f"sess = Session.from_api_key(domain_id='{sess.domain_id}', api_key='{sess.api_key}')")
else:
    sess = Session.from_api_key(domain_id='<domain_id>', api_key='<api_key>')

file_name = "/tmp/testdata.capsule"
domain: dm-8gQzMfrtE5N

Add some facts to this domain#

Create a fact type called is_project_member with the attributes email and project. Add 2 facts to this type: - is_project_member(email="test@test.com", project="project1") - is_project_member(email="test2@test2.com", project="project2")

[7]:
sess.add_fact_type(
    "is_project_member",
    description="Team membership",
    arguments={"email": "email of the member", "project": "name of the project"},
)

sess.add_fact(
    "is_project_member",
    "[email protected]",
    "project1",
)

sess.add_fact(
    "is_project_member",
    "[email protected]",
    "project2",
)
[7]:
{'id': 'ft-48kn7aqfm31sq6li',
 'name': 'is_project_member',
 'arguments': ['[email protected]', 'project2']}
[8]:
sess.list_facts('is_project_member')
[8]:
[{'id': 'ft-48kn7aqfm31sq6li',
  'name': 'is_project_member',
  'arguments': ['[email protected]', 'project2']},
 {'id': 'ft-fczggvwqhg9jyqfv',
  'name': 'is_project_member',
  'arguments': ['[email protected]', 'project1']}]

Open a dataset#

[9]:
# Load dataset
import pandas as pd

data = [
    {"id":1,"first_name":"Amanda","last_name":"Jordan","email":"[email protected]","gender":"Female","ip_address":"1.197.201.2","cc":"6759521864920116","country":"Indonesia","birthdate":"3\\/8\\/1971","salary":49756.53,"title":"Internal Auditor","comments":"Hello friends, my name is Alice Johnson and I just turned 29 years old! \\ud83c\\udf89 I am looking forward to connecting with all of you. Feel free to drop me a line at [email protected] or call me at 415-123-4567."},
    {"id":2,"first_name":"Albert","last_name":"Freeman","email":"[email protected]","gender":"Male","ip_address":"218.111.175.34","cc":"","country":"Canada","birthdate":"1\\/16\\/1968","salary":150280.17,"title":"Accountant IV","comments":"Customer feedback: I recently visited your store at 5678 Pine Avenue, Dallas, TX 75201. My name is Jane Doe, age 43. I had a wonderful experience and the staff was very friendly. You can reach out to me at [email protected] for any further details."},
    {"id":3,"first_name":"Evelyn","last_name":"Morgan","email":"[email protected]","gender":"Female","ip_address":"7.161.136.94","cc":"6767119071901597","country":"Russia","birthdate":"2\\/1\\/1960","salary":144972.51,"title":"Structural Engineer","comments":"Booking Confirmation: Thank you, David Smith (DOB: 01\\/12\\/1978) for booking with us. We have received your payment through the credit card ending with 1234. Your booking ID is #67890. Please save this email for your records. For any queries, contact us at [email protected]."},
]

df = pd.DataFrame(data)
df.head()
[9]:
id first_name last_name email gender ip_address cc country birthdate salary title comments
0 1 Amanda Jordan [email protected] Female 1.197.201.2 6759521864920116 Indonesia 3\/8\/1971 49756.53 Internal Auditor Hello friends, my name is Alice Johnson and I ...
1 2 Albert Freeman [email protected] Male 218.111.175.34 Canada 1\/16\/1968 150280.17 Accountant IV Customer feedback: I recently visited your sto...
2 3 Evelyn Morgan [email protected] Female 7.161.136.94 6767119071901597 Russia 2\/1\/1960 144972.51 Structural Engineer Booking Confirmation: Thank you, David Smith (...

List and create write context#

[10]:
sess.list_write_context()
[10]:
[{'name': 'default',
  'summary': 'Default write context',
  'description': 'No classification of encapsulated data',
  'config': {'key_reuse_ttl': 0,
   'default_capsule_tags': [],
   'required_hooks': []},
  'imported': False,
  'source_domain_id': None,
  'source_domain_name': None},
 {'name': 'sensitive',
  'summary': 'Default write context (sensitive data)',
  'description': 'Classifies data using the fast-pii and data structure classifiers',
  'config': {'key_reuse_ttl': 0,
   'default_capsule_tags': [],
   'required_hooks': [{'hook': 'data-structure-classifier',
     'constraint': '>1.0.0',
     'mode': 'sync'},
    {'hook': 'fast-pii', 'constraint': '>1.0.0', 'mode': 'sync'}]},
  'imported': False,
  'source_domain_id': None,
  'source_domain_name': None}]
[11]:
# Create a new write context
sess.add_write_context(
    "write_ctx", WriteContextBuilder().\
        set_summary("Sample write context").\
        set_description("Sample description").\
        add_hook("fast-pii", ">1.0.0", WriteContextHookMode.Sync)
)
[12]:
sess.list_write_context()
[12]:
[{'name': 'default',
  'summary': 'Default write context',
  'description': 'No classification of encapsulated data',
  'config': {'key_reuse_ttl': 0,
   'default_capsule_tags': [],
   'required_hooks': []},
  'imported': False,
  'source_domain_id': None,
  'source_domain_name': None},
 {'name': 'sensitive',
  'summary': 'Default write context (sensitive data)',
  'description': 'Classifies data using the fast-pii and data structure classifiers',
  'config': {'key_reuse_ttl': 0,
   'default_capsule_tags': [],
   'required_hooks': [{'hook': 'data-structure-classifier',
     'constraint': '>1.0.0',
     'mode': 'sync'},
    {'hook': 'fast-pii', 'constraint': '>1.0.0', 'mode': 'sync'}]},
  'imported': False,
  'source_domain_id': None,
  'source_domain_name': None},
 {'name': 'write_ctx',
  'summary': 'Sample write context',
  'description': 'Sample description',
  'config': {'key_reuse_ttl': 0,
   'default_capsule_tags': [],
   'required_hooks': [{'hook': 'fast-pii',
     'constraint': '>1.0.0',
     'mode': 'sync'}]},
  'imported': False,
  'source_domain_id': None,
  'source_domain_name': None}]

Encapsulate data using the write context#

[13]:
df_capsule = sess.encapsulate(data=df, write_context="write_ctx", path=file_name)
[14]:
!ls -lrtha /tmp/testdata.capsule
-rw-r--r--  1 ajay  wheel   4.9K Jun  7 14:17 /tmp/testdata.capsule

List & Create read contexts#

[15]:
sess.list_read_context()
[15]:
[{'name': 'default',
  'summary': 'Default read context',
  'description': 'The default read context',
  'disable_read_logging': False,
  'key_cache_ttl': 0,
  'read_parameters': [],
  'imported': False,
  'source_domain_id': None,
  'source_domain_name': None}]
[16]:
sess.add_read_context("read_ctx",
    ReadContextBuilder().\
        set_summary("Sample read context").\
        set_description("Sample description").\
        add_required_hook("fast-pii", ">1.0.0").\
        add_read_parameter("key", True, "description")
)
[17]:
sess.list_read_context()
[17]:
[{'name': 'default',
  'summary': 'Default read context',
  'description': 'The default read context',
  'disable_read_logging': False,
  'key_cache_ttl': 0,
  'read_parameters': [],
  'imported': False,
  'source_domain_id': None,
  'source_domain_name': None},
 {'name': 'read_ctx',
  'summary': 'Sample read context',
  'description': 'Sample description',
  'disable_read_logging': False,
  'key_cache_ttl': 0,
  'read_parameters': [{'key': 'key',
    'required': True,
    'description': 'description'}],
  'imported': False,
  'source_domain_id': None,
  'source_domain_name': None}]

Open and read data based on read context#

[18]:
capsule = sess.load_capsule(path=file_name, read_context="read_ctx")

Retrieve the data as a langchain retriever#

[19]:
retriever = capsule.data_as(dt=Datatype.LangchainRetriever)
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

Retrieve some data from the retriever#

[20]:
retriever._get_relevant_documents(query="Amanda Jordan", run_manager=None)
[20]:
[Document(page_content='{"id": "2", "first_name": "Albert", "last_name": "Freeman", "email": "[email protected]", "gender": "Male", "ip_address": "218.111.175.34", "cc": "", "country": "Canada", "birthdate": "1\\\\/16\\\\/1968", "salary": "150280.17", "title": "Accountant IV", "comments": "Customer feedback: I recently visited your store at 5678 Pine Avenue, Dallas, TX 75201. My name is Jane Doe, age 43. I had a wonderful experience and the staff was very friendly. You can reach out to me at [email protected] for any further details."}'),
 Document(page_content='{"id": "1", "first_name": "Amanda", "last_name": "Jordan", "email": "[email protected]", "gender": "Female", "ip_address": "1.197.201.2", "cc": "6759521864920116", "country": "Indonesia", "birthdate": "3\\\\/8\\\\/1971", "salary": "49756.53", "title": "Internal Auditor", "comments": "Hello friends, my name is Alice Johnson and I just turned 29 years old! \\\\ud83c\\\\udf89 I am looking forward to connecting with all of you. Feel free to drop me a line at [email protected] or call me at 415-123-4567."}'),
 Document(page_content='{"id": "3", "first_name": "Evelyn", "last_name": "Morgan", "email": "[email protected]", "gender": "Female", "ip_address": "7.161.136.94", "cc": "6767119071901597", "country": "Russia", "birthdate": "2\\\\/1\\\\/1960", "salary": "144972.51", "title": "Structural Engineer", "comments": "Booking Confirmation: Thank you, David Smith (DOB: 01\\\\/12\\\\/1978) for booking with us. We have received your payment through the credit card ending with 1234. Your booking ID is #67890. Please save this email for your records. For any queries, contact us at [email protected]."}')]

Create a gpt-3.5 qa and test with langchain retriever#

[21]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
chatbot = ConversationalRetrievalChain.from_llm(ChatOpenAI(model='gpt-3.5-turbo'), retriever=retriever)
resp = chatbot({'question': "Give me details about Amanda Jordan", 'chat_history': ""})
print(resp["answer"])
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The class `ChatOpenAI` was deprecated in LangChain 0.0.10 and will be removed in 0.3.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`.
  warn_deprecated(
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The method `Chain.__call__` was deprecated in langchain 0.1.0 and will be removed in 0.3.0. Use invoke instead.
  warn_deprecated(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Amanda Jordan is a female from Indonesia. Her email is [email protected], and her IP address is 1.197.201.2. She works as an Internal Auditor and holds the title of "Internal Auditor." Amanda was born on March 8, 1971, and her salary is $49,756.53. Additionally, her credit card number is 6759 5218 6492 0116. If you need more specific details, please let me know.

Create a new read context with rules to redact data#

[22]:
email_redaction_rule = sess.add_read_context_rules("read_ctx", rule_builder=ReadContextRuleBuilder().add_match_expression(
    source=Source.Tags,
    key="tag.antimatter.io/pii/email_address",
    operator=Operator.Exists
).set_action(Action.Redact).set_priority(20))
[23]:
sess.add_read_context_rules("read_ctx", rule_builder=ReadContextRuleBuilder().add_match_expression(
    source=Source.Tags,
    key="tag.antimatter.io/pii/credit_card",
    operator=Operator.Exists
).set_action(Action.Redact).set_priority(30))
[23]:
'rl-ipo2ukxd955hz6eq'
[24]:
sess.describe_read_context("read_ctx")
[24]:
{'name': 'read_ctx',
 'summary': 'Sample read context',
 'description': 'Sample description',
 'disable_read_logging': False,
 'key_cache_ttl': 0,
 'required_hooks': [{'hook': 'fast-pii',
   'constraint': '>1.0.0',
   'write_context': None}],
 'read_parameters': [{'key': 'key',
   'required': True,
   'description': 'description'}],
 'rules': [{'id': 'rl-rk1q9h43zbwgftdp',
   'match_expressions': [{'source': 'tags',
     'key': 'tag.antimatter.io/pii/email_address',
     'operator': 'Exists',
     'values': None,
     'value': None}],
   'action': 'Redact',
   'token_scope': None,
   'token_format': None,
   'facts': [],
   'priority': 20,
   'imported': False,
   'source_domain_id': None,
   'source_domain_name': None},
  {'id': 'rl-ipo2ukxd955hz6eq',
   'match_expressions': [{'source': 'tags',
     'key': 'tag.antimatter.io/pii/credit_card',
     'operator': 'Exists',
     'values': None,
     'value': None}],
   'action': 'Redact',
   'token_scope': None,
   'token_format': None,
   'facts': [],
   'priority': 30,
   'imported': False,
   'source_domain_id': None,
   'source_domain_name': None}],
 'imported': False,
 'source_domain_id': None,
 'source_domain_name': None,
 'policy_assembly': None}

Materialize the data with the new rules for redaction#

[25]:
capsule = sess.load_capsule(path=file_name, read_context="read_ctx")
[26]:
df = capsule.data_as(dt=Datatype.PandasDataframe)
[27]:
df
[27]:
id first_name last_name email gender ip_address cc country birthdate salary title comments
0 1 Amanda Jordan {redacted} Female 1.197.201.2 {redacted} Indonesia 3\/8\/1971 49756.53 Internal Auditor Hello friends, my name is Alice Johnson and I ...
1 2 Albert Freeman {redacted} Male 218.111.175.34 Canada 1\/16\/1968 150280.17 Accountant IV Customer feedback: I recently visited your sto...
2 3 Evelyn Morgan {redacted} Female 7.161.136.94 {redacted} Russia 2\/1\/1960 144972.51 Structural Engineer Booking Confirmation: Thank you, David Smith (...

Use RAG qa with new redacted context and it’s materialized retriever#

[28]:
retriever = capsule.data_as(dt=Datatype.LangchainRetriever)
retriever._get_relevant_documents(query="Amanda Jordan", run_manager=None)
chatbot = ConversationalRetrievalChain.from_llm(ChatOpenAI(model='gpt-3.5-turbo'), retriever=retriever)
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[29]:
resp = chatbot({'question': "Give me details about Amanda Jordan", 'chat_history': ""})
print(resp["answer"])
Amanda Jordan is a female from Indonesia. She was born on March 8, 1971. Her job title is Internal Auditor, and her salary is $49,756.53. Unfortunately, her email address and credit card information are redacted for privacy reasons. If you need further details, you can contact her at the provided phone number: 415-123-4567.

Remove email redaction from the rule#

[30]:
sess.delete_read_context_rule('read_ctx', email_redaction_rule)
[31]:
sess.describe_read_context("read_ctx")
[31]:
{'name': 'read_ctx',
 'summary': 'Sample read context',
 'description': 'Sample description',
 'disable_read_logging': False,
 'key_cache_ttl': 0,
 'required_hooks': [{'hook': 'fast-pii',
   'constraint': '>1.0.0',
   'write_context': None}],
 'read_parameters': [{'key': 'key',
   'required': True,
   'description': 'description'}],
 'rules': [{'id': 'rl-ipo2ukxd955hz6eq',
   'match_expressions': [{'source': 'tags',
     'key': 'tag.antimatter.io/pii/credit_card',
     'operator': 'Exists',
     'values': None,
     'value': None}],
   'action': 'Redact',
   'token_scope': None,
   'token_format': None,
   'facts': [],
   'priority': 30,
   'imported': False,
   'source_domain_id': None,
   'source_domain_name': None}],
 'imported': False,
 'source_domain_id': None,
 'source_domain_name': None,
 'policy_assembly': None}

Read the data with the new redaction rule#

[32]:
capsule = sess.load_capsule(path=file_name, read_context="read_ctx")
[33]:
retriever = capsule.data_as(dt=Datatype.LangchainRetriever)
retriever._get_relevant_documents(query="Amanda Jordan", run_manager=None)
/Users/ajay/repos/python-client/.venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[33]:
[Document(page_content='{"id": "2", "first_name": "Albert", "last_name": "Freeman", "email": "[email protected]", "gender": "Male", "ip_address": "218.111.175.34", "cc": "", "country": "Canada", "birthdate": "1\\\\/16\\\\/1968", "salary": "150280.17", "title": "Accountant IV", "comments": "Customer feedback: I recently visited your store at 5678 Pine Avenue, Dallas, TX 75201. My name is Jane Doe, age 43. I had a wonderful experience and the staff was very friendly. You can reach out to me at [email protected] for any further details."}'),
 Document(page_content='{"id": "1", "first_name": "Amanda", "last_name": "Jordan", "email": "[email protected]", "gender": "Female", "ip_address": "1.197.201.2", "cc": "{redacted}", "country": "Indonesia", "birthdate": "3\\\\/8\\\\/1971", "salary": "49756.53", "title": "Internal Auditor", "comments": "Hello friends, my name is Alice Johnson and I just turned 29 years old! \\\\ud83c\\\\udf89 I am looking forward to connecting with all of you. Feel free to drop me a line at [email protected] or call me at 415-123-4567."}'),
 Document(page_content='{"id": "3", "first_name": "Evelyn", "last_name": "Morgan", "email": "[email protected]", "gender": "Female", "ip_address": "7.161.136.94", "cc": "{redacted}", "country": "Russia", "birthdate": "2\\\\/1\\\\/1960", "salary": "144972.51", "title": "Structural Engineer", "comments": "Booking Confirmation: Thank you, David Smith (DOB: 01\\\\/12\\\\/1978) for booking with us. We have received your payment through the credit card ending with {redacted}. Your booking ID is #67890. Please save this email for your records. For any queries, contact us at [email protected]."}')]
[34]:
chatbot = ConversationalRetrievalChain.from_llm(ChatOpenAI(model='gpt-3.5-turbo'), retriever=retriever)
resp = chatbot({'question': "Give me details about Amanda Jordan", 'chat_history': ""})
print(resp["answer"])
Amanda Jordan is a female born on March 8, 1971, in Indonesia. She works as an Internal Auditor with a salary of $49,756.53. Her email address is [email protected]. If you need more specific information, feel free to ask.