Personal Identifiable Information

Language detects, classifies and provides options to de-identify personal identifiable information (PII) in unstructured text.

Use Cases

Detecting and curating private information in user feedback

Many organizations collect user feedback is collected through various channels such as product reviews, return requests, support tickets, and feedback forums. You can use Language PII detection service for automatic detection of PII entities to not only proactively warn, but also anonymize before storing posted feedback. Using the automatic detection of PII entities you can proactively warn users about sharing private data, and applications to implement measures such as storing masked data.

Scanning object storage for presence of sensitive data

Cloud storage solutions such as OCI Object Storage are widely used by employees to store business documents in the locations either locally controlled or shared by many teams. Ensuring that such shared locations don't store private information such as employee names, demographics and payroll information requires automatic scanning of all the documents for presence of PII. The OCI Language PII model provides batch API to process many text documents at scale for processing data at scale.

Supported Entities

The following table describes the different entities that PII can extract.

Entity Type Description
PERSON Person name
ADDRESS Address
AGE Age
DATE_TIME Date or time
SSN_OR_TAXPAYER Social security number or taxpayer ID (US)
EMAIL Email
PASSPORT_NUMBER_US Passport number (US)
TELEPHONE_NUMBER Telephone or fax (US)
DRIVER_ID_US Driver identification number (US)
BANK_ACCOUNT_NUMBER Bank account number (US)
BANK_SWIFT Bank account (SWIFT)
BANK_ROUTING Bank routing number
CREDIT_DEBIT_NUMBER Credit or debit card number
IP_ADDRESS IP address, both IPV4 and IPV6
MAC_ADDRESS MAC address

Following are secret types:

COOKIE Website Cookie
XSRF TOKEN Cross-Site Request Forgery (XSRF) Token
AUTH_BASIC Basic Authentication
AUTH_BEARER Bearer Authentication
JSON_WEB_TOKEN JSON Web Token
PRIVATE_KEY Cryptographic Private Key
PUBLIC_KEY Cryptographic Public Key

Following are the OCI account credentials that are the authentication information required to access and manage resources within OCI. These credentials serve the purpose of ensuring secure authentication of users, applications, and services to interact with OCI services and resources.

OCI_OCID_USER OCI User
OCI_OCID_TENANCY Tenancy OCID (Oracle Cloud Identifier)
OCI_SMTP_USERNAME SMTP (Simple Mail Transfer Protocol) Username
OCI_OCID_REFERENCE OCID Reference
OCI_FINGERPRINT OCI Fingerprint
OCI_CREDENTIAL This type covers OCI Auth Token, OAuth Credential and SMTP Credential
OCI_PRE_AUTH_REQUEST OCI Pre-Authenticated Request
OCI_STORAGE_SIGNED_URL OCI Storage Singed URL
OCI_CUSTOMER_SECRET_KEY OCI Customer Secret Key
OCI_ACCESS_KEY OCI Access Keys or security credentials

Examples

Input Text Output Text Masked with "*"

Hello Support Team,

I am reaching out to seek help with my credit card number 5111 1111 1111 1118 expiring on 11/23. There was a suspicious transaction on 12-Aug-2022 which I reported by calling from my mobile number +1 (650) 555-0190 also I emailed from my email id sarah.jones1234@hotmail.com. Would you please let me know the refund status?

Regards,

Sarah

Hello Support Team, I am reaching out to seek help with my credit card number ******************* expiring on ***** . There was a suspicious transaction on *********** which I reported by calling from my mobile number ** ************** also I emailed from my email id *************************** . Would you please let me know the refund status? Regards, *****

The JSON for the example is:

Sample Request
POST https://<region-url>/20210101/actions/batchDetectLanguagePiiEntities
API Request format:
{
  "documents": [
    {
      "languageCode": "en",
      "key": "1",
      "text": "Hello Support Team, I am reaching out to seek help with my credit card number 5111 1111 1111 1118 expiring on 11/23. There was a suspicious transaction on 12-Aug-2022 which I reported by calling from my mobile number +1 (650) 555-0190 also I emailed from my email id sarah.jones1234@hotmail.com. Would you please let me know the refund status? Regards, Sarah"
    }
  ],
  "compartmentId": "ocid1.tenancy.oc1..aaaaaaaadany3y6wdh3u3jcodcmm42ehsdno525pzyavtjbpy72eyxcu5f7q",
  "masking": {
    "ALL": {
      "mode": "MASK",
      "isUnmaskedFromEnd": true,
      "leaveCharactersUnmasked": 4
    }
  }
}
Response JSON:
{
    "documents": [
        {
            "key": "1",
            "entities": [
                {
                    "offset": 79,
                    "length": 19,
                    "type": "CREDIT_DEBIT_NUMBER",
                    "text": "5111 1111 1111 1118",
                    "score": 0.75,
                    "isCustom": false
                },
                {
                    "offset": 111,
                    "length": 5,
                    "type": "DATE_TIME",
                    "text": "11/23",
                    "score": 0.9992455840110779,
                    "isCustom": false
                },
                {
                    "offset": 156,
                    "length": 11,
                    "type": "DATE_TIME",
                    "text": "12-Aug-2022",
                    "score": 0.998766303062439,
                    "isCustom": false
                },
                {
                    "offset": 218,
                    "length": 2,
                    "type": "TELEPHONE_NUMBER",
                    "text": "+1",
                    "score": 0.6941494941711426,
                    "isCustom": false
                },
                {
                    "offset": 221,
                    "length": 14,
                    "type": "TELEPHONE_NUMBER",
                    "text": "(650) 555-0190",
                    "score": 0.9527066349983215,
                    "isCustom": false
                },
                {
                    "offset": 268,
                    "length": 27,
                    "type": "EMAIL",
                    "text": "sarah.jones1234@hotmail.com",
                    "score": 0.95,
                    "isCustom": false
                },
                {
                    "offset": 354,
                    "length": 5,
                    "type": "PERSON",
                    "text": "Sarah",
                    "score": 0.9918518662452698,
                    "isCustom": false
                }
            ],
            "languageCode": "en",
            "maskedText": "Hello Support Team, \nI am reaching out to seek help with my credit card number ***************2345 expiring on *1/23. There was a suspicious transaction on *******2022 which I reported by calling from my mobile number +1 **********9999 also I emailed from my email id ***********************.com. Would you please let me know the refund status?\nRegards,\n*arah"
        }
    ],
    "errors": []
}

Configuring PII or PHI Text Output

In the Language service, you can configure the PII/PHI output when analyzing text.

  1. Complete Analyzing text..
  2. In the PII or PHI section, click Configure in the Output section.
  3. Select PII from the dropdown.
  4. Select from the following:
    • Mask: Select to include or exclude entities.
      1. Anonymization exclusion list: Enter entities to exclude from the UI output and the SDK output.
      2. Include excluded entities from masking in detected entities: Select to include the entity that was excluded from the output in the UI, but to continue to include the entity in the SDK output.
      3. Masking character: Masking character to mask input text.
    • Replace: Replace PII entities with a given sequence of characters.
    • Remove: Remove PII entities from output.
  5. Click Save changes.

PII Rules

Custom PII Rules
Keys Type Description
ruleId String Unique identifier for the rule.
regex String Regular expression pattern to match custom data types. For example, ([A-Z]{5}[0-9]{4}[A-Z]{1}) to match Pan card.
type String Name for entity type to match. For example, PAN_CARD.
prefix List<String> Words or phrases to look for within maxDistance of regex detected word.
suffix List<String> Words or phrases to search for within maxDistance of regex detected word.
isCaseSensitive Boolean Determines if the matching process should consider uppercase and lowercase letters as distinct, with a value of true indicating case sensitivity and false indicating case insensitivity.
maxDistance Integer Defines the maximum allowable distance in characters between the prefix/suffix and the matched pattern, ensuring that the pattern is found within a certain proximity to the prefix/suffix.
priority Integer Priority of rules. Ranges between 1-50 where Priority 1 is highest. For example, if there are two rules with same regex but different prefix and suffix, the rule with the higher priority is considered
regexOnly Boolean

If true, this removes model detected entities which have same regex as the rule regex.

For example:

In the sentence, "I am 25 years old and he is 11 months old," with the suffix set to ["years"]:

  • If regexOnly is true, only 25 is detected because the suffix "months" doesn't match the specified suffix "years".
  • If regexOnly is false, both 25 and 11 are detected—25 from the rule (due to the suffix "years") and 11 from the model.
filterEntityTypes List<String>

OCI entity types to filter. For example, [PERSON, AGE] to filter entity types PERSON and AGE from model detections. If filter set to [ALL], all model detected entities are filtered out.

When listing [All], detection regex based and ignores predefined model entities.

disable Boolean Set to true to disable this rule.

Sample Rules Files

[
          {
          "ruleId": "rule 1",
          "regex": "([A-Z]{2}[0-9]{2}-[0-9]{3}-[0-9]{3}-[0-9]{3}\/[0-9]{3})",
          "type": "NAREGA_ID",
          "isCaseSensitive": true,
          "suffix": ["id", "narega"],
          "regexOnly": true,
          "maxDistance": 10,
          "priority": 2
          },
          {
          "ruleId": "rule 2",
          "regex": "([A-Z]{5}[0-9]{4}[A-Z]{3})",
          "type": "PAN_CARD",
          "prefix": ["pan", "pancard"],
          "isCaseSensitive": false,
          "regexOnly": false,
          "maxDistance": 10,
          "priority": 2,
          "filterEntityTypes": ["PERSON"]
          },
          {
          "ruleId": "rule 3",
          "regex": "([A-Z]{4}[0-9]{7})",
          "type": "IFSC_CODE",
          "prefix": ["IFSC"],
          "isCaseSensitive": false,
          "regexOnly":false,
          "maxDistance": 20,
          "priority": 2,
          "disable": false
          },
          {
          "ruleId": "rule 4",
          "regex": "([A-Z]{2}[0-9]{2}-[0-9]{3}-[0-9]{3}-[0-9]{3}\/[0-9]{3})",
          "type": "TIRA_ID",
          "isCaseSensitive": false,
          "prefix": ["Narega"],
          "regexOnly": false,
          "maxDistance": 10,
          "priority": 1
          }
          ]