Introduction
This document describes smart identifiers, which are built-in content scanning patterns that detect certain types of data. For this release, the system will implement smart identifiers for credit card numbers, U.S. Social Security numbers, CUSIP numbers, and ABA routing numbers.
Internally, a smart identifier consists of a regular expression that matches candidate strings, along with a validation function that checks the candidate match in some way. For example, the validation function for a credit card number ensures that the check digit is correct.
The regular expressions for each smart identifier will include word boundary anchors ('b') at both ends. (This prevents the system from matching a U.S. social security number, for example, in the middle of a longer string of digits.) For simplicity, these are omitted from the descriptions below.
The smart identifiers implementation must be careful about overlapping matches, because a substring found by the regular expression may not validate. For example, a filter is scanning for credit card numbers against the string 9999 4321 9999 9999 9995 1234 5678 9000 should find the valid credit card number 4321 9999 9999 9995, even though a simple regular expression scan for possible numbers would find 9999 4321 9999 9999 and 9995 1234 5678 9000.
Credit Card Numbers
A credit card number begins with a variable length card type, which indicates whether the number is a VISA, MasterCard, AMEX, etc., and ends with a check digit. Different card types use different numbers of digits in the entire number, but the check digit calculation is the same in each case.
Note that enRoute or JCB cards are not matched. Also, 13-digit VISA numbers do not exist, and won't be matched in our implementation.
16-digit credit card numbers will match one of the following regular expressions:
[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}
[0-9]{4}\.[0-9]{4}\.[0-9]{4}\.[0-9]{4}
[0-9]{4} [0-9]{4} [0-9]{4} [0-9]{4}
[0-9]{16}
With the prefix being "4", "51"-"55", or "6011".
The 15-digit AMEX numbers will match one of the following regular expressions:
[0-9]{4}-[0-9]{6}-[0-9]{5}
[0-9]{4}\.[0-9]{6}\.[0-9]{5}
[0-9]{4} [0-9]{6} [0-9]{5}
[0-9]{15}
With the allowable prefixes being "34" or "37".
The 14-digit Diners Club numbers will match one of the following regular expressions:
[0-9]{4}-[0-9]{6}-[0-9]{4}
[0-9]{4}\.[0-9]{6}\.[0-9]{4}
[0-9]{4} [0-9]{6} [0-9]{4}
[0-9]{14}
With the allowable prefixes being "300"-"305", "36", or "38".
Note that the regular expressions define a specific grouping of digits for a given credit card length, and that if there is punctuation between the digits, it has to be the same throughout.
The final digit in a credit card number is a check digit created using the Luhn algorithm. Working from the right end of the number, double every second digit. Then add up the individual digits of the resulting numbers (both the ones that were doubled and the ones that were not). If the result is a multiple of 10, then the number is valid.
For example, given the number 1234 5678 9012 3456:
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
Double: 2 2 6 4 10 6 14 8 18 0 2 2 6 4 10 6
Adding 2 + 2 + 6 + 4 + 1 + 0 ... + 1 + 0 + 6 gives 64, which is not a multiple of 10, so the number is invalid.
Given the number 1234 5678 9876 3333:
1 2 3 4 5 6 7 8 9 8 7 6 3 3 3 3
Double: 2 2 6 4 10 6 14 8 18 8 14 6 6 3 6 3
Adding 2 + 2 + 6 + 4 + 1 + 0 ... + 6 + 3 gives 80, which is a multiple of 10, so the number is valid.
U.S. Social Security Numbers
Social security numbers are divided into a 3 digit area number, which is assigned geographically, a 2 digit group number assigned in a particular order within an area, and a 4 digit serial number assigned sequentially.
Our implementation will use the following regular expressions:
[0-9]{3}-[0-9]{2}-[0-9]{4}
[0-9]{3}\.[0-9]{2}\.[0-9]{4}
[0-9]{3} [0-9]{2} [0-9]{4}
Here are some examples of the expressions above:
555-55-5555
555.55.5555
555 55 5555
The Social Security Administration maintains a list of the area/group numbers that have been assigned: SSN issued [3]. But since this document changes periodically, we can't rely on it for validation. The validation function will check that none of the 3 fields are all zeros, and that the first 3 digits are less than 800. (The previous reference uses 771 as the limit, but the SSA has already assigned numbers with the first 3 digits 771 and 772.)
(Numbers starting with 666 are unassigned, and numbers in the range 987-65-4320 through 987-65-4329 are reserved for advertising. Also, 078-05-1120 is the most misused SSN; it was the actual SSN of a secretary at a wallet company, which used the number as an example.)
CUSIP Numbers
CUSIP (Committee on Uniform Security Identification Procedures) numbers are 9 alphanumeric identifiers that identify North American securities of various types. The number is divided into a 6-character issuer number, which uniquely identifies the issuer (e.g., a company), a 2 character suffix that identifies the particular security; e.g., common stock, vs. preferred stock vs. option vs. fixed-income instrument.
The CUSIP smart identifier code will use the following regular expressions:
[0-9]{3}[0-9a-zA-Z]{3} [0-9a-zA-Z]{2} [0-9]
[0-9]{3}[0-9a-zA-Z]{3}-[0-9a-zA-Z]{2}-[0-9]
[0-9]{3}[0-9a-zA-Z]{3}[0-9a-zA-Z]{2}[0-9]
The validation function is similar to the one used for credit card numbers. The only difference is that letters in the CUSIP number are converted to a numerical value by assigning A=10, B=11, ..., Z=35.
An example from the cusip.com web site uses the CUSIP number 392690 QT 3:
3 9 2 6 9 0 Q T 3
Convert letters: 3 9 2 6 9 0 26 29 3
Double: 3 18 2 12 9 0 26 58 3
Adding 3 + 1 + 8 + 2 + 1 + 2 + ... + 5 + 8 + 3 gives 50, which is a multiple of 10, so the original number was valid.
ABA Routing Numbers
An ABA (American Banking Association) routing number is a 9-digit value. The first 4 digits are the Federal Reserve routing symbol, the next 4 the institution identifier, and the last a check digit.
The ABA routing number smart identifier code will use the following regular expressions:
[0-9]{4} [0-9]{4} [0-9]
[0-9]{4}-[0-9]{4}-[0-9]
[0-9]{9}
The validation function involved multiplying each digit by 3, 7, 1, ...; if the sum of the products is a multiple of 10, the number is valid.
For example consider the number 123 456 789:
1 2 3 4 5 6 7 8 9
Multiply by: 3 7 1 3 7 1 3 7 1
Product: 3 14 3 12 35 6 21 56 9
Adding 3 + 14 + 3 + 12 + 35 + 6 + 21 + 56 + 9 gives 159, which is not a multiple of 10, so the original number was invalid.
Given the number 322 271 627:
3 2 2 2 7 1 6 2 7
Multiply by: 3 7 1 3 7 1 3 7 1
Product: 9 14 2 6 49 1 18 14 7
Adding 9 + 14 + 2 + 6 + 49 + 1 + 18 + 14 + 7 gives 120, which is a multiple of 10, so the original number was valid.
(Although certain ranges of Federal Reserve routing symbols are reserved, and therefore not assigned, the validation algorithm will not check for reserved numbers, to avoid having to revise it if the ABA changes its policy.)