C / V Is 13 |ndex.Type bund|e |ndexhere DOC Library: Type library name here DOC Number: Type document number here
ยป Prepared by: Gabrielle Anderson 'Date: 4/17/02 Job Code: 320087
2nd CLASS Briefing Reviewed by: Type reviewer name here Review Date: Type review date here
Record of Interview Purpose
To find out how the CLASS namecheck system operates
Contact Method
In-person meeting
Contact Place
State Department, Consular Affairs Bureau
Contact Date
March 27, 2002
Participants
State: Dave Williams Cathy Baskay GAP: Judy McCloskey Jody Woods Kate Brentzel Gabrielle Anderson Richard Hung We received an explanation of the principal techniques that are used in the CLASS namecheck system. We also learned that the reason for the failure of the original Al-Jiddi namecheck was most likely a countryrelationship table that did not take into account a possible country association between Canada and Tunisia. Finally, we discussed the resource issues that would be involved if biometrics were introduced into the system.
Architectural Concepts of Name Searching
Mr. Williams informed us that there are five basic questions that need to be answered when constructing a namecheck system. The first question is whether or not each namecheck query will consult all the records in the system. In the case of CLASS, there would clearly not be enough capacity for all 6 million records in the system to be scanned during one namecheck, especially with many namechecks being conducted each day. Therefore, the second question involves determining the criteria for establishing what subset of these 6 million that is to be searched. This determination is referred to as Phase I.
Pagel
Record of Interview
I by: Gabnelie Anaerson s: 4/17/02 i Code:320087
DOC Library: Type DOC Number: Type
The third question to be answered concerns what techniques (linguistic and logic) are to be used to evaluate that particular subset of records. This is considered Phase II. The fourth question concerns the criteria required to constitute a "hit," i.e., what is considered a close enough match to be returned as a hit. As the subset of records is being evaluated, each record receives points based on how close of a match it is. How many points must a hit receive in order to be considered a legitimate hit? The fifth question concerns the way in which the resulting hit list will be ordered, e.g., with exact name matches first, or with CLASS I hits first?
Namechecking Techniques
Mr. Williams ran through some of the techniques that may be used in order to run a namecheck system. He stressed that no system will use just one of these techniques and that each technique should be considered as a tool. Since each technique has its strengths and weaknesses, a good namecheck system will combine a variety of them in order to achieve the best possible results. He also stressed that visa adjudication involves a good deal of subjective decision-making on the part of the consular officer. Also, the information that is required when performing a CLASS namecheck is surname and gender. Additional information for a namecheck is preferred, but not required, e.g., estimated date of birth, country of birth and first name. Name Compression: """ This technique takes the first letter of a surname, drops all its vowels and reduces any double consonants to a single consonant. The system will then return any surnames that fit this pattern. Its strength lies in the fact that it is fast and precise. However, this technique produces near misses if a surname is spelled slightly differently (e.g., reducing Gutierrez to GTRZ would miss Gutierres, which would be compressed to GTRS). If, in an attempt to account for these near misses, you were to require that the system return all matches within one character, you would pull in far too many hits (since there is a maximum of 6 characters in a compressed name.) Another weakness of this technique is that it does not work well on short names, e.g., Lee.
s
Svnonvm Association: This technique can be used with several of the namecheck fields. For example, synonym association can be used to establish a relationship between the name "Joe" and its derivations such as Joey, Jose, Joseph, Guiseppe, etc. Thus, a search for Joe would turn up not only persons with this exact first name but also those whose name was one of these derivatives, hi the case of country, Russia has been equated with all of the former Soviet republics, so that a search for "Russia" will result in initial hits on any of the current independent republics, e.g., Azerbaijan, Belarus, Estonia, Georgia, etc. Additionally, the synonym association technique
Page 2
Record of Interview
pared by: Gabrielle Anderson ate: 4/17/02 i Code: 320087
DOC Library: Type DOC Number: Type
ensures that surname qualifiers (e.g., Van, De, Al, etc.) are separated out when namechecks are performed.
V""
N Gram Analysis: \X' The bi-gram analysis breaks down a surname by two letters at a time. For example, Gutierrez is broken down into "_G, GU, UT, TI IE, ER, RR, RE, EZ, Z_." This particular technique compares the bi-gram for the desired name with the bi-grams for all other names in the data subset. At present in CLASS, if half of the bi-grams in a particular name match the bi-grams in the desired name, then this name is returned as a hit. However, the level required to return a hit based upon this bi-gram analysis can be changed. The same is true for tri-gram analysis, which is identical except that it breaks down a surname into three-letter components. The strength of the N-gram technique is that it is highly tunable, but its weakness lies in the fact that it has a low level of discrimination. Hence, the N-gram analysis is a coarse method, one that is used to develop subsections of data rather than to produce the desired "hit."
,
Position Discounting: This technique allows you to determine how many of the bi-gram or trigram hits fall into the same position as they do in the desired name. For example, a namecheck on "Wilson," using a simple bi-gram analysis, would return "Sonils" as a hit (since 4 of the 7 bi-grams in these names match). However, when position discounting is used along with the bigram analysis, "Sonils" is rejected as a hit, since none of the matching bigrams in "Sonils" occupy the same positions as they do in "Wilson." Component Comparison: This technique assigns a value to surname endings based on the likelihood that a surname with a particular ending belongs to someone from a particular country. For example, the Russian surname ending in "-ichna" is assigned a value of 0.93, indicating that there is a 93% likelihood that a person whose surname ends in "-ichna" is from a Russian-speaking or Slavic country. Then it is clear that the most appropriate a^oritiim to use is the Russian/Slavic algorithm. Another component comparison technique to determine the appropriate algorithm is the tri-gram probability table. In this table, all the possible trigram combinations in the alphabet (from "_AA" to "ZZ_") are listed, along with percentages that indicate to which linguistic algorithm they are likely to belong. For example, with the tri-gram "MAS," there is a 38.5% likelihood that a name containing this tri-gram will be Russian/Slavic and a 46.9% likelihood that it will be will be Arabic. This is a tool to select out what algorithms to apply in each namecheck case. Cultural Regularization: \/ This technique involves transliterating a name from its foreign alphabet spelling into the many forms it could take using the Roman alphabet. AO< \, Qadafi, Khadafi, Cadhafi, etc.) This ensures that one spelling of
PageS
Record of Intervie
Spared by: Gabrielle Anderson fate: 4/17/02 i Code: 320087
DOC Library: Type DOC Number: Type
possible spellings for that Arabic name have been entered. Letter Based Re-Write Rules: This is an alternative way of addressing the issue of names with multiple transliterations. This technique tries to regularize all spellings of a name into a single entry. It does so by assigning a standard spelling to the phonetic sounds that make up the name. For example, the system will convert Mafouz, Mahfoudh, and Mehfouth into Mahfouz for searching purposes. Letter based re-write rules are currently being used for Arabic names. Both the strength and the weakness of this technique lie in its global reach. Although the technique prevents you from having to enter in every possible spelling of a name, it is also likely to pull in a vast number of hits (e.g., with Arabic or Hispanic names) precisely because the system recognizes only one version of the name.
\' will turn up other versions of the same name, provid Phonetic Transcription: This particular technique assigns a phonetic spelling to every name, e.g., 'Stephen' becomes 'Steven.' This is useful because, when presented with unfamiliar names, people tend to spell phonetically. Many names received from the intelligence community are spelled phonetically sincie they are often names that are overheard. However, the use of phonetic transcription, which is tonal in nature, may require significant manual oversight.
\
Al-Jiddi Namecheck
Edit Distance Algorithm: >-This technique measures how many edits are necessary to change a name in the system into the desired name, i.e., what it takes to make the two names equal. For example, if you enter "Waldmirr,'' the edit-distance algorithm will take this name and compare it to a name in the system such as Vladimir. It will determine how many edits need to be done in order to change Waldmirr into Vladimir. In this case, there are 4 edit operations that need to take place: substitution (of 'V for 'W); insertion (of the middle T in Vladimir); deletion (of the extra 'R' in Waldmirr); and reversal (of the 'AL' to 'LA'). Next, the technique looks at the positions of these changes within the two names and assigns values to the distances between them. Using a formula to assess both the number of edits and the distances between them in the two names, the namecheck system will return Vladimir as a hit for Waldmirr. However, if the bi-gram method were used on this particular example, the name Vladimir would not have been returned as a hit. The edit-distance algorithm is a very strong technique; it is, in fact, the primary technique used in spellchecker. Its weakness is that it is machine-intensive. We asked Mr. Williams about the Al-Jiddi namecheck done earlier this year by the U.S. Consulate in Montreal. They ran a namecheck on AlJiddi, a known Al-Qaeda terrorist, entering in his known name, country of birth, estimated date of birth, and current nationality. This did not result in a hit. Only after country of birth and nationality were left blank, did the system return a CLASS n hit for Al-Jiddi.
Page 4
Record of Interviei
-pared by: Gabrieile Anderson Jate: 4/17/02 Job Code: 320087
DOC Library: Type DOC Number: Type
Mr. Williams gave the likely reason for this. When setting up the namecheck system as w hole, one of the first problems that must be addressed is establishing the criteria that will determine which records (out of 6 million) will be checked. This is Phase I of the search, i.e., when CLASS establishes a searchable subset of the 6 million total names. One of the most important criteria used in Phase I is the country field. In Phase I, the country field is analyzed using country-relationship tables. These tables indicate the likelihood that a person from the country entered in the search will also possess biographical data from another country. The country-relationship tables in CLASS do not indicate that a person of Canadian citizenship is likely to have a Tunisian background. Hence, Al-Jiddi's record was thrown out in Phase I, i.e., it was not included in the subset of names that were then searched. Once the country fields were left blank, the country-relationship tables were not used to establish a subset and therefore Al-Jiddi's record was returned as a hit. However, Mr. Williams mentioned that attempting to fix a problem, such as that posed by the Al-Jiddi namecheck, could have unintended consequences. Re-establishing the threshold for the subset may pull in AlJiddi's record but may very well pull in a great deal more records that will also have to be examined. Country-Relationship Tables
In terms of establishing these country-relationship tables in the first place, Mr. Williams stated that they rely on officers in the field to report back to the Visa Office on migration patterns (which determine country associations.) Based on this new information, the Visa Office can adjust the table relationships. These country-relationships do not have to be reciprocal. The last time such an adjustment took place was under John Brennan's predecessor.
CLASS
There are about 4 major CLASS releases each year, e.g., screen changes, table changes, or new algorithms. Posts have access to the same algorithms that exist at headquarters. The algorithms currently running in CLASS are: Russian/Slavic; Arabic; Hispanic; generic; date of birth; and country of birth. Linguistics teams usually put together four groups of names to test the various algorithms, but it is important to note that they cannot test outliers. Mr. Williams mentioned that on April 22"", there would be a 4-day CLASS course for mid-level and senior consular officers and visa managers, though he admitted that the course might be of some interest to junior officers as well. The focus of the course would be on the Arabic language namecheck. Since this course was just starting up, there were still many questions surrounding it. The CLASS back-up system is known as BNS. When BNS is in use, posts can make local updates on their local BNS system. But global changes to BNS, i.e., incorporating the changes made at individual locations
PageS
Record of Interview
spared by: Gabrielle Anderson Jate: 4/17/02 Job Code: 320087
DOC Library: Type DOC Number: Type
worldwide, are compiled at headquarters and sent out to posts once a month. i
Mr. Williams also noted that there is a new NIV system that is currently in the beta-testing phase. It will be piloted in London.
Biometrics
Mr. Williams viewed biometrics as another tool to use in conducting a comprehensive security check. The use of biometrics would be a move toward the development of an identity system, rather than simply a namecheck system. An individual would have to be much more intelligent to foil an identity system. Mr. Williams asserted that, despite vendor claims to the contrary, facial recognition techniques are not especially successful. At present, both facial recognition and fingerprinting run on very limited databases. If either of these techniques were to become part of a standard identity check, there would have to be a significant increase in resources to accommodate the millions of new records. In checking fingerprints, for example, a turn-around time of a few seconds would be needed. At present, a fingerprint inquiry sent to the FBI takes 24-48 hours. The introduction of biometrics would also have a significant impa'ct on operations at post. Consular officers want to be able to adjudicate a visa application in the course of one day, or in as little time as possible.
Documents
We would like to obtain copies of the country-relationship tables used in CLASS.
Page 6
Record of Interview