The following is the text of Department of Defense Polygraph Institute report DODPI94-R-0008, A Comparison of Psychophysiological Detection of Deception Accuracy Rates Obtained Using the Counterintelligence Scope Polygraph and the Test for Espionage and Sabotage Question Formats. This report is available from the Defense Technical Information Center (DTIC) as report #ADA319333, and was also reprinted in the American Polygraph Association quarterly, Polygraph, Vol. 26, No. 2 (1997), pp. 79-106.

The text presented here is taken from the Polygraph reprint. Page numbers are provided between curly braces for citation purposes. Corrections in brackets were added to the reprint text.

{79}

A COMPARISON OF PSYCHOPHYSIOLOGICAL DETECTION OF DECEPTION ACCURACY RATES OBTAINED USING THE COUNTERINTELLIGENCE SCOPE POLYGRAPH AND THE TEST FOR ESPIONAGE AND SABOTAGE QUESTION FORMATS

By

Department of Defense Polygraph Institute
Research Division Staff
June 1995

Abstract

This study was designed to compare the decision accuracy rates obtained using a new psychophysiological detection of deception test, the Test for Espionage and Sabotage (TES) to those obtained using two versions of the counterintelligence scope polygraph (CSP) format; the CSP format using probable lie control (PLC) questions (CSP-PLC), and the CSP format using directed lie control (DLC) questions (CSP-DLC). The TES format differs from the CSP formats in that: (a) the number of issues being tested in a question series is reduced; (b) a maximum of three question repetitions are used to calculate question scores; (c) between-test stimulation is eliminated; (d) the order of questions within the question sequence cannot be altered; (e) each relevant question is compared to the same control questions; (f) the pretest is brief, more standardized, and follows a logical sequence of information presentation; and (g) problems associated with PLC questions are reduced by using DLC questions. The 277 examinees included in the analyses were recruited from the communities surrounding Ft. McClellan, AL. Ninety of the examinees [were] programmed guilty (PG) by enacting one of four possible mock espionage scenarios. Eighteen certified government examiners conducted the examinations. The decisions of the examiners who administered the TES format were significantly more accurate (83.3%) at identifying the examinees than were the decisions of the examiners who administered either the CSP-PLC (55.6%) or the CSP-DLC (58.6%) format. There were no significant differences among the accuracies of the examiner's [sic] decisions at identifying the programmed innocent (PI) examinees. The decision accuracies obtained using the three formats to identify PI examinees were 88.9%, 95.3%, and 95.2% for the TES, CSP-PLC, and CSP-DLC formats respectively. Blind scoring of the examinations yielded similar results.

{80}

Federal agencies use three basic types of psycho physiological detection of deception (PDD) examinations: Pre-employment, security screening, and specific-issue criminal examinations. Several authors have summarized the research conducted to assess the validity of specific-issue criminal examinations (Ansley, 1990; Kircher, Horowitz, & Raskin, 1988; McCauley & Forman, 1988; Raskin, 1989). Little research has been conducted with either pre-employment or security screening examinations. During security screening examinations, the majority of the Department of Defense (DoD) agencies utilize the counterintelligence scope polygraph (CSP) format. Although there is widespread use of CSP security screening examinations--the DoD reported 17,970 examinations conducted during fiscal year 1993 (Department of Defense, 1993) -- the analog studies, to date, suggest that when the CSP format is utilized, 94.9% of the programmed innocent (PI) examinees are correctly identified, but only 43.2% of the programmed guilty (PG) examinees are correctly identified.

Using four different security screening formats, Barland, Honts, and Barger (1989) assessed the accuracies of decisions identifying examinees as guilty or PI of enacting a mock crime. Examiners from four government agencies conducted PDD examinations utilizing their agency's format. The formats included a CSP format in which standard probable lie control (PLC) questions were asked (CSP-PLC), a CSP format in which directed lie control (DLC) questions were asked (CSP-DLC), and two variations of relevant-irrelevant (R/I) formats. The authors reported that 97.2% of all the PI examinees were correctly identified but only 33.7% of all the PG examinees were correctly identified. Differences among the decision accuracies obtained by examiners utilizing the four formats were significant. Decisions based on the results of CSP-PLC tests were the least accurate (8%) in identifying PG examinees, and decisions based on the results [of] CSP-DLC tests were the most accurate (48%) in identifying PG examinees.

However, there were several flaws in the design and analyses of the study. The most critical flaw concerns how the authors reported correct decisions. If, based on the test results, the examiner's decision was that deception was indicated, the examiner attempted to obtain a confession from the examinee. If the examiner was unsuccessful in obtaining a confession, another examination was conducted. If the examiner's decision, based on the second examination, was that no deception was indicated, then the reported decision for that examinee was no deception indicated. Therefore, if a PG examinee was correctly identified but did not confess, and the result of the second test indicated the examinee was truthful, the examinee was reported as a miss. The rationale for this procedure was to simulate field situations. In most screening examinations, when an examinee's physiological responses to the relevant questions indicated deception, the examinee was questioned and then the examination was rerun.

There is concern regarding whether the psychological significance of the relevant questions for an examinee in a mock laboratory situation is equivalent to the psychological significance of the relevant questions for an examinee in an actual field examination (Furedy, 1986; Iacono & Patrick, 1988). Furedy (1986), and Iacono and Patrick (1988) suggest that the psychological significance of the relevant questions is less for examinees in a mock laboratory situation. There is little research concerning the affect [sic] of retesting, in a mock situation, on the

{81}

accuracy of PDD test decisions. However, it is well established that physiological responses to a less significant stimulus will habituate faster (O'Gorman, 1977; Sokolov, 1963). Therefore, due to rapid habituation of the less significant stimulus, fewer PG examinees might be identified with repeated testing. In the Barland, et al. (1989) study, once the PG examinees knew they had been caught (i.e., confronted with the initial deceptive decision)--even if they did not confess--the psychological significance of the relevant questions may have been reduced. Therefore, if the results of only the first test were considered, more PG examinees were correctly identified than the reported results indicate.

Another criticism of the study concerns the wording of the relevant questions. Examinees were asked if they had committed espionage or sabotage "against the United States." Many of the experienced examiners who participated in the study believed that because the PG examinees had participated in a "mock" crime and had not committed any act "against the United States," the wording of the relevant questions was inappropriate (Barland, et al. 1989). The examiners believed that the question wording might have reduced, even more, the psychological significance of the acts that the PG examinees did commit. This could have contributed to the low accuracy rates for identifying the PG examinees.

The only other laboratory study concerned with screening, conducted by Honts (1989), was not designed to test the validity of the CSP format but to compare the accuracies of decisions identifying PI and PG examinees using two different sets of relevant questions. Eighty-nine percent of the PI examinees were correctly identified but only 58% of the PG examinees were correctly identified. No difference was found between the accuracies of the decisions as a function of the two sets of questions. A detailed report of the research was not written, therefore it is difficult to evaluate the study. However, one possible problem with the design was that the PG examinees were allowed only 10 minutes to execute a complex scenario that included memorizing a lengthy article. Another problem with the study may have been the way the examiners' decisions were reported. The report states, "The CSP examinations were administered just as if they were being given in the field." (Honts, 1989, p. 4). This suggests, but does not state, that decisions were reported in the same manner as in the Barland et al. (1989) study. If decisions were reported in that manner, then the accuracy of the decisions identifying the PG examinees may have been higher than indicated by the report.

Although the report results of the first study are suspect, combined with the results of the second study they suggest that decisions based on CSP test data are not highly accurate in identifying PG examinees--at least in a laboratory situation. This study was therefore completed to compare the accuracy of decisions obtained concerning PG and PI examinee's [sic] veracity using a new screening test format, the Test for Espionage and Sabotage (TES), to that obtained using the CSP-PLC and CSP-DLC question formats.

{82}

TES Development

Theoretical Basis: Significance/attention model

The relationship between arousal and attention is well established for respiratory (Obrist, 1981; Sokolov, 1963), electrodermal (Dawson & Schell, 1982; Dawson, Schell, Beers, & Kelly, 1982; Kilpatrick, 1972; Kimmel, van Olst, & Orlebeke, 1979; Nikula, 1991; Ohman, 1979), and cardiovascular (Coles & Duncan-Johnson, 1975; Coles & Strayer, 1985; Jennings, 1986a; Jennings, 1986b) measures of arousal. When a significant change in sensory stimulation occurs, attention shifts to focus on the input. If the significant change is perceptual (increased volume, novel stimulus, etc.) the attention shift is referred to as an orienting response (OR). The physiological arousal associated with an OR is well documented (Jennings, 1986a). When the significant change is in the meaning or content of the stimulus, self-focused attention occurs, resulting in the concomitant physiological arousal (Germana & Chernault, 1968; van Olst, Hemstra, & ten Kortenaar, 1979). In addition, when attention is focused more, the arousal is greater (Carver, Blaney, & Scheier, 1979; Dawson, Schell, & Filion, 1990; Easterbrook, 1959; McLean, 1969; Maltzman, Kantor, & Langdon, 1966; Sampson, 1969; Waid, Orne, & Orne, 1981). Therefore, arousal level is a valid indicator of attention (Glass, Holyoak, & Santa, 1979; Kahneman, 1973). Of critical importance to PDD is the fact that the amount that attention is focused changes in proportion to the degree of stimulus significance (Jennings, 1986b). Therefore, a stimulus of greater significance will elicit a greater focusing of attention that will result in greater physiological arousal.

Several studies have indicated clearly that the act of lying is not a necessary condition for a PDD examination to yield accurate results (Davis, 1961; Dawson, 1980; Gustafson & Orne, 1965; Kugelmass, Lieblich, & Bergman, 1967; Orne, Thackray, & Paskewitz, 1972; Thackray & Orne, 1968; Waid, Orne, & Wilson, 1979). This author would argue further, that PDD tests detect neither deception nor guilt, but merely reflect relative degrees of physiological arousal. The scores assigned based on the physiological arousal indicate only that the examinee focused attention more when one type of question (relevant or control) was asked compared to when the other type of question was asked. The greater focusing of attention indicates a more significant stimulus. Therefore, the physiological responses are used to infer a focusing of attention due to the significance of the questions. However, inferences about why the relevant questions are significant must be cautious. One (the most probable one) of the reasons the relevant questions are significant to an examinee is because the examinee is being deceptive.

The design of this study and the development of the TES format were based on the previous hypothesis--the test does not assess deception, but merely indicates a relative degree of arousal from which significance of the stimulus is inferred. Therefore, the decisions for the TES are either: (a) significant responding (SR) occurred following the relevant questions, (b) no significant response (NSR) occurred following the relevant questions, or (c) the responding following the relevant questions was inconclusive (INC). It is the examiner's job to eliminate,

{83}

during the pretest, possible confounding reasons that would cause the relevant questions to be significant to the examinee. Then if the examinee does respond physiologically when the relevant questions are asked, the examiner must ascertain why the relevant questions were significant to the examinee.

Control Questions

The standard control question is the PLC, in which the examiner manipulates the examinee so the examinee's answer to the question probably is a lie. The major difficulty with using the PLC is the need to increase the psychological significance of the PLC for the examinee. The examinee must believe that responses elicited by the PLC and relevant question are equally important. The examiner must be skillful enough to increase sufficiently, but not too much, the significance of the control question. This requires the examiner to be able to "read" the examinee's stress level and know what level is appropriate. Additional problems associated with the PLC include: (a) the perception that they are intrusive, offensive, and/or embarrassing to some examinees (due to the nature of the associated psychological manipulations); (b) the occasional difficulty of developing a PLC which excludes all aspects of the relevant issue; and (c) possible difficulty associated with maintaining the psychological significance of PLC questions during repeated testing (as sometimes occurs with security screening examinations). DLCs eliminate most, if not all, of the problems associated with PLCs because: (a)) they require little or no psychological manipulation, (b) they are easy to explain and it is easy to justify their purpose, (c) the examinee can readily answer them and the veracity of the answer is not in question, (d) they are less sensitive to examiner competence (no psychological manipulation required), (e) the questions and the procedures for introducing them to the examinee are easily standardized, (f) they are not personally intrusive, so they are not offensive or embarrassing, and (g) they can be constructed so they do not overlap the relevant issues.

Based on the positive results of research with DLC questions (Abrams, 1993; Barland, 1981; Honts & Raskin, 1988; Horowitz, 1989; Raskin & Kircher, 1990; Reed, 1990; Reed, 1995), and the numerous advantages of the DLCs, it was decided to include, in the TES format, DLC questions rather than PLC questions. However, each study which used DLCs implemented the DLCs differently regarding: (a) when, during the examination, the acquaintance test was conducted; (b) the rationale for conducting the acquaintance test; (c) the rationale for including the DLC questions; (d) the construction of the DLC questions; and (e) how the DLC questions were pretested. With respect to these issues, the following decisions were made, and incorporated into the TES testing procedures. The acquaintance test (ACQT) is a standard known solution numbers test (Department of Defense Polygraph Institute, 1994a). During a TES examination, the ACQT is conducted prior to the review of the relevant and control questions. The rationale for this is when the ACQT is presented immediately after the explanations of the instrumentation and the physiology and prior to question review, it enhances the logical flow of the pretest. The examinee is told that the purpose of the ACQT is to: (a) demonstrate the examination process to the examinee, (b) allow the examinee to become accustomed to the components and procedures, (c) allow the examiner an opportunity to adjust the instrument, and (d) allow the examiner to make sure the examinee is physiologically capable of responding when lying. In

{84}

addition, the rationale for including the DLCs can be explained more easily and efficiently if the ACQT was conducted previously. The examiner refers to the ACQT during the explanation of the purpose of the DLC questions--to make sure the examinee continues to respond physiologically when lying, just as occurred on the ACQT. The examinee is told that if she or he does not continue to respond physiologically when lying, the test results will be inconclusive.

The DLCs are pretested in the following standardized manner. The examiner minimizes the behavior to be discussed ("this is something we all have done"). The examiner asks the DLC question being pretested ("Have you ever violated a minor traffic law?") and obtains a verbal commitment from the examinee that the examinee has, in fact, engaged in such behavior. If the examinee denies having engaged in the behavior, the examiner is required to utilize a different DLC. Examiners are not permitted to try to convince the examinee, with examples or suggestions, that the examinee had engaged in the behavior. Once a verbal commitment is obtained, the examinee is asked to think of a specific occasion during which the examinee engaged in the behavior. The examinee is instructed not to tell the examiner about the incident but only to think about it. Examiners are not allowed to suggest that the examinee think about the most recent or most significant incident, but only an incident. After the examiner has obtained a verbal commitment that the examinee has a specific incident in mind, the examiner repeats the question and instructs the examinee to think about the specific incident and then to lie by answering "no" to the question. Finally, the examinee is instructed that when the question is asked during the test, the examinee is to think about the incident and then lie by answering "no."

Examiners are instructed to be sure that the examinee actually thinks about the incident because cognitive processing results in increased physiological activity (Jennings, 1986a; Jennings, 1986b). The intent is to make a major component of the physiological response to a DLC be due to the cognitive processing which occurs when the examinee thinks about the specific incident. If the examinee engages in the cognitive processing, the DLCs should elicit strong physiological responses regardless of the emotional valance [sic] of the question. Any emotional response to the incident being imagined, would also increase the significance of the question and, therefore, the strength of the physiological response.

Question Sequence

Reed (1995) reported research on a new format in which the relevant and control questions were repeated within the same question sequence. However, unlike standard control question PDD tests, the question sequence was asked only once. The results indicated that, when using a mock screening paradigm to program examinees guilty, 81.5% of the PI examinees and 73.9% of the PG examinees were correctly identified. With minor revisions, the TES format was developed directly from this previous research. The sequence contains two different irrelevant (IR1 and IR2) questions, two different control (C1 and C2) questions, two different relevant (R1 and R2) questions and a sacrifice relevant (Sr) question. The question sequence is IR1 IR2 Sr C1 R1 R2 C2 R1 R2 C1 R1 R2 C2.

{85}

Pretest Phase

A standardized pretest was developed. The examinee is given a brief introduction to the procedures and asked to sign a form indicating his consent to be given the PDD examination. Next, the examinee's physical suitability to undergo the examination is assessed. Then the operation of the polygraph instrument is explained and a brief explanation of physiological responding is given. Next, the ACQT is introduced as an opportunity to demonstrate the procedures to the examinee and to assess the examinee's physiological suitability. After the ACQT is conducted and the results are presented to the examinee, the test questions are reviewed with the examinee. The examiner reviews, with the examinee, the three relevant questions (including the sacrifice relevant question), the two control questions, and the two irrelevant questions, in that order. The precise meaning and intent of each relevant question is explained so the examinee fully understands what behaviors the question includes. If the examinee has any problem understanding a relevant question, alternative relevant questions are available.

Testing Phase

During the administration of the TES, the examiner is not allowed to insert, into the question sequence, more than two irrelevant questions in succession. The inter-question interval (question onset to question onset) is 20 to 30 seconds, with an average of 25 seconds. If a physiological response occurs timely to a question but could have been caused by other factors (movement, orienting response to outside noise, etc.), it is referred to as an artifact and the question cannot be scored. If an artifact occurs during the asking of a TES relevant question, in order to provide three scorable physiological responses for that question, the examiner is required to conduct a "short test" with the following question sequence: IR1 IR2 Sr C1 R1 R2 C2.

Test Scoring

Tests are scored using the 7-point scoring criteria taught at the Department of Defense Polygraph Institute (DoDPI), in which the relevant strength of the physiological responses to a relevant question is compared to the relative strength of the physiological responses to a control question (DoDPI, 1994b). A positive score is assigned if the physiological responses to the control question are greater than those to the relevant questions. A negative score is assigned if the physiological responses to the relevant question are greater than those to the control questions. In most PDD security screening tests, each relevant question is compared to the stronger (relatively more responding) of the two control questions that bracket it. However, based on previous research (Reed, 1995), the first repetition of the first control question is not used when scoring a TES examination. The physiological responses to the first repetition of R1 and R2 are compared only to the physiological responses to the first repetition of the second control question. There are three scores for each relevant question, because each relevant question is repeated three times. The three scores are summed to provide one score for each relevant question. If a short test was conducted, only the relevant question to which the artifact occurred is scored. The other relevant question on the short test is not scored.

{86}

Decision Criteria

Multi-issue examinations, in which different relevant questions address separate issues, typically require a score of +3 or greater for each relevant question, for an NSR decision to be rendered. This decision was based on the belief that each question is related to a separate issue and therefore should be treated separately. However, research suggests that when an SR decision is rendered, the strongest physiological responses are not always to the question to which the examinee is being deceptive (Barland, 1981; Barland, et al., 1989; Correa & Adams, 1981; Raskin, Kircher, Honts, & Horowitz, 1988). These studies reported that the accuracies of the decisions for detecting deception decreased when responding to specific questions was assessed. Thus, an examinee who committed sabotage might respond physiologically to a question regarding the disclosure of classified information, but not to a question regarding sabotage. Therefore, decision criteria should be based on the test as a whole, not on responses to individual questions. Reed (1995), thus, adopted the following decision criteria for scoring TES examinations. An NSR decision is rendered if the scores for both questions are positive and they sum to +4 or greater. An SR decision is rendered if the score for either question is -3 or less or if the scores for both questions are -2 (total score of -4). If the scores do not meet either the NSR or the SR criteria, the decision is INC.

Standardization

Other aspects of the TES format also were standardized. First, the number of artifact-free questions required to calculate a score was standardized. With many PDD formats, the same decision criteria (-3 or less for a deceptive decision) are utilized to reach a decision, whether the score was calculated from two repetitions of the questions or from five repetitions (Department of Defense Polygraph Institute, 1994c; Honts & Raskin, 1988; Horowitz, 1989; Raskin, 1982). Examiners using the TES format are required to calculate scores from the physiological responses to three artifact free repetitions of each question. Second, the sequence in which the questions are asked was standardized. With many PDD formats, the sequence of questions is repeated multiple times (usually 3). With each repetition, the examiner might change the sequence in which the questions are asked. Federal examiners are allowed to modify the sequence of the questions based on their subjective opinions (Department of Defense Polygraph Institute, 1992), whereas Raskin and his colleagues (Honts & Raskin, 1988; Horowitz, 1989; Kircher & Raskin, 1988) systematically and objectively modify their question sequence. The sequence of TES questions is not repeated. Therefore, there is no option to modify the question sequence. Third, between successive repetitions of the question sequence, some examiners interact with the examinee by discussing the examinee's perception of the questions (Horowitz, 1989; Podlesney & Raskin, 1977; Raskin, 1982). This form of interaction is not standardized. The question sequence is not repeated with the TES format. Therefore there is no opportunity for between-test interaction. Finally, the dialogue for administering each of the individual components of the pretest was standardized by providing explicit outlines and examples. This includes: (a) the administration of the ACQT (as described above), (b) the rationale and presentation of the DLC questions (as described above), (c) the explanations regarding the polygraph instrument and the

{87}

physiological responses, and (d) the logical sequencing of the presentation of these components of the pretest.

Methods

Examinees

Three hundred and six examinees were recruited by a local employment agency under contract to the Department of Defense Polygraph Institute and were paid $30.00 for their participation. Individuals who met the following criteria were excluded from participation: (a) less than 19 or more than 60 years of age, (b) not in good health, (c) pregnant, or (d) did not have the equivalent of a high school diploma. One hundred thirty-nine male (M = 26.7, SD = 7.8) and 167 female (M = 28.2, SD = 8.8) examinees were scheduled for testing. There were 69 PI and 33 PG examinees assigned to the CSP-PLC group, 70 PI and 32 PG examinees assigned to the CSP-DLC group, and 67 PI and 35 PG examinees assigned to the TES group.

Examiners

Twelve certified examiners (11 males and 1 female) from the Office of the Secretary of the Air Force (OSAF) and 6 (5 males and 1 female) from the United States Army Intelligence and Security Command (USAINSCOM) conducted the examinations. The examiners had an average of 6.5 years of experience, with a range of 1.5 to 19 years. Selection of the examiners was determined by the agencies. Although examiner selection was not random (selection criteria generally involve availability and experience), the examiners were considered representative of the CSP examiner population. Examiners were assigned randomly to administer one of the three PDD formats, with the restriction that a format was utilized by two INSCOM and 4 OSAF examiners. Examiners received four hours of training to familiarize them with the format, pretest, scoring rules and control questions to be used. They conducted two practice examinations before conducting an examination for the project. Each examiner completed two 4-hour examinations (morning and afternoon) on seven days and one 4-hour examination on three days for a total of 17 examinations each. The examiners were not given any information regarding the base rates. They did not receive feedback regarding the accuracy of their decisions until the end of the study, and they were blind as to whether the examinee was PG.

Apparatus

The examiners used standard field polygraph instruments manufactured by either Lafayette or Stoelting. Standard respiratory, electrodermal, and cardiovascular responses were recorded. The electrodermal component was operated in the manual mode. The examinations were conducted individually in large (20 x 20) rooms in a building located on Ft. McClellan. The scenarios used to program examinees guilty were enacted in another building located approximately two miles from the examination building. There were no video recording devices nor one-way mirrors in the examination rooms. The examinations were audio taped.

{88}

Scenarios

The PG examinees enacted one of four mock scenarios. Each scenario was representative of one of the four relevant questions. The "espionage" scenario required one examinee to steal a classified document from an office and give the document to a second examinee. The second examinee received the document and placed it inside a vehicle located in the parking lot. Examinees who enacted the "sabotage" scenario, stole either a classified document or a classified computer disk. The examinee either put the document through a paper shredder or with a pair of scissors, cut the disk into pieces. An examinee who enacted the "unauthorized contact" scenario was asked to meet with a German agent who was sitting in a car in the parking lot. The agent requested that the examinee obtain some classified information to be given to the agent at a later time. During the enactment of the "unauthorized disclosure" scenario, the scenario setter was called out of his office midway through briefing the examinee regarding some classified computer information. A third person, who appeared to be fixing a window screen, entered the office and attempted to engage the examinee in conversation regarding what the examinee had been told. All PG examinees received $100.00 as payment for their participation in the "crime." In addition, all PG examinees wrote a statement indicating that "for the purposes of this project" they had engaged in espionage, sabotage, unauthorized contact, or unauthorized disclosure, depending on which scenario they enacted.

The author did not believe that fear or guilt could be instilled in the examinees. Therefore, the scenarios were not intended to convince the examinees that they had done anything wrong. However, because it is assumed that physiologic; [sic] responses occur during PDD examinations due to the significance of the questions, an attempt was made to make the relevant questions significant to the examinees through cognitive means as well as through the behavioral component (i.e., their actual participation in the scenario). Therefore, the scenario setters colluded with the examinees to "beat" the examiners.

Formats

Three separate PDD formats were employed. Currently, four different relevant questions, each of which is asked once within the sequence of questions, are included in the CSP-PLC format. The sequence of questions is repeated three times, with a short break between each repetition. Probable lie control questions are included in the question sequence. The CSP-DLC format is identical to the CSP-PLC format except the control questions are DLCs rather than PLCs. During this study, examiners who utilized the CSP formats (PLC or DLC) followed the guidelines established and taught at the DoDPI, (1992) with respect to the administration and scoring of the test, and the decision criteria applied to the scores.

Four relevant questions are also repeated three times during the TES examination format, but the sequence of questions differs substantially from that used during the CSP examination. Two of the four relevant questions are repeated three times each during the first series of questions (TES sub-test A) and the other two relevant questions are repeated three times each

{89}

during the second series of questions (TES sub-test B). The TES examination format requires two question series, rather than the three series used during a CSP examination. The examiner does not mention the sub-test B relevant questions until after sub-test A testing has been completed (including any post test discussions). The same or different DLCs are utilized on the two sub-tests. During this study, if the decision based on the results of the first sub-test was SR, then the second sub-test was not conducted, and if the decision for either sub-test was SR, the examinee was considered SR.

Questions. Barland et al. (1989) were criticized for using relevant questions containing the phrase "against the United States." Based on the significance/attention model proposed earlier, a test will be more accurate if the examinee's attention is focused on the actual issues being tested. Field examiners apply this principle to test question construction by including "time bars" or "situation bars" which are caveats to a question. The caveats are intended to direct the examinee's attention to a specific time or situation (e.g., prior to 1993 have you ever ... ). Therefore, in order to focus the PG examinee's attention on the mock scenario, the caveat "during this project" was added to each question. The same four relevant questions were asked with each format. The relevant questions were: (a) Sabotage, "Have you committed an act of sabotage during this project?" (b) Espionage, "Have you committed an act of espionage during this project?" (c) Disclosure, "During this project, have you provided classified material to an unauthorized person?["] and (d) Contact, "During this project, have you had unauthorized contact with a foreign national?" During the pretest phase of each examination, the precise meaning of each relevant question was explained to the examinee. In order to standardize the control questions, a list of ten PLC questions was developed for use with the CSP-PLC format and a list of ten DLC questions was developed for use with the CSP-DLC and TES formats. The specific DLCs were chosen because they: (a) concerned trivial behaviors (e.g., minor traffic violations), (b) were questions that were not likely to appear personally intrusive, and (c) were questions that did not overlap with the relevant issue. The same sacrifice relevant (Sr) question (Regarding the project security questions, do you intend to answer truthfully?) and the same list of four irrelevant questions were used by all examiners.

Procedures

During each session, eighteen examinees were given information regarding the research project, their participation, and the PDD examination. If they agreed to participate, they signed a form indicating their consent to participate in the research project. The examinees were taken in groups of two either to another building to be programmed guilty, or to the testing site. The PG examinees received information regarding the purpose of the scenario and signed an additional consent form indicating their agreement to participate in the scenario. After they enacted one of the scenarios, they were transported to the testing site. The transportation of the examinees to the testing site was timed so the examiners were not able to discern which examinees were PI and which were programmed guilty.

The examinations were conducted and each examiner provided a numeric score and a decision (SR, INC, NSR) based on the numeric score, for each test. The decisions were rendered

{90}

according to the decision criteria for the format utilized. An NSR decision concluded the examination (NSR to both sub-tests for the TES format). If the decision was INC, the examiner briefly discussed the questions with the examinee to determine if the examinee understood the questions. Then, the test was administered again. If, based on the data from the second test, the examiner's decision was INC, then the decision for that examinee was INC. When the examiner rendered an SR decision, the examiner confronted the examinee with the results.

Programmed guilty examinees were instructed to confess their guilt if they were confronted by the examiner, but not to reveal any details of their activities. Once a PG examinee confessed, the examination was concluded. However, a PI examinee who responded significantly to the relevant questions--a false positive (FP) decision--was questioned by the examiner to determine if there was a legitimate, "real-world" explanation for the examinee's physiological response to the relevant questions. The examiner recorded any information provided by the examinee and concluded the examination. Two examiners, otherwise not involved with the study, independently evaluated the information obtained from the examinees who received FP decisions. If the two examiners agreed that the information was significant enough to justify the examinee's physiological responding--a false positive decision with justification (FPWJ)--then that examinee's data was not included in the original data analyses. All of the examinees tested during a session were debriefed simultaneously. Examinees who participated in mock scenarios returned the $100.00.

Data Reduction and Analyses

The data from 277 examinees were included in the analyses. The remaining 29 examinees were excluded for the following reasons: Eight PG examinees confessed their guilt to the examiner prior to the examination; six examinees were not medically suitable to be tested; four examinations were incomplete; three examinees were DoDPI employees; and eight FPWJ examinees were excluded. The differences in the number of excluded examinees in each of the three groups were not significant.

If the scoring based on the physiological responding during an initial test resulted in an inconclusive decision and a second test was conducted, unless otherwise indicated, only the result of the second test was included in the analyses. The percentages of correct and incorrect decisions were calculated for each group. Simple proportionality tests were conducted to determine if differences between sets of percentages were significant. Unless otherwise stated, the significance criterion was set at .05 using a two-tailed probability distribution.

Three examiners who did not conduct any of the examinations each scored a different third of the test conducted with each of the three formats. The blind raters rendered decisions based solely on their scoring of the recorded physiological reactions, whereas the original examiner's [sic] scoring, and therefore their decisions, might have been influenced by their interactions with the examinees.

{91}

Results

The major finding was that when a conclusive decision was made (i.e., inconclusive decisions were excluded) the decisions of the examiners who administered the TES format were significantly more accurate (83.3%) identifying the PG examinees than were the decisions of the examiners who administered either the CSP-PLC (55.6%) or the CSP-DLC (58.6%) format. There were no significant differences among the accuracies of the examiners' decisions identifying the PI examinees.

Original Examiners' Decisions

Table 1

Number of Correct Decisions, Inconclusive (INC) Decisions, and Errors Made by the Examiners in Identifying Programmed Guilty and Programmed PI Examinees

	Decisions
Format	Correct	INC	Errors
Programmed Guilty Examinees
CSP-PLC	15_a	2	12
CSP-DLC	17_a	2	12
TES	25_b	0	5
Programmed PI examinees
CSP-PLC	61	1	3
CSP-DLC	59	4	3
TES	48	2	6

Note: Frequencies within columns with different subscripts are significantly different from each other at p < .05.

The number of correct decisions, inconclusive decisions, and errors made by the examiners are presented in Table 1. The accuracies of the decisions obtained using the three formats to identify PG examinees (inconclusive decisions excluded) were 83.3%, 55.6%, and 58.6% for the TES, CSP-PLC, and CSP-DLC formats respectively. The accuracies of the

{92}

decisions identifying PG examinees were significantly different, among the three formats. When inconclusive decisions were excluded from the analyses, the decisions made by the examiners who administered the TES format were significantly more accurate identifying the PG examinees than were the decisions made both by examiners who administered the CSP-PLC format (z = 2.28, p = .022) and by examiners who administered the CSP-DLC format (z = 2.09, p = .036). Similarly, when inconclusive decisions were included in the analyses, the decisions made by examiners who administered the TES format were significantly more accurate identifying the PG examinees than were the decisions made both by examiners who administered the CSP-PLC format (z = 2.60, p = .009) and by examiners who administered the CSP-DLC format (z = 2.40, p. = .016). The accuracies of the decisions identifying PG examinees for the two CSP formats, including and excluding inconclusive decisions, were not significantly different. The accuracies of the examiners' decisions identifying the PI examinees were not significantly different, among the three formats. The accuracies of the decisions of the three formats identifying PI examinees (inconclusive decisions excluded) were 88.9%, 95.3%, and 95.2% for the TES, CSP-PLC, and CSP-DLC formats, respectively.

Eight FPWJ examinees were excluded from the previous analyses because two independent judges agreed that the examinees' explanations regarding their cognitive activities during the test were sufficient to justify physiological responding to the relevant questions. When those eight examinees are included in the analyses of the accuracies of decisions identifying PI examinees, the percent accuracies are 81.4%, 95.3%, and 90.8% for the TES, CSP-PLC, and CSP-DLC formats, respectively. The accuracies of the examiners' decisions in identifying PI examinees were significantly different among the three formats, both when inconclusive decisions were excluded from the analyses, and when inconclusive decisions were included in the analyses. Decisions made by examiners who administered the CSP-PLC format were significantly more accurate identifying the PI examinees than were the decisions made by examiners who administered the TES format (excluding inconclusive, z = 2.43, p = .015; including inconclusives, z = 2.48, p = .013). The accuracies of the decisions identifying PI examinees were not significantly different for examiners who administered the CSP-DLC format compared to the decisions of examiners who administered either the TES or the CSP-PLC formats.

Blind Raters' Decisions

The number of correct decisions, inconclusive decisions and errors made by the blind raters are presented in Table 2. Sample sizes are smaller than in Table 1 because the blind raters scored INC some examinations which the original examiner scored conclusive (SR or NSR). Additional testing would have been required for the blind raters to reach a decision. Therefore, these examinations were not included in the blind raters' decisions. There was no statistically significant different [sic] among the number of examinations omitted from each format group.

The decisions of the blind raters were significantly more accurate in correctly identifying PG examinees when the data were collected with the TES format (81.0%), than were their decisions when the data were collected with either the CSP-PLC format (57.2%) or the CSP-DLC

{93}

format (42.9%). The differences among the accuracies of the blind raters' decisions identifying the PG examinees were significant both when inconclusive decisions were excluded from the analyses (TES vs. PLC, z = 2.01, p = .04; TES vs. DLC, z [=] 2.54, p = .011) and when inconclusive decisions were included in the analyses (TES vs. PLC, z = 1.97, p = .05; TES vs. DLC, z = 2.84, p = .004). The accuracies of the blind raters' decisions in identifying the PI examinees were not significantly different, among the three formats. The accuracies, based on the blind raters' decisions, of the three formats in identifying PI examinees (inconclusive decisions excluded) were 88.5%, 93.2%, and 94.4% for the TES, CSP-PLC, and CSP-DLC formats, respectively.

Table 2

Number of Correct Decisions, Inconclusive (INC) Decisions, and Errors Made by the Blind Raters in Identifying Programmed Guilty and Programmed PI Examinees

	Decisions
Format	Correct	INC	Errors
Programmed Guilty Examinees
CSP-PLC	12_a	1	11
CSP-DLC	9_a	4	12
TES	17_b	1	4
Programmed PI examinees
CSP-PLC	55	3	4
CSP-DLC	51	5	3
TES	46	0	6

Note: Frequencies within columns with different subscripts are significantly different from each other at p < .05.

Interrater Reliability

Pearson correlation coefficients were calculated between the numeric scores of the original examiners and the numeric scores of the blind raters, for each format, to determine interrater reliability. Within each format, a separate correlation coefficient was calculated using the data

{94}

from each of the four relevant questions. The correction coefficients are listed, by format and question, in Table 3. In addition, the reliability of the categorical decisions (SR, NSR, INC), based on the numerical scores of the original examiners and the blind raters, was high for each format. The percent agreements were 89% (Kappa = .76, t = 6.9), 89.5% (Kappa = .70, t = 7.7), and 89% (Kappa [=] .73, t = .67) for the TES, CSP-PLC, and CSP-DLC formats, respectively. All of the reliability measures were significant (p < .0001).

Inconclusive Decisions

The percentage of PI examinees who were retested due to INC decisions when the examiners administered the TES (either sub-test), CSP-PLC, and CSP-DLC formats were 21.4%, 23.1%, and 19.7% respectively. The percentage of PG examinees who were retested, due to INC decisions, when the examiners administered the TES (either sub-test), CSP-PLC, and CSP-DLC formats were 13.3%, 10.3%, and 29.0% respectively. The percentages of INC decisions were not significantly different among the three formats.

Table 3

Pearson Product Moment Correlation Coefficients Calculated Between the Original Examiners' Numerical Scores and the Blind Raters' Numerical Scores to Each Question

	Question
Format	Espionage	Sabotage	Disclosure	Contact
TES	.82*	.84*	.77*	.78*
CSP-PLC	.82*	.88*	.80*	.87*
CSP-DLC	.78*	.89*	.86*	.87*

*p < .0001

Confounding Variables

There were no significant differences in the distributions of PG examinees or PI examinees among the examination formats, as a function of either ethnic origin or gender. In addition, inferential statistical analyses calculated to determine if the number of PG examinees

{95}

participating in each scenario differed significantly among testing formats indicated the differences were not significant.

Physiological Response Scores to Specific Questions

To ensure that no question elicited stronger physiological responses from the examinees, than any other question, the PI examinees' numerical scores for each question were analyzed with a Quade non-parametric repeated measures analysis. The relative strengths of the PI examinees' physiological responses to the four questions were not significantly different from one another. To determine if the PG examinees' physiological responses were greatest to the question specific to the scenario they previously enacted, the PG examinees' numerical scores for each question were analyzed with a Quade non-parametric repeated measures analysis. The data from PG examinees who had enacted different scenarios were analyzed separately. Therefore, four separate analyses were performed, one for each scenario. The data from PG examinees who were administered the TES format were not included in these analyses, because many of those examinees were not administered the second sub-test. The relative strength of the physiological responses to the question specific to the scenario previously enacted was significantly stronger than the relative strengths of the physiological responses to the other three questions, only when the sabotage scenario had been enacted [Quade (3, 15) = 5.39, p < .01].

Table 4

The Number of Programmed Guilty Examinees, Administered a CSP Examination, with the Most Negative Score for Each Question

	Question
Scenario	Espionage	Sabotage	Disclosure	Contact
Espionage*	1	0	1	6
Sabotage**	0	6	0	0
Contact*	0	1	7	2
Disclosure	0	1	2	5

Note: Analyses tested the significance of the distribution within each scenario.
* p < .01. ** p < .001.

{96}

The data in Table 4 are frequency distributions in which the columns are the question to which the PG examinee received the most negative score (strongest physiological response), and the rows are the scenario in which the PG examinee participated. The data include only CSP (PLC and DLC) examinations and only examinations in which the strongest negative score was -3 or less (i.e., true positive results). To determine whether PG examinees' physiological responses were stronger to the question specific to the scenario they had enacted, rather than to any other question, the data in Table 4 were analyzed using the Chi-square. Four separate Chi-square statistics were calculated, one for each scenario. The distributions were significantly different from chance for the espionage [X²(3) = 11, p < .015], sabotage [X²(3) = 19, p < .001], and contact [X²(3) = 11.6, p < .01] statistics.

When the PG examinees had enacted either the sabotage or contact scenario their strongest physiological responses were usually to the question related to the scenario they had enacted. The same trend was true for the disclosure scenario but the effect was not significant. However, [sic- superfluous comma] the PG examinees had enacted the espionage scenario their physiological responses were usually stronger to the "disclosure" question than they were to the "espionage" question. Overall, 59% (75%, if the data from the espionage scenario are not included) of the examinees responded most strongly to the question specific to the scenario previously enacted. However, 8 of the 32 examinees received a score of -3 or less to at least two questions and for 3 of those examinees neither response was to the question specific to the scenario (espionage) previously enacted.

Development of new TES Scoring and Decision Criteria

The data from the current study were utilized to determine if different scoring and decision criteria would yield more accurate results and/or fewer inconclusive decisions. The sets of decision criteria are listed in Table 5. The data were reevaluated, using each set of decision criteria, once when the data were scored using the physiological responses following the first repetition of the first control question (1C1) for scoring purposes and again when the physiological response[s] to 1C1 were not used for scoring purposes. In general, decision criteria which were less stringent for assigning an NSR decision resulted in slightly higher accuracies in identifying PI examinees and slightly lower accuracies in identifying PG examinees. The opposite was true for decision criteria which were less stringent for assigning an SR decision. Similarly, using the physiological responses to 1C1 for scoring purposes, resulted in slightly higher accuracies in identifying PI examinees and slightly lower accuracies in identifying PG examinees. The opposite was true when the physiological response[s] to 1C1 were not used for scoring purposes. The accuracies of the decisions using the different decision criteria were not significantly different from the original decision accuracies.

Because each set of decision criteria increased the detection rate of one category of examinee (PI or PG) and decreased the detection rate of the other category of examinee, it was decided to keep the original decision criteria but to try to retain the benefits of scoring the data with or without the physiological responses to 1C1. Because including the physiological data from 1C1 increased the detection rate of PI examinees and excluding the physiological data from

{97}

1C1 increased the detection rate of PG examinees, the combined detection rate might be increased if both approaches were utilized.

The detection rate of the PI examinees was increased first. The initial scoring of the test used the physiological responses to 1C1. If the decision was conclusive (SR or NSR), then the decision was final. However, if a conclusive decision could not be made then the physiological responses to the first two relevant questions were reevaluated using only the physiological responses to the second control question (1C2) as a comparison. The rescoring results in the same or less positive scores, because the physiological responses to 1C1 typically are stronger than the physiological responses to 1C2. The new scoring method identified the PI examinees first, then, if a conclusive decision could not be made, the rescore identified more of the PG examinees.

The new scoring method did not result in significant differences in the accuracies of detection. However, it reduced the number of initial INC decisions. With the original scoring and decision criteria, 13 PI and 4 PG examinees received INC decisions. With the new scoring method, only 1 PG and 6 PI examinees received INC decisions. However, statistically, the decreases in the number of inconclusive decisions were not significant.

Table 5

Sets of Decision Criteria Used to Evaluate the Data

	Decision Criteria
SET	NSR Decision	SR Decision
Original	R1 + R2 >= +4 and R1 and R2 > 0	R1 or R2 <= -3
1	R1 + R2 >= +4 and R1 and R2 >= 0	R1 or R2 <= -3
2	R1 + R2 >= +3 and R1 and R2 > 0	R1 or R2 <= -3
3	R1 + R2 >= +3 and R1 and R2 >= 0	R1 or R2 <= -3
4	R1 + R2 >= +4 and R1 and R2 > 0	R1 or R2 <= -2
5	R1 + R2 >= +4 and R1 and R2 >= 0	R1 or R2 <= -2
6	R1 + R2 >= +3 and R1 and R2 > 0	R1 or R2 <= -2
7	R1 + R2 >= +3 and R1 and R2 >= 0	R1 or R2 <= -2

Note: Any test score which did not meet either the SR or the SR [sic] decision criteria resulted in an "inconclusive" decision.

{98}

Discussion

The decisions of the examiners who administered the TES format were significantly more accurate (83.3%) at identifying the PG examinees than were the decisions of the examiners who administered either the CSP-PLC (55.6%) or the CSP-DLC (58.6%) format. There were no significant differences among the accuracies of the examiners' decisions at identifying the PI examinees. The accuracies of the decisions obtained using the three formats to identify the PI examinees were 88.9%, 95.3%, and 95.2% for the TES, CSP-PLC, and CSP-DLC formats, respectively. The results were supported by the accuracies obtained from blind scoring of the examinations. The accuracies of the blind raters' decisions with the TES format were similar to the accuracies of the original examiners' decisions. When the data were collected with the TES format, the decisions of the blind examiners were significantly more accurate (81.0%) in correctly identifying PG examinees, than the decisions obtained when the data were collected with either the CSP-PLC format (57.2%) or the CSP-DLC format (42.9%). The accuracies of the blind raters' decisions identifying the PI examinees were not significantly different, among the three formats.

One possible explanation, consistent with the significance/attention model, for the significant differences among the decisions made using the formats to identify PG examinees is the amount of information to which the examinee was required to attend during the examination. Four relevant questions, each of which addresses a separate issue, are asked during the administration of a CSP test (PLC or DLC). Therefore, these examinees are given information and questioned regarding four separate issues. Perhaps, having so much information to process and focus on diffuses the examinee's attention, reducing the physiological responses, thereby reducing the accuracy of PG identification. Only two relevant questions are asked during each TES test, which reduces the amount of information presented to the examinee during a test. A proponent of the significance/attention model would predict higher detection rates when fewer issues are involved. This also could explain why detection accuracies typically are higher for specific issue criminal examinations (single issue examinations) than for security screening examinations (multiple issue examinations). However, there is little research assessing the affect [sic] of the number of issues addressed during an examination on the detection accuracy of the test.

Barland et al. (1989) assessed the differences in detection rates between single and multiple issue examinations. The authors reported that accuracies of the decisions obtained using single and multiple issue tests were not significantly different. However, the study did not test the issue adequately. The principle investigator (G. Barland, personal communication, September, 1993) stated that the examiners conducting the single issue examination were instructed to conduct the three examinations as separate examinations (i.e., pretest only the two relevant questions for the first exam, conduct the exam, and so on). However, a random sample of the single issue examinations administered during that study indicated that the time between the examinations was only 1 minute and 8 seconds longer than the time between the tests (charts) within an examination. One minute and 8 seconds is not sufficient time to pretest two relevant and 3 control questions. Therefore, it is possible that some of the examiners, contrary to

{99}

instructions, were pretesting all of the relevant questions prior to conducting any of the examinations. If all six relevant questions were discussed with the examinee prior to any testing, the examinee could have been thinking about all six relevant questions, even though only two relevant questions were asked on any one test. In addition, the number of PG examinees for whom INC decisions were rendered was significantly greater when the multiple issue examination (28.3%) was administered than when the single issue examinations (10.5%) were administered (test of proportionality, z = 1.96, p < .05). Raskin, et al. (1988) reviewed multiple issue field examinations conducted by a federal agency and found there was a negative relationship between the number of issues and test accuracy. They concluded that the agency should minimize the number of issues on a test to maximize decision accuracy. Studies should be conducted to compare the accuracy of decisions identifying PI and PG examinees when different numbers of relevant issues are addressed.

An additional complicating factor with multiple issues tests is that the examinee does not always respond, physiologically, to the question to which she or he is being deceptive. Whether a deceptive examinee's greatest physiological responses occur following the question to which the examinee is being deceptive has implications both for the number and type of relevant questions asked on a PDD test, and the criteria used to render a decision based on those responses. Barland (1981) reported that the accuracy of PG examinee identification decreased when responding to specific questions was assessed. He concluded that the correctly identified PG examinees were responding to questions other than the one to which they were lying. Also, Correa and Adams (1981) using an R/I format, reported better detection rates when the test was evaluated as a whole compared to the detection rates based on individual questions. Barland, et al. (1989), also concluded that the examinees were not always responding to the specific question to which they were deceptive. Raskin, et al. (1988) reported similar results with field examinations conducted by a federal law enforcement agency. They concluded that the tests did not detect deception at the level of the individual crime, which suggests that numerical scores associated with individual relevant issues may be a poor guide in choosing the issue for interrogation.

The data from the current study support the previous findings. Although there was a relationship between the scenario enacted and the specific question to which the examinee responded physiologically, the relationship was modest. In fact, 41% of the PG examinees did not have strong physiological responses to the question related to the scenario in which they participated. Therefore, decision criteria should not be based only on the physiological responses to individual questions but also on the relevant questions as a group. It should be noted that strong physiological responses to one relevant question do not indicate that it is the most significant question nor the only significant question for the examinee.

The results of this study indicate that a proportion of individuals have strong physiological responses to one or more relevant questions because the question is significant to the individual for reasons other than deceptive responses to the questions. Forty-five percent of the PI examinees who received FP decisions (9% of all PI examinees) following a TES examination were deemed to have concerns sufficient to expect strong physiological responding to one or

{100}

more relevant questions. In addition, field experience indicates that examinees often have concerns about a question, or the question brings something not directly related to the question to mind, which they, initially, do not discuss with the examiner. During the examination, the examinee may focus more attention on that question thereby producing physiological responses to the question (consistent with the significance/attention model). Because the TES format is more sensitive to identifying the PG examinees, it also will be more sensitive to identifying individuals with "outside" issues. This was apparent from the larger number (although not significant) of FPWJ examinees identified when the TES was administered compared to the number when either CSP was administered. Therefore, it is important to determine why an examinee responds physiologically to a relevant question. Future studies need to assess: (a) what proportion of PI examinees have concerns related to the relevant questions, (b) what proportion of those examinees actually respond to the relevant questions, and (c) what effect pretest disclosure of information has on the likelihood the examinee will respond to the questions (e.g., is the examinee less likely to respond to the relevant questions, if the personal concerns are discussed prior to the test).

It also is possible that the decision accuracies obtained with the TES format might have been attenuated by examiner unfamiliarity with the format. The examiners who administered the TES format were not familiar with the format, whereas the examiners who administered either of the CSP formats were familiar with the CSP format. There are many differences (including the pretest and the actual conduct of the examination) between the TES and the standard CSP format. Tape recordings of the early TES examinations, are interpreted as indicating that the examiners were not comfortable with what to say and often did not pretest the relevant questions sufficiently. If an examinee was thinking about something not specifically related to the relevant question and the examiner did not adequately deal with the issue, the examinee might have responded to the questions during the test. Once examiners become more familiar with the format, accuracy rates might increase.

The numbers of male and female examinees in the three conditions were not significantly different nor were the numbers of African-American and Caucasian examinees in the three conditions. Therefore, it is unlikely that the significant differences among the accuracy rates obtained using the three formats to identify PG examinees are not [sic] attributable to gender or racial differences. In addition, the number of PG examinees who participated in each scenario did not differ significantly among the testing formats. Therefore, the significant differences among the accuracy rates obtained using the three formats to identify PG examinees are not attributable to differences among the scenarios.

The significant differences among the accuracy rates obtained using the three formats to identify PG examinees do not appear to have been due to the different types of control questions. If the DLC questions had contributed significantly to the higher detection rate of the PG examinees who were administered the TES format, then the detection rate of the PG examinees tested with the CSP-DLC format should have been higher than the detection rate of the PG examinees tested with the CSP-PLC format. It was not. Similarly, the DLCs do not appear to

{101}

have affected the detection rates for PI examinees. Although the differences were not significant, PI examinees who were administered the TES format were identified less frequently than PI examinees who were administered either CSP format. In addition, when the FPWJ examinees were included in the analyses, significantly more PI examinees were correctly identified using the CSP-PLC format versus the TES format. However, in both sets of analyses, the number of PI examinees correctly identified using the CSP-DLC was not significantly different than the number identified when either the CSP-PLC or TES was administered. Therefore, any differences among the accuracy rates in detecting PI or PG examinees is not attributable to differences between the PLCs and the DLCs.

The syntax of the relevant questions is an issue that affects the generalizability of the results. A previous study (Barland, et al., 1989) was criticized for using relevant questions that included the phrase "against the United States," because the examinee did not commit a crime "against the United States." Proponents of a significance/attention model would argue that a test would be more accurate if the examinee's attention is focused on the actual issues being tested. In fact, field examiners have been applying this principle for years in the development of control questions and sometimes relevant questions (DoDPI, 1994d). Field examiners caveat their questions with "time bars" or "situation bars" to narrow the examinee's attention to a specific time or situation (e.g., prior to 1993 have you ever ...). In the current study, it was decided to caveat the relevant questions with the phrase "during this project" and to omit the phrase "against the United States" to ensure that the subjects' attention was focused on the test issues.

There is no reason to expect that the caveat added to the relevant questions would differentially affect the accuracies of decisions obtained using the three different formats. Therefore, it is unlikely that the differences in decision accuracies among the three formats are attributable to the caveat. However, because the relevant questions were designed to focus the examinee's attention on the project, it is possible that the accuracies obtained during the study may not be an accurate reflection of the accuracies that would occur in the field. Studies should be conducted to assess the impact of "time" or "situation" bars on PDD test accuracy. This is an important question because the practice is so popular in the field.

In conclusion, the new TES format may be a viable alternative to the CSP format currently utilized for security examinations. The TES format differs from the CSP formats in that: (a) the number of issues being tested in a question series is reduced; (b) a maximum of three question repetitions are used to calculate question scores; (c) between-test stimulation is eliminated; (d) the order of questions within the question sequence cannot be altered; (e) each relevant question is compared to the same control questions; (f) the pretest is brief, more standardized, and follows a logical sequence of information presentation; and (g) problems associated with PLC questions are reduced by using DLC questions. Some of these differences might account for the fact that in a laboratory mock situation, the decisions of examiners who administered the TES format were significantly more accurate at identifying PG examinees than were the decisions of examiners who administered either CSP format. If future testing with the TES format continues to demonstrate high accuracy rates for discriminating between PI and PG

{102}

examinees, the federal government should consider changing their security screening programs to utilize the TES as their primary PDD examination.

References

Abrams, S. (1993). The directed lie control question. Unpublished manuscript.

Ansley, N. (1990). The validity and reliability of polygraph decisions in real cases. Polygraph, 19, 169-181.

Barland, G.H. (1981) A validation and reliability study of counter intelligence screening tests. Unpublished manuscript. Security Support Battalion 902nd Military Intelligence Group. U.S. Army, Ft. Meade, Maryland.

Barland, G.H., Honts, C.R., & Barger, S.D. (1989). Studies of the accuracy of screening polygraph examinations. Unpublished report. Department of Defense Polygraph Institute, Ft. McClellan, AL.

Carver, C.S., Blaney, P.H., & Scheier, M.F. (1979). Focus of attention, chronic expectancy, and responses to a feared stimulus. Journal of Personality and Social Psychology, 37, 1186-1195.

Coles, M.G.H., & Duncan-Johnson, C.C. (1975). Cardiac activity and information processing: The effects of stimulus significance and detection and response requirements. Journal of Experimental Psychology, 1, 418-428.

Coles, M.G.H., & Strayer, D.L. (1985). The psychophysiology of the cardiac cycle time effect. In J.R. Orlebeke, G. Mulder, & L.J.P. van Doornen (Eds.). Psychophysiology of cardiovascular control: Models, methods, and data (pp. 517-534). New York: Plenum Press.

Correa, E.I., & Adams, H.E. (1981). The validity of the pre-employment polygraph examination and the effects of motivation. Polygraph, 20, 143-155.

Davis, R.C. (1961). Physiological responses as a means of evaluating information. In A.D. Biderman and H. Zimmer (Eds.). The manipulation of human behavior (pp. 142-168). New York: Wiley.

Dawson, M.E. (1980). Physiological detection of deception: Measurement of responses to questions and answers during countermeasures maneuvers. Psychophysiology, 17, 8-17.

{103}

Dawson, M.E., & Schell, A.M. (1982). Electrodermal responses to attended and nonattended significant stimuli during dichotic listening. Journal of Experimental Psychology: Human Perception and Performance, 8, 315-324.

Dawson, M.E., Schell, A.M., Beers, J.R., & Kelly, A. (1982). Allocation of cognitive processing capacity during human autonomic classical conditioning. Journal of Experimental Psychology: General, 111, 273-295.

Dawson, M.E., Schell, A.M., & Filion, D.L. (1990). The electrodermal system. In J.T. Cacioppo and L.G. Tassinary (Eds.). Principles of psychophysiology: Physical, social, and inferential elements (pp. 295-324). New York: Cambridge University Press.

Department of Defense Polygraph Institute (1992). Counter Intelligence Scope Polygraph. Ft. McClellan, AL.

Department of Defense Polygraph Institute (1994a). Stimulation test. Ft. McClellan, AL.

Department of Defense Polygraph Institute (1994b). Test data analysis. Ft. McClellan, AL.

Department of Defense Polygraph Institute (1994c). Modified General Question Technique (MGQT). Ft. McClellan, AL.

Department of Defense Polygraph Institute (1994d). Test question construction. Ft. McClellan, AL.

Easterbrook, J.A. (1959). The effect of emotion on cue utilization and the organization of behavior. Psychological Review, 66, 183-201.

Furedy, J.J. (1986). Lie detection as psychophysiological differentiation: Some fine lines. In M.G.H. Coles, E[.] Donchin, & S.W. Porges (Eds.) Psychophysiology: Systems, processes, and applications. New York: Guilford Press.

Germana, J., & Chernault, G. (1968). Patterns of galvanic skin responses to signal and non-signal stimuli. Psychophysiology, 5, 284-292.

Glass, A.L., Holyoak, K.J., & Santa, J.L. (1979). Cognition. Menlo Park, CA Addison-Wesley Publishing Co.

Gustafson, L.A., & Orne, M.T. (1965). The effects of verbal responses on the laboratory detection of deception. Psychophysiology, 2, 10-13.

Honts, C.R. (1989). The relative validity of two CSP question series. Unpublished manuscript, Department of Defense Polygraph Institute, Ft. McClellan, AL.

{104}

Honts, C.R., & Raskin, D.C. (1988). A field study of the validity of the directed lie control question. Journal of Police Science and Administration, 16, 56-61.

Horowitz, S.W. (1989). The role of control questions in physiological detection of deception. Unpublished doctoral dissertation. University of Utah, Salt Lake City, Utah.

Iacono, W.G., & Patrick, C.J. (1988). Polygraph techniques. In R. Rogers (Ed.) Clinical assessment of malingering and deception (pp. 205-233). New York: Guilford Press.

Jennings, J.R. (1986a). Bodily changes during attending. In M.G.H. Coles, E. Donchin, & S.W. Porges (Eds.) Psychophysiology: Systems, processes, and applications (pp. 268-289). New York: Guilford Press.

Jennings, J.R. (1986b). Memory, thought, and bodily response. In M.G.H. Coles, E. Donchin, & S.W. Porges (Eds.) Psychophysiology: Systems, processes, and applications (pp. 290-308). New York: Guilford Press.

Kahneman, D. (1973). Attention and effort. Englewood Cliffs, NJ: Prentice-Hall.

Kilpatrick, D. (1972). Differential responsiveness of two electrodermal indices to psychological stress and performance of a complex cognitive task. Psychophysiology, 9, 218-226.

Kimmel, H., Olst, E.H., van, & Orlebeke, J.F. (1979). The orienting reflex in humans. Hillsdale, NJ: Erlebaum.

Kircher, J.C., Horowitz, S.W., & Raskin, D.C. (1988). Meta-analyses of mock crime studies of the control question polygraph technique. Law and Human Behavior, 12, 79-90.

Kircher, J.C., & Raskin, D.C. (1988). Human versus computerized evaluations of polygraph data in a laboratory setting. Journal of Applied Psychology, 73, 291-302.

Kugelmass, S., Lieblich, I., & Bergman, Z. (1967). The role of "lying" in psychophysiological detection. Psychophysiology, 3, 312-315.

Maltman, I., Kantor, W., & Langdon, B. (1966). Immediate and delayed retention, arousal, and the orienting and defensive reflexes. Psychonomic Science, 6, 445-446.

McCauley, C., & Forman, R.F. (1988). A review of the office of technology assessment report on polygraph validity. Basic and Applied Social Psychology, 9, 73-84.

McLean, P.D. (1969). Induced arousal and time of recall as determinants of paired associate recall. British Journal of Psychology, 60, 57-62.

{105}

Nikula, R. (1991). Psychological correlates of nonspecific skin conductance responses. Psychophysiology, 28, 86-90.

Obrist, P.A. (1981). Cardiovascular psychophysiology: A perspective. New York: Plenum Press.

O'Gorman, J.G. (1977). Individual differences in habituation of human physiological responses: A review of theory, method and findings in the study of personality correlates of non-clinical populations. Biological Psychology, 5, 257-318.

Ohman, A. (1979). The orienting response, attention and learning: An information processing perspective. In H.D. Kimmel, E.H. Van Olst, & J.F. Orlebeke (Eds.) The orienting reflex in humans (pp. 443-471). Hillsdale, NJ: Erlbaum.

Olst, E.H., van, Hemstra, M.L., & ten Kortenaar, T. (1979). Stimulus significance and the orienting reaction. In H.D. Kimmel, E.H. van Olst, & J.F. Orlebeke (Eds.) The orienting reflex in humans (pp. 521-547). Hillsdale, NJ: Erlbaum.

Orne, M.T., Thackray, R.I., & Paskewitz, D.A. (1972). On the detection of deception: A method for the study of physiological effects of psychological stimuli. In N.S. Greenfield and R.A. Sternbach (Eds.) Handbook of psychophysiology (pp. 743-785). New York: Holt, Rinehart, & Winston.

Podlesny, J.A., & Raskin, D.C. (1977). Physiological measures and the detection of deception. Psychological Bulletin, 84, 782-799.

Raskin, D.C. (1982). The scientific basis of polygraph techniques and their uses in the judicial system. In A. Trankell (Ed.) Reconstructing the past: The role of psychologists in criminal trials (pp. 317-371). Stockholm: Norstedt and Soners.

Raskin, D.C. (1989). Polygraph techniques for the detection of deception. In D.C. Raskin (Ed.) Psychological methods in criminal investigation and evidence (pp. 247-296). New York: Springer Publishing Co.

Raskin, D.C. & Kircher, J.C. (1990). Development of a computerized polygraph system and physiological measures for the detection of deception and countermeasures: A pilot study. (Preliminary Report Contract 8-L655300-000). Salt Lake City, UT: University of Utah.

Raskin, D.C., Kircher, J.C., Honts, C.R., & Horowitz, S.W. (1988). A study of the validity of polygraph examinations in criminal investigation. (Final report Grant No. 85-IJ-CX-0040). Washington, D.C.: National Institute of Justice.

{106}

Reed, S.D. (1990). Counter narcotics polygraph: NCP 2. Unpublished manuscript, Department of Defense Polygraph Institute, Ft. McClellan, AL.

Reed, S.D. (1995). Psychophysiological detection of deception--Single test interview. Paper presented at the meeting of the American Academy of Forensic Science, Seattle, WA.

Sampson, J.R. (1969). Further study of encoding and arousal factors in free recall of verbal and visual material. Psychonomic Science, 16, 221-222.

Sokolov, E.N. (1963). Perception and the conditioned reflex. New York: McMillan.

Thackray, R.I., & Orne, M.T. (1968). Effects of the type of stimulus employed and the level of subject awareness on the detection of deception. Journal of Applied Psychology, 52, 234-2329 [sic].

Waid, W.M., Orne, E.C., and Orne, M.T. (1981). Selective memory for social information, alertness, and physiological arousal in the detection of deception. Journal of Applied Psychology, 66, 224-232.

Waid, W.M., Orne, M.T., & Wilson, S.K. (1979). Effects of level of socialization on electrodermal detection of deception. Science, 212, 71-73.

Acknowledgements

Sheila D. Reed, Ph.D., served as principle investigator throughout planning, data collection, and drafting of this manuscript. Final editing was completed by members of the Department of Defense Polygraph Institute Research Division. We would like to thank the Office of the Secretary of the Air Force (OSAF) and the United States Army Intelligence and Security Command (USAINSCOM) polygraph programs for their support and their input into the design of the study. Specifically, we would like to thank Bruce Thompson and Jim Morrison for their valuable contributions, including the time they spent monitoring the examinations and helping with the scenarios. A special thanks is due to all of the people who made the study possible: The OSAF and INSCOM examiners (Edith Andreasen, Douglas Blake, Tawainia Barrera, Ray Brafford, Dave Cameron, David Case, Greg Eggleston, Richard Giraud, Ronald Herring, Otto Jackson, Bryan Ladeaux, Russ Nichols, Michael Rhodes, Donald Schupp, Ed Stoval, James Vaughan, Michael Walker, and Harrison Wright), the scenario setters (Earl Taylor, Sam Braddock, and Gordon Barland), the research assistants (Jeff St. Cyr, Linda Knickerbocker, and Joan Harrison), and the DoDPI support staff (Frank Ragan and Randy Reynolds). Appreciation is also extended to the Anniston Employment and Temporary Services, Inc. and Judy Manners for the high quality examinees they provided. Gratitude is also expressed to Andrew Dollins, John Schwartz, and Don Weinstein for diligently reading and editing earlier drafts of this manuscript.
Funds for this research were provided by the DoDPI under project DoDPI93-P-0044. The views expressed in this article do not reflect the official policy or position of the Department of Defense or the U.S. Government.