Bayesian Network Modeling of Offender Behavior for Criminal Profiling Masters Defense Kelli Crews Baumgartner Laboratory for Intelligent Systems & Control Mechanical Engineering Duke University 10/18/2007
1
Outline • • • • • • •
Background to Criminal Profiling Motivation Bayesian Networks (BNs) Methodology Results Future Works Acknowledgments
10/18/2007
2
The Case of the Mad Bomber • • • •
Male Eastern European Late 40’s to early 50’s Lived in Connecticut with a maiden aunt or sister • Paranoia • Wearing a doublebreasted suit, buttoned
10/18/2007
• • • •
George Meteskey Slavic descent 54 years old Lived in Connecticut with two maiden sisters • Acute Paranoia • When apprehended, was wearing a double breasted suitbuttoned 3
Background • Goals of criminal profiling: – Concentrate criminal investigations – Interrogation strategies
• Collaboration with law enforcement
10/18/2007
4
Previous Research • Early 1980’s: FBI developed the organized/disorganized dichotomy • 1985 -1994: Dr. David Canter expanded the FBI model: interpersonal coherence, time and place, criminal characteristics, criminal behavior, and forensic awareness • 1999-Present: G. Salfati and D. Canter explore the expressive/instrumental dichotomy 10/18/2007
5
Expressive vs. Instrumental • Uses multidimensional scaling (MDS) – Non-metric multidimensional scaling procedure – Plots association coefficients for each crime scene behavior – similarly themed actions will co-occur in the same region of the plot, and vice versa
• Single offender-single victim homicides by the British police (82 cases and 247 cases) 10/18/2007
6
Results of MDS Research • 62% of the cases exhibited a majority of the crime scene characteristics in a single theme • 74% of all offenders could also be classified as either expressive or instrumental • 55% of the cases exhibited the same theme in both their crime scene actions and in the offender background characteristics • High Frequency crime scene behaviors can be eliminated 10/18/2007
7
Motivation • Develop a network model linking the profile of the offender to his/her decisions and behaviors at the crime scene • Determine correlations between input and output variables based on data from real cases – Input variables: Crime Scene Analysis (CSA), victimology assessment, and medical examination. Output variables: Offender profile: sex of offender, prior convictions, relationship with victim – Training data: Solved cases (inputs/outputs known) • Apply software to produce the offender profile for unsolved cases, given the input variables 10/18/2007
8
Criminal Profiling Software Development •
Use expert knowledge to initialize BN variables
•
Train the NM with solved cases
•
Test NM with validation cases by inputting the crime scene variables and comparing the predicted offender variables to the observed offender variables. Data
10/18/2007
Initial Model
Inputs
Trained Model
Outputs
9
Belief Networks • Bayesian Networks, B=(S,Θ)
P(X1) X1
– Probabilistic Network
P(x1,1 )
P(x1,2 )
0.4
0.6
xi,j =x variable,state P(X2|X1)
X2
P(x2,1 | X1 ) P(x2,2 | X1 )
X3
P(X3 | X1)
P(x3,1 | X1 ) P(x3,2 | X1 )
X1=x1,1
0.8
0.2
X1=x1,1
0.1
0.9
X1=x1,2
0.3
0.7
X1=x1,2
0.5
0.5
10/18/2007
10
Example of Inference P(X1)
• X2 and X3 are observed • X1 is unknown, to be inferred P(X2|X1)
• Possible states :
10/18/2007
X1
X2
X3
P(X3 | X1)
⎧ X 1 = {x1,1 ,..., x1,r1 } ⎪ ⎨ X 2 = {x 2,1 ,..., x 2,r 2 } ⎪ X = {x ,..., x } 3,1 3, r 3 ⎩ 3 11
Bayes Theorem • Bayes Rule* to infer X1 when X2=x2,r2 and X3=x3,r3: P ( X 1 | x 2 , r 2 , x 3, r 3 ) =
P ( x 2 , r 2 , x 3, r 3 | X 1 ) P ( X 1 ) P( x 2,r 2 , x s ,r 3 )
• Marginalization of the observed variables: r1
3
k =1
j =2
P( x 2,r 2 , x3,r 3 ) = ∑ P( X 1 = x1,k )∏ P( x j ,rj | x k ) • The posterior probability distribution: r1
∑ P( X h =1
10/18/2007
1
= x1,h | x 2,r 2 , x3,r 3 ) = 1 F. Jensen, Bayesian Networks and Decision Graphs, 2001
12
BN Training • Set of training data, T, to learn BN, B=(S,Θ) • Use structural learning algorithm to find Soptimal – Search method needed to decrease search space – Scoring Metric
• Maximum Likelihood Estimation (MLE) algorithm to learn CPTs, P(Θoptimal |Soptimal, T )
10/18/2007
13
Joint Probability Scoring Metric • Relative measure of the level of compatibility of Sh given the training data T, P(Sh |T) • P(Sh |T) α P(Sh,T): P ( S h , Τ) = P( S h | Τ) P (Τ) ⇒ P ( S ih , Τ) P ( S ih , Τ) P( S ih | Τ) P (Τ) = = h h P( S j | Τ) P ( S j , Τ) P ( S hj , Τ) P (Τ) ∴ P ( S ih | Τ) < P ( S hj | Τ) ⇔ P ( S ih , Τ) < P ( S hj , Τ) 10/18/2007
14
Scoring Metric Assumptions 1. 2. 3. 4. 5.
All variables are discrete All structures are equally likely All variables are known with no missing variables All cases in T occur independently given a BN model No prior knowledge of the numerical properties to assign to Bh with structure Sh before observing T B h = (S h , Θ h ) ∈ Β P( S h , T ) = ∫ f (T | S h , Θ h ) f (Θ h | S h ) P( S h )dΘ h ⇒ Θ
ri ( r 1 )! − i P( S h , T ) = P ( S h ) ⋅ ∏∏ N ijk ! ∏ i =1 j =1 ( N ij + ri − 1)! k =1 n
10/18/2007
qi
G.F. Cooper et al. , Machine Learning, 1992.
15
Variable Definition for Scoring Metric ri ( r 1 )! − i P( S h , T ) = P( S h ) ⋅ ∏∏ N ijk ! ∏ i =1 j =1 ( N ij + ri − 1)! k =1 n
qi
n: Number of model variables qi: Number of unique instantiations of πi, where πi=pa(Xi) ri: Number of possible states for Xi Nijk: Number of cases in T that Xi=xi,k and πi is instantiated as wij, where k=(1,…,ri) r Nij: N ij = ∑ N ijk i
k =1
10/18/2007
G.F. Cooper et al. , Machine Learning, 1992.
16
K2 Algorithm • Maximizes scoring metric • Node Ordering: X1 ÆX2ÆX3 • Limit on number of parents h P ( S optimal , T ) = max [ P ( S , T )] ⇒ h S
qi ri n ⎧⎪ ⎫⎪ K 2 ( r 1 )! − h i max P( S ) ⋅ ∏∏ N ijk !⎬ ⎯⎯→ ⎨ ∏ h B ⎪⎩ ⎪⎭ i =1 j =1 ( N ij + ri − 1)! k =1 qi ri ⎧⎪ ⎫⎪ (ri − 1)! max g =∏ N ijk !⎬ ⎨ ∏ h B ⎪⎩ ⎪⎭ j =1 ( N ij + ri − 1)! k =1 10/18/2007
G.F. Cooper et al. , Machine Learning, 1992.
17
K2’ Algorithm P(X4)
X4
P(X2|X1), X4)
10/18/2007
P(X1)| X4) X1 X2
X3
P(X3 | X1)
18
X4
X1 X2
X3
K2’ Algorithm
• Inhibit nodal connections between Input nodes – d-seperation: X3 ⊥ X2, X4 iff X1 is known – X2 and X3 are not affected by X4 Æ X1 relationship if parents are known • Everything else same as K2 ⊥ 10/18/2007
19
The Learned Structure • Initial Structure is an empty graph • Final Structure is learned from T O1 O2 … Om I1
I2 … Ik : Only for K2
10/18/2007
20
Parameter Learning • Maximum Likelihood Estimator (MLE) for parameter learning due to no missing values (EM algorithm otherwise) • MLE determines the parameters that maximize the probability (likelihood) of T t
f (T | Θ h , S h ) = ∏ f (C i | Θ h , S h ) = L(Θ h | T , S h ) i =1
t
Λ = ln L = ∑ ln[L(Θ h | T , S h )] ⇒ i =1
∂Λ =0 h ∂Θ 10/18/2007
21
Modeling Example •
Train a Network Model for a simple problem with two inputs and two outputs: – Inputs, CSA
(1) Place of aggression characteristics (2) Amount of disorder provoked from fight/struggle
– Outputs, criminal profile:
(1) Gender of the offender (2) Presence of sexual relations between victim and offender
•
Train model with simulated cases using Matlab
•
Use evidence of the inputs to infer outputs in new cases
10/18/2007
22
Example: Variable Definition Inputs from Crime Scene Analysis (CSA) Checklist (evidence): • Characteristics about Place of Aggression, node PA (1) (2) (3) (4)
•
Not crowded external place (remote) Semi-crowded external place( semi-remote) Crowded external place Inner place (room, building, office)
DiSorder Provoked by fight/struggle, node DS (1) (2) (3) (4)
In room/area where corpse is found On all the area/room/study/office/store In the vicinities of area/room/study/office/store No disorder provoked
Outputs from Network Model, criminal profile: • Gender of offender, node G (1) (2)
•
Male Female
Presence of Sexual Relations between victim and offender, node SR
10/18/2007
(1) (2)
Yes No
23
Network Model 1 Solved Case: PA = 2 (Semi-remote) DS = 4 (No disorder) G = 1 (Male) SR = 2 (No)
Network variables: PA (Place of Aggression), DS (DiSorder provoked by fight), G (Gender of offender), SR (Sexual Relations between victim and offender). 10/18/2007
24
Results: Percent Binary Error for Validation Set • Error metric: if x = x* , then error = 0, if x ≠ x*, then error = 1 Percentage of Each Output Node Inferred Incorrectly vs. Number of Training Cases
10/18/2007
25
Full CP Model • 247 cases: – 200 training cases (T) – 47 validation cases (V)
• 57 total variables – 36 CS variables (inputs) – 21 CP variables (outputs)
10/18/2007
26
Internal Stability • Internal stability refers to the consistency of predictions
10/18/2007
27
Overall Performance Summary • Overall Predictive Accuracy (OPA):
OPA(%) =
K C ,CL ( ≥50%)
Kt : Total predictions
Kt
× 100
KC,CL(≥50%): Total correct predictions with CL ≥50% Algorithm Accuracy (%) Correct Predictions (number of nodes) 10/18/2007
K2 64.1% 633
K2’ 79% 780 28
Confidence Level of Predictions • Confidence Level Accuracy (CLA):
CLA(%) = Algorithm: CL:
K C ,CL K CL
× 100
K2 || K2’ KCL
KC,CL
CLA(%)
Δ (KC,CL)
0.5 ≤CL< 0.7 225 || 262 140 || 162 62.2 || 61.8
22
0.7≤CL< 0.9
405 || 470 334 || 386 82.5 || 82.1
52
0.9 ≤ CL
168 || 255 159 || 232
73
10/18/2007
94.6 || 91
29
Zero Marginal Probability Variables Algorithm CL KCL
K2 || K2’ ≥ 50% 798 || 987
ZMP Variables
K t − ( K w + K ZMP ) × 100 PA(%) = Kt 1. Add more training cases 2. Declare variable independencies (K2’) 3. Decrease number of system variables 10/18/2007
30
High Frequency Variables 3. Decrease number of system variables: – –
High Frequency variables are present in more than 50% of cases High Frequency Model (HFM): CP model with HF Variables removed
CS Behavior Face not hidden
Frequency 88.4%
Victim found at scene where killed
78.9%
Victim found face up
61.1%
Multiple wounds to the body
52.2%
10/18/2007
31
HFM Overall Predictive Accuracy • Negligible accuracy increase for K2’ • Decrease in the number of ZMP variables for K2 Algorithm:
K2HFM K2’HFM
K2
K2’
OPA (%)
66%
79.6%
64.1%
79%
KC,≥50%
652
786
633
780
KZMP
168
0
189
0
10/18/2007
32
Frequency of Occurrence • Frequency of occurrence is the number of times the variable was present in the dataset • Frequency method (F) predicts the states of V by the more apparent state in T • The CP by F is the same over all V
10/18/2007
33
OPA for K2’ and F • Overall Predictive Accuracy: Algorithm
K2’
F
Accuracy (%)
79%
79.3%
Correct Predictions 780 (number of nodes)
10/18/2007
784
34
Confidence Level of Predictions • Confidence Level Accuracy (CLA):
Algorithm: CL:
K2’ || F KCL
KC,CL
CLA(%)
Δ (KC,CL)
0.5 ≤ CL< 0.7
262 || 329 162 || 216
65.7 || 61.8
-54
0.7 ≤ CL< 0.9
470 || 470 386 || 396
82.1 || 84.3
-10
0.9 ≤ CL< 0.95
139 || 141 125 || 126
90 || 89.3
-1
0.95 ≤ CL
116 || 47
92.2 || 98
61
10/18/2007
107 || 46
35
Information Entropy • Information Entropy (H) quantifies the certainty/uncertainty of a model • Amount of information is related to the confidence of the prediction: less entropy means more confidence in the long run ri
H = −∑ pi log( pi ) i =1
10/18/2007
36
H for K2’ and F • K2’ model H is an infeasible calculation • An average of each variable’s H is a suitable measure k
H ( X 1,..., X k ) = ∑ H ( X i | X i −1 ,..., X 1 ) ⇒ i =1 k
H ( X 1,..., X k ) ≤ ∑ H ( X i ) i =1
• H(K2’)=0.43 vs. H(F)=0.48 10/18/2007
37
CL Ranges for Predictions
10/18/2007
38
Crime Scene Variables • Input variables from Crime Scene (evidence) Input Variable
Definition
I1, pen
Foreign Object Penetration
I2, hid
Face Hidden
I3, blnd
Blindfolded
I4, blnt
Wounds caused by blunt Instrument
I5, suff
Suffocation (other than strangulation)
10/18/2007
39
Criminal Profile Variables • Output variables comprising the criminal profile Output Variable
Definition
O1, yoff
Young offender (17-21 years old)
O2, thft
Criminal record of theft
O3, frd
Criminal record of fraud
O4, brg
Criminal record of burglary
O5, rlt
Relationship with victim
O6, unem
Unemployed at time of offense
O7, male
Male
O8, famr
Familiar with area of offense occurrence
10/18/2007
40
Predicted Case vs. Actual Case • K2’ Profile • Frequency Profile -A- Young: A (0.813) Young: A (0.805) -A- Theft: A (0.75) Theft: A (0.54) -P- Fraud: A (0.76) Fraud: A (0.67) -A- Burglary: A (1) Burglary: A (0.67) Relationship: A (0.64) -A- Relationship: A (1) Unemployed: P (0.52) -P- Unemployed: P (0.79) -A- Male: A (1) Male: P (0.9) Familiar w/ area: P (0.86) Familiar w/ area: P (0.91) -P10/18/2007
41
Predicted Case vs. Actual Case • K2’ Profile • Frequency Profile -A- Young: A (0.83) Young: A (0.81) -P- Theft: P (0.52) Theft: A (0.54) -A- Fraud: A (0.67) Fraud: A (0.67) -P- Burglary: A (0.54) Burglary: A (0.67) Relationship: A (0.64) -A- Relationship: A (0.57) Unemployed: P (0.52) -P- Unemployed: P (0.53) -P- Male: P (0.82) Male: P (0.9) Familiar w/ area: P (0.86) Familiar w/ area: P (0.86) -P10/18/2007
42
Predicted Case vs. Actual Case • K2’ Profile • Frequency Profile -A- Young: A (0.83) Young: A (0.81) -A- Theft: P (0.97) Theft: A (0.54) -P- Fraud: A (0.67) Fraud: A (0.67) -A- Burglary: P (0.60) Burglary: A (0.67) Relationship: A (0.64) -P- Relationship: A (0.59) Unemployed: P (0.52) -A- Unemployed: P (0.61) -A- Male: P (1) Male: P (0.9) Familiar w/ area: P (0.86) Familiar w/ area: P (0.87) -P10/18/2007
43
Predicted Case vs. Actual Case • K2’ Profile • Frequency Profile -A- Young: A (0.93) Young: A (0.81) -A- Theft: A (0.75) Theft: A (0.54) -P- Fraud: P (0.73) Fraud: A (0.67) -A- Burglary: A (0.82) Burglary: A (0.67) Relationship: A (0.64) -P- Relationship: P (0.56) Unemployed: P (0.52) -A- Unemployed: A (0.87) -P- Male: P (0.89) Male: P (0.9) Familiar w/ area: P (0.86) Familiar w/ area: P (0.78) -P10/18/2007
44
Evidence: pen: penetration blnd: blindfolded hid: face hidden blnt: blunt instrument suff: suffocation CP: yoff: young offender frd: fraud rlt: relationship w/ victim thft: theft brg: burglary famr: familiar w/ area unmp: unemployed 10/18/2007
Slice of K2’ DAG …
yoff
thft
…
…
pen
frd
brg
unmp suff
rlt
blnd
male
famr
hid
blnt 45
Conclusions • Due to the absence of ZMP variables, the K2’ structural learning algorithm requires fewer cases compared to K2 • A benefit of a BN model over the naïve frequency approach to acquire a CP is the range of confidence levels of the BN model due to the evidence • Because all of the variables are binary, the frequency approach is more susceptible to better performance than if the variables had many states 10/18/2007
46
Further Research • Develop a search algorithm that increases performance for BN • Incorporate Salfati’s Expressive/ Instrumental dichotomy to supervise training of a BN model • Apply method to other fields • Combine NN and BN methods to improve model performance 10/18/2007
47
Neural Network Research • Neural networks implemented similar to BN Algorithm: Nodes Predicted Correctly
K2’ 780
NN 739
Overall PA (%)
79%
75%
10/18/2007
48
Acknowledgments • Special thanks to – Dr. Silvia Ferrari, advisor – Dr. Gabrielle Salfati, John Jay College of Criminal Justice – Dr. Marco Strano, President of the International Crime Analysis Association (ICAA) – My Masters Committee 10/18/2007
49
April Fools' Day Origin • April 1 was originally New Years Day in France • 1582 Pope Gregory decreed the Gregorian calendar was to replace the Julian calendar • January 1 is New Years day according to new calendar • Those who refused to accept the new calendar were April fools 10/18/2007
50
Questions?
10/18/2007
51
Laboratory for Intelligent Systems & Control Mechanical Engineering Duke University
10/18/2007
52