Algorithms for Computing Variants of the Longest Common Subsequence Problem M. Sohel Rahman Costas Iliopoulos
Goal of this paper… • Introduce new variants of LCS problem • Present efficient algorithms to solve them
19 MAR 2007
LSD 2007
Longest Common Subsequence • Given two sequences:
– X = CAAGCTAAGCTAC – Y = TCAAGTAGAAC
• Common Subsequence: A Subseq common to both X and Y. • LCS- A subseq having the highest length
19 MAR 2007
LSD 2007
LCS-Example 1
X=
Y=
2
3
4
5
6
7
8
9
10
11
C A A G C T A A G C T
C C G T A T 1
2
19 MAR 2007
3
4
5
A common subseq: CCT Length = 3
6
LSD 2007
LCS-Example 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T 1
2
19 MAR 2007
3
4
5
A Longest common subseq: CCTAT Length = 5
6
LSD 2007
LCS-Example 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T
A Longest common subseq: CCTAT Length = 5
1
Another LCS: CGTAT Length = 5
2
19 MAR 2007
3
4
5
6
LSD 2007
New Variants Introduced • • • •
FIG: Fixed Gapped LCS ELAG: Elastic Gapped LCS RIFIG: Rigid Fixed Gapped LCS RELAG: Rigid Elastic Gapped LCS
19 MAR 2007
LSD 2007
19 MAR 2007
LSD 2007
FIG: K = 4 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T 1
2
19 MAR 2007
3
4
5
6
A FIG LCS: CCTAT Length = 5 Same as an LCS
LSD 2007
FIG: K = 2 1
X=
2
3
4
5
6
7
8
9
10
11
12
C A A G C T A A G C G T 3>K
Y=
C C G T A T 1
2
19 MAR 2007
3
4
5
6
CCTAT Length = 5 Can’ be a FIG LCS if K = 2!
LSD 2007
FIG: K = 2 1
X=
2
3
4
5
6
7
8
10
11
12
C A A G C T A A G C G T 2 <= K
Y=
9
4>K
C C G T A T 1
2
19 MAR 2007
3
4
5
6
LSD 2007
FIG: K = 2 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T 1
2
19 MAR 2007
3
4
5
FIG LCS: CGTA Length = 4
6
LSD 2007
RIFIG: K = 4 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T 1
2
19 MAR 2007
3
4
5
6
LSD 2007
RIFIG: K = 3 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T 1
2
19 MAR 2007
3
4
5
A RIFIG LCS: GT Length = 2
6
LSD 2007
RIFIG: K = 3 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T 1
2
19 MAR 2007
3
4
5
A(NOTHER) RIFIG LCS: GT Length = 2
6
LSD 2007
RIFIG: K = 3 1
X=
2
3
4
5
6
7
8
10
11
12
C A A G C T A A G C G T 4>K
Y=
9
C C G T A T 1
2
19 MAR 2007
3
4
5
6
LSD 2007
RIFIG: K = 3 4 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T 1
2
19 MAR 2007
3
4
5
6
LSD 2007
ELAG: K_1 = 2, K_2 = 4 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T
K_2 constraint is OK K_1 constraint is not met For every case in Y!!!
1
2
19 MAR 2007
3
4
5
6
LSD 2007
ELAG: K_1 = 2, K_2 = 3 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T ELAG LCS for K_1 = 2, K_2 = 3: GT Length = 2 1
2
19 MAR 2007
3
4
5
6
LSD 2007
ELAG: K_1 = 2, K_2 = 3 4 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T
1
2
19 MAR 2007
3
4
5
6
LSD 2007
RELAG • RELAG IS SIMILAR TO ELAG WITH THE ADDED CONSTRAINT OF RIGIDITY LIKE RIFIG
19 MAR 2007
LSD 2007
Notations • Matches: We say a pair (i, j), 1 <= i, j <= n defines a match, if X[i] = Y[i]. • The set of all matches is M • |M| = R
19 MAR 2007
LSD 2007
M 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T 1
2
3
4
5
6
M = {(1, 1), (1, 2), …}
19 MAR 2007
LSD 2007
M 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T 1
2
3
4
5
6
M = {(1, 1), (1, 2), (2, 5), …}
19 MAR 2007
LSD 2007
M 1
2
3
4
5
6
7
8
9
10
11
12
X=
C A A G C T A A G C G T
Y=
C C G T A T 1
2
3
4
5
6
M = {(1, 1), (1, 2), (2, 5), (3, 5)…} and so on.
19 MAR 2007
LSD 2007
FIG Algorithms • • • •
O(n^2 + R(K+1)^2) Algorithm O(Rloglogn + R(K+1)^2) Algorithm O(n^2 + Rloglogn + RK) Algorithm O(n^2 + R) Algorithm
19 MAR 2007
LSD 2007
Classical LCS Algorithm
19 MAR 2007
LSD 2007
Y
C
0
1
2
2
2
3
4
A
0
1
2
2
2
3
3
G
0
1
2
2
2
2
2
A
0
1
1
1
1
1
1
C
0
0
0
0
0
0
1
0
0
0
0
0
0
A
G
G
T
A
C
19 MAR 2007
X LSD 2007
Y
C
0
1
2
A
0
1
G
0
A C
LCS (X[1..2], Y[1..4])
2
2
3
4
2
2
2
3
3
1
2
2
2
2
2
0
1
1
1
1
1
1
0
0
0
0
0
0
1
0
0
0
0
0
0
A
G
G
T
A
C
19 MAR 2007
X LSD 2007
Y
C
0
1
2
2
3
4
A
0
1
2LCS (X[1..5], 2 Y)2
3
3
G
0
1
2
2
2
2
2
A
0
1
1
1
1
1
1
C
0
0
0
0
0
0
1
0
0
0
0
0
0
A
G
G
T
A
C
19 MAR 2007
2
X LSD 2007
Y
C
0
1
2
2
2
3
4
A
0
1
2
2
2
3
3
G
0
1
2
2
2
2
2
A
0
1
1
1
1
1
1
C
0
0
0
0
0
0
1
0
0
0
0
0
0
A
G
G
T
A
C
19 MAR 2007
X LSD 2007
A New Approach C A Y
G A C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
Then: Calculate… *
C
*
A
* *
YG
*
*
A
* *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
Then: Calculate… *
C
*
A
* *
YG
*
*
A
* *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
All matches in first row-> 1 *
C
*
A
* *
YG
*
*
A
* 1 *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
*
C
*
A
* *0 + 1 = 1*
YG
*
A
* 1 *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
*
C
*
A
*
*
YG
0+1=1
*
1 *
A
* 1 *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
*
C 1+1=2
*
A
*
YG
*
*
1 *
A
1 * 1 *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
*
C 1+1=2
*
A
2 *
YG
*
1 *
A
* 1 * 1 *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
0+1=1
C
*
*
A
* 2 * 2 *
YG
1 *
A
1 * 1 *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
2+1=3
C
1 *
A
*
* 2 * 2 *
YG
1 *
A
1 * 1 *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
3+1=4
*
C
1 *
A
3 * 2 * 2 *
YG
1 *
A
1 * 1 *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
4 *
C
1 *
A
3 * 2 * 2 *
YG
1 *
A
1 * 1 *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
Algorithm for FIG *
C
*
A
* *
YG
*
*
A
* *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
Algorithm for FIG: K = 2 *
C
*
A
* *
YG
*
*
A
* *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
Algorithm for FIG: K = 1 *
C
*
A
* *
YG
*
*
A
* *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
Running Time • Using van Emde Boas structure we can compute M in O(Rloglogn) • For each (i,j) \in M we check a (K+1)^2. • So total time O(Rloglogn + R(K+1)^2)
19 MAR 2007
LSD 2007
Now our goal is… • To improve our algorithm…
19 MAR 2007
LSD 2007
2 VIP Facts for LCS…
Y
C
0
1
2
2
2
3
4
A
0
1
2
2
2
3
3
G
0
1
2
2
2
2
2
A
0
1
1
1
1
1
1
C
0
0
0
0
0
0
1
0
0
0
0
0
0
A
G
G
T
A
C
19 MAR 2007
LSD 2007 X
Fact 1: Higher entry => higher or Equal value C
1 *
A
4 *
3 * 2 * 2 *
YG
1 *
A
1 * 1 *
C
A 19 MAR 2007
G
G LSD 2007 X
T
A
C
Fact 1: Higher entry => higher or Equal value C
1 *
A
4 *
3 * 2 * 2 *
YG
1 *
A
1 * 1 *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
Fact 2: Independence on own row/col C
1 *
A
4 *
3 * 2 * 2 *
YG
1 *
A
1 * 1 *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
Array H Maximum in this Column
C
*
*
A
* *
YG
*
*
A
* *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
Array H Maximum in this Column
C
*
A
* *
YG
*
*
A
*
* *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
Array H C
*
A
So each entry of Array H is the latest entry of that column (by FACT 1)
*
YG
*
*
*
A
*
* *
C
A 19 MAR 2007
G
G X LSD 2007
T
A
C
H
0
1
2
2
0
1
1 *
C
1 *
A
* 2 * 2 *
YG
1 *
A
1 * 1 *
C
A 19 MAR 2007
G
G LSD 2007
T
A
C
H
0
1
2
2
0
1
1 *
C
1 *
A
* 2 * 2 *
YG
1 *
A
1 * 1 *
C
A 19 MAR 2007
G
G LSD 2007
T
A
C
H
0
1
2
2
0
13
1 *
C
1 *
A
3 * 2 * 2 *
YG
1 *
A
1 * 1 *
C
A 19 MAR 2007
G
G LSD 2007
T
A
C
H
0
1
2
2
0
3
1 *
C
1 *
A
3 * 2 * 2 *
YG
1 *
A
1 * 1 *
C
A 19 MAR 2007
G
G LSD 2007
T
A
C
SO WE CAN MAINTAIN ARRAY H ON THE FLY
4 *
C
1 *
A
3 * 2 * 2 *
YG
1 *
A
1 * 1 *
C
A 19 MAR 2007
G
G LSD 2007
T
A
C
Adopting for FIG • The problem is Fact 1 doesn’t hold for FIG. • Why? – Because of the K constraint a continuing LCS may have to stop and start over again at a particular point.
19 MAR 2007
LSD 2007
1 *
T
*
*
G G T
1 * 2 * 2 *
T
1 * 1 * 1 *
T 19 MAR 2007
T
T LSD 2007
G
A
C
FIG: K = 1 1 *
T
*
*
G G T
1 * 2 * 2 *
T
1 * 1 * 1 *
T 19 MAR 2007
T
T LSD 2007
G
A
C
FIG: K = 1FACT 1 doesn’t 1 * 1 *
T
*
hold!
G G T
1 * 2 * 2 *
T
1 * 1 * 1 *
T 19 MAR 2007
T
T LSD 2007
G
A
C
FIG: K = 1 1 * 1 *
T
*
G G T
1 * 2 * 2 *
T
1 * 1 * 1 *
T 19 MAR 2007
T
T LSD 2007
G
A
C
FIG: K = 1 1 * 1 * 1 *
T
FACT 1 doesn’t hold!
G G T
1 * 2 * 2 *
T
1 * 1 * 1 *
T 19 MAR 2007
T
T LSD 2007
G
A
C
Maintaining H for FIG
• van Emde Boas (vEB) structure: • Maintain a sorted list of integers in the range [1..n] – Can perform in O(log log n) time • insertion of an item • Deletion of an item
– Can perform in constant time:
• next(i) (successor element of i in the list) • prev(i) (predecessor element of i in the list)
19 MAR 2007
LSD 2007
E_1:
{1, (1, 1)}
{1, (2, 1)}
E_2:
{1, (1, 2)}
{2, (2, 2)}
E_3:
{1, (1, 3)}
{2, (2, 3)}
H 6
G
5
T
4
C
3
C
2
T
1
T
-
1 (2, 1)
2 (2, 2)
2 (2, 3)
…
…
…
* *
*
*
K=2
1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007
T 1
T 2
T G LSD 2007 3 4
A 5
A 6
E_1:
{1, (1, 1)}
{1, (2, 1)}
E_2:
{1, (1, 2)}
{2, (2, 2)}
E_3:
{1, (1, 3)}
{2, (2, 3)}
H 6
G
5
T
4
C
3
C
2
T
1
T
-
1 (2, 1)
Now consider row 5 Delete all entries from row 5 – K – 2 = 1 2 (2, 2)
2 (2, 3)
…
…
…
* *
*
*
K=2
1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007
T 1
T 2
T G LSD 2007 3 4
A 5
A 6
E_1:
{1, (1, 1)}
{1, (2, 1)}
E_2:
{1, (1, 2)}
{2, (2, 2)}
E_3:
{1, (1, 3)}
{2, (2, 3)}
H 6
G
5
T
4
C
3
C
2
T
1
T
-
1 (2, 1)
Now consider row 5 Delete all entries from row 5 – K – 2 = 1 2 (2, 2)
2 (2, 3)
…
…
…
* *
*
*
K=2
1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007
T 1
T 2
T G LSD 2007 3 4
A 5
A 6
E_1:
{1, (1, 1)}
{1, (2, 1)}
E_2:
{1, (1, 2)}
{2, (2, 2)}
E_3:
{1, (1, 3)}
{2, (2, 3)}
H 6
G
5
T
4
C
3
C
2
T
1
T
-
1 (2, 1)
2 (2, 2)
Max of E_1
*
*
2 (2, 3)
…
…
…
* *
K=2
1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007
T 1
T 2
T G LSD 2007 3 4
A 5
A 6
E_1:
{1, (1, 1)}
{1, (2, 1)}
E_2:
{1, (1, 2)}
{2, (2, 2)}
E_3:
{1, (1, 3)}
{2, (2, 3)}
H 6
G
5
T
4
C
3
C
2
T
1
T
-
1 (2, 1)
*
2 (2, 2)
2 (2, 3)
Max of E_2
*
…
…
…
*
*
K=2
1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007
T 1
T 2
T G LSD 2007 3 4
A 5
A 6
E_1:
{1, (1, 1)}
{1, (2, 1)}
E_2:
{1, (1, 2)}
{2, (2, 2)}
E_3:
{1, (1, 3)}
{2, (2, 3)}
H 6
G
5
T
4
C
3
C
2
T
1
T
-
1 (2, 1)
*
2 (2, 2)
*
2 (2, 3)
…
…
…
* Max of E_3
*
K=2
1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007
T 1
T 2
T G LSD 2007 3 4
A 5
A 6
E_1:
{1, (1, 1)}
{1, (2, 1)}
E_2:
{1, (1, 2)}
{2, (2, 2)}
E_3:
{1, (1, 3)}
{2, (2, 3)}
H 6
G
5
T
4
C
3
C
2
T
1
T
-
1 (2, 1)
2 (2, 2)
2 (2, 3)
…
…
…
* *
*
*
K=2
1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007
T 1
T 2
T G LSD 2007 3 4
A 5
A 6
E_1:
{1, (1, 1)}
{1, (2, 1)}
E_2:
{1, (1, 2)}
{2, (2, 2)}
E_3:
{1, (1, 3)}
{2, (2, 3)}
H 6
G
5
T
4
C
3
C
2
T
1
T
-
1 (2, 1)
{1, (5, 1)}
2 (2, 2)
2 (2, 3)
…
…
…
* 1 *
*
*
K=2
1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007
T 1
T 2
T G LSD 2007 3 4
A 5
A 6
E_1:
{1, (1, 1)}
{1, (2, 1)}
{1, (5, 1)}
E_2:
{1, (1, 2)}
{2, (2, 2)}
{2, (5, 2)}
E_3:
{1, (1, 3)}
{2, (2, 3)}
H 6
G
5
T
4
C
3
C
2
T
1
T
-
1 (2, 1)
2 (2, 2)
2 (2, 3)
…
…
…
* 1 * 2 *
*
K=2
1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007
T 1
T 2
T G LSD 2007 3 4
A 5
A 6
E_1:
{1, (1, 1)}
{1, (2, 1)}
{1, (5, 1)}
E_2:
{1, (1, 2)}
{2, (2, 2)}
{2, (5, 2)}
E_3:
{1, (1, 3)}
{2, (2, 3)}
{3, (5, 3)}
H 6
G
5
T
4
C
3
C
2
T
1
T
-
1 (2, 1)
2 (2, 2)
2 (2, 3)
…
…
…
* 1 * 2 * 3 *
K=2
1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007
T 1
T 2
T G LSD 2007 3 4
A 5
A 6
E_1:
{1, (1, 1)}
{1, (2, 1)}
E_2:
{1, (1, 2)}
{2, (2, 2)}
E_3:
{1, (1, 3)}
{2, (2, 3)}
H 6
G
5
T
4
C
3
C
2
T
1
T
-
1 (2, 1)
{1, (5, 1)}
Now consider row 6 {2, (5, 2)} Delete all entries from row 6 – K – 2 = 2 {3, (5, 3)} 2 (2, 2)
2 (2, 3)
…
…
…
* 1 * 2 * 3 *
K=2
1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007
T 1
T 2
T G LSD 2007 3 4
A 5
A 6
E_1:
{1, (1, 1)}
{1, (2, 1)}
{1, (5, 1)}
E_2:
{1, (1, 2)}
{2, (2, 2)}
{2, (5, 2)}
E_3:
{1, (1, 3)}
{2, (2, 3)}
{3, (5, 3)}
H 6
G
5
T
4
C
3
C
2
T
1
T
-
1 (5, 1)
2 (5, 2)
3 (5, 3)
Update H
…
…
…
* 1 * 2 * 3 *
K=2
1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007
T 1
T 2
T G LSD 2007 3 4
A 5
A 6
E_1:
{1, (1, 1)}
{1, (2, 1)}
{1, (5, 1)}
E_2:
{1, (1, 2)}
{2, (2, 2)}
{2, (5, 2)}
E_3:
{1, (1, 3)}
{2, (2, 3)}
{3, (5, 3)}
H 6
G
5
T
4
C
3
C
2
T
1
T
-
1 (5, 1)
2 (5, 2)
3 (5, 3)
…
…
…
4 * 1 * 2 * 3 *
K=2
1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007
T 1
T 2
T G LSD 2007 3 4
A 5
A 6
Running Time • We have changed the checking of (K+1)^2 area to (K+1) entries per match • We need to maintain vEB structure. – Every mach is inserted and deleted at most once
• O(R(K+1) + Rloglogn)
19 MAR 2007
LSD 2007
Now… • Remove K from the running time.
19 MAR 2007
LSD 2007
Range Maxima Query • Given an array of numbers for preprocessing • Query: Given a range find the maximum • Results: O(n) time preprocessing for O(1) time per query.
19 MAR 2007
LSD 2007
Apply RMQ • Preprocess Array H per row (Recall FACT 2) • Apply range maxima query on H per match. • We are done.
19 MAR 2007
LSD 2007
Running Time • O(n^2) for the preprocessing (O(n) per row) • O(R) work per match • O(Rloglogn preprocessing for M) • O(n^2 + Rloglogn)
19 MAR 2007
LSD 2007
LCS Algorithm • This gives us a new LCS algorithm of O(n^2 + R) • This can be improved to O(Rloglogn) or may be even better (not in this paper).
19 MAR 2007
LSD 2007
ELAG, RIFIG, RELAG • Some modifications and extension on the FIG algorithms.
19 MAR 2007
LSD 2007
Motivation & Application • Tackle degenerate strings in all these algorithms. Assuming DNA/RNA alphabet, this doesn’t increase running time. – This gives us an algorithm for finding longest common substring for degenrate alphabet (Put K = 1)
19 MAR 2007
LSD 2007
Motivation & Application • Extracting long multiple repeats in DNA using lossless filter requires computation of LCS with bounded span: – Our algorithm can give this LCS.
19 MAR 2007
LSD 2007
Motivation & Application • Introduction of the notion of gap constraint in LCS seemed to be appealing in its own right.
19 MAR 2007
LSD 2007
Latest Updates… • Improvements on FIG, ELAG, RIFIG, RELAG: – O(n+Rlog n) for FIG and ELAG – O(n+R) for RIFIG and RELAG Collaborators
Paper Under Review
Marcin Kubica and Tomasz Walen (Warsaw Univ.) 19 MAR 2007
LSD 2007
Latest Updates… • Using the concept of this paper, a new O(Rloglogn) time algorithm for the traditional LCS problem (Accepted at AAIM 2007)
19 MAR 2007
LSD 2007
Latest Updates… • Using the concept of this paper, a new O(Rploglogn) time algorithm for the Constrained LCS problem – Paper Under Review
19 MAR 2007
LSD 2007
THANK YOU
19 MAR 2007
LSD 2007