Lcs Isaac06 Lsd07

  • Uploaded by: api-3696125
  • 0
  • 0
  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Lcs Isaac06 Lsd07 as PDF for free.

More details

  • Words: 4,268
  • Pages: 94
Algorithms for Computing Variants of the Longest Common Subsequence Problem M. Sohel Rahman Costas Iliopoulos

Goal of this paper… • Introduce new variants of LCS problem • Present efficient algorithms to solve them

19 MAR 2007

LSD 2007

Longest Common Subsequence • Given two sequences:

– X = CAAGCTAAGCTAC – Y = TCAAGTAGAAC

• Common Subsequence: A Subseq common to both X and Y. • LCS- A subseq having the highest length

19 MAR 2007

LSD 2007

LCS-Example 1

X=

Y=

2

3

4

5

6

7

8

9

10

11

C A A G C T A A G C T

C C G T A T 1

2

19 MAR 2007

3

4

5

A common subseq: CCT Length = 3

6

LSD 2007

LCS-Example 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T 1

2

19 MAR 2007

3

4

5

A Longest common subseq: CCTAT Length = 5

6

LSD 2007

LCS-Example 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T

A Longest common subseq: CCTAT Length = 5

1

Another LCS: CGTAT Length = 5

2

19 MAR 2007

3

4

5

6

LSD 2007

New Variants Introduced • • • •

FIG: Fixed Gapped LCS ELAG: Elastic Gapped LCS RIFIG: Rigid Fixed Gapped LCS RELAG: Rigid Elastic Gapped LCS

19 MAR 2007

LSD 2007

19 MAR 2007

LSD 2007

FIG: K = 4 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T 1

2

19 MAR 2007

3

4

5

6

A FIG LCS: CCTAT Length = 5 Same as an LCS

LSD 2007

FIG: K = 2 1

X=

2

3

4

5

6

7

8

9

10

11

12

C A A G C T A A G C G T 3>K

Y=

C C G T A T 1

2

19 MAR 2007

3

4

5

6

CCTAT Length = 5 Can’ be a FIG LCS if K = 2!

LSD 2007

FIG: K = 2 1

X=

2

3

4

5

6

7

8

10

11

12

C A A G C T A A G C G T 2 <= K

Y=

9

4>K

C C G T A T 1

2

19 MAR 2007

3

4

5

6

LSD 2007

FIG: K = 2 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T 1

2

19 MAR 2007

3

4

5

FIG LCS: CGTA Length = 4

6

LSD 2007

RIFIG: K = 4 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T 1

2

19 MAR 2007

3

4

5

6

LSD 2007

RIFIG: K = 3 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T 1

2

19 MAR 2007

3

4

5

A RIFIG LCS: GT Length = 2

6

LSD 2007

RIFIG: K = 3 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T 1

2

19 MAR 2007

3

4

5

A(NOTHER) RIFIG LCS: GT Length = 2

6

LSD 2007

RIFIG: K = 3 1

X=

2

3

4

5

6

7

8

10

11

12

C A A G C T A A G C G T 4>K

Y=

9

C C G T A T 1

2

19 MAR 2007

3

4

5

6

LSD 2007

RIFIG: K = 3 4 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T 1

2

19 MAR 2007

3

4

5

6

LSD 2007

ELAG: K_1 = 2, K_2 = 4 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T

K_2 constraint is OK K_1 constraint is not met For every case in Y!!!

1

2

19 MAR 2007

3

4

5

6

LSD 2007

ELAG: K_1 = 2, K_2 = 3 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T ELAG LCS for K_1 = 2, K_2 = 3: GT Length = 2 1

2

19 MAR 2007

3

4

5

6

LSD 2007

ELAG: K_1 = 2, K_2 = 3 4 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T

1

2

19 MAR 2007

3

4

5

6

LSD 2007

RELAG • RELAG IS SIMILAR TO ELAG WITH THE ADDED CONSTRAINT OF RIGIDITY LIKE RIFIG

19 MAR 2007

LSD 2007

Notations • Matches: We say a pair (i, j), 1 <= i, j <= n defines a match, if X[i] = Y[i]. • The set of all matches is M • |M| = R

19 MAR 2007

LSD 2007

M 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T 1

2

3

4

5

6

M = {(1, 1), (1, 2), …}

19 MAR 2007

LSD 2007

M 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T 1

2

3

4

5

6

M = {(1, 1), (1, 2), (2, 5), …}

19 MAR 2007

LSD 2007

M 1

2

3

4

5

6

7

8

9

10

11

12

X=

C A A G C T A A G C G T

Y=

C C G T A T 1

2

3

4

5

6

M = {(1, 1), (1, 2), (2, 5), (3, 5)…} and so on.

19 MAR 2007

LSD 2007

FIG Algorithms • • • •

O(n^2 + R(K+1)^2) Algorithm O(Rloglogn + R(K+1)^2) Algorithm O(n^2 + Rloglogn + RK) Algorithm O(n^2 + R) Algorithm

19 MAR 2007

LSD 2007

Classical LCS Algorithm

19 MAR 2007

LSD 2007

Y

C

0

1

2

2

2

3

4

A

0

1

2

2

2

3

3

G

0

1

2

2

2

2

2

A

0

1

1

1

1

1

1

C

0

0

0

0

0

0

1

0

0

0

0

0

0

A

G

G

T

A

C

19 MAR 2007

X LSD 2007

Y

C

0

1

2

A

0

1

G

0

A C

LCS (X[1..2], Y[1..4])

2

2

3

4

2

2

2

3

3

1

2

2

2

2

2

0

1

1

1

1

1

1

0

0

0

0

0

0

1

0

0

0

0

0

0

A

G

G

T

A

C

19 MAR 2007

X LSD 2007

Y

C

0

1

2

2

3

4

A

0

1

2LCS (X[1..5], 2 Y)2

3

3

G

0

1

2

2

2

2

2

A

0

1

1

1

1

1

1

C

0

0

0

0

0

0

1

0

0

0

0

0

0

A

G

G

T

A

C

19 MAR 2007

2

X LSD 2007

Y

C

0

1

2

2

2

3

4

A

0

1

2

2

2

3

3

G

0

1

2

2

2

2

2

A

0

1

1

1

1

1

1

C

0

0

0

0

0

0

1

0

0

0

0

0

0

A

G

G

T

A

C

19 MAR 2007

X LSD 2007

A New Approach C A Y

G A C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

Then: Calculate… *

C

*

A

* *

YG

*

*

A

* *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

Then: Calculate… *

C

*

A

* *

YG

*

*

A

* *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

All matches in first row-> 1 *

C

*

A

* *

YG

*

*

A

* 1 *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

*

C

*

A

* *0 + 1 = 1*

YG

*

A

* 1 *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

*

C

*

A

*

*

YG

0+1=1

*

1 *

A

* 1 *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

*

C 1+1=2

*

A

*

YG

*

*

1 *

A

1 * 1 *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

*

C 1+1=2

*

A

2 *

YG

*

1 *

A

* 1 * 1 *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

0+1=1

C

*

*

A

* 2 * 2 *

YG

1 *

A

1 * 1 *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

2+1=3

C

1 *

A

*

* 2 * 2 *

YG

1 *

A

1 * 1 *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

3+1=4

*

C

1 *

A

3 * 2 * 2 *

YG

1 *

A

1 * 1 *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

4 *

C

1 *

A

3 * 2 * 2 *

YG

1 *

A

1 * 1 *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

Algorithm for FIG *

C

*

A

* *

YG

*

*

A

* *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

Algorithm for FIG: K = 2 *

C

*

A

* *

YG

*

*

A

* *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

Algorithm for FIG: K = 1 *

C

*

A

* *

YG

*

*

A

* *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

Running Time • Using van Emde Boas structure we can compute M in O(Rloglogn) • For each (i,j) \in M we check a (K+1)^2. • So total time O(Rloglogn + R(K+1)^2)

19 MAR 2007

LSD 2007

Now our goal is… • To improve our algorithm…

19 MAR 2007

LSD 2007

2 VIP Facts for LCS…

Y

C

0

1

2

2

2

3

4

A

0

1

2

2

2

3

3

G

0

1

2

2

2

2

2

A

0

1

1

1

1

1

1

C

0

0

0

0

0

0

1

0

0

0

0

0

0

A

G

G

T

A

C

19 MAR 2007

LSD 2007 X

Fact 1: Higher entry => higher or Equal value C

1 *

A

4 *

3 * 2 * 2 *

YG

1 *

A

1 * 1 *

C

A 19 MAR 2007

G

G LSD 2007 X

T

A

C

Fact 1: Higher entry => higher or Equal value C

1 *

A

4 *

3 * 2 * 2 *

YG

1 *

A

1 * 1 *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

Fact 2: Independence on own row/col C

1 *

A

4 *

3 * 2 * 2 *

YG

1 *

A

1 * 1 *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

Array H Maximum in this Column

C

*

*

A

* *

YG

*

*

A

* *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

Array H Maximum in this Column

C

*

A

* *

YG

*

*

A

*

* *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

Array H C

*

A

So each entry of Array H is the latest entry of that column (by FACT 1)

*

YG

*

*

*

A

*

* *

C

A 19 MAR 2007

G

G X LSD 2007

T

A

C

H

0

1

2

2

0

1

1 *

C

1 *

A

* 2 * 2 *

YG

1 *

A

1 * 1 *

C

A 19 MAR 2007

G

G LSD 2007

T

A

C

H

0

1

2

2

0

1

1 *

C

1 *

A

* 2 * 2 *

YG

1 *

A

1 * 1 *

C

A 19 MAR 2007

G

G LSD 2007

T

A

C

H

0

1

2

2

0

13

1 *

C

1 *

A

3 * 2 * 2 *

YG

1 *

A

1 * 1 *

C

A 19 MAR 2007

G

G LSD 2007

T

A

C

H

0

1

2

2

0

3

1 *

C

1 *

A

3 * 2 * 2 *

YG

1 *

A

1 * 1 *

C

A 19 MAR 2007

G

G LSD 2007

T

A

C

SO WE CAN MAINTAIN ARRAY H ON THE FLY

4 *

C

1 *

A

3 * 2 * 2 *

YG

1 *

A

1 * 1 *

C

A 19 MAR 2007

G

G LSD 2007

T

A

C

Adopting for FIG • The problem is Fact 1 doesn’t hold for FIG. • Why? – Because of the K constraint a continuing LCS may have to stop and start over again at a particular point.

19 MAR 2007

LSD 2007

1 *

T

*

*

G G T

1 * 2 * 2 *

T

1 * 1 * 1 *

T 19 MAR 2007

T

T LSD 2007

G

A

C

FIG: K = 1 1 *

T

*

*

G G T

1 * 2 * 2 *

T

1 * 1 * 1 *

T 19 MAR 2007

T

T LSD 2007

G

A

C

FIG: K = 1FACT 1 doesn’t 1 * 1 *

T

*

hold!

G G T

1 * 2 * 2 *

T

1 * 1 * 1 *

T 19 MAR 2007

T

T LSD 2007

G

A

C

FIG: K = 1 1 * 1 *

T

*

G G T

1 * 2 * 2 *

T

1 * 1 * 1 *

T 19 MAR 2007

T

T LSD 2007

G

A

C

FIG: K = 1 1 * 1 * 1 *

T

FACT 1 doesn’t hold!

G G T

1 * 2 * 2 *

T

1 * 1 * 1 *

T 19 MAR 2007

T

T LSD 2007

G

A

C

Maintaining H for FIG

• van Emde Boas (vEB) structure: • Maintain a sorted list of integers in the range [1..n] – Can perform in O(log log n) time • insertion of an item • Deletion of an item

– Can perform in constant time:

• next(i) (successor element of i in the list) • prev(i) (predecessor element of i in the list)

19 MAR 2007

LSD 2007

E_1:

{1, (1, 1)}

{1, (2, 1)}

E_2:

{1, (1, 2)}

{2, (2, 2)}

E_3:

{1, (1, 3)}

{2, (2, 3)}

H 6

G

5

T

4

C

3

C

2

T

1

T

-

1 (2, 1)

2 (2, 2)

2 (2, 3)







* *

*

*

K=2

1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007

T 1

T 2

T G LSD 2007 3 4

A 5

A 6

E_1:

{1, (1, 1)}

{1, (2, 1)}

E_2:

{1, (1, 2)}

{2, (2, 2)}

E_3:

{1, (1, 3)}

{2, (2, 3)}

H 6

G

5

T

4

C

3

C

2

T

1

T

-

1 (2, 1)

Now consider row 5 Delete all entries from row 5 – K – 2 = 1 2 (2, 2)

2 (2, 3)







* *

*

*

K=2

1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007

T 1

T 2

T G LSD 2007 3 4

A 5

A 6

E_1:

{1, (1, 1)}

{1, (2, 1)}

E_2:

{1, (1, 2)}

{2, (2, 2)}

E_3:

{1, (1, 3)}

{2, (2, 3)}

H 6

G

5

T

4

C

3

C

2

T

1

T

-

1 (2, 1)

Now consider row 5 Delete all entries from row 5 – K – 2 = 1 2 (2, 2)

2 (2, 3)







* *

*

*

K=2

1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007

T 1

T 2

T G LSD 2007 3 4

A 5

A 6

E_1:

{1, (1, 1)}

{1, (2, 1)}

E_2:

{1, (1, 2)}

{2, (2, 2)}

E_3:

{1, (1, 3)}

{2, (2, 3)}

H 6

G

5

T

4

C

3

C

2

T

1

T

-

1 (2, 1)

2 (2, 2)

Max of E_1

*

*

2 (2, 3)







* *

K=2

1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007

T 1

T 2

T G LSD 2007 3 4

A 5

A 6

E_1:

{1, (1, 1)}

{1, (2, 1)}

E_2:

{1, (1, 2)}

{2, (2, 2)}

E_3:

{1, (1, 3)}

{2, (2, 3)}

H 6

G

5

T

4

C

3

C

2

T

1

T

-

1 (2, 1)

*

2 (2, 2)

2 (2, 3)

Max of E_2

*







*

*

K=2

1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007

T 1

T 2

T G LSD 2007 3 4

A 5

A 6

E_1:

{1, (1, 1)}

{1, (2, 1)}

E_2:

{1, (1, 2)}

{2, (2, 2)}

E_3:

{1, (1, 3)}

{2, (2, 3)}

H 6

G

5

T

4

C

3

C

2

T

1

T

-

1 (2, 1)

*

2 (2, 2)

*

2 (2, 3)







* Max of E_3

*

K=2

1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007

T 1

T 2

T G LSD 2007 3 4

A 5

A 6

E_1:

{1, (1, 1)}

{1, (2, 1)}

E_2:

{1, (1, 2)}

{2, (2, 2)}

E_3:

{1, (1, 3)}

{2, (2, 3)}

H 6

G

5

T

4

C

3

C

2

T

1

T

-

1 (2, 1)

2 (2, 2)

2 (2, 3)







* *

*

*

K=2

1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007

T 1

T 2

T G LSD 2007 3 4

A 5

A 6

E_1:

{1, (1, 1)}

{1, (2, 1)}

E_2:

{1, (1, 2)}

{2, (2, 2)}

E_3:

{1, (1, 3)}

{2, (2, 3)}

H 6

G

5

T

4

C

3

C

2

T

1

T

-

1 (2, 1)

{1, (5, 1)}

2 (2, 2)

2 (2, 3)







* 1 *

*

*

K=2

1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007

T 1

T 2

T G LSD 2007 3 4

A 5

A 6

E_1:

{1, (1, 1)}

{1, (2, 1)}

{1, (5, 1)}

E_2:

{1, (1, 2)}

{2, (2, 2)}

{2, (5, 2)}

E_3:

{1, (1, 3)}

{2, (2, 3)}

H 6

G

5

T

4

C

3

C

2

T

1

T

-

1 (2, 1)

2 (2, 2)

2 (2, 3)







* 1 * 2 *

*

K=2

1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007

T 1

T 2

T G LSD 2007 3 4

A 5

A 6

E_1:

{1, (1, 1)}

{1, (2, 1)}

{1, (5, 1)}

E_2:

{1, (1, 2)}

{2, (2, 2)}

{2, (5, 2)}

E_3:

{1, (1, 3)}

{2, (2, 3)}

{3, (5, 3)}

H 6

G

5

T

4

C

3

C

2

T

1

T

-

1 (2, 1)

2 (2, 2)

2 (2, 3)







* 1 * 2 * 3 *

K=2

1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007

T 1

T 2

T G LSD 2007 3 4

A 5

A 6

E_1:

{1, (1, 1)}

{1, (2, 1)}

E_2:

{1, (1, 2)}

{2, (2, 2)}

E_3:

{1, (1, 3)}

{2, (2, 3)}

H 6

G

5

T

4

C

3

C

2

T

1

T

-

1 (2, 1)

{1, (5, 1)}

Now consider row 6 {2, (5, 2)} Delete all entries from row 6 – K – 2 = 2 {3, (5, 3)} 2 (2, 2)

2 (2, 3)







* 1 * 2 * 3 *

K=2

1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007

T 1

T 2

T G LSD 2007 3 4

A 5

A 6

E_1:

{1, (1, 1)}

{1, (2, 1)}

{1, (5, 1)}

E_2:

{1, (1, 2)}

{2, (2, 2)}

{2, (5, 2)}

E_3:

{1, (1, 3)}

{2, (2, 3)}

{3, (5, 3)}

H 6

G

5

T

4

C

3

C

2

T

1

T

-

1 (5, 1)

2 (5, 2)

3 (5, 3)

Update H







* 1 * 2 * 3 *

K=2

1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007

T 1

T 2

T G LSD 2007 3 4

A 5

A 6

E_1:

{1, (1, 1)}

{1, (2, 1)}

{1, (5, 1)}

E_2:

{1, (1, 2)}

{2, (2, 2)}

{2, (5, 2)}

E_3:

{1, (1, 3)}

{2, (2, 3)}

{3, (5, 3)}

H 6

G

5

T

4

C

3

C

2

T

1

T

-

1 (5, 1)

2 (5, 2)

3 (5, 3)







4 * 1 * 2 * 3 *

K=2

1 * 2 * 2 * 1 * 1 * 1 * 19 MAR 2007

T 1

T 2

T G LSD 2007 3 4

A 5

A 6

Running Time • We have changed the checking of (K+1)^2 area to (K+1) entries per match • We need to maintain vEB structure. – Every mach is inserted and deleted at most once

• O(R(K+1) + Rloglogn)

19 MAR 2007

LSD 2007

Now… • Remove K from the running time.

19 MAR 2007

LSD 2007

Range Maxima Query • Given an array of numbers for preprocessing • Query: Given a range find the maximum • Results: O(n) time preprocessing for O(1) time per query.

19 MAR 2007

LSD 2007

Apply RMQ • Preprocess Array H per row (Recall FACT 2) • Apply range maxima query on H per match. • We are done.

19 MAR 2007

LSD 2007

Running Time • O(n^2) for the preprocessing (O(n) per row) • O(R) work per match • O(Rloglogn preprocessing for M) • O(n^2 + Rloglogn)

19 MAR 2007

LSD 2007

LCS Algorithm • This gives us a new LCS algorithm of O(n^2 + R) • This can be improved to O(Rloglogn) or may be even better (not in this paper).

19 MAR 2007

LSD 2007

ELAG, RIFIG, RELAG • Some modifications and extension on the FIG algorithms.

19 MAR 2007

LSD 2007

Motivation & Application • Tackle degenerate strings in all these algorithms. Assuming DNA/RNA alphabet, this doesn’t increase running time. – This gives us an algorithm for finding longest common substring for degenrate alphabet (Put K = 1)

19 MAR 2007

LSD 2007

Motivation & Application • Extracting long multiple repeats in DNA using lossless filter requires computation of LCS with bounded span: – Our algorithm can give this LCS.

19 MAR 2007

LSD 2007

Motivation & Application • Introduction of the notion of gap constraint in LCS seemed to be appealing in its own right.

19 MAR 2007

LSD 2007

Latest Updates… • Improvements on FIG, ELAG, RIFIG, RELAG: – O(n+Rlog n) for FIG and ELAG – O(n+R) for RIFIG and RELAG Collaborators

Paper Under Review

Marcin Kubica and Tomasz Walen (Warsaw Univ.) 19 MAR 2007

LSD 2007

Latest Updates… • Using the concept of this paper, a new O(Rloglogn) time algorithm for the traditional LCS problem (Accepted at AAIM 2007)

19 MAR 2007

LSD 2007

Latest Updates… • Using the concept of this paper, a new O(Rploglogn) time algorithm for the Constrained LCS problem – Paper Under Review

19 MAR 2007

LSD 2007

THANK YOU

19 MAR 2007

LSD 2007

Related Documents

Lcs Isaac06 Lsd07
November 2019 2
Tpircs Lcs
October 2019 14
Lcs Assignment.pdf
December 2019 17
Lcs 2005
November 2019 15
Lcs-feb-09
December 2019 6
Lcs-dec-08
November 2019 7