Dynamic Programming | Set 4 (Longest Common Subsequence)

We have discussed Overlapping Subproblems and Optimal Substructure properties in Set 1 and Set 2 respectively. We also discussed one example problem in Set 3. Let us discuss Longest Common Subsequence (LCS) problem as one more example problem that can be solved using Dynamic Programming.

LCS Problem Statement: Given two sequences, find the length of longest subsequence present in both of them. A subsequence is a sequence that appears in the same relative order, but not necessarily contiguous. For example, “abc”, “abg”, “bdf”, “aeg”, ‘”acefg”, .. etc are subsequences of “abcdefg”. So a string of length n has 2^n different possible subsequences.

It is a classic computer science problem, the basis of diff (a file comparison program that outputs the differences between two files), and has applications in bioinformatics.

Examples:
LCS for input Sequences “ABCDGH” and “AEDFHR” is “ADH” of length 3.
LCS for input Sequences “AGGTAB” and “GXTXAYB” is “GTAB” of length 4.

The naive solution for this problem is to generate all subsequences of both given sequences and find the longest matching subsequence. This solution is exponential in term of time complexity. Let us see how this problem possesses both important properties of a Dynamic Programming (DP) Problem.

1) Optimal Substructure:
Let the input sequences be X[0..m-1] and Y[0..n-1] of lengths m and n respectively. And let L(X[0..m-1], Y[0..n-1]) be the length of LCS of the two sequences X and Y. Following is the recursive definition of L(X[0..m-1], Y[0..n-1]).

If last characters of both sequences do not match (or X[m-1] != Y[n-1]) then
L(X[0..m-1], Y[0..n-1]) = MAX ( L(X[0..m-2], Y[0..n-1]), L(X[0..m-1], Y[0..n-2])

Examples:
1) Consider the input strings “AGGTAB” and “GXTXAYB”. Last characters match for the strings. So length of LCS can be written as:
L(“AGGTAB”, “GXTXAYB”) = 1 + L(“AGGTA”, “GXTXAY”)

2) Consider the input strings “ABCDGH” and “AEDFHR. Last characters do not match for the strings. So length of LCS can be written as:
L(“ABCDGH”, “AEDFHR”) = MAX ( L(“ABCDG”, “AEDFHR”), L(“ABCDGH”, “AEDFH”) )

So the LCS problem has optimal substructure property as the main problem can be solved using solutions to subproblems.

Time complexity of the above naive recursive approach is O(2^n) in worst case and worst case happens when all characters of X and Y mismatch i.e., length of LCS is 0.
Considering the above implementation, following is a partial recursion tree for input strings “AXYT” and “AYZX”

In the above partial recursion tree, lcs(“AXY”, “AYZ”) is being solved twice. If we draw the complete recursion tree, then we can see that there are many subproblems which are solved again and again. So this problem has Overlapping Substructure property and recomputation of same subproblems can be avoided by either using Memoization or Tabulation. Following is a tabulated implementation for the LCS problem.

your solution does not discriminate between characters from same string and characters from different string

AMIT

Shouldn’t it be
if(x[i-1]==y[i-1])
l[i][j]=max(l[i-1][j],l[i][j-1],l[i-1][j-i]+1);
instead of
L[i][j] = L[i-1][j-1] + 1;
Will that make any difference?

goyalmnl

We can also do this using set3 approach. let arr[i] be the LCS of the sequence string1[i] and string2[n-1],including string1[i]. Also we store the position of string1[i] in string2 at each arr[i]. Now, for any j>i, first check the position of string1[j]>string1[i] in string2 and from these pick the one with maximum length.
Finally, the one with the largest length (largest value of arr)is the ans.
Please correct me if i am wrong.

Prateek Sharma

Python code with O(n) time complexity without using dynamic programming approach.It generates all longest common subsequence if present b/w two strings.
Approach-
1)Compute all substrings of both strings using normal substring generating algo(NO ned to use recursion here).Each substring has the same order of elements as present in original string.
2)Compare all substrings of both strings with each other and find the longest common substring. That’s the answer.

def longestCommonSubsequence(storingList,storingList1):
stringList =[]
for i in storingList:
for j in storingList1:
if i == j:
stringList.append(i)
max =0
var =’a’
for i in stringList:
if len(i)>max:
max = len(i)
tempList =[]
for i in stringList:

if len(i) == max:
if any([True for e in [i] if e in tempList]):
continue
else:
tempList.append(i)
for i in tempList:
print "".join(i)

I agree George. It will be 2^n-1 because the number of ways of choosing 0 characters from n (which is 1 way) is also included in 2^n. That is not a valid subsequence. I would request the authors to please update this.

Sandeep

Can we straight away remove the characters present in first input sequence X and not in the second input sequence and the vice versa.

can anyone write a function to print all Longest Common Sequence with same max langth.
for ex. strings are
s1=”BDCABA”;
s2=”ABCBDAB”;
max length of LCS is 4 and there are two possible sequences.
BDAB and BCAB

Shaanu

Can you please provide and explains the suffix tree or suffix array implementation as it would cost O(n+m) in finding the length of LCS..

sparco

Code for First row and column space optimised in the code
Instead of L[m+1][n+1] , L[m][n] can be used

@Ujjwal W: Thanks for pointing this out. We have corrected the typo. Keep it up!!

http://www.linkedin.com/in/ramanawithu Venki

@Sandeep, the title is misleading. The programs prints only the length of LCS, not the LCS itself.

The table approach is more understandable if we can print the table status after each iteration of inner loop (initially fill the table with zeros or -1). Simply print the 2D matrix after every inner loop. It would be helpful to the beginners of DP how the algorithm builds the table.

here I have written a backtrack function to print the LCS string. However, it prints one of the many(if exist) same length strings. Can someone modify it to print all the strings with same length??
In the code, c1 and c2 are the respective lengths of the strings and mat is a 2D matrix of size (c1+1)*(c2+1).
This function is called after LCS populates the matrix.

Note1: This code prints a LCS multiple times if multiple paths leads to the same LCS. One dirty way to fix this by store all the LCSs first and then print only unique LCS. So if you guys have any better idea to stop printing multiple LCS then please reply

Note2: The below recursive function can be memoized as it solves various same sub-problems multiple times