Member Search

Problem Statement

Sequence alignment is a well-studied task that comes in many flavors. One
problem involves determining how various pieces of DNA combine to form longer
sequences which are then used for various purposes. In particular, we have
three sets of sequences: A, B and C. By selecting sequences a, b, and c from
A, B and C, we can append them to achieve a longer sequences abc. However,
this process is not without complexities, and during the appending process a
number of insertions, deletions and mutations might be made, making it
difficult to tell what the original a, b and c were.

Implementation Details

You will be given String[]'s A, B and C representing the sets
mentioned above. You will also be given a number of query strings. For each
query, you must determine which sequence abc most closely matches the query
string, where closely is defined by Levenshtein
distance. You should return a int[] with three element for each query string. Thus, ret[i*3] gives the index of the element of A to use for query i (indexed from 0), while element
ret[i*3+1] gives the element of B and
ret[i*3+2] gives the element of C.

Test Generation

The query strings will be formed by selecting one string from each set. For
each of the three selected strings, two independent random integers
d1 and d2 between 0 and 10, inclusive will be chosen and
d1 characters will be deleted from the front of the string and
d2 characters will be deleted from the end. Next, four integers
(i1 ... i4) will be chosen by sampling from an exponential
distribution with lambda = 0.3, and then taking the floor of the sampled
value. i1 random characters will be inserted before a,
i2 random characters between a and b, i3 random
characters between b and c, and i4 random characters after c.
Finally, each character will be changed to a different character with
probability 0.1.

Each of the sets A, B and C will be randomly (and independently) selected from one of four biological datasets. Thus, there are 64 possible values for the sets (4 for each A, B and C).

Scoring

Your solution should find abc triples that are close to queries
quickly. For a set of queries, we find the total Levenshtein distance summed
over all queries, DIST. The total runtime over all queries in the set is
TIME. Your score will then be DIST * 0.01 + ln(TIME). Your overall score will be the sum over all test cases of 1,000,000/score.

Notes

If the total number of characters to be deleted from a, b or c exceeds the length of the string, the whole string is deleted.

-

Aside from the first 5 tests, each test has N = 100,000 queries.

-

The time limit is 30 seconds and the memory limit is 1024M.

-

There are 64 tests -- one for each choice of A, B and C.

-

The final tests will use the same possibilities for A, B and C, which can be seen in the source code of the offline tester.

Examples

0)

seed = 1
N = 10
ABC sets = (1, 3, 0)

The set A used in this test is that with index 1, the set B is that with index 3, and set C has index 0.

1)

seed = 2
N = 100
ABC sets = (0, 3, 1)

2)

seed = 3
N = 1000
ABC sets = (2, 0, 1)

3)

seed = 4
N = 10000
ABC sets = (2, 2, 3)

4)

seed = 5
N = 100000
ABC sets = (3, 2, 3)

5)

seed = 6
N = 100000
ABC sets = (2, 1, 3)

6)

seed = 7
N = 100000
ABC sets = (3, 0, 1)

7)

seed = 8
N = 100000
ABC sets = (3, 0, 1)

8)

seed = 9
N = 100000
ABC sets = (2, 0, 2)

9)

seed = 10
N = 100000
ABC sets = (1, 1, 0)

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2010, TopCoder, Inc. All rights reserved.