Broad Network


GenomicRangeQuery at app.codility.com-programmers in C Plus Plus Explained - Find the minimal nucleotide from a range of sequence DNA

Foreword: Category of Article - Technology, Computers and Software, Algorithm

By: Chrysanthus Date Published: 29 Jan 2024

Task

Task score : 100% -  Correctness : 100% ; Performance : 100%
Detected time complexity : O(N + M)
Lesson 5 at app.codility.com/programmers
The solution and its explanation are given below.

Problem
A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T have impact factors of 1, 2, 3 and 4, respectively. You are going to answer several queries of the form: What is the minimal impact factor of nucleotides contained in a particular part of the given DNA sequence?

The DNA sequence is given as a non-empty string S = S[0]S[1]...S[N-1] consisting of N characters. There are M queries, which are given in non-empty arrays P and Q, each consisting of M integers. The K-th query (0 <= K < M) requires you to find the minimal impact factor of nucleotides contained in the DNA sequence between positions P[K] and Q[K] (inclusive).

For example, consider string S = CAGCCTA and arrays P, Q such that:

    P[0] = 2    Q[0] = 4
    P[1] = 5    Q[1] = 5
    P[2] = 0    Q[2] = 6

The answers to these M = 3 queries are as follows:

        The part of the DNA between positions 2 and 4 contains nucleotides G and C (twice), whose impact factors are 3 and 2 respectively, so the answer is 2.
        The part between positions 5 and 5 contains a single nucleotide T, whose impact factor is 4, so the answer is 4.
        The part between positions 0 and 6 (the whole string) contains all nucleotides, in particular nucleotide A whose impact factor is 1, so the answer is 1.

Write a function:

    vector<int> solution(string &S, vector<int> &P, vector<int> &Q);

that, given a non-empty string S consisting of N characters and two non-empty vectors P and Q consisting of M integers, returns a vector consisting of M integers specifying the consecutive answers to all queries.

Result vector should be returned as a vector of integers.

For example, given the string S = CAGCCTA and vectors P, Q such that:

    P[0] = 2    Q[0] = 4
    P[1] = 5    Q[1] = 5
    P[2] = 0    Q[2] = 6

the function should return the values [2, 4, 1], as explained above.

Write an efficient algorithm for the following assumptions:

        N is an integer within the range [1..100,000];
        M is an integer within the range [1..50,000];
        each element of vectors P and Q is an integer within the range [0..N - 1];
        P[K] <= Q[K], where 0 <= K < M;
        string S consists only of upper-case English letters A, C, G, T.

Note:
A slice is a sequence of elements that form a subset (subgroup) of the given vector (array).

Brute Force Approach

The program for the brute-force approach is as follows (read through the code):

    #include <iostream>
    #include <vector>
    #include <string>
    #include <unordered_map>
    using namespace std;

    vector<int> solution(string &S, vector<int> &P, vector<int> &Q) {
        int M = P.size();
        unordered_map<char, int> mp = {{'A', 1}, {'C', 2}, {'G', 3}, {'T', 4}};
        vector<int> answers(M, 0);

        for(int K=0; K<M; K++) {
            int min = 5;
            for (int i=P[K]; i<=Q[K]; i++) {    //scan the slice
                if (mp[S[i]] < min) {
                    min = mp[S[i]];
                    answers[K] = min;
                }
            }
        }

        return answers;
    }

    int main()
    {  
        vector<int> P{2, 5, 0};
        vector<int> Q{4, 5, 6};
        string str = "CAGCCTA";
        vector<int> ret = solution(str, P, Q);
        cout << ret[0] <<' '<< ret[1] <<' '<< ret[2] << endl;
    
        vector<int> P2{0};
        vector<int> Q2{0};
        string str2 = "CAGCCTA";
        vector<int> ret2 = solution(str2, P2, Q2);
        cout << ret2[0] <<' '<< ret2[1] <<' '<< ret2[2] << endl;
    
        return 0;
    }

The output is:

    2 4 1
    2 0 0

The scoring and detected time complexity for this solution() function, at app.codility.com/programmers is:

    Task score : 62% -  Correctness : 100% ; Performance : 0%
    Detected time complexity: O(N x M)

The correctness is 100% and the speed (time taken) is 0% (too slow). The reasons for the 0% performance (from app.codility.com/programmers) are given in the following table:

almost_all_same_lettersX TIMEOUT ERROR
GGGGGG..??..GGGGGG..??..GGGGGGKilled. Hard limit reached: 6.000 sec.
large_random
large random string, length
X TIMEOUT ERROR
Killed. Hard limit reached: 6.000 sec.
extreme_large
all max ranges
X TIMEOUT ERROR
Killed. Hard limit reached: 6.000 sec.

N x M as number of operations for the solution() function, can be compared (is similar) to N2 operations (especially when N = M). Imagine a grid of N columns by M rows where each cell in the grid represents an operation. N is for the length of the given string, S, and M is for the given number of queries. So, the total number of operations for the above solution() function, is N x M. Remember, it is the maximum possible number of operations that is considered, when dealing with time complexity.

Smart Solution

N + M number of operations is normally less than N x M number of operations. So, if the number of operations for the above solution() can be reduced from N x M to N + M, then the overall time taken for the solution() function to complete, would be shorter, (hence speed will be faster).

Multiplication (N x M) is continuous addition (continuous summing). Prefix-sums can reduce time complexity from O(N2) to O(N). So, prefix-sums can be used somewhere, somehow in the solution() function code, to enable the number of operations go down from (N x M) to (N + M). After all, this problem is on the lesson (tutorial) topic of prefix-sums.

If the slice sum from the prefix-sums for the number of occurrences of each nucleotide type, in the string, is obtained for each query, then the total number of operations can be reduced from (N x M) to (N + M). However, here, two dimensional array is needed and not one dimensional array.

In the problem example, N is 7 and M is 3. With that, 7 x 3 = 21 operations and N + M = 7 + 3 = 10 operations.

Illustration
The illustration is done here with the above sample case of S = CAGCCTA consisting of N = 7 characters and M = 3 queries, which are:

    P[0] = 2    Q[0] = 4
    P[1] = 5    Q[1] = 5
    P[2] = 0    Q[2] = 6

A two dimensional array of occurrences has to be created with all elements initialized to 0, as the following table shows:

Number of Occurrences of Nucleotide Types in given String, Cells initialized to 0
index01 (C)2 (A)3 (G)4 (C)5 (C)6 (T)7 (A)
nucleotide
0 (A)00000000
1 (C)00000000
2 (G)00000000
3 (T)00000000

The code segment to produce this two dimensional array with all elements initialized to 0 is:

    int occurrences[4][N+1];
    for (int j=0; j<4; j++) {
     for (int i=0; i<N+1; i++) {
         occurrences[j][i] = 0;
     }
    }

The time of execution for this code segment is not included in the overall time complexity. The first column of zeros is the prefix sums preceding zero.

Type A nucleotide has an impact factor of 1. However, in the table its row index is 0 = 1 - 1. Type C nucleotide has an impact factor of 2. However, in the table its row index is 1 = 2 - 1. Type G nucleotide has an impact factor of 3. However, in the table its row index is 2 = 3 - 1. Type T nucleotide has an impact factor of 4. However, in the table its row index is 3 = 4 - 1.

Next in the program, the number of occurrences of each nucleotide type in the string, S = "CAGCCTA" has to be put into the two dimensional array (table). The table becomes:

Number of Occurrences of Nucleotides in a 2D Array
index01 (C)2 (A)3 (G)4 (C)5 (C)6 (T)7 (A)
nucleotide
0 (A)00100001
1 (C)01001100
2 (G)00010000
3 (T)00000010

'A' occurs 1 time at index 1 (zero based indexing) in the given string, but occurs at index 2 = 1+1 above. 'A' also occurs once at index 6 in the given string, but occurs at index 7 = 6+1 above. C occurs 1 time at index 0 in the given string, 1 time at index 3 in the given string, and 1 time at index 4 in the given string. However, C occurs 1 time at index 1 = 0+1 above, 1 time at index 4 = 3+1 above, and 1 time at index 5 = 4+1 above. G occurs 1 time at index 2 in the given string, but occurs at index 3 = 2+1 above. T occurs 1 time at index 5 in the given string, but occurs at index 6 = 5+1 above. Every other cell remains at the initialized value of 0.

The code segment to produce the above 2D array is:

    for (int i=0; i < N+1; i++) {
        if(S[i] == 'A') occurrences[0][i+1] = 1;
        else if(S[i] == 'C') occurrences[1][i+1] = 1;
        else if(S[i] == 'G') occurrences[2][i+1] = 1;
        else if(S[i] == 'T') occurrences[3][i+1] = 1;
    }

The subscript, [i+1] is to compensate for the prefix sums preceding zero. This affects the sub-scripting of the slice sums - see below.

Next in the program, the prefix sums for the number of occurrences of each nucleotide type for the string, S = "CAGCCTA" are obtained for each row. The 2D table (array) becomes:

Prefix Sums for Number of Occurrences of Nucleotides in a 2D Array
index01 (C)2 (A)3 (G)4 (C)5 (C)6 (T)7 (A)
nucleotide
0 (A)00111112
1 (C)01112333
2 (G)00011111
3 (T)00000011

The prefix sums is not done the conventional way here. Here, it is done within the row. The code segment for this table is:

    for (int j=0; j<4; j++) {
        for (int i=1; i<N+1; i++) {
            occurrences[j][i] = occurrences[j][i] + occurrences[j][i-1];    //prefix summing; equivalent to occurrences[j][i] += occurrences[j][i-1];
        }
    }

Slice Sum

The normal way to obtain a slice sum is as follows: Subtract the prefix sum at the beginning of the slice, from the prefix sum just after the slice. This is done for each row in the above final 2D array, per query, for the problem.

Consider the query: P[0] = 2 . . . Q[0] = 4:
If A occurs at least once, then since A has the smallest impact factor, the minimal impact factor is 1. If the slice sum for A in this query is greater than 0, then it means A occurs at least once. From the above table, the slice sum for A in this interval is 0 = 1 - 1 (in row 0). So A does not occur at all in this portion (part) of the string, S = "CAGCCTA".

The impact factors are as follows: A= 1; C=2; G=3; T=4. Now that A is not found in the range P[0] = 2 . . . Q[0] = 4, in the string , S = "CAGCCTA", if C is found in that range, then 2 for C will be the minimal impact factor, since 3 for G and 4 for T are higher impact factors.

Still on the query: P[0] = 2 . . . Q[0] = 4: if C occurs at least once, then since C has the bigger impact factor after A, the minimal impact factor will be 2. This assumes that A did not occur at all, in the range. If the slice sum for C in this query is greater than 0, then it means C occurs at least once. From the above table, the slice sum for C in this interval is 2 = 3 - 1 (in row 1). Two is greater than 0. So C occurs at least once, in this portion of the string, S = "CAGCCTA". And the answer for the query is 2 for C = 2. Here, the impact factor for C being equal to 2 and the slice sum being equal to 2 is coincidence.

Consider the query: P[1] = 5 . . . Q[1] = 5:
If A occurs at least once, then since A has the smallest impact factor, the minimal impact factor is 1. If the slice sum for A in this query is greater than 0, then it means A occurs at least once. From the above table, the slice sum for A in this interval is 0 = 1 - 1 (in row 0). So A does not occur at all in this portion of the string, S = "CAGCCTA".

Now that A is not found in the range P[0] = 2 . . . Q[0] = 4, in the string , S = "CAGCCTA", if C is found in that range, then 2 for C will be the minimal impact factor, since 3 for G and 4 for T are higher impact factors.

Still on the query: P[1] = 5 . . . Q[1] = 5, if C occurs at least once, then since C has the bigger impact factor after A, the minimal impact factor will be 2 for C. This assumes that A did not occur at all, in the range. If the slice sum for C in this query is greater than 0, then it means C occurs at least once. From the above table, the slice sum for C in this interval (inclusive) is 0 = 3 - 3 (in row 1). So C does not occur at all in this portion (part) of the string, S = "CAGCCTA".

Still on the query: P[1] = 5 . . . Q[1] = 5, if G occurs at least once, then since G has the bigger impact factor after C, the minimal impact factor will be 3 for G. This assumes that neither A nor C occurred at all, in the range. If the slice sum for G in this query is greater than 0, then it means G occurs at least once. From the above table, the slice sum for G in this interval is 0 = 1 - 1 (in row 2). So G does not occur at all in this portion of the string, S = "CAGCCTA".

Still on the query: P[1] = 5 . . . Q[1] = 5, if T occurs at least once, then since T has the bigger impact factor after G, the minimal impact factor will be 4 for T. This assumes that neither A nor C nor G occurred at all, in the range. If the slice sum for T in this query is greater than 0, then it means T occurs at least once. From the above table, the slice sum for T in this interval (inclusive) is 1 - 0 = 1 (in row 3). One is greater than 0. So T occurs at least once, in this portion of the string, S = "CAGCCTA". And the answer for the query is 4 for T = 4.

Consider the query: P[2] = 0 . . . Q[2] = 6:
If A occurs at least once, then since A has the smallest impact factor, the minimal impact factor will be 1. If the slice sum for A in this query is greater than 0, then it means A occurs at least once. From the above table, the slice sum for A in this interval is 2 - 0 = 2 (in row 0). So A occurs at least once, in this portion of the string, S = "CAGCCTA". And the answer for the query is 1 for A = 1.

The code segment for this slice sum is:

    for (int K=0; K<M; K++){
        if (occurrences[0][Q[K]+1] - occurrences[0][P[K]] > 0) answers[K] = 1;  //for A
        else if (occurrences[1][Q[K]+1] - occurrences[1][P[K]] > 0) answers[K] = 2;  //for C
        else if (occurrences[2][Q[K]+1] - occurrences[2][P[K]] > 0) answers[K] = 3;  //for G
        else if (occurrences[3][Q[K]+1] - occurrences[3][P[K]] > 0) answers[K] = 4;  //for T
    }

Complete Program

The complete program for this task (problem) is:

#include <iostream>
#include <vector>
#include <string>
using namespace std;

vector<int> solution(string &S, vector<int> &P, vector<int> &Q) {
    int N = S.size();
    int M = P.size();
    vector<int> answers(M);

    int occurrences[4][N+1];

    for (int j=0; j<4; j++) {
      for (int i=0; i<N+1; i++) {
          occurrences[j][i] = 0;
      }
    }

    for (int i=0; i < N+1; i++) {
        if (S[i] == 'A') occurrences[0][i+1] = 1;
        else if (S[i] == 'C') occurrences[1][i+1] = 1;
        else if (S[i] == 'G') occurrences[2][i+1] = 1;
        else if (S[i] == 'T') occurrences[3][i+1] = 1;
    }

    for (int j=0; j<4; j++) {
        for (int i=1; i<N+1; i++) {
            occurrences[j][i] = occurrences[j][i] + occurrences[j][i-1];
        }
    }
    
    for (int K=0; K<M; K++){
        if (occurrences[0][Q[K]+1] - occurrences[0][P[K]] > 0) answers[K] = 1;
        else if(occurrences[1][Q[K]+1] - occurrences[1][P[K]] > 0) answers[K] = 2;
        else if(occurrences[2][Q[K]+1] - occurrences[2][P[K]] > 0) answers[K] = 3;
        else if(occurrences[3][Q[K]+1] - occurrences[3][P[K]] > 0) answers[K] = 4;
    }

    return answers;
}

int main()
{
    vector<int> P{2, 5, 0};
    vector<int> Q{4, 5, 6};
    string str = "CAGCCTA";
    vector<int> ret = solution(str, P, Q);
    cout << ret[0] <<' '<< ret[1] <<' '<< ret[2] << endl;
    
    vector<int> P2{0};
    vector<int> Q2{0};
    string str2 = "CAGCCTA";
    vector<int> ret2 = solution(str2, P2, Q2);
    cout << ret2[0] <<' '<< ret2[1] <<' '<< ret2[2] << endl;
    
    return 0;
}

The output is:

    2 4 1
    2 0 0

At app.codility.com/programmers, the scoring and detected time complexity for this smart approach is:

    Task score : 100% -  Correctness : 100% ; Performance : 100%
    Detected time complexity: O(N+M)

where N is for the number of characters in the string, and M is the number of queries.

Conclusion

Use Prefix Sums, but not very directly, and in some two-dimensional format.




Related Links

More Related Links

Cousins

BACK NEXT

Comments