GenomicRangeQuery at app.codility.com/programmers in C++ Explained: Find the minimal nucleotide from a range of sequence DNA.
Task score : 100% - Correctness : 100% ; Performance : 100%
Detected time complexity : O(N + M)
Lesson 5 at app.codility.com/programmers
The solution and its explanation are given below.
Category of Article : Technology | Computers and Software | Algorithm
By: Chrysanthus Date Published: 28 May 2025
Problem
A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T have impact factors of 1, 2, 3 and 4, respectively. You are going to answer several queries of the form: What is the minimal impact factor of nucleotides contained in a particular part of the given DNA sequence?
The DNA sequence is given as a non-empty string S = S[0]S[1]...S[N-1] consisting of N characters. There are M queries, which are given in non-empty arrays P and Q, each consisting of M integers. The K-th query (0 ≤ K < M) requires you to find the minimal impact factor of nucleotides contained in the DNA sequence between positions P[K] and Q[K] (inclusive).
For example, consider string S = CAGCCTA and arrays P, Q such that:
P[0] = 2 Q[0] = 4 P[1] = 5 Q[1] = 5 P[2] = 0 Q[2] = 6
The answers to these M = 3 queries are as follows:
The part of the DNA between positions 2 and 4 contains nucleotides G and C (twice), whose impact factors are 3 and 2 respectively, so the answer is 2. The part between positions 5 and 5 contains a single nucleotide T, whose impact factor is 4, so the answer is 4. The part between positions 0 and 6 (the whole string) contains all nucleotides, in particular nucleotide A whose impact factor is 1, so the answer is 1.
Write a function:
vector<int> solution(string &S, vector<int> &P, vector<int> &Q);
that, given a non-empty string S consisting of N characters and two non-empty vectors P and Q consisting of M integers, returns a vector consisting of M integers specifying the consecutive answers to all queries.
Result vector should be returned as a vector of integers.
For example, given the string S = CAGCCTA and vectors P, Q such that:
P[0] = 2 Q[0] = 4 P[1] = 5 Q[1] = 5 P[2] = 0 Q[2] = 6the function should return the values [2, 4, 1], as explained above.
Write an efficient algorithm for the following assumptions:
N is an integer within the range [1..100,000]; M is an integer within the range [1..50,000]; each element of vectors P and Q is an integer within the range [0..N - 1]; P[K] ≤ Q[K], where 0 ≤ K < M; string S consists only of upper-case English letters A, C, G, T.
Note:
A slice is a sequence of elements that form a subset (subgroup) of the given vector (array).
Brute Force Approach
The program for the brute-force approach is as follows (read through the code):
#include <iostream> #include <vector> #include <string> #include <unordered_map> using namespace std; vector<int> solution(string &S, vector<int> &P, vector<int> &Q) { int M = P.size(); unordered_map<char, int> mp = {{'A', 1}, {'C', 2}, {'G', 3}, {'T', 4}}; vector<int> answers(M, 0); for(int K=0; K<M; K++) { int min = 5; for (int i=P[K]; i<=Q[K]; i++) { //scan the slice if (mp[S[i]] < min) { min = mp[S[i]]; answers[K] = min; } } } return answers; } int main() { vector<int> P{2, 5, 0}; vector<int> Q{4, 5, 6}; string str = "CAGCCTA"; vector<int> ret = solution(str, P, Q); cout << ret[0] <<' '<< ret[1] <<' '<< ret[2] << endl; vector<int> P2{0}; vector<int> Q2{0}; string str2 = "CAGCCTA"; vector<int> ret2 = solution(str2, P2, Q2); cout << ret2[0] <<' '<< ret2[1] <<' '<< ret2[2] << endl; return 0; }
The output is:
2 4 1 2 0 0
The scoring and detected time complexity for this solution() function, at app.codility.com/programmers is:
Task score : 62% - Correctness : 100% ; Performance : 0% Detected time complexity: O(N x M)
The correctness is 100% and the speed (time taken) is 0% (too slow). The reasons for the 0% performance (from app.codility.com/programmers) are given in the following table:
almost_all_same_letters GGGGGG..??..GGGGGG..??..GGGGGG | ✘ TIMEOUT ERROR Killed. Hard limit reached: 6.000 sec. |
large_random large random string, length | ✘ TIMEOUT ERROR Killed. Hard limit reached: 6.000 sec. |
extreme_large all max ranges | ✘ TIMEOUT ERROR Killed. Hard limit reached: 6.000 sec. |
N x M as number of operations for the solution() function, can be compared (is similar) to N2 operations (especially when N = M). Imagine a grid of N columns by M rows where each cell in the grid represents an operation. N is for the length of the given string S, and M is for the given number of queries. So, the total number of operations for the above solution() function, is N x M. Remember, it is the maximum possible number of operations that is considered, when dealing with time complexity.
Smart Solution
N + M number of operations is normally less than N x M number of operations. So, if the number of operations for the above solution() can be reduced from N x M to N + M, then the overall time taken for the solution() function to complete, would be shorter, (hence speed will be faster).
Multiplication (N x M) is continuous addition (continuous summing). Prefix-sums can reduce time complexity from O(N2) to O(N). So, prefix-sums can be used somewhere, somehow in the solution() function code, to enable the number of operations go down from (N x M) to (N + M). After all, this problem is on the lesson (tutorial) topic of prefix-sums.
If the slice sum from the prefix-sums for the single occurrence of each nucleotide type, in the string, is obtained for each query, then the total number of operations can be reduced from (N x M) to (N + M). However, here, two dimensional array is needed and not one dimensional array.
In the problem example, N is 7 and M is 3. With that, 7 x 3 = 21 operations and N + M = 7 + 3 = 10 operations.
Illustration
The illustration is done here with the above sample case of S = CAGCCTA consisting of N = 7 characters and M = 3 queries, which are:
P[0] = 2 Q[0] = 4 P[1] = 5 Q[1] = 5 P[2] = 0 Q[2] = 6
A two dimensional array of single occurrence has to be created with all elements initialized to 0, as the following table shows:
Single Occurrence of Nucleotide Type in given String, Cells initialized to 0 | |||||||
Column index/Row index | 0 (C) | 1 (A) | 2 (G) | 3 (C) | 4 (C) | 5 (T) | 6 (A) |
0 (A) | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 (C) | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 (G) | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 (T) | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The code segment to produce this two dimensional array with all elements initialized to 0 is:
int occurrences[4][N]; for (int j=0; j<4; j++) { for (int i=0; i<N; i++) { occurrences[j][i] = 0; } }
The time of execution for this code segment is not included in the overall time complexity. The column for the prefix sums preceding zero is implicit (not in the table).
Type A nucleotide has an impact factor of 1. However, in the table its row index is 0 = 1 - 1. Type C nucleotide has an impact factor of 2. However, in the table its row index is 1 = 2 - 1. Type G nucleotide has an impact factor of 3. However, in the table its row index is 2 = 3 - 1. Type T nucleotide has an impact factor of 4. However, in the table its row index is 3 = 4 - 1.
Next in the program, the single occurrence of each nucleotide type in the string, S = "CAGCCTA" has to be put into the two dimensional array (table). The table becomes:
Single Occurrence of Nucleotide Type in a 2D Array | |||||||
Column index/Row index | 0 (C) | 1 (A) | 2 (G) | 3 (C) | 4 (C) | 5 (T) | 6 (A) |
0 (A) | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
1 (C) | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
2 (G) | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 (T) | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
A occurs 1 time at index 1 (zero based indexing) in the given string, and is represented 1 time at index 1 above (row 0). ‘A’ also occurs once at index 6 in the given string, and is represented once at index 6 above (row 0). C occurs 1 time at index 0 in the given string, 1 time at index 3 in the given string, and 1 time at index 4 in the given string. Then, C is represented 1 time at index 0 above (row 1), 1 time at index 3 above (row 1), and 1 time at index 4 above (row 1). G occurs 1 time at index 2 in the given string, and is represented one time at index 2 above (row 2). T occurs 1 time at index 5 in the given string, and is represented one time at index 5 above (row 3). Every other cell remains at the initialized value of 0. The zeros (column) for the prefix sums are implicit (not explicit).
The code segment to produce the above 2D array is:
for (int i=0; i < N; i++) { if(S[i] == 'A') occurrences[0][i] = 1; else if(S[i] == 'C') occurrences[1][i] = 1; else if(S[i] == 'G') occurrences[2][i] = 1; else if(S[i] == 'T') occurrences[3][i] = 1; }
The reader must note that this non-nested for-loop takes N time and not N x M time. However, app.codility.com/programmers has not really included this N time, in the overall time complexity.
Next in the program, the prefix sums for the single occurrence of each nucleotide type for the string, S = "CAGCCTA" are obtained for each row. The 2D table (array) becomes:
Prefix Sums for Single Occurrence of Nucleotide Type in a 2D Array | |||||||
Column index/Row index | 0 (C) | 1 (A) | 2 (G) | 3 (C) | 4 (C) | 5 (T) | 6 (A) |
0 (A) | 0 | 1 | 1 | 1 | 1 | 1 | 2 |
1 (C) | 1 | 1 | 1 | 2 | 3 | 3 | 3 |
2 (G) | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
3 (T) | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
The code segment for this table is:
for (int j=0; j<4; j++) { for (int i=1; i<N; i++) { occurrences[j][i] = occurrences[j][i] + occurrences[j][i-1]; //prefix summing; equivalent to occurrences[j][i] += occurrences[j][i-1]; } }
Slice Sum
The normal way to obtain a slice sum is as follows: Subtract the prefix sum at the beginning of the slice, from the prefix sum just after the slice. This should be done for each row in the above final 2D array, per query, for the problem. However, because of the unconventional way of doing the prefix sums for this problem, the prefix sum just before the beginning of the slice, is subtracted from the prefix sum at the end of the slice (it is still fundamentally the same thing: slice sum for prefix sums).
Consider the query: P[0] = 2 . . . Q[0] = 4:If A occurs at least once, then since A has the smallest impact factor, the minimal impact factor is 1. If the slice sum for A in this query is greater than 0, then it means A occurs at least once. From the above table, the slice sum for A in this interval is 0 = 1 - 1 (in row 0). So A does not occur at all in this portion (part) of the string, S = "CAGCCTA".
The impact factors are as follows: A= 1; C=2; G=3; T=4. Now that A is not found in the range P[0] = 2 . . . Q[0] = 4, in the string , S = "CAGCCTA", if C is found in that range, then 2 for C will be the minimal impact factor, since 3 for G and 4 for T are higher impact factors.
Still on the query: P[0] = 2 . . . Q[0] = 4: if C occurs at least once, then since C has the bigger impact factor after A, the minimal impact factor will be 2. This assumes that A did not occur at all, in the range. If the slice sum for C in this query is greater than 0, then it means C occurs at least once (since A did not occur at all). From the above table, the slice sum for C in this interval is 2 = 3 - 1 (in row 1). Two is greater than 0. So C occurs at least once, in this portion of the string, S = "CAGCCTA". And the answer for the query is 2 for C = 2. Here, the impact factor for C being equal to 2 and the slice sum being equal to 2 is coincidence.
Consider the query: P[1] = 5 . . . Q[1] = 5:If A occurs at least once, then since A has the smallest impact factor, the minimal impact factor is 1. If the slice sum for A in this query is greater than 0, then it means A occurs at least once. From the above table, the slice sum for A in this interval is 0 = 1 - 1 (in row 0). So A does not occur at all in this portion of the string, S = "CAGCCTA".
Now that A is not found in the range P[1] = 5 . . . Q[1] = 5, in the string , S = "CAGCCTA", if C is found in that range, then 2 for C will be the minimal impact factor, since 3 for G and 4 for T are higher impact factors.
Still on the query: P[1] = 5 . . . Q[1] = 5, if C occurs at least once, then since C has the bigger impact factor after A, the minimal impact factor will be 2 for C. This assumes that A did not occur at all, in the range. If the slice sum for C in this query is greater than 0, then it means C occurs at least once. From the above table, the slice sum for C in this interval (inclusive) is 0 = 3 - 3 (in row 1). So C does not occur at all in this portion (part) of the string, S = "CAGCCTA".
Still on the query: P[1] = 5 . . . Q[1] = 5, if G occurs at least once, then since G has the bigger impact factor after C, the minimal impact factor will be 3 for G. This assumes that neither A nor C occurred at all, in the range. If the slice sum for G in this query is greater than 0, then it means G occurs at least once. From the above table, the slice sum for G in this interval is 0 = 1 - 1 (in row 2). So G does not occur at all in this portion of the string, S = "CAGCCTA".
Still on the query: P[1] = 5 . . . Q[1] = 5, if T occurs at least once, then since T has the bigger impact factor after G, the minimal impact factor will be 4 for T. This assumes that neither A nor C nor G occurred at all, in the range. If the slice sum for T in this query is greater than 0, then it means T occurs at least once. From the above table, the slice sum for T in this interval is 1 - 0 = 1 (in row 3). One is greater than 0. So T occurs at least once, in this portion of the string, S = "CAGCCTA". And the answer for the query is 4 for T = 4.
Consider the query: P[2] = 0 . . . Q[2] = 6:If A occurs at least once, then since A has the smallest impact factor, the minimal impact factor will be 1. If the slice sum for A in this query is greater than 0, then it means A occurs at least once. From the above table, the slice sum for A in this interval is 2 - 0 = 2 (in row 0). So A occurs at least once, in this portion of the string, S = "CAGCCTA". And the answer for the query is 1 for A = 1. The prefix sums preceding 0, has been taken into consideration for this subtraction.
Assuming that the index, -1 for the rows, has the prefix sums preceding 0 (which should be 0 for the four rows), then the code segment for the slice sum for the minimal impact factor is:
for (int K=0; K<M; K++){ if (occurrences[0][Q[K]] - occurrences[0][P[K]-1] > 0) answers[K] = 1; //for A else if (occurrences[1][Q[K]] - occurrences[1][P[K]-1] > 0) answers[K] = 2; //for C else if (occurrences[2][Q[K]] - occurrences[2][P[K]-1] > 0) answers[K] = 3; //for G else if (occurrences[3][Q[K]] - occurrences[3][P[K]-1] > 0) answers[K] = 4; //for T }
Since the index, -1 for the rows, does not have the prefix sums preceding 0, then the code segment for the slice sum becomes:
for (int K=0; K<M; K++){ int justBeforeSliceBegin; for (int x=0; x<4; x++) { if (P[K]-1 == -1) //to account for prefix sums preceding 0 justBeforeSliceBegin = 0; else justBeforeSliceBegin = occurrences[x][P[K]-1]; if(occurrences[x][Q[K]] - justBeforeSliceBegin > 0) { answers[K] = x+1; break; } } }
Complete Program
The complete program for this task (problem) is:
#include <iostream> #include <vector> #include <string> using namespace std; vector<int> solution(string &S, vector<int> &P, vector<int> &Q) { int N = S.size(); int M = P.size(); vector<int> answers(M); int occurrences[4][N]; for (int j=0; j<4; j++) { for (int i=0; i<N; i++) { occurrences[j][i] = 0; } } for (int i=0; i < N; i++) { if (S[i] == 'A') occurrences[0][i] = 1; else if (S[i] == 'C') occurrences[1][i] = 1; else if (S[i] == 'G') occurrences[2][i] = 1; else if (S[i] == 'T') occurrences[3][i] = 1; } for (int j=0; j<4; j++) { for (int i=1; i<N; i++) { occurrences[j][i] = occurrences[j][i] + occurrences[j][i-1]; } } for (int K=0; K<M; K++){ int justBeforeSliceBegin; for (int x=0; x<4; x++) { if (P[K]-1 == -1) //to account for prefix sums preceding 0 justBeforeSliceBegin = 0; else justBeforeSliceBegin = occurrences[x][P[K]-1]; if(occurrences[x][Q[K]] - justBeforeSliceBegin > 0) { answers[K] = x+1; break; } } } return answers; } int main() { vector<int> P{2, 5, 0}; vector<int> Q{4, 5, 6}; string str = "CAGCCTA"; vector<int> ret = solution(str, P, Q); cout << ret[0] <<' '<< ret[1] <<' '<< ret[2] << endl; vector<int> P2{0}; vector<int> Q2{0}; string str2 = "CAGCCTA"; vector<int> ret2 = solution(str2, P2, Q2); cout << ret2[0] <<' '<< ret2[1] <<' '<< ret2[2] << endl; return 0; }
The output is:
2 4 1 2 0 0
At app.codility.com/programmers, the scoring and detected time complexity for this smart approach is:
Task score : 100% - Correctness : 100% ; Performance : 100% Detected time complexity: O(N+M)
where N is for the number of characters in the string, and M is the number of queries.
Conclusion
Use Prefix Sums, but not very directly, and in a two-dimensional structure.