Broad Network


GenomicRangeQuery at app.codility.com/programmers in PHP Explained: Find the minimal nucleotide from a range of sequence DNA.

Task score : 100% - Correctness : 100% ; Performance : 100%
Detected time complexity : O(N + M)
Lesson 5 at app.codility.com/programmers
The solution and its explanation are given below.

Category of Article : Technology | Computers and Software | Algorithm

By: Chrysanthus Date Published: 28 May 2025

Problem

A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T have impact factors of 1, 2, 3 and 4, respectively. You are going to answer several queries of the form: What is the minimal impact factor of nucleotides contained in a particular part of the given DNA sequence?

The DNA sequence is given as a non-empty string S = S[0]S[1]...S[N-1] consisting of N characters. There are M queries, which are given in non-empty arrays P and Q, each consisting of M integers. The K-th query (0 ≤ K < M) requires you to find the minimal impact factor of nucleotides contained in the DNA sequence between positions P[K] and Q[K] (inclusive).

For example, consider string S = CAGCCTA and arrays P, Q such that:

    P[0] = 2    Q[0] = 4
    P[1] = 5    Q[1] = 5
    P[2] = 0    Q[2] = 6

The answers to these M = 3 queries are as follows:

        The part of the DNA between positions 2 and 4 contains nucleotides G and C (twice), whose impact factors are 3 and 2 respectively, so the answer is 2.
        The part between positions 5 and 5 contains a single nucleotide T, whose impact factor is 4, so the answer is 4.
        The part between positions 0 and 6 (the whole string) contains all nucleotides, in particular nucleotide A whose impact factor is 1, so the answer is 1.

Write a function:

    function solution($S, $P, $Q);

that, given a non-empty string S consisting of N characters and two non-empty arrays P and Q consisting of M integers, returns a array consisting of M integers specifying the consecutive answers to all queries.

Result array should be returned as a array of integers.

For example, given the string S = CAGCCTA and arrays P, Q such that:

    P[0] = 2    Q[0] = 4
    P[1] = 5    Q[1] = 5
    P[2] = 0    Q[2] = 6
the function should return the values [2, 4, 1], as explained above.

Write an efficient algorithm for the following assumptions:

        N is an integer within the range [1..100,000];
        M is an integer within the range [1..50,000];
        each element of arrays P and Q is an integer within the range [0..N - 1];
        P[K] ≤ Q[K], where 0 ≤ K < M;
        string S consists only of upper-case English letters A, C, G, T.

Note:

A slice is a sequence of elements that form a subset (subgroup) of the given array.

Brute Force Approach

The program for the brute-force approach is as follows (read through the code):

<?php
    function solution($S, $P, $Q) {
        $M = count($P);
        $mp = ['A' => 1, 'C' => 2, 'G' => 3, 'T' => 4];    //map
        $answers = array_fill(0, $M, 0);

        for($K=0; $K < $M; $K++) {
            $min = 5;
            for ($i=$P[$K]; $i <= $Q[$K]; $i++) {    //scan the slice
                if ($mp[$S[$i]] < $min) {
                    $min = $mp[$S[$i]];
                    $answers[$K] = $min;
                }
            }
        }

        return $answers;
    }

    $P = [2, 5, 0];
    $Q = [4, 5, 6];
    $str = 'CAGCCTA';
    $ret1 = solution($str, $P, $Q);
    echo $ret1[0] . ' ' . $ret1[1] . ' ' . $ret1[2] . ' ' . "\n";

    $P2 = [0];
    $Q2 = [0];
    $str2 = 'CAGCCTA';
    $ret2 = solution($str2, $P2, $Q2);
    echo $ret2[0] . ' ' . $ret2[1] . ' ' . $ret2[2] . ' ' . "\n";
?>

The output is:

    2 4 1
    2 undefined undefined 

where undefined meaans 0 here. 

The scoring and detected time complexity for this solution() function, at app.codility.com/programmers is:

    Task score : 62% -  Correctness : 100% ; Performance : 0%
    Detected time complexity: O(N x M)

The correctness is 100% and the speed (time taken) is 0% (too slow). The reasons for the 0% performance (from app.codility.com/programmers) are given in the following table:

almost_all_same_letters
GGGGGG..??..GGGGGG..??..GGGGGG
✘ TIMEOUT ERROR
Killed. Hard limit reached: 6.000 sec.
large_random
large random string, length
✘ TIMEOUT ERROR
Killed. Hard limit reached: 6.000 sec.
extreme_large
all max ranges
✘ TIMEOUT ERROR
Killed. Hard limit reached: 6.000 sec.

N x M as number of operations for the solution() function, can be compared (is similar) to N2 operations (especially when N = M). Imagine a grid of N columns by M rows where each cell in the grid represents an operation. N is for the length of the given string S, and M is for the given number of queries. So, the total number of operations for the above solution() function, is N x M. Remember, it is the maximum possible number of operations that is considered, when dealing with time complexity.

Smart Solution

N + M number of operations is normally less than N x M number of operations. So, if the number of operations for the above solution() can be reduced from N x M to N + M, then the overall time taken for the solution() function to complete, would be shorter, (hence speed will be faster).

Multiplication (N x M) is continuous addition (continuous summing). Prefix-sums can reduce time complexity from O(N2) to O(N). So, prefix-sums can be used somewhere, somehow in the solution() function code, to enable the number of operations go down from (N x M) to (N + M). After all, this problem is on the lesson (tutorial) topic of prefix-sums.

If the slice sum from the prefix-sums for the single occurrence of each nucleotide type, in the string, is obtained for each query, then the total number of operations can be reduced from (N x M) to (N + M). However, here, two dimensional array is needed and not one dimensional array.

In the problem example, N is 7 and M is 3. With that, 7 x 3 = 21 operations and N + M = 7 + 3 = 10 operations.

Illustration

The illustration is done here with the above sample case of S = CAGCCTA consisting of N = 7 characters and M = 3 queries, which are:

    P[0] = 2    Q[0] = 4
    P[1] = 5    Q[1] = 5
    P[2] = 0    Q[2] = 6

A two dimensional array of single occurrence has to be created with all elements initialized to 0, as the following table shows:

Single Occurrence of Nucleotide Type in given String, Cells initialized to 0
Column
index/Row
index
0 (C)1 (A)2 (G)3 (C)4 (C)5 (T)6 (A)
0 (A)0000000
1 (C)0000000
2 (G)0000000
3 (T)0000000

The code segment to produce this two dimensional array with all elements initialized to 0 is:

    $occurrences = Array();    //will have 4 rows and N columns
    for ($j=0; $j < 4; $j++) {
      for ($i=0; $i < $N; $i++) {    //will have N elements (columns)
          $occurrences[$j][$i] = 0;
      }
    }

The time of execution for this code segment is not included in the overall time complexity. The column for the prefix sums preceding zero is implicit (not in the table).

Type A nucleotide has an impact factor of 1. However, in the table its row index is 0 = 1 - 1. Type C nucleotide has an impact factor of 2. However, in the table its row index is 1 = 2 - 1. Type G nucleotide has an impact factor of 3. However, in the table its row index is 2 = 3 - 1. Type T nucleotide has an impact factor of 4. However, in the table its row index is 3 = 4 - 1.

Next in the program, the single occurrence of each nucleotide type in the string, S = "CAGCCTA" has to be put into the two dimensional array (table). The table becomes:

Single Occurrence of Nucleotide Type in a 2D Array
Column
index/Row
index
0 (C)1 (A)2 (G)3 (C)4 (C)5 (T)6 (A)
0 (A)0100001
1 (C)1001100
2 (G)0010000
3 (T)0000010

A occurs 1 time at index 1 (zero based indexing) in the given string, and is represented 1 time at index 1 above (row 0). ‘A’ also occurs once at index 6 in the given string, and is represented once at index 6 above (row 0). C occurs 1 time at index 0 in the given string, 1 time at index 3 in the given string, and 1 time at index 4 in the given string. Then, C is represented 1 time at index 0 above (row 1), 1 time at index 3 above (row 1), and 1 time at index 4 above (row 1). G occurs 1 time at index 2 in the given string, and is represented one time at index 2 above (row 2). T occurs 1 time at index 5 in the given string, and is represented one time at index 5 above (row 3). Every other cell remains at the initialized value of 0. The zeros (column) for the prefix sums are implicit (not explicit).

The code segment to produce the above 2D array is:

    for ($i=0; $i < $N; $i++) {
        if ($S[$i] == 'A') $occurrences[0][$i] = 1;
        else if ($S[$i] == 'C') $occurrences[1][$i] = 1;
        else if ($S[$i] == 'G') $occurrences[2][$i] = 1;
        else if ($S[$i] == 'T') $occurrences[3][$i] = 1;
    }

The reader must note that this non-nested for-loop takes N time and not N x M time. However, app.codility.com/programmers has not really included this N time, in the overall time complexity.

Next in the program, the prefix sums for the single occurrence of each nucleotide type for the string, S = "CAGCCTA" are obtained for each row. The 2D table (array) becomes:

Prefix Sums for Single Occurrence of Nucleotide Type in a 2D Array
Column
index/Row
index
0 (C)1 (A)2 (G)3 (C)4 (C)5 (T)6 (A)
0 (A)0111112
1 (C)1112333
2 (G)0011111
3 (T)0000011

The code segment for this table is:

    for ($j=0; $j < 4; $j++) {
        for ($i=1; $i < $N; $i++) {
            $occurrences[$j][$i] = $occurrences[$j][$i] + $occurrences[$j][$i-1];    s//prefix summing; equivalent to occurrences[j][i] += occurrences[j][i-1];
        }
    }

Such prefix sums without the preceding zeros are referred to as running sums.

Slice Sum

The normal way to obtain a slice sum is as follows: Subtract the prefix sum at the beginning of the slice, from the prefix sum just after the slice. This should be done for each row in the above final 2D array, per query, for the problem. However, because of the unconventional way of doing the prefix sums for this problem, the prefix sum just before the beginning of the slice, is subtracted from the prefix sum at the end of the slice (it is still fundamentally the same thing: slice sum for prefix sums).

Consider the query: P[0] = 2 . . . Q[0] = 4:

If A occurs at least once, then since A has the smallest impact factor, the minimal impact factor is 1. If the slice sum for A in this query is greater than 0, then it means A occurs at least once. From the above table, the slice sum for A in this interval is 0 = 1 - 1 (in row 0). So A does not occur at all in this portion (part) of the string, S = "CAGCCTA".

The impact factors are as follows: A= 1; C=2; G=3; T=4. Now that A is not found in the range P[0] = 2 . . . Q[0] = 4, in the string , S = "CAGCCTA", if C is found in that range, then 2 for C will be the minimal impact factor, since 3 for G and 4 for T are higher impact factors.

Still on the query: P[0] = 2 . . . Q[0] = 4: if C occurs at least once, then since C has the bigger impact factor after A, the minimal impact factor will be 2. This assumes that A did not occur at all, in the range. If the slice sum for C in this query is greater than 0, then it means C occurs at least once (since A did not occur at all). From the above table, the slice sum for C in this interval is 2 = 3 - 1 (in row 1). Two is greater than 0. So C occurs at least once, in this portion of the string, S = "CAGCCTA". And the answer for the query is 2 for C = 2. Here, the impact factor for C being equal to 2 and the slice sum being equal to 2 is coincidence.

Consider the query: P[1] = 5 . . . Q[1] = 5:

If A occurs at least once, then since A has the smallest impact factor, the minimal impact factor is 1. If the slice sum for A in this query is greater than 0, then it means A occurs at least once. From the above table, the slice sum for A in this interval is 0 = 1 - 1 (in row 0). So A does not occur at all in this portion of the string, S = "CAGCCTA".

Now that A is not found in the range P[1] = 5 . . . Q[1] = 5, in the string , S = "CAGCCTA", if C is found in that range, then 2 for C will be the minimal impact factor, since 3 for G and 4 for T are higher impact factors.

Still on the query: P[1] = 5 . . . Q[1] = 5, if C occurs at least once, then since C has the bigger impact factor after A, the minimal impact factor will be 2 for C. This assumes that A did not occur at all, in the range. If the slice sum for C in this query is greater than 0, then it means C occurs at least once. From the above table, the slice sum for C in this interval (inclusive) is 0 = 3 - 3 (in row 1). So C does not occur at all in this portion (part) of the string, S = "CAGCCTA".

Still on the query: P[1] = 5 . . . Q[1] = 5, if G occurs at least once, then since G has the bigger impact factor after C, the minimal impact factor will be 3 for G. This assumes that neither A nor C occurred at all, in the range. If the slice sum for G in this query is greater than 0, then it means G occurs at least once. From the above table, the slice sum for G in this interval is 0 = 1 - 1 (in row 2). So G does not occur at all in this portion of the string, S = "CAGCCTA".

Still on the query: P[1] = 5 . . . Q[1] = 5, if T occurs at least once, then since T has the bigger impact factor after G, the minimal impact factor will be 4 for T. This assumes that neither A nor C nor G occurred at all, in the range. If the slice sum for T in this query is greater than 0, then it means T occurs at least once. From the above table, the slice sum for T in this interval is 1 - 0 = 1 (in row 3). One is greater than 0. So T occurs at least once, in this portion of the string, S = "CAGCCTA". And the answer for the query is 4 for T = 4.

Consider the query: P[2] = 0 . . . Q[2] = 6:

If A occurs at least once, then since A has the smallest impact factor, the minimal impact factor will be 1. If the slice sum for A in this query is greater than 0, then it means A occurs at least once. From the above table, the slice sum for A in this interval is 2 - 0 = 2 (in row 0). So A occurs at least once, in this portion of the string, S = "CAGCCTA". And the answer for the query is 1 for A = 1. The prefix sums preceding 0, has been taken into consideration for this subtraction.

Assuming that the index, -1 for the rows, has the prefix sums preceding 0 (which should be 0 for the four rows), then the code segment for the slice sum for the minimal impact factor is:

    for (K=0; K<M; K++){
        if (occurrences[0][Q[K]] - occurrences[0][P[K]-1] > 0) answers[K] = 1;  //for A
        else if (occurrences[1][Q[K]] - occurrences[1][P[K]-1] > 0) answers[K] = 2;  //for C
        else if (occurrences[2][Q[K]] - occurrences[2][P[K]-1] > 0) answers[K] = 3;  //for G
        else if (occurrences[3][Q[K]] - occurrences[3][P[K]-1] > 0) answers[K] = 4;  //for T
    }

Since the index, -1 for the rows, does not have the prefix sums preceding 0, then the code segment for the slice sum becomes:

    for ($K=0; $K < $M; $K++){
        $justBeforeSliceBegin;

        for ($x=0; $x < 4; $x++) {
            if ($P[$K]-1 == -1)    //to account for prefix sums preceding 0
                $justBeforeSliceBegin = 0;
            else
                $justBeforeSliceBegin = $occurrences[$x][$P[$K]-1];

            if($occurrences[$x][$Q[$K]] - $justBeforeSliceBegin > 0) {
                $answers[$K] = $x+1;
                break;
            }
        }
    }

This code segment is considered to take M time and not M x M time (not M2 time): The nested for-loop replaces the above if-compound statement.

Complete Program

The complete program for this task (problem) is:

<?php
function solution($S, $P, $Q) {
    $N = strlen($S);
    $M = count($P);
    $answers;    //will have M elements

    $occurrences = Array();    //will have 4 rows and N columns
    for ($j=0; $j < 4; $j++) {
      for ($i=0; $i < $N; $i++) {    //will have N elements (columns)
          $occurrences[$j][$i] = 0;
      }
    }

    for ($i=0; $i < $N; $i++) {
        if ($S[$i] == 'A') $occurrences[0][$i] = 1;
        else if ($S[$i] == 'C') $occurrences[1][$i] = 1;
        else if ($S[$i] == 'G') $occurrences[2][$i] = 1;
        else if ($S[$i] == 'T') $occurrences[3][$i] = 1;
    }

    for ($j=0; $j < 4; $j++) {
        for ($i=1; $i < $N; $i++) {
            $occurrences[$j][$i] = $occurrences[$j][$i] + $occurrences[$j][$i-1];
        }
    }
    
    for ($K=0; $K < $M; $K++){
        $justBeforeSliceBegin;

        for ($x=0; $x < 4; $x++) {
            if ($P[$K]-1 == -1)    //to account for prefix sums preceding 0
                $justBeforeSliceBegin = 0;
            else
                $justBeforeSliceBegin = $occurrences[$x][$P[$K]-1];

            if($occurrences[$x][$Q[$K]] - $justBeforeSliceBegin > 0) {
                $answers[$K] = $x+1;
                break;
            }
        }
    }

    return $answers;
}

$P = [2, 5, 0];
$Q = [4, 5, 6];
$str = 'CAGCCTA';
$ret1 = solution($str, $P, $Q);
echo $ret1[0] . ' ' . $ret1[1] . ' ' . $ret1[2] . ' ' . "\n";

$P2 = [0];
$Q2 = [0];
$str2 = 'CAGCCTA';
$ret2 = solution($str2, $P2, $Q2);
echo $ret2[0] . ' ' . $ret2[1] . ' ' . $ret2[2] . ' ' . "\n";
?>

The output is:

    2 4 1
    2 undefined undefined 

where undefined meaans 0 here. 

At app.codility.com/programmers, the scoring and detected time complexity for this smart approach is:

    Task score : 100% -  Correctness : 100% ; Performance : 100%
    Detected time complexity: O(N+M)

where N is for the number of characters in the string, and M is the number of queries.

Conclusion

Use Prefix Sums, but not very directly, and in a two-dimensional structure.

The rest of the 17 lessons, with explanation of lessons, explanation of tasks, and explanation of solutions, can be found at (click): Explanations in PHP for the 17 Algorithm Lessons at https://app.codility.com/programmers with Problems and Solutions





Related Links

Basics of PHP with Security Considerations
White Space in PHP
PHP Data Types with Security Considerations
PHP Variables with Security Considerations
PHP Operators with Security Considerations
PHP Control Structures with Security Considerations
PHP String with Security Considerations
PHP Arrays with Security Considerations
PHP Functions with Security Considerations
PHP Return Statement
Exception Handling in PHP
Variable Scope in PHP
Constant in PHP
PHP Classes and Objects
Reference in PHP
PHP Regular Expressions with Security Considerations
Date and Time in PHP with Security Considerations
Files and Directories with Security Considerations in PHP
Writing a PHP Command Line Tool
PHP Core Number Basics and Testing
Validating Input in PHP
PHP Eval Function and Security Risks
PHP Multi-Dimensional Array with Security Consideration
Mathematics Functions for Everybody in PHP
PHP Cheat Sheet and Prevention Explained
More Related Links

Cousins

BACK NEXT

Comments