Broad Network


Packing and Unpacking Text in Perl

Perl pack and unpack Functions – Part 1

Writing a Perl Module

Foreword: In this part of the series I explain how to pack and unpack text in Perl, to achieve joining or separation of bytes with some conversion.

By: Chrysanthus Date Published: 27 Jan 2015

Introduction

This is part 1 of my series, Perl pack and unpack Functions. In this part of the series I explain how to pack and unpack text in Perl, to achieve joining or separation of bytes with some conversion. A variable can hold a string. In the string, you have characters. In the computer memory, these characters are placed in memory cells, one next to the other, one character per cell. The code of each such character is called a byte. The Perl pack() and the Perl unpack() functions are used to manipulate such consecutive bytes in memory. This series assumes you are using the ASCII character set or an ASCII compatible character set, for the code samples.

Pre-Knowledge
This series is part of my volume, Writing a Perl Module. The first series in this volume is Internet Sockets and Perl. The second series is this one. At the bottom of this page, you have links to the different series in the volume. You should be reading the series in the order given.

In each series you should read the parts in the order given. For each part, the links and order are in a menu on the (top) left of the page.

Use of the pack and unpack Functions for Text
Consider the following statement:

    $str = "I have a pencil, a pen and a book.";

Before we continue, note that a space is a character; it is a byte in memory. Each of the characters in the above string occupies a cell in memory. Each code in each cell is called a byte. The characters of a variable like $str, are in consecutive cells in memory. Without using regular expression, you can separate these characters into smaller consecutive characters in portions, to end up with something like:

    "I have", "a pencil", " a pen", "and a boo"

Index counting within a string begins from zero. Note that the space characters at index 6 and index 22 of the main string ($str) have not been included in the separations. Also note that the comma after “pencil” in the main string, has not been included in the separations. Still note that the ‘k’ and full stop at the end of the main string have not been included in the separations. Assume that it was optional for us to include these omissions. The four separations could be assigned to variables leading to something like:

    $str0 = "I have"
    $str1 = "a pencil"
    $str2 = " a pen"
    $str3 = "and a boo"

When you separate a string in this manner, maintaining the consecutiveness of the characters, you are said to be unpacking the string. When you join the separate portions to have back the main string with or without the omissions, you are said to be packing the string. Note that inside the computer, each character is a byte, even the space character.

The unpack Function
The following code unpacks the above main string into the variables given:

    my $str = "I have a pencil, a pen and a book.";
    my ($str0, $str1, $str2, $str3) = unpack("A6xA8xA6xA9", $str);

The expression, "A6xA8xA6xA9" is an example of what is called a template. The second argument in the function is the main string: it can be a variable or a literal. The template can also be a variable. The unpack function returns a list.

Now, the letter, A in uppercase represents a character in the template. A6 at its position in the template, means copy the first 6 characters (bytes) into the variable, $str0. x at its position in the template, means skip the next corresponding character of the main string. x is in lowercase. x2 means skip the next 2 characters, x3 means skip the next 3 characters, and so on. A8 at its position in the template means copy the next 8 corresponding characters of the string into variable, $str1. The x after 8 means skip the next character. A6 again means copy the next 6 characters of the corresponding portion in the string into variable, $str2. The x that follows means skip the next character in the process. A9 means copy the next 9 characters of the corresponding portion of the string into variable, $str3. Template letters are case sensitive, in the sense that ‘A’ does not mean ‘a’.

Consider the following template:

    "AAA"

This would copy single characters. The first A means copy the first character of the string into the first variable of the list; the second A means copy the second character into the second variable of the list. The third A means copy the third character into the third variable of the list (and so on). If you want to copy consecutive characters of a portion into a variable, follow the A for the start of the portion, with a number, as in A6 or A8 above.

Note: A and A1 mean the same thing; and x and x1 mean the same thing. A variable in the return list, corresponds to the letter A followed by a number. So, A1 means one character for the variable and A6 means 6 characters for the variable.

The pack Function
The pack function does the opposite of the unpack function. It groups portions of strings to form one string. The formed string will have the portions lying consecutively without any skip of memory cell (in memory). When you unpack, you return a list of items; when you pack, you return a single item.

The portions of strings packed do not necessarily have to come from the same source (original string). So by default, the pack function does not include characters you do not ask for like, comma, space, full stop or even letters like e, f, g. etc. Note, when a digit e.g. 4, 6, 5 is typed within quotes, it is a character and not a number. A single digit as a character, occupies 1 byte.

The pack function uses the same template as the unpack function. However, the x here will instead introduce the null character. The null character is '\0'. So x1 means introduce 1 null character, x2 means introduce two null characters, x3 means introduce three null characters, etc.

The following code packs the separated strings above into the main string, but the characters omitted, have not been included in the packing process.

use strict;

    my $str0 = "I have";
    my $str1 = "a pencil";
    my $str2 = " a pen";
    my $str3 = "and a boo";

    my $str = pack("A6A8A6A9", $str0, $str1, $str2, $str3);

    print $str;

The output is:

    I havea pencil a penand a boo

In the template here, the x meta character was not used, as it would introduce a null byte (character). The null character should not be printed; there is actually a nuance with it - see below.

The only way to add characters like comma, full stop or even ordinary letters, to the packed string is to use new variables for the needed characters. The following code illustrates this in building the above initial string:

use strict;

    my $str0 = "I have";
    my $str0A = " ";
    my $str1 = "a pencil";
    my $str1B = ",";
    my $str2 = " a pen";
    my $str2C = " ";
    my $str3 = "and a boo";
    my $str3D = "k.";

    my $str = pack("A6AA8AA6AA9A2", $str0, $str0A, $str1, $str1B, $str2, $str2C, $str3, $str3D);

    print $str;


The output is now,

    I have a pencil, a pen and a book.

The new variables, $str0A, $str1B and $str2C, have the characters or sub-strings needed. $str0A corresponds to the template portion, A after the first 6; $str1B corresponds to the template portion, A after 8; $str2C corresponds to the template portion, A after the second 6; and $str3D corresponds to the template portion, A2 at the end.

The unpack function returns a list and has one item for the second argument. The pack function returns one item and has a list for its second argument. For both functions, the templates have the same meaning except for the use of x. When packing, A8 for example in the template will take 8 characters of one variable and join to the pack string, based on its position in the template (it actually takes the first 8 characters in the variable and ignores the rest of the characters in the variable).

Note: the characters in the template are called meta characters. A meta character is a character about another character (in a string).

Consuming all Unknown Characters
You may want to pack the characters of a variable into a string but you do not know how many characters are in the variable! No problem, there is the * meta character to use. It means consume the rest of the characters. Consider the following code segment:

use strict;

    my $str0 = "chosen";
    my $str1 = "remainder";

    my $str = pack("A6A*", ($str0, $str1));

    print $str;

The second argument as a list can be in parentheses. The output is:

    chosenremainder

To separate the two words, you have to add another variable (of comma or space) between the two present variables. The added variable will have a corresponding template portion between the two present template portions.

The * meta character means consume everything left. So A* means A as many times as possible, for the corresponding variable; not 6, not 2 not 8, but as many times as possible, for the corresponding variable. When unpacking, * is usually used at the end of the template. In that case, it would mean, copy the characters left into the last variable returned.

Effect of smaller number for Template Portion
If you meant to consider say 8 characters and you typed say 4 for the template, only the first 4 characters for the portion, will be returned. That is, if you type A4 instead of A8, only the first 4 characters of the sub-string in question are returned.

The meta characters of the template are case sensitive: A3 does not mean a3. In fact a3 in the template is an error.

Spaces within a Template
Note: you can have space between the template portions for the purpose of readability. The following two statements will yield the same above result:

    my $str = pack("A6A*", ($str0, $str1));
    my $str = pack("A6 A*", ($str0, $str1));

The space(s) has (have) no implication in the output.

The null Byte
The null byte or null character is '\0'; where 0 is zero and not the letter, O. You use the pack() and unpack() functions in low-level programming. In low level programming, from time to time, you will be required to add the null character at the end of a phrase, within a long text. You can either add the null character by string concatenation or by the x meta character in the pack function. The following code illustrates this with string concatenation:

use strict;

    my $strA = "blabla";
    my $strB = "She loves everyone." . "\0";
    my $strC = "Is she a flirt?" . "\0";
    my $strD = "bla1bla2";

    my $str = pack("A6A20A16A8", $strA,$strB,$strC,$strD);

    print $str;

Note that for the template portions for variables, $strB and  $strC, the numbers have been increased by 1 to take account of the null byte; from 19 to 20 for $strB and from 15 to 16 for $strC. When typing the null byte in code, use double quotes and not single quotes. The output is:

    blablaShe loves everyone. Is she a flirt? bla1bla2

The following code illustrates the insertion of the null byte with the x meta character:

use strict;

    my $strA = "blabla";
    my $strB = "She loves everyone.";
    my $strC = "Is she a flirt?";
    my $strD = "bla1bla2";

    my $str = pack("A6A19xA15xA8", $strA,$strB,$strC,$strD);

    print $str;

The output is:

    blablaShe loves everyone. Is she a flirt? bla1bla2

same as for concatenation.

In the packed string the null byte is actually hexadecimal, 00 – see later. However, note that the print function prints it as a space character.

The Problem of the Null Character
The print function prints the null character as a space. To be exact, the null byte (character) is never printed.

When you use the pack() function with the resulting string having a null byte, there is no problem; the null byte is stored in the string as hexadecimal 00. When you use the unpack()function with the resulting string having a null byte, there is still no problem; the null byte is still stored in the string as hexadecimal 00. However, when you pack text whose result has a null byte, and then you unpack the result with the hope of maintaining the null byte, and you continue the cycle (packing and unpacking), you will end up with the null byte replaced by the space character, hexadecimal, 20. Maybe there is a problem with the pack or unpack functions internally. I solve this problem by using concatenation or regular expression where possible.

That is it for this part of the series. We stop here and continue in the next part. Before we take a break, know that the pack and unpack functions are Perl built-in functions; they are not from a module.

Chrys

Related Links

Internet Sockets and Perl
Perl pack and unpack Functions
Writing MySQL Protocol Packets in PurePerl
Developing a PurePerl MySQL API
Using the PurePerl MySQL API
Database
Perl Course
MySQL Course
More Related Links
Perl Mailsend
PurePerl MySQL API
Perl Course - Professional and Advanced
Major in Website Design
Web Development Course
Producing a Pure Perl Library
MySQL Course

NEXT

Comments

Become the Writer's Fan
Send the Writer a Message