C-like arrays in perl

General Tech Learning Aids/Tools 2 years ago

0 2 0 0 0 tuteeHUB earn credit +10 pts

5 Star Rating 1 Rating

Posted on 16 Aug 2022, this text provides information on Learning Aids/Tools related to General Tech. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Take Quiz To Earn Credits!

Turn Your Knowledge into Earnings.

tuteehub_quiz

Answers (2)

Post Answer
profilepic.png
manpreet Tuteehub forum best answer Best Answer 2 years ago

 

I want to create and manipulate large arrays of (4 byte) integers in-memory. By large, I mean on the order of hundreds of million. Each cell in the array will act as a counter for a position on a chromosome. All I need is for it to fit in memory, and have fast (O(1)) access to elements. The thing I'm counting is not a sparse feature, so I can't use a sparse array.

I can't do this with a regular perl list, because perl (at least on my machine) uses 64 bytes per element, so the genomes of most of the organisms I work with are just too big. I've tried storing the data on-disk via SQLite and hash tying, and though they work, are very slow, especially on ordinary drives. (It works reasonably ok when I run on 4-drive raid 0's).

I thought I could use PDL arrays, b/c PDL stores its arrays just as C does, using only 4 bytes per element. However, I found that update speed to be excruciatingly slow compared to perl lists:

use PDL;
use Benchmark qw/cmpthese/;

my $N = 1_000_000;
my @perl = (0 .. $N - 1);
my $pdl = zeroes $N;

cmpthese(-1,{ 
    perl => sub{
        $perl[int(rand($N))]++;
    },
    pdl => sub{
        # note that I'm not even incrementing here just setting to 1
        $pdl->set(int(rand($N)), 1);
    }
});

Returns:

          Rate  pdl perl
pdl   481208/s   -- -87%
perl 3640889/s 657%   --    

Does anyone know how to increase pdl set() performance, or know of a different module that can accomplish this?

profilepic.png
manpreet 2 years ago

I cannot tell what sort of performance you will get, but I recommend using the vec function, documented here, to split a string into bit fields. I have experimented and found that my Perl will tolerate a string up to 500_000_000 characters long. which corresponds to 125,000,000 32-bit values.

my $data = "\0" x 500_000_000;
vec($data, 0, 32)++;            # Increment data[0]
vec($data, 100_000_000, 32)++;  # Increment data[100_000_000]

If this isn't enough there may be something in the build of Perl that controls the limit. Alternatively if you think you can get smaller fields - say 16-bit counts - vec will accept field widths of any power of 2 up to 32.

Edit: I believe the string size limit is related to the 2GB maximum private working set on 32-bit Windows processes. If you are running Linux or have a 64-bit perl you may be luckier than me.


I have added to your benchmark program like this

my $vec = "\0" x ($N * 4);

cmpthese(-3,{ 
    perl => sub{
        $perl[int(rand($N))]++;
    },
    pdl => sub{
        # note that I'm not even incrementing here just setting to 1
        $pdl->set(int(rand($N)), 1);
    },
    vec => sub {
        vec($vec, int(rand($N)), 32)++; 
    },
});

giving these results

          Rate  pdl  vec perl
pdl   472429/s   -- -76% -85%
vec  1993101/s 322%   -- -37%
perl 3157570/s 568%  58%   --

so using vec is two-thirds the speed of a native array. Presumably that's acceptable.


0 views   0 shares

No matter what stage you're at in your education or career, TuteeHub will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.