Front page | perl.beginners |
Postings from April 2022
Re: Please help: perl run out of memory
Thread Previous
|
Thread Next
From:
David Mertens
Date:
April 17, 2022 18:00
Subject:
Re: Please help: perl run out of memory
Message ID:
CA+4ieYWe--b-hZOdD5nf+1V+GuAMy4rWva6hcfwdL0uknamjqQ@mail.gmail.com
I see nothing glaringly inefficient in the Perl. This would be fine on your
system if you were dealing with 1 million items, but you could easily be
pushing up against your system's limits with the generic data structures
that Perl uses, especially since Perl is probably using 64-bit floats and
ints, and storing the hash keys twice (because you have to hashes).
You could try to use the Perl Data Language, PDL, to create large typed
arrays with minimal overhead. However, I think a more Perlish approach
would be to use a single hash to store the data, as you do (or maybe using
pack/unpack to store the data using 32-bit floats and integers). Then
instead of using sort, run through the whole collection and build your own
top-20 list (or 50 or whatever) by hand. This way the final process of
picking out the top 20 doesn't allocate new storage for all 80 million
items.
Does that make sense? I could bang out some code illustrating what I mean
if that would help.
David
On Sun, Apr 17, 2022, 5:33 AM wilson <info@bigcount.xyz> wrote:
> hello the experts,
>
> can you help check my script for how to optimize it?
> currently it was going as "run out of memory".
>
> $ perl count.pl
> Out of memory!
> Killed
>
>
> My script:
> use strict;
>
> my %hash;
> my %stat;
>
> # dataset: userId, itemId, rate, time
> # AV056ETQ5RXLN,0000031887,1.0,1397692800
>
> open HD,"rate.csv" or die $!;
> while(<HD>) {
> my ($item,$rate) = (split /\,/)[1,2];
> $hash{$item}{total} += $rate;
> $hash{$item}{count} +=1;
> }
> close HD;
>
> for my $key (keys %hash) {
> $stat{$key} = $hash{$key}{total} / $hash{$key}{count};
> }
>
> my $i = 0;
> for (sort { $stat{$b} <=> $stat{$a}} keys %stat) {
> print "$_: $stat{$_}\n";
> last if $i == 99;
> $i ++;
> }
>
> The purpose is to aggregate and average the itemId's scores, and print
> the result after sorting.
>
> The dataset has 80+ million items:
>
> $ wc -l rate.csv
> 82677131 rate.csv
>
> And my memory is somewhat limited:
>
> $ free -m
> total used free shared buff/cache
> available
> Mem: 1992 152 76 0 1763
> 1700
> Swap: 1023 802 221
>
>
>
> What confused me is that Apache Spark can make this job done with this
> limited memory. It got the statistics done within 2 minutes. But I want
> to give perl a try since it's not that convenient to run a spark job
> always.
>
> The spark implementation:
>
> scala> val schema="uid STRING,item STRING,rate FLOAT,time INT"
> val schema: String = uid STRING,item STRING,rate FLOAT,time INT
>
> scala> val df =
> spark.read.format("csv").schema(schema).load("skydrive/rate.csv")
> val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ...
> 2 more fields]
>
> scala>
>
> df.groupBy("item").agg(avg("rate").alias("avg_rate")).orderBy(desc("avg_rate")).show()
> +----------+--------+
>
> | item|avg_rate|
> +----------+--------+
> |0001061100| 5.0|
> |0001543849| 5.0|
> |0001061127| 5.0|
> |0001019880| 5.0|
> |0001062395| 5.0|
> |0000143502| 5.0|
> |000014357X| 5.0|
> |0001527665| 5.0|
> |000107461X| 5.0|
> |0000191639| 5.0|
> |0001127748| 5.0|
> |0000791156| 5.0|
> |0001203088| 5.0|
> |0001053744| 5.0|
> |0001360183| 5.0|
> |0001042335| 5.0|
> |0001374400| 5.0|
> |0001046810| 5.0|
> |0001380877| 5.0|
> |0001050230| 5.0|
> +----------+--------+
> only showing top 20 rows
>
>
> I think my perl script should be possible to be optimized to run this
> job as well. So ask for your helps.
>
> Thanks in advance.
>
> wilson
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> http://learn.perl.org/
>
>
>
Thread Previous
|
Thread Next