Front page | perl.beginners |
Postings from April 2022
Re: Please help: perl run out of memory
Thread Previous
|
Thread Next
From:
M.N Thanishka sree Manikandan
Date:
April 21, 2022 11:56
Subject:
Re: Please help: perl run out of memory
Message ID:
CALrEMA6mf1ihH_t7JyDDzUhy3s4uf=eXKQ9AZFPnGKbZZELqbw@mail.gmail.com
Hi wilson
Try this module file::slurp
Regards,
Manikandan
On Sun, 17 Apr, 2022, 15:03 wilson, <info@bigcount.xyz> wrote:
> hello the experts,
>
> can you help check my script for how to optimize it?
> currently it was going as "run out of memory".
>
> $ perl count.pl
> Out of memory!
> Killed
>
>
> My script:
> use strict;
>
> my %hash;
> my %stat;
>
> # dataset: userId, itemId, rate, time
> # AV056ETQ5RXLN,0000031887,1.0,1397692800
>
> open HD,"rate.csv" or die $!;
> while(<HD>) {
> my ($item,$rate) = (split /\,/)[1,2];
> $hash{$item}{total} += $rate;
> $hash{$item}{count} +=1;
> }
> close HD;
>
> for my $key (keys %hash) {
> $stat{$key} = $hash{$key}{total} / $hash{$key}{count};
> }
>
> my $i = 0;
> for (sort { $stat{$b} <=> $stat{$a}} keys %stat) {
> print "$_: $stat{$_}\n";
> last if $i == 99;
> $i ++;
> }
>
> The purpose is to aggregate and average the itemId's scores, and print
> the result after sorting.
>
> The dataset has 80+ million items:
>
> $ wc -l rate.csv
> 82677131 rate.csv
>
> And my memory is somewhat limited:
>
> $ free -m
> total used free shared buff/cache
> available
> Mem: 1992 152 76 0 1763
> 1700
> Swap: 1023 802 221
>
>
>
> What confused me is that Apache Spark can make this job done with this
> limited memory. It got the statistics done within 2 minutes. But I want
> to give perl a try since it's not that convenient to run a spark job
> always.
>
> The spark implementation:
>
> scala> val schema="uid STRING,item STRING,rate FLOAT,time INT"
> val schema: String = uid STRING,item STRING,rate FLOAT,time INT
>
> scala> val df =
> spark.read.format("csv").schema(schema).load("skydrive/rate.csv")
> val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ...
> 2 more fields]
>
> scala>
>
> df.groupBy("item").agg(avg("rate").alias("avg_rate")).orderBy(desc("avg_rate")).show()
> +----------+--------+
>
> | item|avg_rate|
> +----------+--------+
> |0001061100| 5.0|
> |0001543849| 5.0|
> |0001061127| 5.0|
> |0001019880| 5.0|
> |0001062395| 5.0|
> |0000143502| 5.0|
> |000014357X| 5.0|
> |0001527665| 5.0|
> |000107461X| 5.0|
> |0000191639| 5.0|
> |0001127748| 5.0|
> |0000791156| 5.0|
> |0001203088| 5.0|
> |0001053744| 5.0|
> |0001360183| 5.0|
> |0001042335| 5.0|
> |0001374400| 5.0|
> |0001046810| 5.0|
> |0001380877| 5.0|
> |0001050230| 5.0|
> +----------+--------+
> only showing top 20 rows
>
>
> I think my perl script should be possible to be optimized to run this
> job as well. So ask for your helps.
>
> Thanks in advance.
>
> wilson
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> http://learn.perl.org/
>
>
>
Thread Previous
|
Thread Next