how to average, max based on group time

0

need your help , i want grouping timestamp based on minutes and value average and max

my data below :

2019-12-27 12:01:00,047 3
2019-12-27 12:01:00,079 34
2019-12-27 12:02:00,091 4
2019-12-27 12:02:00,091 5
2019-12-27 12:03:00,099 3
2019-12-27 12:04:00,101 8

my expected result :

timestamp        average max
2019-12-27 12:01 18.5     34
2019-12-27 12:02 4.5      5 
2019-12-27 12:03 3        3
2019-12-27 12:04 8        8

could you help my case above, thanks for help

Fajar Hadiyanto

Posted 2019-12-27T15:12:55.400

Reputation: 1

https://stackoverflow.com/q/59504156/1030675 – choroba – 2019-12-27T17:50:10.500

Answers

1

A simple python program would do the thing.

import sys
from statistics import mean
time_to_value_dict = {}
# usage guide : python3 test-awk.py /Users/a/Desktop/example.js
with open(sys.argv[1]) as f:
    for line in f:
        parts = line.split()
        second_field = parts[1]
        third_field = int(parts[2])
        # to take the required value from the second field by slicing the string
        trimmed_key = second_field[0:5]
        final_key = parts[0] + " " + trimmed_key
        arr = time_to_value_dict.get(final_key,[])
        arr.append(third_field)
        time_to_value_dict[final_key] = arr
    print("timestamp        average          max")
    list_to_print = []
    for item in time_to_value_dict:
        value_from_dict = time_to_value_dict.get(item)
        average_val = mean(value_from_dict)
        max_val = max(value_from_dict)
        temp_list = []
        temp_list.append(item)
        temp_list.append(str(average_val))
        temp_list.append(str(max_val))
        list_to_print.append(temp_list)
        temp_list = []
    col_width = max(len(word) for row in list_to_print for word in row) + 1  # padding
    for row in list_to_print:
        print ("".join(word.ljust(col_width) for word in row))

Note the following in order to run this script

  1. The name of the python script is test-awk.py
  2. Save your input in a file and name that as example.js . Pass the file path as an argument to this script.

  3. I have used the standard statistics module to find the mean and the max.

Hope this helps.

Ajay Kr Choudhary

Posted 2019-12-27T15:12:55.400

Reputation: 11

0

Perl to the rescue!

< data.txt perl -lne '
    ($time, $num) = /(.*),.* (.*)/;
    if ($time ne $previous_time && $count) {
        print join " ", $previous_time, $sum / $count, $max;
        ($sum, $count, $max) = (0, 0, 0);
    }
    $max = $num if $num > $max;
    $sum += $num;
    ++$count;
    $previous_time = $time;
    END { print join " ", $previous_time, $sum / $count, $max; }'
  • -n reads the input line by line
  • -l removes newlines from input and adds them to input
  • if the time without the part after a comma is not the same as the previous one, we report the average and maximum for the previous time and reset the variables that keep them. We always populate the max, sum, count, and previous time.

This only works if the date for the same time are adjacent. If not, you need a hash to gather the data for each timestamp:

< data.txt perl -MList::Util=max,sum -lne '
    ($time,$num) = /(.*),.* (.*)/;
    push @{ $by_time{$time} }, $num;
    END { print "timestamp average max";
          for (sort keys %by_time) {
              @nums = @{ $by_time{$_} };
              print join " ", $_, sum(@nums) / @nums, max(@nums);
          }
    }'
  • -M loads a module, see List::Util for the particular one
  • Each time wihtout the part after the comma is used as a key in a hash %by_time. We store an array reference to each key, the array contains all the numbers.

choroba

Posted 2019-12-27T15:12:55.400

Reputation: 14 741

hi choroba..thanks a lot for reply but result became second not minutes ? timestamp average max 2019-12-27 13:00:00 32.835 1692 2019-12-27 13:00:01 28.2946859903382 1167 2019-12-27 13:00:02 19.0829694323144 348 2019-12-27 13:00:03 10.5258964143426 298 2019-12-27 13:00:04 21.4375 307 2019-12-27 13:00:05 10.8564102564103 214 2019-12-27 13:00:06 59.6923076923077 1090 2019-12-27 13:00:07 442.74025974026 1929 2019-12-27 13:00:08 652.200607902736 2215 – Fajar Hadiyanto – 2019-12-27T17:40:30.013

@FajarHadiyanto: Sorry, my mistake. It should be rather simple to fix it :-) – choroba – 2019-12-27T17:48:31.877

Just replace the comma with the colon in the regex. – choroba – 2019-12-27T17:48:50.840

great choroba...thanks for your help... – Fajar Hadiyanto – 2019-12-27T18:04:21.703

"thank you" = upvote/accept. – choroba – 2019-12-27T18:24:12.837

0

A solution with awk

awk -F'[: ,]' '
{
        for (i = 0; i <= 59; i++) {
                sprintf("%02d", i)
                if ($3 == i) {
                        line[i] = $1 " " $2 ":" $3
                        count[i]++
                        sum[i] += $6
                        max[i] = max[i] < $6 ? $6 : max[i]
                        next
                }
        }
}

END {
        print "timestamp", "average", "max"
        for (i in count) {
                print line[i], sum[i] / count[i], max[i]
        }
}' file

The output isn't formatted, could be done with printf.

(original command line idea, pretty print with -o option)

awk -F'[: ,]' '{for (i=0;i<=59;i++) {sprintf("%02d",i); if ($3 == i) {line[i]=$1" "$2":"$3;count[i]++;sum[i]+=$6;max[i]=max[i]<$6?$6:max[i];next};};};END{print "timestamp","average","max";for (i in count) print line[i],sum[i]/count[i],max[i]}'

Paulo

Posted 2019-12-27T15:12:55.400

Reputation: 606