creating CDF data using bash or awk or perl

0

we have some data like:

12 0
13 0 
20 0
25 1
64 4
77 1
89 100
1201 204

I'd like to get this output:

3 0
5 1
6 4
7 100
8 204

explanation: we have 3 AS(autonomous systems) that have degree of 0,and then we have 5 AS that they degree is 1 or less than 1 and we have 6 AS that they Degree is 4 or less than 4 and ...

There are lots of lines (100,000) this is CDF of distributions i guess, this is part of parsing of bgpdump data and i want to calculate that numbers.and tnx again for your help

Arash

Posted 2012-12-20T15:33:46.943

Reputation: 678

Can you define the transformation you want to apply to the data? Looking at the example, I can't tell what you're trying to do. – Aaron Miller – 2012-12-20T15:38:16.037

ok well,we have 3 AS(autonomous systems) that have degree of 0,and then we have 5 AS that they degree is 1 or less than 1 and so an – Arash – 2012-12-20T15:41:26.540

this is example of Cumulative distribution function (CDF) – Arash – 2012-12-20T15:43:04.577

1So it sounds like column 2 in the data is the degree, and what you're looking for is a result with the degree in column 2 and the number of systems of that degree (or less) in column 1. See answer below -- – Aaron Miller – 2012-12-20T15:46:57.500

i want this:we have 3 AS(autonomous systems) that have degree of 0,and then we have 5 AS that they degree is 1 or less than 1 and we have 6 AS that they Degree is 4 or less than 4 and ... – Arash – 2012-12-20T15:49:27.027

yep you right ;) – Arash – 2012-12-20T15:55:32.473

Answers

1

Here's a quick Perl script which should do the job for you:

#!/usr/bin/perl
use strict;
my %result;
my @data;
my %data;
my @degrees;
my $infile = shift() || die "Usage: $0 <file>\n";

# Read source data from input file
open IN, '<', $infile
    or die "Couldn't open data file: $!\n";
while (my $line = <IN>) { chomp $line; push @data, $line; };
close IN;

# Convert data lines to hash
foreach my $line (@data) {
    my ($count, $degree) = split(/\s+/, $line);
    $data{$degree}++;
};

# Get sorted degrees for count-up iteration
@degrees = sort { $a <=> $b } keys %data;

# Iterate degrees, adding each one's system count to result for this degree
# and all higher degrees
for (my $i = 0; $i < scalar(@degrees); $i++) {
    my $degree = $degrees[$i];
    my $count = $data{$degree};
    for (my $j = $i; $j < scalar(@degrees); $j++) {
        $result{$degrees[$j]} += $count;
    };
};

# Output result counts
foreach my $degree (sort { $a <=> $b } keys %result) {
    print "$result{$degree} $degree\n";
};

This script will require considerable memory for large input datasets; it slurps the entire input file before operating on it, because it doesn't look like the input file is sorted, and it's necessary to sort the data by degree before operating on it. That said, it should do the job for you pretty well -- let me know if it doesn't!

Aaron Miller

Posted 2012-12-20T15:33:46.943

Reputation: 8 849

i'm begginear in perl how i can pass my file?can i use :perl yourscript myfile ?? – Arash – 2012-12-20T16:32:03.433

1Yup -- just edited the answer so the script will accept a filename on the command line, so you can call it exactly that way. – Aaron Miller – 2012-12-20T16:38:27.323

tnx u so muchhhhhhhhhhhhhhhhhhh :-* :D – Arash – 2012-12-20T16:40:57.153

1Glad to be of help! – Aaron Miller – 2012-12-20T16:42:25.137

1

Here's a quick 100% bash script that will do the job:

a=()
while read _ n; do
    [[ -n $n ]] && ((++a[n]))
done < datafile.txt
c=0
for i in ${!a[@]}; do
    echo "$((c+=a[i])) $i"
done

If you want a script that you can call from a command line:

#!/bin/bash

a=()
while read _ n; do
    [[ -n $n ]] && ((++a[n]))
done < "$1"
c=0
for i in ${!a[@]}; do
    echo "$((c+=a[i])) $i"
done

Or if you prefer a one-liner to impress your grand-mother:

a=(); while read _ n; do [[ -n $n ]] && ((++a[n])); done < datafile.txt; c=0; for i in ${!a[@]}; do echo "$((c+=a[i])) $i"; done

It runs in about 2-3 seconds on Pentium dual core @2.6GHz on a file with 100000 lines.

Edit

Explanations:

The first loop:

  • We initialize a to be an empty array: a=()
  • We read file datafile.txt line by line. There are two fields per line, only the second one is put in the variable names n
  • If n is non-empty (this is the test [[ -n $n ]] we increment the value of the n-th key of array a; that's what the line ((++a[n])) does. ((...)) is bash's arithmetic context.
  • After reading all the files, we have an array a, and the k-th field is exactly the number of autonomous systems that have a degree equal to k.

Then the second loop:

  • Before the loop, variable c is set to 0.
  • for i in ${!a[@]}; do will loop through all keys of array a.
  • $((c+=a[i])) will add the value of a[i] to c and expand to this value. This value is echoed with the value of the key i appended to it.

Hope this helps!

gniourf_gniourf

Posted 2012-12-20T15:33:46.943

Reputation: 1 882

could you plz explain the codes?? tnx – Arash – 2012-12-20T18:34:29.527

1@arashams Added explanations. – gniourf_gniourf – 2012-12-20T18:53:37.507