26

I am looking to sort a list of domain names (a web filter whitelist) starting from the TLD and working upwards. I am looking any *nix or windows tools that can do this easily, though a script would be fine too.

So if the is the list you are given

www.activityvillage.co.uk 
ajax.googleapis.com 
akhet.co.uk 
alchemy.l8r.pl 
au.af.mil 
bbc.co.uk 
bensguide.gpo.gov 
chrome.angrybirds.com 
cms.hss.gov 
crl.godaddy.com 
digitalhistory.uh.edu 
digital.library.okstate.edu 
digital.olivesoftware.com

This is what I want as the output.

chrome.angrybirds.com 
crl.godaddy.com 
ajax.googleapis.com 
digital.olivesoftware.com 
digital.library.okstate.edu 
digitalhistory.uh.edu 
bensguide.gpo.gov 
cms.hss.gov 
au.af.mil 
alchemy.l8r.pl 
www.activityvillage.co.uk 
akhet.co.uk 
bbc.co.uk

Just in case you are wondering why, Squidguard, has a bug/design flaw. If both www.example.com and example.com are both included in a list, then the example.com entry is ignored and you can only visit content from www.example.com. I have several large lists that need some cleanup because someone added entries without looking first.

usef_ksa
  • 805
  • 4
  • 11
  • 16
Zoredache
  • 128,755
  • 40
  • 271
  • 413
  • Shouldn't the `com` domains appear before `edu` in your sorted list? – Sven Feb 28 '12 at 02:01
  • 11
    Yup, I fail at manual sorting, which is why I am looking for a tool. :) – Zoredache Feb 28 '12 at 02:04
  • 1
    Dupe of http://stackoverflow.com/questions/342590/sorting-domain-names ? – Mark Wagner Feb 28 '12 at 04:58
  • 3
    Also, the python version there is nice compared to perl version because pythons's sort works on lists of lists; perl's sort doesn't and had to be implemented. – Mark Wagner Feb 28 '12 at 05:11
  • 1
    On a side note this would be much more challenging if OP had asked for the top-domains according to Mozilla's public suffix (https://publicsuffix.org/) list to be handled as one block. I might at some point come around to doing it (it would be a nice-to-have for a project), anyone else interested? – phk Jan 09 '16 at 19:03

7 Answers7

19

This simple python script will do what you want. In this example I name the file domain-sort.py:

#!/usr/bin/env python
from fileinput import input
for y in sorted([x.strip().split('.')[::-1] for x in input()]): print('.'.join(y[::-1]))

To run it use:

cat file.txt | ./domain-sort.py

Note that this looks a little uglier since I wrote this as more or a less a simple one-liner I had to use slice notation of [::-1] where negative values work to make a copy of the same list in reverse order instead of using the more declarative reverse() which does it in-place in a way that breaks the composability.

And here's a slightly longer, but maybe more readable version that uses reversed() which returns an iterator, hence the need to also wrap it in list() to consume the iterator and produce a list:

#!/usr/bin/env python
from fileinput import input
for y in sorted([list(reversed(x.strip().split('.'))) for x in input()]): print('.'.join(list(reversed(y))))

On a file with 1,500 randomly sorted lines it takes ~0.02 seconds:

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Maximum resident set size (kbytes): 21632

On a file with 150,000 randomly sorted lines it takes a little over 3 seconds:

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.20
Maximum resident set size (kbytes): 180128

Here is an arguably more readable version that does the reverse() and sort() in-place, but it runs in the same amount of time, and actually takes slightly more memory.

#!/usr/bin/env python
from fileinput import input

data = []
for x in input():
   d = x.strip().split('.')
   d.reverse()
   data.append(d)
data.sort()
for y in data:
   y.reverse()
   print('.'.join(y))

On a file with 1,500 randomly sorted lines it takes ~0.02 seconds:

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Maximum resident set size (kbytes): 22096

On a file with 150,000 randomly sorted lines it takes a little over 3 seconds:

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.08
Maximum resident set size (kbytes): 219152
das Keks
  • 134
  • 6
aculich
  • 3,520
  • 1
  • 25
  • 33
  • I liked seeing lots of solutions. I am accepting the python based answer mostly because it is what I use for many of my other scripts. The other answers all seem to work also. – Zoredache Feb 28 '12 at 23:46
  • 2
    If anyone is interested in sorting by domain name first, ignoring TLD, use `data.sort(key=lambda x: x[1:])` – Calimo Dec 07 '16 at 09:47
10

Here's a PowerShell script that should do what you want. Basically it throws all the TLD's into an array reverses each TLD, sorts it, reverses it back to its original order, and then saves it to another file.

$TLDs = Get-Content .\TLDsToSort-In.txt
$TLDStrings = @();

foreach ($TLD in $TLDs){
    $split = $TLD.split(".")
    [array]::Reverse($split)
    $TLDStrings += ,$split
}

$TLDStrings = $TLDStrings|Sort-Object

foreach ($TLD in $TLDStrings){[array]::Reverse($TLD)}

$TLDStrings | %{[string]::join('.', $_)} | Out-File .\TLDsToSort-Out.txt

Ran it on 1,500 records - took 5 seconds on a reasonably powerful desktop.

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255
  • It should be fairly simple to convert this script into bash or another language I think. – Mark Henderson Feb 28 '12 at 02:44
  • 5 seconds seems like a long time for only 1,500 lines. [My python implementation](http://serverfault.com/a/364372/93109) does 1,500 in a fraction of a second, and 150,000 in just over 3 seconds. What do you suppose makes it so slow in PowerShell? – aculich Feb 28 '12 at 08:19
  • Yeah it is a long time. I've no idea why it takes so long though. Probably because powershell isn't really aimed at doing things like this. – Mark Henderson Feb 28 '12 at 09:46
10

cat domain.txt | rev | sort | rev

user3721740
  • 107
  • 1
  • 4
  • I guess that would work. I do like having the TLD be sorted though, and this wouldn't accomplish that. Using this, the TLDs in my example would be in the order (uk, mil, pl, com, edu, gov) Since it is is a simple right to left sort, instead of the domain boundaries. – Zoredache Feb 04 '15 at 22:48
  • best answer i've seen! – Daniel Aug 18 '17 at 11:11
  • 3
    `rev domain.txt|sort|rev` – Rich Apr 18 '18 at 19:38
7

Slightly less cryptic, or at least prettier, Perl:

use warnings;
use strict;

my @lines = <>;
chomp @lines;

@lines =
    map { join ".", reverse split /\./ }
    sort
    map { join ".", reverse split /\./ }
    @lines;

print "$_\n" for @lines;

This is a simple example of a Guttman–Rosler transform: we convert the lines into the appropriate sortable form (here, split the domain name on periods and reverse the order of the parts), sort them using the native lexicographic sort and then convert the lines back to their original form.

Ilmari Karonen
  • 895
  • 5
  • 11
7

In Unix scripting: reverse, sort and reverse:

awk -F "." '{for(i=NF; i > 1; i--) printf "%s.", $i; print $1}' file |
  sort |
  awk -F "." '{for(i=NF; i > 1; i--) printf "%s.", $i; print $1}'
jfg956
  • 1,116
  • 1
  • 8
  • 12
  • Similar thing with a single loop: `awk -F. '{for(i=NF;i>0;i--){printf ".%s",$i};printf "\t%s\n",$0}' file|sort|cut -f2` Might want to strip out local hosts first with `grep \. file | awk ...` – Rich Apr 18 '18 at 19:53
3

Here it is in (short and cryptic) perl:

#!/usr/bin/perl -w
@d = <>; chomp @d;
for (@d) { $rd{$_} = [ reverse split /\./ ] }
for $d (sort { for $i (0..$#{$rd{$a}}) {
        $i > $#{$rd{$b}} and return 1;
        $rd{$a}[$i] cmp $rd{$b}[$i] or next;
        return $rd{$a}[$i] cmp $rd{$b}[$i];
} } @d) { print "$d\n" }
Mark Wagner
  • 17,764
  • 2
  • 30
  • 47
  • Do you have timing information for this sort? I'm curious to see how this compares to [@Mark-Henderson's PowerShell implementation](http://serverfault.com/a/364331/93109), as well as [my Python implementation](http://serverfault.com/a/364372/93109). I used `/usr/bin/time -v` for the elapsed time and max memory stats. – aculich Feb 28 '12 at 08:21
  • 4
    Perl totally WINS at obfuscation. – Massimo Feb 28 '12 at 11:14
  • 4
    Describing a Perl script as "short and cryptic" is redundant. – Belmin Fernandez Feb 28 '12 at 12:31
  • @aculich, with the exception of the powershell script, all the options seem to take less then a 0.1 seconds on my file. – Zoredache Feb 28 '12 at 23:45
0
awk -F"." 's="";{for(i=NF;i>0;i--) {if (i<NF) s=s "." $i; else s=$i}; print s}' <<<filename>>> | sort | awk -F"." 's="";{for(i=NF;i>0;i--) {if (i<NF) s=s "." $i; else s=$i}; print s}'

What this does is to reverse each filed in the domain name, sort and reverse back.

This truly sorts the domain list, lexicographically based on each part of the domain-name, from right to left.

The reverse solution (rev <<<filename>>> | sort | rev) , does not, I've tried it.