Domain name trading is big business. One of the most useful tools for domain name trading is an automatic appraisal tool, so that you can easily estimate how much a given domain is worth. Unfortunately, many automatic appraisal services require a membership/subscription to use. In this challenge, you will write a simple appraisal tool that can roughly estimate the values of .com domains.

Input / Output

As input, your program should take a list of domain names, one per line. Each domain name will match the regex ^[a-z0-9][a-z0-9-]*[a-z0-9]$, meaning that it is composed of lowercase letters, digits, and hyphens. Each domain is at least two characters long and neither begins nor ends with a hyphen. The .com is omitted from each domain, since it is implied.

As an alternative form of input, you can choose to accept a domain name as an array of integers, instead of a string of characters, as long as you specify your desired character-to-integer conversion.

Your program should output a list of integers, one per line, which gives the appraised prices of the corresponding domains.

Internet and Additional Files

Your program may have access to additional files, as long as you provide these files as part of your answer. Your program is also allowed to access a dictionary file (a list of valid words, which you don't have to provide).

(Edit) I have decided to expand this challenge to allow your program to access the internet. There are a couple restrictions, being that your program cannot look up the prices (or price histories) of any domains, and that it only uses pre-existing services (the latter to cover up some loopholes).

The only limit on total size is the answer size limit imposed by SE.

Example input

These are some recently-sold domains. Disclaimer: Although none of these sites seem malicious, I do not know who controls them and thus advise against visiting them.

6d3
buyspydrones
arcader
counselar
ubme
7483688
buy-bikes
learningmusicproduction

Example Output

These numbers are real.

Scoring

Scoring will be based on "difference of logarithms." For example, if a domain sold for $300 and your program appraised it at $500, your score for that domain is abs(ln(500)-ln(300)) = 0.5108. No domain will have a price less than $1. Your overall score is your average score for the set of domains, with lower scores better.

To get an idea what scores you should expect, simply guessing a constant 36 for the training data below results in a score of about 1.6883. A successful algorithm has a score less than this.

I chose to use logarithms because the values span several orders of magnitude, and the data will be filled with outliers. The use of absolute difference instead of difference squared will help reduce the effect of outliers in scoring. (Also, note that I am using the natural logarithm, not base 2 or base 10.)

Data Source

I have skimmed a list of over 1,400 recently sold .com domains from Flippa, a domain auction website. This data will make up the training data set. After the submission period is over, I will wait an additional month to create a test data set, with which the submissions will be scored. I might also choose to collect data from other sources to increase the size of the training/test sets.

The training data is available at the following gist. (Disclaimer: Although I have used some simple filtering to remove some blatantly NSFW domains, several might still be contained in this list. Also, I advise against visiting any domain you don't recognize.) The numbers on the right-hand side are the true prices. https://gist.github.com/PhiNotPi/46ca47247fe85f82767c82c820d730b5

Here is a graph of the price distribution of the training data set. The x-axis the the natural log of price, with y-axis being count. Each bar has a width of 0.5. The spikes on the left correspond to $1 and $6 since the source website requires bids to increment at least $5. The test data may have a slightly different distribution.

Here is a link to the same graph with a bar width of 0.2. In that graph you can see spikes at $11 and $16.

PhiNotPi

Posted 2016-07-04T23:56:01.413

Reputation: 26 739

Just a heads up, using the most common digraphs to score a domain works abysmally! R² ≅ 0 – None – 2016-07-05T17:29:00.370

2Someone obviously needs to do a neural network answer for this. – user48538 – 2016-07-06T17:31:05.533

1Can the program link to the internet (say, query google)? Explicitly not for the purpose of looking up prices of course, but for collecting data to use as a characteristic. – Joe – 2016-07-08T19:43:03.670

@Joe Sorry I haven't had a chance to respond, but I have decided to allow internet access. – PhiNotPi – 2016-07-11T17:52:28.643

Answers

Perl, 1.38605

I figured I should go ahead and post my own submission, in the hope that it spurs competition. Its score of 1.38605 means that it is typically off by a factor of 3.999 (that was my stopping point). I didn't use any machine learning libraries, just straight up Perl. It does require access to a dictionary; I used the one from here.

Please feel free to use some of the numbers/statistics from my program in your own.

use strict;

my %dict;
my $dictname = "dict.txt";
open(my $dfh, '<', $dictname);
while (my $row = <$dfh>) {
  chomp $row;
  $dict{lc $row} = 1;
}

my $domain = <>;
chomp($domain);

my $guess = 1;

if($domain =~ /^[a-z]*$/){
    my @bylength = (200000,20001,401,45,45,41,26,26,26,26,26,24);
  if(length($domain) < ~~@bylength+2){
    $guess *= $bylength[length($domain)-2];
  } else {
    $guess *= 18;
  }
} elsif ($domain =~ /^[0-9]*$/){
  my @bylength = (300000,30001,6000,605,50);
  if(length($domain) < ~~@bylength+2){
    $guess *= $bylength[length($domain)-2];
  } else {
    $guess *= 7;
  }
} elsif ($domain =~ /^[a-z0-9]*$/){
  my @bylength = (52300,523,28);
  if(length($domain) < ~~@bylength+2){
    $guess *= $bylength[length($domain)-2];
  } else {
    $guess *= 23;
  }
} else {
  my @bylength = (50000,500,42,32,32,31);
  if(length($domain) < ~~@bylength+2){
    $guess *= $bylength[length($domain)-2];
  } else {
    $guess *= 12;
  }
}

my $wordfact = 1;

my $leftword = 0;
for(my $i = 1; $i <= length($domain); $i++){
  my $word = substr $domain, 0, $i;
  if(exists($dict{$word})){
    $leftword = $i;
  }
}

$wordfact *= ($leftword/length($domain))**2 * 0.8 + ($leftword/length($domain)) * -0.1 + 0.9;

if($leftword/length($domain) >= 0.8){
  $wordfact *= 2.4;
}

my $rightword = 0;
for(my $i = 1; $i <= length($domain); $i++){
  my $word = substr $domain, length($domain)-$i, $i;
  if(exists($dict{$word})){
    $rightword = $i;
  }
}

$wordfact *= ($rightword/length($domain))**2 * 0.9 + ($rightword/length($domain)) * -0.2 + 1;         

$guess *= $wordfact;

my $charfact = 1;
my %charfacts = (
  i => 1.12, #500
  l => 0.84,
  s => 1.09,
  a => 0.94,
  r => 1.03,
  o => 0.97, 
  c => 1.22, #400
  d => 0.88,
  u => 1.07,
  t => 0.95,
  e => 1.08,
  m => 0.91, #300
  p => 1.08,
  y => 0.92,
  g => 0.97,
  ne => 0.56, #100
  n => 1.13,
  z => 0.67,
  re => 1.30,
  es => 0.75,
);
while(my ($key,$value) = each %charfacts){
  if($domain =~ /$key/){
    $charfact *= $value;
  }
}
$guess *= $charfact;

$guess = int($guess + 0.5);
if($guess <= 0){
  $guess = 1;
}

print $guess;

Here's a graph made by my scoring program, showing a scatter plot of appraisal over actual price and a histogram of the errors. In the scatter plot .:oO@ mean 10, 20, 30, 40, 50 domains at that point, respectively. In the histogram each O represents 16 domains.

The scale is set at 1 character width = e^(1/3).

There are three main steps to this program. The results from each step are multiplied together.

Categorization by character class and length. It determines if the domain is all letters, all numbers, letters and numbers, or if it contains a hyphen. It then gives a numerical value determined by the domain's length. I found that there is a strange dip in value around length 5. I suspect this is due to sampling: shorter domains are valuable due to their length (even if the letters are nonsense), while most longer domains tend to be words/phrases. I order to prevent overfitting, I put a restriction in that domains cannot be penalized for being shorter (so length 5 is at least as good as length 6).
Evaluation of word content. I use the dictionary to determine the lengths of the left-hand and right-hand words in a domain name. For example, myawesomesite -> my & site -> 2 & 4. Then, I try to do some fitting based on what proportion of the domain name is made up by those words. Low values typically indicate that the domain does not contain a word, contains a pluralized/modified word not in the dictionary, contains a word surrounded by other characters (internal words are not detected, although I tried that with no improvement), or contains a several-word phrase. High values indicate that it is a single word or a likely a two word phrase.
Evaluation of character content. I looked for substrings that were contained in lot of domains and which appeared to affect the values of the domain. I believe this is caused by certain types of words being more popular/more attractive, for various reasons. For example, the letter i appeared in about half of the domains (741 of them), and bumps up domain value by about 12% on average. That's not overfitting; there's something real there, that I don't fully understand. The letter l appears in 514 domains and has a factor of 0.84. Some of the less common letters/digraphs, like ne which appeared 125 times and has a really low factor of 0.56, might be overfitting.

In order to improve upon this program, I would probably need to use machine learning of some kind. Also, I could look for relationships between length, word content, and character content to find better ways of combining those separate results into the overall appraisal value.