Challenge Similarity Detector

11

1

Challenge

Given two question IDs, try to figure out how similar they are by looking at the answers.

Details

You will be given two question IDs for codegolf.stackexchange.com; you may assume that there exist questions for both IDs that are not deleted, but are not necessarily open. You must run through all of the answers and determine the minimum Levenshtein distance between the code in the answers to the two questions (not including deleted answers). That is, you should compare every answer in question 1 to every answer in question 2, and determine the minimum Levenshtein distance. To find the code in an answer, assume the following procedure:

How to find the code snippet

A body of text is the answer's actual code if it is in backticks and is on its own line, or if it is indented with 4 spaces, with an empty line above it, unless there is no text above.

Examples of valid and not-valid code snippets (with . as a space) (separated by a ton of equal signs)

This is `not a valid code snippet because it is not on its own line`
========================================
This is:
`A valid code snippet`
========================================
This is
....not a valid code snippet because there's no spacing line above
========================================
This is

....A valid code snippet because there's a spacing line above
========================================
....Valid code snippet because there's no other text
========================================

If there are no valid code snippets in the answer, ignore the answer completely. Note that you should only take the first codeblock.

Final Specs

The two question IDs can be inputted in any reasonable format for 2 integers. The output should be the smallest Levenshtein distance between any two valid answers from either challenge. If there are no "valid" answers for one or both of the challenges, output -1.

Test Case

For challenge 115715 (Embedded Hexagons) and 116616 (Embedded Triangles) both by Comrade SparklePony, the two Charcoal answers (both by KritixiLithos) had a Levenshtein distance of 23, which was the smallest. Thus, your output for 115715, 116616 would be 23.

Edit

You may assume that the question has at most 100 answers because of an API pagesize restriction. You should not ignore backticks in code blocks, only if the code block itself is created using backticks and not on its own line.

Edit

I terminated the bounty period early because I made a request to a mod to get a one-week suspension and I didn't want the bounty to be automatically awarded to the highest scoring answer (which happens to be the longest). If a new submission comes in or a submission is golfed enough to become shorter than 532 bytes before the actual end of the bounty period (UTC 00:00 on Jun 1), I will give that a bounty to stay true to my promise, after the suspension expires. If I remember correctly, I need to double the bounty period next time so if you do get an answer in, you might get +200 :)

HyperNeutrino

Posted 2017-04-24T06:10:01.820

Reputation: 26 575

1I'm confused by what counts as a valid code snippet. Why not just whatever's in <code> tags in the html? – Calvin's Hobbies – 2017-04-24T06:24:13.760

@HelkaHomba What about the newline restrictions? I could try to find another way to incorporate those. – HyperNeutrino – 2017-04-24T06:26:22.900

@HelkaHomba Essentially, if the answer contains backtick-delimited code within a line, it should be ignored. – HyperNeutrino – 2017-04-24T06:32:35.323

This is one of those answers, where it's easier to do the main part of the question. Downloading the page and extracting the code blocks is harder than doing the levenshtein distance. – Bálint – 2017-04-24T12:05:13.467

@Bálint I'm not quite sure I agree with you on what the main part of the question is. I don't think the main part is the levenshtein distance, because most languages have a built-in for that which I did not and will not disallow. I think that the main part of the challenge is to get the answers, find the code blocks, and iterate through them, which, as you said, is the harder part of the challenge, also being the "main part" in my opinion. – HyperNeutrino – 2017-04-24T12:12:07.647

I am not clear on part of the capturing code blocks part. So on this q the JS answer has backticks in the code to that can be ignored yes? The JS code has large amounts of whitespace in it as well. For the haskell answer I should have to match the whole code block even though it is on more than one line? Wanted to know for sure how to deal with multiline code block answers. FYI since I use PowerShell I spent a bit rolling my own LD function since PowerShell does not have one. That part was fun though.

– Matt – 2017-05-29T02:34:44.393

@Matt The backticks should not be ignored. It's only if the code formatting is caused by backticks should you ignore the codeblock. You need to match the whole Haskell answer's code block because it is a proper codeblock. – HyperNeutrino – 2017-05-29T02:37:30.410

1Cool. Just checking. – Matt – 2017-05-29T02:51:14.377

Is there any special rules for questions that have more than 100 answers? Are we always expected to check all answers of all questions ids passed? Guaranteed perhaps to have question under a certain threshold? – Matt – 2017-05-29T13:45:30.710

@Matt Just wondering, how does having >100 answers change anything? – HyperNeutrino – 2017-05-29T13:56:16.257

For me I am using the StackExchange API to query answers and their content. The max pagesize is 100, default being 30. So if there are more that 100 I need to send a query again with another page. If I don't use the API it gets even harder. – Matt – 2017-05-29T13:57:55.327

@Matt Then I will say you can assume the question has at most 100 answers; I did not know about that. Thanks! – HyperNeutrino – 2017-05-29T14:10:14.833

Here is a relevant reference to the docs https://api.stackexchange.com/docs/paging

– Matt – 2017-05-29T14:13:54.720

My LD calculations might be flawed. Is there an online resource you are using for LD values couple I found hate unicode? I knew nothing about LD until I started looking at this challenge. Does yours use deletions, substitutions and transpositions? I don't think mine does the latter so I got 23. – Matt – 2017-05-29T15:36:34.997

@Matt What are transpositions? I believe mine only uses deletions, substitutions, and insertions. – HyperNeutrino – 2017-05-29T15:52:30.180

Mine supports those as well. I read a discussion for on online calculator that could have supported swapping neighboring character positions. Your LD was lower in your example so I was curious. I made my function based on this page. It worked fine for the other examples online I found so I had assumed I got that part of the challenge done. That is why I wanted to know what you did to get yours.

– Matt – 2017-05-29T16:07:44.517

I got 23 for these FN«AX²ιβ×__β↓↘β←↙β↑←×__β↖β→↗β to NαWα«X²ι↙AX²⁻ι¹β↙β↑↖β→A⁻α¹α. Sorry for all the comments. – Matt – 2017-05-29T16:08:12.630

hm... should I write another java beast? – tuskiomi – 2017-05-29T16:23:21.403

@tuskiomi If nobody else answers I really have no choice but to give it to you as long as it's an actualy answer ;P So go ahead :P – HyperNeutrino – 2017-05-29T16:26:03.407

@Matt I don't know what I was doing. You're right; it is 23. Sorry for the confusion. :P Thanks for catching that; I've updated. – HyperNeutrino – 2017-05-29T16:29:19.993

I'm really hoping you will expand more on what a valid code block is for the purposes of this challenge. Take for example Q97752 and the last answer in Python 2.7. I see a valid code block 7 lines long. Each line is indented, but that doesn't seem to meet your criteria? I also see you've answered this question for Matt. please add more to the body of the OP. It really is not very clear. Especially the sentence which begins, "A body of text is the answer's actual code...". To me, the body of text is much more than the code. Can you please clarify for us all. – Octopus – 2017-05-30T21:09:48.807

@Octopus How is it invalid? The answer meets my criteria as it has an empty line above it and no text below it. I will add my comments into the question. – HyperNeutrino – 2017-05-30T21:15:46.873

Answers

1

PowerShell, 532 Bytes

$1,$2=$args
$a={irm "api.stackexchange.com/2.2/questions/$args/answers?pagesize=100&site=codegolf&filter=!9YdnSMKKT"|% i*}
$r={$args.body-replace"(?sm).*?^(<pre.*?>)?<code>(.*?)</code>.*",'$2'}
$1=&$a $1;$2=&$a $2
(0..($1.count-1)|%{
    $c=&$r $1[$_]
    0..($2.count-1)|%{
        &{$c,$d=$args;$e,$f=$c,$d|% le*;$m=[object[,]]::new($f+1,$e+1);0..$e|%{$m[0,$_]=$_};0..$f|%{$m[$_,0]=$_};1..$e|%{$i=$_;1..$f|%{$m[$_,$i]=(($m[($_-1),$i]+1),($m[$_,($i-1)]+1),($m[($_-1),($i-1)]+((1,0)[($c[($i-1)]-eq$d[($_-1)])]))|sort)[0]}};$m[$f,$e]} $c $d
    }
}|sort)[0]

I left newlines in there for some readability. The are still reflected in my byte count.

Pretty sure I have a handle on this. The tough part for me was actually getting the Levenshtein distance since PowerShell does not have a builtin for that as far as I know. Because of that I was able to answer the related challenge on Levenshtein distance. When my code refers to an anonymous function for LD you can refer to that answer for a more detailed explanation as to how that works.

Code with comments and progress indicator

The code can get real slow (because of the LD) so I built in some progress indicators for myself so I could follow the action as it unfolded and not assume it was stuck in a loop somewhere. The code for monitoring progress is not in the top block nor counted in my byte count.

# Assign the two integers into two variables. 
$1,$2=$args

# Quick function to download up to 100 of the answer object to a given question using the SE API
$a={irm "api.stackexchange.com/2.2/questions/$args/answers?pagesize=100&site=codegolf&filter=!9YdnSMKKT"|% i*}

# Quick function that takes the body (as HTML) of an answer and parses out the likely codeblock from it. 
$r={$args.body-replace"(?sm).*?^(<pre.*?>)?<code>(.*?)</code>.*",'$2'}

# Get the array of answers from the two questions linked.
$1=&$a $1;$2=&$a $2

# Hash table of parameters used for Write-Progress
# LD calcuations can be really slow on larger strings so I used this for testing so I knew 
# how much longer I needed to wait.
$parentProgressParameters = @{
    ID = 1 
    Activity = "Get LD of all questions" 
    Status = "Counting poppy seeds on the bagel"
}

$childProgressParameters = @{
    ID = 2
    ParentID = 1
    Status = "Progress"
}


# Cycle each code block from each answer against each answer in the other question.
(0..($1.count-1)|%{
    # Get the code block from this answer
    $c=&$r $1[$_]

    # Next line just for displaying progress. Not part of code. 
    Write-Progress @parentProgressParameters -PercentComplete (($_+1) / $1.count * 100) -CurrentOperation "Answer $($_+1) from question 1"

    0..($2.count-1)|%{
        # Get the code block from this answer   
        $d=&$r $2[$_]

        # Next two lines are for progress display. Not part of code. 
        $childProgressParameters.Activity = "Comparing answer $($_+1) of $($2.count)"
        Write-Progress @childProgressParameters -PercentComplete (($_+1) / $2.count * 100) -CurrentOperation "Answer $($_+1) from question 2"

        # Anonymous function to calculate Levenstien Distance
        # Get a better look at that function here: https://codegolf.stackexchange.com/a/123389/52023
        &{$c,$d=$args;$e,$f=$c,$d|% le*;$m=[object[,]]::new($f+1,$e+1);0..$e|%{$m[0,$_]=$_};0..$f|%{$m[$_,0]=$_};1..$e|%{$i=$_;1..$f|%{$m[$_,$i]=(($m[($_-1),$i]+1),($m[$_,($i-1)]+1),($m[($_-1),($i-1)]+((1,0)[($c[($i-1)]-eq$d[($_-1)])]))|sort)[0]}};$m[$f,$e]} $c $d
    }
# Collect results and sort leaving the smallest number on top.
}|sort)[0]

My logic for finding the code blocks is to take answer as HTML and look for a code tag set, optionally surrounded by a pre tag set that starts on its own line. In testing it found all the correct data on 6 different question sets.

I tried to work from the markdown code but it was too difficult to find the right code block.

Sample runs

Challenge-Similarity-Detector 97752 122740
57

Challenge-Similarity-Detector 115715 116616
23

Matt

Posted 2017-04-24T06:10:01.820

Reputation: 1 075

I have spent the better part of 3 days looking into this. This challenge is in my top 5 for most fun attempting. TFTC(Thanks for the challange) – Matt – 2017-05-30T00:35:56.467

Nice job! Thanks, I'm glad you enjoyed it! :) – HyperNeutrino – 2017-05-30T00:59:15.620

Note: I awarded the bounty earlier than stated because I am requesting a suspension so I cannot award it any later. Good job! :) – HyperNeutrino – 2017-05-30T21:19:44.077

Requesting a suspension? – Matt – 2017-05-30T21:34:05.573

Yes, I asked Dennis to give me a 1-week suspension so I can focus on schoolwork. It's been done before (though I'm still here... I don't know when I'll disappear). – HyperNeutrino – 2017-05-30T21:40:37.337

3

Java + Jsoup, 1027 bytes

The first two arguments are the question IDs.

Golfed:

import org.jsoup.*;import org.jsoup.nodes.*;class M{String a1[]=new String[100],a2[]=new String[100],c[];int i1=0,i2=0;public static void main(String a[])throws Exception{String r="https://codegolf.stackexchange.com/questions/";M m=new M();m.c=m.a1;m.r(Jsoup.connect(r+a[0]).get());m.c=m.a2;m.r(Jsoup.connect(r+a[1]).get());int s=m.ld(m.a1[1],m.a2[1]);for(int i=2;i<m.a1.length;i++)for(int j=2;j<m.a2.length;i++){if(m.a1[i]==null)break;int d=m.ld(m.a1[i],m.a2[j]);if(d<s)s=d;}System.out.print(s);}void r(Document d){a:for(Element e:d.select("td")){for(Element p:e.select("pre")){ a(p.select("code").get(0).html());continue a;}}}void a(String d){c[c==a1?i1++:i2++]=d;}int ld(String a,String b){a=a.toLowerCase();b=b.toLowerCase();int[]costs=new int[b.length()+1];for(int j=0;j<costs.length;j++)costs[j]=j;for(int i=1;i<=a.length();i++){costs[0]=i;int nw=i-1;for(int j=1;j<=b.length();j++){int cj=Math.min(1+Math.min(costs[j],costs[j-1]),a.charAt(i-1)==b.charAt(j-1)?nw:nw+1);nw=costs[j];costs[j]=cj;}}return costs[b.length()];}}

Readable:

import org.jsoup.*;import org.jsoup.nodes.*;

class M {
    String a1[]=new String[100],a2[]=new String[100],c[];
    int i1=0,i2=0;
    public static void main(String a[])throws Exception{
    String r="https://codegolf.stackexchange.com/questions/";
    M m=new M();

    m.c=m.a1;
    m.r(Jsoup.connect(r+a[0]).get());
    m.c=m.a2;
    m.r(Jsoup.connect(r+a[1]).get());

    int s=m.ld(m.a1[1],m.a2[1]);
    for(int i=2;i<m.a1.length;i++)for(int j=2;j<m.a2.length;i++){if(m.a1[i]==null)break;int d=m.ld(m.a1[i],m.a2[j]);if(d<s)s=d;}
    System.out.print(s);
}

void r(Document d) {
    a:for(Element e:d.select("td")) {for(Element p:e.select("pre")) { 
        a(p.select("code").get(0).html());
        continue a;
    }}
}

void a(String d){c[c==a1?i1++:i2++]=d;}

int ld(String a, String b) {
    a = a.toLowerCase();
    b = b.toLowerCase();
    int [] costs = new int [b.length() + 1];
    for (int j = 0; j < costs.length; j++)costs[j] = j;
    for (int i = 1; i <= a.length(); i++) {
        costs[0] = i;
        int nw = i - 1;
        for (int j = 1; j <= b.length(); j++) {
            int cj = Math.min(1 + Math.min(costs[j], costs[j - 1]), a.charAt(i - 1) == b.charAt(j - 1) ? nw : nw + 1);
            nw = costs[j];
            costs[j] = cj;
        }
    }
    return costs[b.length()];
}

}

Tomahawk2001913

Posted 2017-04-24T06:10:01.820

Reputation: 31

beat me to it!!!! Nice! – tuskiomi – 2017-05-29T21:37:05.020

1Welcome to PPCG! It's not against the rules to use a third-party library, but we require that the use of the library is noted with the language (so a Java answer that uses a library called JavaHTML would be labelled "Java + JavaHTML"). – Mego – 2017-05-29T21:42:54.250

Ok, thanks! I'll keep that in mind for next time! – Tomahawk2001913 – 2017-05-29T21:45:25.690

There is nothing stopping you from using a library in this challenge if you wanted to. – Matt – 2017-05-30T01:00:51.367

I might have to now that somebody has topped my answer! – Tomahawk2001913 – 2017-05-30T01:45:20.803

1709 bytes is huge for code golf entries. Take a peek at some of the other questions and their answers. 99% can be done in under 20 characters (for some languages) with answers in general edging in on 0 characters (some puzzles and some languages this is actually possible). Average I'd say is around 9 to 20 bytes for a golfing language (Java is not a golfing language). E.g. Jelly. The longest answer I've seen in Jelly has been in the neighborhood of 75 bytes.

– Draco18s no longer trusts SE – 2017-05-30T21:16:23.853

It seemed like a good idea at the time when nobody else had answered. I may look into a golfing language such as Jelly in the future. – Tomahawk2001913 – 2017-05-30T23:10:28.333

0

Mathematica, 540 bytes

f=Flatten;l=Length;P=StringPosition;(H[r_]:=Block[{s,a,t,k},t={};d=1;k="https://codegolf.stackexchange.com/questions/"<>r;s=First/@P[Import[k,"Text"],"<pre><code>"];a=f[First/@P[Import[k,"Text"],"answerCount"]][[1]];While[d<l@s,If[s[[d]]>a,AppendTo[t,s[[d]]]];d++];Table[StringDelete[StringCases[StringTake[Import[k,"Text"],{t[[i]],t[[i]]+200}],"<pre><code>"~~__~~"</code></pre>"],{"<pre><code>","</code></pre>"}],{i, l@t}]];Min@DeleteCases[f@Table[EditDistance[ToString@Row@H[#1][[i]],ToString@Row@H[#2][[j]]],{i,l@H[#1]},{j,l@H[#1]}],0])&


input

["115715", "116616"]

output

23

uses built-in EditDistance which "gives the edit or Levenshtein distance between strings or vectors u and v."

As for the test case mathematica

EditDistance["FN«AX²ιβ×__β↓↘β←↙β↑←×__β↖β→↗β","NαWα«X²ι↙AX²⁻ι¹β↙β↑↖β→A⁻α¹α"]

returns 23

I guess I can golf it a bit more
Takes some minutes to run

J42161217

Posted 2017-04-24T06:10:01.820

Reputation: 15 931