11
1
Challenge
Given two question IDs, try to figure out how similar they are by looking at the answers.
Details
You will be given two question IDs for codegolf.stackexchange.com
; you may assume that there exist questions for both IDs that are not deleted, but are not necessarily open. You must run through all of the answers and determine the minimum Levenshtein distance between the code in the answers to the two questions (not including deleted answers). That is, you should compare every answer in question 1 to every answer in question 2, and determine the minimum Levenshtein distance. To find the code in an answer, assume the following procedure:
How to find the code snippet
A body of text is the answer's actual code if it is in backticks and is on its own line, or if it is indented with 4 spaces, with an empty line above it, unless there is no text above.
Examples of valid and not-valid code snippets (with .
as a space) (separated by a ton of equal signs)
This is `not a valid code snippet because it is not on its own line`
========================================
This is:
`A valid code snippet`
========================================
This is
....not a valid code snippet because there's no spacing line above
========================================
This is
....A valid code snippet because there's a spacing line above
========================================
....Valid code snippet because there's no other text
========================================
If there are no valid code snippets in the answer, ignore the answer completely. Note that you should only take the first codeblock.
Final Specs
The two question IDs can be inputted in any reasonable format for 2 integers. The output should be the smallest Levenshtein distance between any two valid answers from either challenge. If there are no "valid" answers for one or both of the challenges, output -1
.
Test Case
For challenge 115715
(Embedded Hexagons) and 116616
(Embedded Triangles) both by Comrade SparklePony, the two Charcoal answers (both by KritixiLithos) had a Levenshtein distance of 23, which was the smallest. Thus, your output for 115715, 116616
would be 23
.
Edit
You may assume that the question has at most 100 answers because of an API pagesize restriction. You should not ignore backticks in code blocks, only if the code block itself is created using backticks and not on its own line.
Edit
I terminated the bounty period early because I made a request to a mod to get a one-week suspension and I didn't want the bounty to be automatically awarded to the highest scoring answer (which happens to be the longest). If a new submission comes in or a submission is golfed enough to become shorter than 532 bytes before the actual end of the bounty period (UTC 00:00 on Jun 1), I will give that a bounty to stay true to my promise, after the suspension expires. If I remember correctly, I need to double the bounty period next time so if you do get an answer in, you might get +200 :)
1I'm confused by what counts as a valid code snippet. Why not just whatever's in <code> tags in the html? – Calvin's Hobbies – 2017-04-24T06:24:13.760
@HelkaHomba What about the newline restrictions? I could try to find another way to incorporate those. – HyperNeutrino – 2017-04-24T06:26:22.900
@HelkaHomba Essentially, if the answer contains backtick-delimited code within a line, it should be ignored. – HyperNeutrino – 2017-04-24T06:32:35.323
This is one of those answers, where it's easier to do the main part of the question. Downloading the page and extracting the code blocks is harder than doing the levenshtein distance. – Bálint – 2017-04-24T12:05:13.467
@Bálint I'm not quite sure I agree with you on what the main part of the question is. I don't think the main part is the levenshtein distance, because most languages have a built-in for that which I did not and will not disallow. I think that the main part of the challenge is to get the answers, find the code blocks, and iterate through them, which, as you said, is the harder part of the challenge, also being the "main part" in my opinion. – HyperNeutrino – 2017-04-24T12:12:07.647
I am not clear on part of the capturing code blocks part. So on this q the JS answer has backticks in the code to that can be ignored yes? The JS code has large amounts of whitespace in it as well. For the haskell answer I should have to match the whole code block even though it is on more than one line? Wanted to know for sure how to deal with multiline code block answers. FYI since I use PowerShell I spent a bit rolling my own LD function since PowerShell does not have one. That part was fun though.
– Matt – 2017-05-29T02:34:44.393@Matt The backticks should not be ignored. It's only if the code formatting is caused by backticks should you ignore the codeblock. You need to match the whole Haskell answer's code block because it is a proper codeblock. – HyperNeutrino – 2017-05-29T02:37:30.410
1Cool. Just checking. – Matt – 2017-05-29T02:51:14.377
Is there any special rules for questions that have more than 100 answers? Are we always expected to check all answers of all questions ids passed? Guaranteed perhaps to have question under a certain threshold? – Matt – 2017-05-29T13:45:30.710
@Matt Just wondering, how does having >100 answers change anything? – HyperNeutrino – 2017-05-29T13:56:16.257
For me I am using the StackExchange API to query answers and their content. The max pagesize is 100, default being 30. So if there are more that 100 I need to send a query again with another page. If I don't use the API it gets even harder. – Matt – 2017-05-29T13:57:55.327
@Matt Then I will say you can assume the question has at most 100 answers; I did not know about that. Thanks! – HyperNeutrino – 2017-05-29T14:10:14.833
Here is a relevant reference to the docs https://api.stackexchange.com/docs/paging
– Matt – 2017-05-29T14:13:54.720My LD calculations might be flawed. Is there an online resource you are using for LD values couple I found hate unicode? I knew nothing about LD until I started looking at this challenge. Does yours use deletions, substitutions and transpositions? I don't think mine does the latter so I got 23. – Matt – 2017-05-29T15:36:34.997
@Matt What are transpositions? I believe mine only uses deletions, substitutions, and insertions. – HyperNeutrino – 2017-05-29T15:52:30.180
Mine supports those as well. I read a discussion for on online calculator that could have supported swapping neighboring character positions. Your LD was lower in your example so I was curious. I made my function based on this page. It worked fine for the other examples online I found so I had assumed I got that part of the challenge done. That is why I wanted to know what you did to get yours.
– Matt – 2017-05-29T16:07:44.517I got 23 for these
FN«AX²ιβ×__β↓↘β←↙β↑←×__β↖β→↗β
toNαWα«X²ι↙AX²⁻ι¹β↙β↑↖β→A⁻α¹α
. Sorry for all the comments. – Matt – 2017-05-29T16:08:12.630hm... should I write another java beast? – tuskiomi – 2017-05-29T16:23:21.403
@tuskiomi If nobody else answers I really have no choice but to give it to you as long as it's an actualy answer ;P So go ahead :P – HyperNeutrino – 2017-05-29T16:26:03.407
@Matt I don't know what I was doing. You're right; it is 23. Sorry for the confusion. :P Thanks for catching that; I've updated. – HyperNeutrino – 2017-05-29T16:29:19.993
I'm really hoping you will expand more on what a valid code block is for the purposes of this challenge. Take for example Q97752 and the last answer in Python 2.7. I see a valid code block 7 lines long. Each line is indented, but that doesn't seem to meet your criteria? I also see you've answered this question for Matt. please add more to the body of the OP. It really is not very clear. Especially the sentence which begins, "A body of text is the answer's actual code...". To me, the body of text is much more than the code. Can you please clarify for us all. – Octopus – 2017-05-30T21:09:48.807
@Octopus How is it invalid? The answer meets my criteria as it has an empty line above it and no text below it. I will add my comments into the question. – HyperNeutrino – 2017-05-30T21:15:46.873