Detect Duplicate Questions

20

6

Detect Duplicate Questions

Once upon a time, there was a golfing site. It had a problem: people would post similar or identical questions again and again. You have been chosen selected forced conscripted blackmailed requested to automate the process of deciding whether a question is a duplicate of an existing one, by whatever means necessary (see Rules).

Input

Your program must accept a single URL as input. It may assume that this leads to a question on codegolf.stackexchange.com.

Output

Search the site for similar questions. If you think that the input question is a duplicate of an existing question (or vice versa), output the other question's URL. You may output multiple URLs, separated by new lines. At the end of your output, output end (on a separate line).

Scoring

  • If a question that you output was indeed marked as a duplicate of the input question (or vice versa), you score 4 points. This is a "correct guess".
  • For each false positive (aka "incorrect guess"), you lose 2 points.
  • For each question that was actually a duplicate but does not appear in your output (aka "missing guess"), lose 1 point.

The highest score after handling 32 input questions wins. These 32 questions are a "round". At the beginning of each round, the scores will be reset to 0. One round will be run every few days, and the leaderboard updated after each round.

Rules

  • If questions A and C are both closed as duplicates of B, A will count as a duplicate of C and vice versa.
  • At the start of each round, your program may not possess any data about any questions (i.e. no hardcoding), except on how to parse the website.
  • However, you may keep data in external files during a round.
  • No data may be kept between rounds.
  • Your output must have a trailing new line.
  • You may not use any data from the website except search results and the URL, title, tags and text of a question, with or without formatting. For example, you may not use the text "marked as duplicate by foo, bar..." that appears on duplicate questions.
  • You may retrieve this data directly from the site, via data.SE or via the API.
  • Each submission must have a name.
  • Each submission must have clear version numbering.
  • If a submission does not produce output after a time limit (to be decided; please state how long your submission takes) it will be killed and lose 8 points.

user16402

Posted 2014-09-12T20:05:04.743

Reputation:

Are we penalized for claiming the same question is duplicate twice? – pppery – 2016-06-12T01:20:45.323

@PenutReaper Question 20006 has since been deleted. – pppery – 2016-06-15T13:00:38.013

2Isn't 1 minute subjective ? Network connections and crawling will lead to a huge number of web requests. It might easily take more than 1 minute for everyone :) – Optimizer – 2014-09-12T20:17:10.967

@Optimizer Are you suggesting I increase it to 2 minutes? 4? – None – 2014-09-12T20:20:40.080

4I think we cannot come to that number directly, you might have to write an example program yourself (or use the first answer) to determine the correct threshold time. – Optimizer – 2014-09-12T20:23:21.957

@Optimizer I think I'll use the first few answers (but no more than 8 minutes or so) – None – 2014-09-12T20:46:19.873

7Instead of scraping the site, you should go through the API, and specify which fields may be used. – Gilles 'SO- stop being evil' – 2014-09-12T21:36:21.053

@Gilles I'm not familiar with the API - is 'post name', 'post body', 'post tags' enough? – None – 2014-09-13T10:37:31.253

5It would be so funny if this question was a duplicate.. oh the irony xD – Teun Pronk – 2014-09-15T09:40:59.730

can I use a search engine instead of the site's search? – Mhmd – 2014-09-16T12:36:51.520

@Mhmd No; you may use the API though, I will add rules about that – None – 2014-09-16T13:00:58.590

@Gilles I've added rules for the API; you may now use the site itself, the API or data.SE – None – 2014-09-16T13:02:51.593

3

@professorfish You could really use some test cases, here you go. This data all came from Data.SE, so it should be reliable. Feel free to make me look silly and prove me wrong.

This question has http://codegolf.stackexchange.com/q/37737 has no duplicates.

This question http://codegolf.stackexchange.com/q/12348 has this http://codegolf.stackexchange.com/q/10465

This question http://codegolf.stackexchange.com/q/12498 has these http://codegolf.stackexchange.com/q/20006 http://codegolf.stackexchange.com/q/242

– PenutReaper – 2014-09-16T16:08:31.077

Answers

3

Python 3

I am giving this entry the name The Differ.

Code:

import urllib.request, gzip, re, json, difflib, sys
API_URL = "https://api.stackexchange.com/"
qurl = input()
qid = int(re.search("\d+",qurl).group(0))
def request(url,wrapper=False,**params):
    params.setdefault("filter","withbody")
    params.setdefault("site","codegolf")
    url = API_URL + url + "?"+"&".join([str(k)+"="+str(v) for k,v in params.items()])
    compressed_response = urllib.request.urlopen(url)
    response = gzip.decompress(compressed_response.read()).decode("utf8")
    response_object = json.loads(response)
    if wrapper:
        return response_object
    else:
        return response_object["items"]
question = request("questions/%s"%qurl)[0]
tags = ";".join(question["tags"])
title = question["title"]
escaped = title.replace(" ","%20")
related = request("similar",title=escaped,pagesize=100)
hasmore = False
length = sys.maxsize
for tag in question["tags"]:
    result = request("search",tagged=tag,
                     wrapper=True,
                     filter="!-*f(6rc.cI8O",
                     pagesize=100)
    if result["total"] < length:
        length = result["total"]
        related.extend(result["items"])
        hasmore = result["has_more"]
        besttag = tag
related.extend(best)
if length < 1500:
    for page in itertools.count(2):
        if not hasmore:
            break
        response = request("search",
                           tagged=besttag,
                           page=page,
                           pagesize=100,
                           filter="!-*f(6rc.cI8O",
                           wrapper=True)
        hasmore = response["has_more"]
        related.extend(result["items"])
matcher = difflib.SequenceMatcher(None, question["body"], None)
titlematcher = difflib.SequenceMatcher(None, question["title"], None)
seen = set()
seen.add(question["question_id"])
for possible in related:
    matcher.set_seq2(possible["body"])
    titlematcher.set_seq2(possible["title"])
    score = matcher.ratio()+titlematcher.ratio()
    qid = possible["question_id"]
    if score > .85 and qid not in seen:
        print(qid)
        seen.add(qid)
print("end")

The filter "!-*f(6rc.cI8O" included the total parameter on the global wrapper object and the body parameter on questions.

This entry makes two API requests plus one per tag on the question plus one per hundred questions in its least used tag. If it hits an api throttle (which it doesn't check for), it will raise an urllib.error.HTTPError: HTTP Error 400: Bad Request

pppery

Posted 2014-09-12T20:05:04.743

Reputation: 3 987