Automatic parsing of citation text in academic references

Is there any software (or pseudo-code) which can automatically scan a piece of text (either pasted into the tool, or read from a .doc/.pdf) and identify citation data using standard formats? The data would then be split up into its constituent fields and exported in XML, CSV, or some other structured data format. I have looked at cb2Bib but it was only able to extract the year from Harvard-style references, which is insufficient.

Alistair Knock

Posted 2009-08-18T08:56:28.420

Reputation: 834

I'm not sure if this what you might need but you can try this https://www.refhive.com

– Mostafa Elmoghazi – 2019-04-03T21:49:45.467

Do you want to scan the text itself or just the references section? – innaM – 2009-08-18T09:09:12.120

Just the references - it'd likely be a document containing personal publications. – Alistair Knock – 2009-08-19T09:07:38.367

Answers

Take a look at this list of Citation Parsers that can generate XML from input text:

http://freecite.library.brown.edu
http://paracite.eprints.org
http://aye.comp.nus.edu.sg/parsCit (in maintenance mode as of Aug 1, 2012)
http://opcit.eprints.org
http://search.cpan.org/~mjewell/Biblio-Citation-Parser-1.10

With freecite you could use a curl command to submit citations as follows (in PHP):

$cmd = "curl -H 'Accept: text/xml' -d \"" . $myinput . "\" http://freecite.library.brown.edu/citations/create";
$xmloutput = exec( $cmd );

KEG

Posted 2009-08-18T08:56:28.420

Reputation: 41

Another option is https://github.com/inspirehep/refextract. It's not based on ML but it worked very well on my tests.

– Josir – 2019-02-02T13:32:18.303

At the moment (2017) the most active Open-Source project implementing this seem to be Anystyle Parser (last version 07-2016). It can be used through a web-interface, API, or downloaded as a RubyGem.

They explicitly mention on their website that the implementation is inspired by ParsCit (last version 2013?) and FreeCite (last commit 2009).

Also form their website:

AnyStyle Parser uses powerful machine learning heuristics based on Conditional Random Fields that can be trained by everyone using our built-in editor.

That is a realy cool feature, that makes this the most interesting implementation (imho). Training seems to be pretty straightforward, as explained in the API documentation. You just provide some manually corrected results, and and run the Anystyle.parser.train command. I am not sure if ParsCit and FreeCite also support this, but if they don't, this seems like a huge feature-difference to me.

Wouter

Posted 2009-08-18T08:56:28.420

Reputation: 1 259

With the exception of Anystyle Parser they're all mentioned in the currently highest voted answer. What actually makes them stand out? What would be advantages or disadvantages given the original question? – Seth – 2017-04-07T10:00:34.680

Ah, indeed. I'll edit and improve my answer. Thx for pointing that out. – Wouter – 2017-04-07T11:40:43.710

Looks like it's dead now. – expert – 2017-12-08T13:45:58.440

@expert : It's resting :) I've been mailing with the developer, while trying to get the source to compile on Windows (works now). He's planning to give it a remake in 2018 to add new functionality, like full-text parsing of papers. – Wouter – 2017-12-09T11:45:34.213

@Wouter did he ever make clear how to get it compiled on windows? Very interested in this as well. – Brandon – 2018-02-28T01:22:10.820

@Brandon: I have posted a HOWTO here: https://github.com/inukshuk/wapiti-ruby/issues/3

– Wouter – 2018-02-28T08:24:03.077

1That looks great, thanks! As someone who has never touched ruby, it will be very helpful indeed. – Brandon – 2018-02-28T09:27:17.160

Try a tool such as Regex Buddy or Expresso.

If you're not a programmer Regular Expressions may be a bit intimidating, but they're really not that hard, especially with a decent tool like one of the above.

Here's an example of someone using Regular Expressions for extracting citations:

Citation parsing regular expression

Ash

Posted 2009-08-18T08:56:28.420

Reputation: 2 611

Mendeley should be able to do this. It can import PDFs and then export the metadata to BibTeX, RIS and EndNote XML. It is free to download and is cross-platform.

Edit: I tested this on a few documents. The PDF import does seem to work well for references that are formatted correctly. For a document I created using LaTeX, all of the references with the author in the form "Smith, J." or "J. Smith", etc., were imported fine. If the author is a company (a single word), or the reference is incomplete, it does not work as well. The extracted references can easily be edited and exported to BibTeX, etc.

sblair

Posted 2009-08-18T08:56:28.420

Reputation: 12 231

"This feature was removed in Mendeley 0.9.7 because it was consuming a fair amount of resources (client and server side) without providing enough value. We plan to re-introduce it in an improved form in future." ...... http://feedback.mendeley.com/forums/4941-mendeley-feedback/suggestions/834313-version-0-9-7-does-not-extract-references-from-the

– iceman – 2010-08-16T09:50:03.353

Try http://www.crossref.org/guestquery/#stqsearch

This one is capable of automatic parsing your reference text and offers a link to an on-line article.

anton

Posted 2009-08-18T08:56:28.420

Reputation: 11

I've seen a Westlaw program do that for legal citations, but that's probably not what you're looking for. Reference Manager might do something like that for academic formats, but I've never used it.

Kaypro II

Posted 2009-08-18T08:56:28.420

Reputation: 1 359

This probably belongs more as a comment to @Abhinav, but zotero definitely only handles structured data, as you would find described here:

http://www.zotero.org/support/getting_stuff_into_your_library#importing_records_from_other_reference_tools

An interesting hack might be to try to write a program that uses each citation as a search query in your favorite database, then uses something like zotero to generate the ref information. You could also download structured information from services like citeUlike. Let me know if you end up doings something like that! (put it up on github if you do ;).

Dav Clark

Posted 2009-08-18T08:56:28.420

Reputation: 174

Zotero is a plugin for firefox which does this for web content. Not sure if there is a similar tool for documents/pdfs

Abhinav

Posted 2009-08-18T08:56:28.420

Reputation: 2 030

I don't see how Zotero does what the OP asks. I've installed it, but there seems to be no option to parse a reference. – Rikki – 2015-03-06T13:03:36.397

Zotero parses citations from specially-coded websites, not from regular text. – Ochado – 2016-04-26T13:06:15.333

1I know that this isn't exactly what Zotero is designed to do, but if you pointed Firefox to a text file or html file with the relevant data, Zotero might recognize the references and then you could add it to the Zotero library and export the whole library into whatever format you like (I know Zotero supports a bunch of formats). This would be painful for a large number of files though. – nedned – 2009-09-01T14:11:15.633