How can I extract text from a table in a PDF file?

5

4

I am trying to implement an algorithm described in an academic paper, which I have in PDF format. The algorithm includes a table of 256 entries that I want to copy to my implementation. However, I can't seem to copy the table as text that I can manipulate. I can only copy it as an image.

How can I extract the table easily without typing it in?

Nathan Fellman

Posted 2009-08-10T19:16:54.333

Reputation: 8 152

Answers

4

PDF2Table

This gives it out to XML I think.

If we surf the web we can find PDF files in heaps. Once technical details of an amazing five mega pixel digital camera, once a statistic about the last two years incomes of an enterprise, and once a brilliant crime novel of Sir Arthur Conan Doyle is saved in a PDF file. The widespread use of this file format takes the focus on the question of how to reuse the data in such a file. Many things are already done in this area. For example, there are several tools that convert PDF-files to other formats.

My work focuses only on the extraction of table information from PDF-files. I searched for tools that extract basic information from PDF-files. I found a tool named pdf2html which also returns data in XML format. To access this XML output I used the JDOM archive.

I developed several heuristics for table detection and decomposition. These heuristics work pretty good on lucid tables (without spanning columns or rows) and fairly good on complex tables (with spanning rows or columns).

Sourceforge link

Ivo Flipse

Posted 2009-08-10T19:16:54.333

Reputation: 24 054

4

  1. The PDF format from its inception (more than 20 years ago) never was intended to be host of extractable, meaningfully structured data.

  2. Its original purpose was to be a reliable visual representation of text, images and diagrams in a document -- a kind of digital paper (that would also reliably be transferred to real paper via printing). Only later in its development more features were added, amongst them some which should help in extracting data again (google for Tagged PDF).

  3. For some examples of problems which are posed when data scraping tables from PDFs, see this article:

  4. Contradicting my point '1.' above, now I say this: for an amazing family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages), see these links:

So: go look for Tabula. If any tools can do what you want, at this time (4 years after this question was asked) Tabula is probably amongst the best for the job!


P.S.: Tabula is Free and Open Source Software, written in Ruby.

Kurt Pfeifle

Posted 2009-08-10T19:16:54.333

Reputation: 10 024

Hi, @ZiaUlRehmanMughal. I rejected your suggestion for an edit to this answer, because I think you should add your amendment as a separate answer of your own. If you do so (and if I come aware of it) I'll even upvote (after I checked the tool you suggest -- which I do not know yet). – Kurt Pfeifle – 2019-12-19T09:21:37.637

2

Your problem might be that it was pasted into the pdf as an image by the origional author. If this is the case (you could find out by seeing if other text in the document will copy as text) your only options are probably to copy it by hand (hope you can touch type) or use OCR software that comes with scanners.

Toby Allen

Posted 2009-08-10T19:16:54.333

Reputation: 2 634

Unfortunately, it looks like this is the case. However, I although Ivo's answer doesn't seem to solve the problem, I prefer to accept it since it's more likely to be the answer for the general case. – Nathan Fellman – 2009-08-10T20:05:48.900

1

I haven't tried this, but the pdf2table project, might help.

jiggysoo

Posted 2009-08-10T19:16:54.333

Reputation: 11

It's buggy (I got an infinite loop generating output xml) and written in highly unidiomatic Java (so not so easy to understand or modify). I'd stay away if you have any choice. – Barry Kelly – 2013-09-26T08:28:45.280

0

The non-free application PDF2XL and the free PDF Mechanic can both extract tabular data to CSV and Excel often perfectly depending on the exact formatting of the table.

Matthew Lock

Posted 2009-08-10T19:16:54.333

Reputation: 4 254

0

One option seems to be to save the document (or maybe just the page with the table you want) as an xml file. I just did this in Adobe Acrobrat Pro by saving as "XML Spreadsheet 2003." This retained the tabular format in the resulting xml file (viewable in Excel). The only "imperfection" is that it considers each literal row in the table as a row in the Excel file. So if any text breaks across rows (e.g., long names), then it will show up as two rows in excel. For a small table, that's pretty minor cleanup.

Other than that, it seems like this process could be automated.

Matt Jans

Posted 2009-08-10T19:16:54.333

Reputation: 1