Extract all text of specific font from MS Word document?

0

I have converted a PDF document to word format. The PDF is an issue of a magazine. I need all of the text that is a certain font, and either 9.5 or 10 point. Those are the contents of the articles, which I want to do text mining on (word frequencies only). The tons of other stuff that would be extremely tedious to remove a different way.

Right clicking on an area of text and selecting "Select text with similar formatting" does not work. (This is what I was planning to do). It only selects the current paragraph. I tried to 'select all' and then set all paragraph formatting to be the same, but this had no effect.

Ted Smith

Posted 2016-11-29T01:22:40.217

Reputation: 1

Answers

2

In a copy of your document (in case you want to keep all the text), you could delete all the text that isn't the specific font and size by doing 3 find and replaces (the first two will highlight the text you want to keep, the third will delete the remaining plain text).

(NB: ensure the Find what and Replace with textboxes are empty and don't accidentally include a space in Replace with)

  1. Find all text of the font that is 9.5 point, and replace with highlighting (it will keep the text but highlight it in the last highlighter colour you used - you may want to highlight a word at the start so you have a colour selected). enter image description here
  2. Do the same find and replace, but this time replace 10 point font with highlight.

Steps 1 and 2 above will mean that any text which is NOT 9.5 or 10 point in the font style you are looking for won't be highlighted.

  1. Do a find and replace, where the find is simply Not Highlight (to get Not Highlight as the Format option, you have to select it from the Format button at bottom left of screenshot twice).
    enter image description here

This will delete all the non-highlighted text, leaving you with just the text you want to keep.

Tanya

Posted 2016-11-29T01:22:40.217

Reputation: 1 400