0

I'm curious what the current state of author recognition software is, i.e. software that detects the author of a certain anonymous text based on a certain pool of texts obtained from elsewhere. This could identify people in dangerous positions like critics of a government or whistle blowers even if they have been totally careful in terms of security otherwise.

I found this question but it is quite old now: https://softwareengineering.stackexchange.com/questions/203133/how-advanced-are-author-recognition-methods

Some articles (Software Helps Identify Anonymous Writers or Helps Them Stay That Way and The Analysis Software That Wrecked J.K. Rowling’s Anonymity) are also old but portray a more scary world than the stackexchange answer.

I imagine recent years with machine learning becoming far more popular and powerful have done a lot to make software like this much more reliable than back in 2013.

So how reliable is this nowadays and what would be appropriate measures to take against it?

  • Can't speak about the software itself, but the big problem is going to be getting the source text in the first place: Whistleblowers are usually going to be talking with people who have a vested interest to **not** reveal the exact message, in part for this exact reason. – Clockwork-Muse Dec 22 '19 at 10:15
  • @Clockwork-Muse sorry if the example role I used could be confusing, I do mean texts that are specifically written by someone that wants to keep their identity hidden. Edited the post slightly to reflect this. – Sebastiaan van den Broek Dec 22 '19 at 10:19

2 Answers2

1

So how reliable is this nowadays ...

Without looking at recent papers but just based on how machine learning works in general: there is no generic authorship detection method which fits all possible use cases. Instead what is possible depends a lot on the specific use case and on the available data.

If the use case is to determine the author from a group of 10 possible authors and you have sufficient training material the detection is to be expected very reliable. If you want to detect instead who of 100000 people was the author I expect it to be impossible even with sufficient training material. But you might at least be able to narrow down the field of potential suspects to a few 1000. Only, in many cases you will not have enough training material even for this.

On the other hand if the author has a very unique style of writing it makes it again easier to detect it. Or in other words: it is far easier to associate some good written essay with J.K.Rowling than the usual average quality essay with some specific 14 year old kid.

It also depends on the kinds of text: a book is different from an article in the newspaper is different from an email is different from a twitter message. Not only the length differ but also the amount of time which was used to improve and tailor the wording and style. It is probably easier to detect authorship of books than of emails since the authors unique style is more clearly reflected within the book.

I'm at least familiar with recent research where machine learning was used to determine if a specific email was authored by the claimed sender in order to detect sender spoofing or account takeover. In this case even with a large corpus of previous mails the false positive rate was very high.

... what would be appropriate measures to take against it?

Again this depends on the use case. But similar to transferring the style of images or of music or creating deepfakes it is possible to use style transfer with texts in order to hide the real author or even fake a specific author. And of course making only few data available for training helps a lot too.

Anyway, I think in most cases authorship attribution will not be done based on the writing style but on the contents of the text. Especially in case of government critics or whistleblowers these texts usually contain information only few know in this detail - because otherwise these texts would usually not be seen much as a problem. Thus the focus of investigation will be to find out who has this particular knowledge or had access to specific leaked information.

Steffen Ullrich
  • 184,332
  • 29
  • 363
  • 424
1

Interesting question. Last year, someone within the Trump administration penned an anonymous op-ed essay entitled, 'I Am Part of the Resistance Inside the Trump Administration', which was seen as very unflattering to President Trump and his administration. There was a lot of interest at the time (especially on the part of the Trump administration, but also by the public and by the news media) in determining the author of the essay. Surely, some have tried to determine the who wrote the essay using the techniques that you describe in your question. Moreover, the fact that only a relatively small number of people would have the knowledge to write such an essay would simplify the problem to some degree. Yet, the identity of the author has yet to be conclusively determined. In fact the same individual has recently written a book, and still remains anonymous.

A lot was made of the use of the word 'Lodestar' in the essay. This word is rarely used in the English language, but it was a word the Vice President Pence used with some regularity. Many speculated that this directly pointed to Pence as the writer of the essay, while others speculated that this was a ploy by the writer to throw the scent off of himself/herself, and on to Pence.

With regard to your question about counter-measures - try using a language translation program to translate your writing to another language, then translate it back again to the language that you wrote it in. You'll see that this dramatically changes the writing style of your piece. It will look like it was written by a non-native speaker of your language. This will do a lot to throw the scent off you as far as the writing style is concerned, but as Steffen Ullrich points out in his answer, the writing may still be connected to you based on its contents.

mti2935
  • 19,868
  • 2
  • 45
  • 64