Study at CFL





CFL Interviews

Shlomo Argamon
March 2014

"A great deal of what is now done manually can be automated."

Shlomo Argamon is Professor of Computer Science and Director of the Linguistic Cognition Laboratory at the Illinois Institute of Technology. During a visit to CFL as part of Aston University's Distinguished Visitor scheme, he talked to Andrea Nini about his work in computational linguistics and authorship attribution.

Have you done any consultancy work as a forensic linguist?

A small amount. There were two cases that I was asked to consult on, but before I had gotten very far with any of the analysis the consultancy was called off for ancillary reasons.

Did you analyse these texts, nonetheless?

I only had completed partial analyses of the data before my involvement of the cases was called off.

How do you think the kind of analysis you made for those forensic case differs from other kinds of attribution analyses?

I come to forensic linguistics from the standpoint of computational linguistics so I research on how to build fully automated solutions that can perform the attribution task without human intervention. Obviously, when it comes to explaining results, and certainly when testifying, you need human interpretation. But the idea is to develop techniques that we can apply fully automatically, then measure their reliability rigorously, and then use them to get a clear idea of what's happening in an actual case.

Now, in our research we typically work with cases which are much easier for a number of reasons than a typical forensics case. Krzysztof Kredens says that he distinguishes between "authorship attribution" and "forensic authorship attribution." In most authorship attribution work done in historical or literary scholarship one has long and relatively clean texts in terms of grammatical structure and so forth, and one usually has quite a lot of data. You therefore have a lot of background data on your suspect, you have a lot of background data on other distractor or confusor authors to compare the suspect against, and the questioned text itself is often quite long. In such situations we can apply statistical techniques that rely on the fact that there is a lot of data. For example, you can get pretty good statistical estimates of the frequency that somebody uses nouns versus verbs or things like that, and such statistics usually characterise authorship fairly well.

By contrast, what I found in the forensic context, which was not really surprising, is that the data were much smaller so I couldn't directly rely on these kinds of statistical techniques. In addition, the texts were grammatically and orthographically, really in almost every way, ill-formed -- hence I couldn't rely on standard computer software that we use all the time to automatically parse the syntax of a sentence and assign the parts of speech. That wouldn't have worked well, because the texts were sufficiently ill-formed that it can't be done by using regular methods. So what I needed to look for were ways to extract useful lexical or even character-based features -- things like emoticons, abbreviations, use of numerals in place of syllables, and so forth.

As far as I had gotten in these cases was to go manually through samples of the data to extract some sort of a dictionary of plausible lexical and orthographic features, as opposed to applying an automated computer system which would "read" through all of the data and automatically find potentially interesting and useful deviations from the linguistic norms.

From what I saw in these cases and from what I gather from the experience of my colleagues working on cases, this is really the key distinction. There's quite a gap there between what is possible and what works in these different scenarios. The key question now is how do we go from methods that give good accuracy and insight, but only when given large amounts of well-formed data, and push them down to meet real-world needs where data are much less well-formed, are much more informal, have a lot more linguistic variation, and so forth? That's where a lot of this work needs to move towards.

What do you think then is the missing link between authorship attribution and forensic authorship attribution?

 I don't think that there is a single silver bullet which is going to solve the problem. We can start, though, from the general framework which is as follows: Take a text or a body of texts. The first thing to do is identify what features of the text to represent, to create an abstract representation of the style of a text. These features could be as simple as the relative frequencies of different words in the text, you might take function words or parts of speech or other features. Then you can construct an abstract representation of the text based on the frequencies of those various features in the text. Given such a representation, there is a whole library of statistical machine learning classification methods that take such abstract mathematical representations of the texts and find the best ways to distinguish between, say, one group of texts and another group of texts, say, Author A and Author B in the simplest case. One can then take a questioned document, represent it in the same way, and then compare it to documents by A and B to get a predicted answer. As a side effect of this process, you can also find out which features were most determinative of this classification, so you can explain why this questioned document was classified in one way or another.

Now, this is a very general framework, so the key question is: What features do you choose? The framework itself doesn't care and the statistical classification methods don't care. There are some properties of the features that interact with the mathematical properties of one method or another, but this is usually a fairly minor affair. The key direction is to ask: What are the features that we can automatically extract from these texts that will be useful? Then we can ask what their statistical properties are, so we know what kinds of statistical methods to use for the analysis. And if we can solve those problems in a way that applies better to the forensic context then we may be able to automate a lot of the analysis that forensic linguists now do manually.

So in your opinion what we need is better ways of finding the features needed to do forensic authorship attribution?

Precisely. We need a better understanding of how to find and identify those features. For example, one kind of feature which is often very important in forensic cases, such as in text messages where you aren't looking at just a single text but rather a whole sequence of texts, are discourse features, features that relate one text to previous texts by the same author or to other authors' texts that is responding to. Those kinds of features are much harder to understand and capture automatically than lexical and syntactic features. And that's I think one of the most significant problems here, added to the fact that an individual text is very, very small so there's very little data to work on, to make these determinations. This is a fundamentally difficult technical problem.

In terms of these features that are more difficult to analyse automatically without human intervention, what do you think the direction is in the future? Do you think it is possible or it will be possible to make even these analyses automatic or do you think it will always require some sort of human intervention?

I think that a great deal can probably be automated -- at a minimum we can develop systems to aid the analysts, that is, provide a computer-aided analysis. In fact, this is ultimately what there is anyway -- even if there is a fully automated process that spits out an answer at the end, a human has to come in, look at the answer, look at the explanation for the answer, and make sure that it makes sense and that is consistent with what we know about the case and about human language.

But a great deal of what is now done manually can be automated. One of the difficulties with automating extraction of some of these discourse features is understanding what is going on in discourse often involves not just knowing something about linguistic structure but also knowing something or even a lot about the world, knowing, for example, that families have parents, have siblings, and so forth. That's knowledge about the world that allows us to interpret statements in a discourse in context and to understand the discourse function. As a concrete example, if a paedophile in a chat with a child writes "I'm going to call your mom and dad", understanding that the statement is functioning as a threat requires knowledge of family structure and dynamics; it's not clear just from the linguistic structure that it is a threat, although in the context of the conversation it obviously is one. So our question from a computational linguistics standpoint is what knowledge is needed and how do we represent this knowledge? How do we capture this knowledge to be able to do this analysis? And that's really a big question.

How do you think the field is evolving?

I think that what we are starting to see in the field, which the gracious invitation from CFL for me to visit is part of, is much more collaboration between traditional applied linguists and forensic linguists with computational linguists such as myself. We are looking at how we can work together, both by on the one hand having applied linguists inform better what we are doing on the computational side to help us to develop systems that work better and also that are more practical and applicable to real world cases not just ivory tower research, and on the other hand working to design and develop computational systems which that can give useful assistance to linguists. A very important piece of that is, as I have said, that our software systems have to be such that they don't just give an answer but they also give the analyst an idea of how it got to the answer, so that the analyst can properly evaluate it and put it together with everything else they know.

This process requires some real reorientation from everybody involved to make this collaboration work. I think that what we will see as we look forward are teams composed of both linguists and software systems -- we will see some sophisticated teamwork approaches where the software systems are doing fairly deep analyses and linguists understand how to work with those systems in order to produce more effective results overall. I note that one very important thing that software systems bring to the problem is not just the possibility of getting new kinds of results, but also on the very basic level doing a lot of very simple work that takes a great deal of time and effort for the linguist, such as counting numbers of occurrences of discourse features. If we can do that automatically and accurately, we will make the job of the forensic linguist much easier and give them much more time to do more interesting parts of the task.

Related links

Shlomo Argamon's webpage at the IIT
Shlomo Argamon's twitter page



© Centre for Forensic Linguistics, Aston University, Birmingham, UK, 2014