"A great deal of what is now done manually can be automated."
Shlomo Argamon is Professor of Computer Science and Director of the Linguistic Cognition Laboratory at the Illinois Institute of Technology.
During a visit to CFL as part of Aston University's Distinguished Visitor scheme, he talked to
Andrea Nini about his work in computational linguistics and authorship attribution.
Have you done any consultancy work as a forensic linguist?
A small amount. There were two cases that I was asked to consult on,
but before I had gotten very far with any of the analysis the
consultancy was called off for ancillary reasons.
Did you analyse these texts, nonetheless?
I only had completed partial analyses of the data before my
involvement of the cases was called off.
How do you think the kind of analysis you made for those forensic
case differs from other kinds of attribution analyses?
I come to forensic linguistics from the standpoint of computational
linguistics so I research on how to build fully automated solutions
that can perform the attribution task without human
intervention. Obviously, when it comes to explaining results, and
certainly when testifying, you need human interpretation. But the idea
is to develop techniques that we can apply fully automatically, then
measure their reliability rigorously, and then use them to get a clear
idea of what's happening in an actual case.
Now, in our research we typically work with cases which are much
easier for a number of reasons than a typical forensics
case. Krzysztof Kredens says that he distinguishes between "authorship
attribution" and "forensic authorship attribution." In most authorship
attribution work done in historical or literary scholarship one has
long and relatively clean texts in terms of grammatical structure and
so forth, and one usually has quite a lot of data. You therefore have
a lot of background data on your suspect, you have a lot of background
data on other distractor or confusor authors to compare the suspect
against, and the questioned text itself is often quite long. In such
situations we can apply statistical techniques that rely on the fact
that there is a lot of data. For example, you can get pretty good
statistical estimates of the frequency that somebody uses nouns versus
verbs or things like that, and such statistics usually characterise
authorship fairly well.
By contrast, what I found in the forensic context, which was not
really surprising, is that the data were much smaller so I couldn't
directly rely on these kinds of statistical techniques. In addition,
the texts were grammatically and orthographically, really in almost
every way, ill-formed -- hence I couldn't rely on standard computer
software that we use all the time to automatically parse the syntax of
a sentence and assign the parts of speech. That wouldn't have worked
well, because the texts were sufficiently ill-formed that it can't be
done by using regular methods. So what I needed to look for were ways
to extract useful lexical or even character-based features --
things like emoticons, abbreviations, use of numerals in place of
syllables, and so forth.
As far as I had gotten in these cases was to go manually through
samples of the data to extract some sort of a dictionary of plausible
lexical and orthographic features, as opposed to applying an automated
computer system which would "read" through all of the data and
automatically find potentially interesting and useful deviations from
the linguistic norms.
From what I saw in these cases and from what I gather from the
experience of my colleagues working on cases, this is really the key
distinction. There's quite a gap there between what is possible and
what works in these different scenarios. The key question now is how
do we go from methods that give good accuracy and insight, but only
when given large amounts of well-formed data, and push them down to
meet real-world needs where data are much less well-formed, are
much more informal, have a lot more linguistic variation, and so
forth? That's where a lot of this work needs to move towards.
What do you think then is the missing link between authorship attribution and forensic authorship attribution?
I don't think that there is a single silver bullet which is going to
solve the problem. We can start, though, from the general framework
which is as follows: Take a text or a body of texts. The first thing
to do is identify what features of the text to represent, to create an
abstract representation of the style of a text. These features could
be as simple as the relative frequencies of different words in the
text, you might take function words or parts of speech or other
features. Then you can construct an abstract representation of the
text based on the frequencies of those various features in the
text. Given such a representation, there is a whole library of
statistical machine learning classification methods that take such
abstract mathematical representations of the texts and find the best
ways to distinguish between, say, one group of texts and another group
of texts, say, Author A and Author B in the simplest case. One can
then take a questioned document, represent it in the same way, and
then compare it to documents by A and B to get a predicted answer. As
a side effect of this process, you can also find out which features were
most determinative of this classification, so you can explain
why this questioned document was classified in one way or
Now, this is a very general framework, so the key question is: What
features do you choose? The framework itself doesn't care and the
statistical classification methods don't care. There are some
properties of the features that interact with the mathematical
properties of one method or another, but this is usually a fairly
minor affair. The key direction is to ask: What are the features that
we can automatically extract from these texts that will be useful?
Then we can ask what their statistical properties are, so we know what
kinds of statistical methods to use for the analysis. And if we can
solve those problems in a way that applies better to the forensic
context then we may be able to automate a lot of the analysis that
forensic linguists now do manually.
So in your opinion what we need is better ways of finding the features needed to do forensic authorship attribution?
Precisely. We need a better understanding of how to find and identify
those features. For example, one kind of feature which is often very
important in forensic cases, such as in text messages where you aren't
looking at just a single text but rather a whole sequence of texts,
are discourse features, features that relate one text to previous
texts by the same author or to other authors' texts that is responding
to. Those kinds of features are much harder to understand and capture
automatically than lexical and syntactic features. And that's I think
one of the most significant problems here, added to the fact that an
individual text is very, very small so there's very little data to work
on, to make these determinations. This is a fundamentally difficult
In terms of these features that are more difficult to analyse
automatically without human intervention, what do you think the
direction is in the future? Do you think it is possible or it will be
possible to make even these analyses automatic or do you think it will
always require some sort of human intervention?
I think that a great deal can probably be automated -- at a minimum we
can develop systems to aid the analysts, that is, provide a
computer-aided analysis. In fact, this is ultimately what there is
anyway -- even if there is a fully automated process that spits out an
answer at the end, a human has to come in, look at the answer, look at
the explanation for the answer, and make sure that it makes sense and
that is consistent with what we know about the case and about human language.
But a great deal of what is now done manually can be automated. One of
the difficulties with automating extraction of some of these discourse
features is understanding what is going on in discourse often involves
not just knowing something about linguistic structure but also knowing
something or even a lot about the world, knowing, for example, that
families have parents, have siblings, and so forth. That's knowledge
about the world that allows us to interpret statements in a discourse
in context and to understand the discourse function. As a concrete
example, if a paedophile in a chat with a child writes "I'm going to
call your mom and dad", understanding that the statement is
functioning as a threat requires knowledge of family structure and
dynamics; it's not clear just from the linguistic structure that it is
a threat, although in the context of the conversation it obviously is
one. So our question from a computational linguistics standpoint is what
knowledge is needed and how do we represent this knowledge? How do we
capture this knowledge to be able to do this analysis? And that's
really a big question.
How do you think the field is evolving?
I think that what we are starting to see in the field, which the
gracious invitation from CFL for me to visit is part of, is much more
collaboration between traditional applied linguists and forensic
linguists with computational linguists such as myself. We are looking
at how we can work together, both by on the one hand having applied
linguists inform better what we are doing on the computational side to
help us to develop systems that work better and also that are more
practical and applicable to real world cases not just ivory tower
research, and on the other hand working to design and develop
computational systems which that can give useful assistance to
linguists. A very important piece of that is, as I have said, that our
software systems have to be such that they don't just give an answer
but they also give the analyst an idea of how it got to the answer, so
that the analyst can properly evaluate it and put it together with
everything else they know.
This process requires some real reorientation from everybody involved
to make this collaboration work. I think that what we will see as we
look forward are teams composed of both linguists and software systems
-- we will see some sophisticated teamwork approaches where the
software systems are doing fairly deep analyses and linguists
understand how to work with those systems in order to produce more
effective results overall. I note that one very important thing that
software systems bring to the problem is not just the possibility of
getting new kinds of results, but also on the very basic level doing a
lot of very simple work that takes a great deal of time and effort for
the linguist, such as counting numbers of occurrences of discourse
features. If we can do that automatically and accurately, we will
make the job of the forensic linguist much easier and give them much
more time to do more interesting parts of the task.
Shlomo Argamon's webpage at the IIT
Shlomo Argamon's twitter page