Mel Evans for the SSEMWG Blog
Aphra Behn and her literary works, as is often the case with pioneers, have been subject to extensive criticism and derogation. Accusations of plagiarism, salacious and bawdy content, and poor literary merit were directed at her theatrical, prose and poetic works throughout her lifetime, and subsequent criticism and popular opinion has often adopted a similar stance. More recently, Behn’s achievements have, of course, been increasingly recognised and celebrated as part of the endeavour to rehabilitate and reposition women’s writings in Anglophone literature; Virginia Woolf’s famous eulogy to Behn, as the woman to whom all subsequent women writers owe their opportunity, has been followed by literary and critical scholarship. The forthcoming Cambridge Works of Aphra Behn, for which the present author is a General Editor, along with Elaine Hobby, Gillian Wright and Claire Bowditch, is part of this effort to properly contextualise and appreciate the social, textual and literary qualities of Behn’s diverse writings. Because the general editors, along with Alan Hogarth, the postdoctoral research fellow focusing on attribution matters, work very closely together, this post refers to ‘we’ and ‘our’, although the author on this occasion is Mel Evans. Collectively, we are the members of the AHRC-funded project ‘Editing Aphra Behn in the Digital Age’.
One continuing thread between the criticism of Behn’s works in her lifetime and that of present-day scholarship is that of authorship: debate and evaluation of what Behn really wrote, and what she did not. The value of authorship, as a singular and proprietary activity, was of course a significant conceptual development during the latter half of the seventeenth century (see Kewes 1992), and Behn’s works reflect this context. In addition to the accusations of plagiarism, which were made most famously in relation to The Rover, provoking a vigorous defence from Behn in the published play’s postscript (the titular quote for this blogpost is taken from that postscript), others sought to benefit from the association of Behn’s name. There are several prose fiction works published after her death (in 1696, 1698 and 1700) that raise suspicions. For example, in 1698 Charles Gildon (a well-known literary hack) prepared the volume All the histories and novels written by the late ingenious Mrs. Behn entire in one volume which included short fiction and translations associated with Behn in her lifetime, such as ‘Oroonoko’, as well as ‘three Novels not Printed before’. Gildon takes care to inform the reader that ‘The Stile of the Court of the King of Bantam, being so very different from Mrs Behn’s usual way of Writing’ could ‘perhaps call its being genuine in Question.’ Gildon explains such deviation by way of a wager, Behn being challenged to write ‘in the Style of the Celebrated Scarron.’
Subsequent volumes in 1700, prefaced by Samuel Briscoe, add a further five works of English prose fiction ‘never before Printed.’ This series of ever-expanding volumes containing new-found works by the ‘incomparable Mrs. Behn,’ of course raises suspicions for the present-day scholar. It may well have led to similar doubts among contemporary readers, although we have not yet encountered any examples (for a discussion of the bookselling context of the period, see Maureen Bell’s forthcoming essay in Women’s Writing, 2020).
In the twenty-first century, the focus on authorship remains a cornerstone of cultural and literary appraisals of historical women’s writing. Despite the shift to other non-authorial frames of analysis, for scholars interested in destabilising the traditional canon, discovering what, how and where women were able to write – and in some cases even make a living from their pen – remains of high importance. This stance informs The Cambridge Edition of the Works of Aphra Behn, which will include texts with a longstanding association with Behn, even if they lack either Behn’s name on the title-page or a lifetime date of (first) publication. The collection of such dubia is considerable, and spans across all the genres (drama, poetry, prose fiction, prose nonfiction, correspondence and works of translation) that Behn was known to write and publish. In order to produce an informed and fair editorial apparatus for these works, the edition combines traditional literary critical approaches to authorship with non-traditional computational techniques, known as computational stylistics, or stylometry.
Computational Approaches to Authorship
Computational approaches to authorship and attribution date back over fifty years, with the publication of Moesteller and Wallace’s Inference and Disputed Authorship: The Federalist in 1964 generally considered the landmark initiation of the field. Since then, literary scholars have focused on applying computational methods primarily to a particular subset of time periods, works, languages and authors. The debates – often heated – surrounding Shakespeare’s authorship are perhaps the most familiar for readers of this blog. The approach is essentially interested in the likeness between a questionable text (a work of dubia) and works known to be by a given author (such as Shakespeare, or Behn). The identification of this likeness is based on the quantitative profile of one or more linguistic and textual characteristics of the texts in question. The specifics of our approach are outlined in more detail below, although what we wish to explore in the following discussion are some of the challenges that have arisen in our attempt to investigate works by Behn, and those of her post-Shakespearian contemporaries.
One observable facet of past computational authorship and attribution work is that it remains largely canonical in its interests: focused on removing or restoring works to male authors traditionally heralded as the vanguard of English literature. As editors of Behn’s works engaging with computational approaches to authorship, we are therefore in a fortunate position to offer a modest corrective: not to the language under analysis, which remains early modern English, but in our exploring a period of early modern literature that is not Shakespearean, and in our focusing on a woman writer. Another (perhaps related) facet is that the field of stylometry and authorship attribution, as it pertains especially to early modern literature, is surprisingly male dominated; this is despite the increasing equality in genders in other English literature and English language sub-disciplines. The field can also be somewhat heated and passionate, and not always constructively so. For us to place a new focus on Behn, and to do so as scholars with a vested interest in getting her works and writings better known, understood and appreciated, is therefore a challenging and occasionally risky (from a professional perspective) enterprise.
The dramatic dubia
At its core, non-traditional authorship attribution relies on the premise that each writer (or speaker) has their own distinctive voice (idiolect) that is created by the frequencies and sequences of the words they use. Despite, or perhaps because of, the field’s relatively youthful status, a variety of methods abound in how this authorial signal can be identified. Our approach has been to trial a combination of approaches, looking for continuities and coherence in the results. If our findings point consistently in the same direction, then we have some evidence from which we can make an assessment as to the likelihood of Behn’s involvement in any given text. In taking this somewhat egalitarian approach, there arise opportunities as well as challenges; opportunities, because it enables us to discuss and draw on the expertise of scholarship across the field; yet challenges, in that because of the divisions that already exist, there can arise assumptions that in using one particular approach or method (first), you are therefore aligning yourself to any associated ideologies and arguments. As newcomers to the field, and as scholars whose agenda is primarily to do the best by Behn, we have sought to negate such assumptions whenever possible.
The analysis of Behn’s dubia has thus far focused on Behn’s drama, with some preliminary explorations of her prose fiction and her correspondence (literary and otherwise). Still to come are her poetry and translations. The kinds of insights produced by non-traditional approaches to authorship, as well as the challenges of working on less-established periods of literature (within this particular research context), are diverse and numerous – some have been as expected, others less so. One exciting aspect of using computational methods to investigate Behn’s authorship is that it has necessitated a focus, for the first time, on what characterises Behn’s style, as well as that of some of her contemporaries. Attribution work has not always dwelled upon the attributes that inform verdicts of a dubious text’s likeness or otherwise with a candidate author; but our exploration of Behn’s drama and prose fiction has revealed the variety and scope of her writing in ways that complement and expand upon existing literary criticism. As we report elsewhere (Evans, 2017), Behn’s dramatic style shows changes over time. Whilst the development of a literary author’s style is not a new concept, the ways in which it changed is striking and of interest to literary scholars, showing a shift in language forms that focused on the self (e.g. first-person pronouns, exclamations conveying emotion) to a more interactive, interpersonal set of features (e.g. second-person pronouns, adverbial markers of place and time). Whilst of merit on their own, such as informing discussions surrounding Behn’s flexibility in attending to, or anticipating, the requirements of the stage over the course of the Restoration period, these findings also inform assessments of authorship. The research has shaped our thinking as to how the investigation of the late seventeenth century, and a writer like Behn, is best configured, as against studies working with earlier or later periods of English literature.
Our exploration of the drama, as it presently stands, has produced a range of findings.

Figure 1: Results for a Rolling Delta analysis of The Revenge; lower scores indicate a greater stylistic similarity (i.e. in the word frequencies) between each author’s plays and The Revenge.
Reported here for the first time, our work suggests that The Revenge – a play first performed and published in 1680, and long recognised as an adaptation of Marston’s The Dutch Courtesans – shows a strong likeness with Behn’s dramatic works. This likeness is consistent across a variety of different linguistic measures and statistical treatments (Figure 1 provides a representative set of results). Behn’s stylistic profile and therefore likeness to The Revenge is unwavering regardless of the other authors included in the text (based on a sample of works from dramatists including Dryden, D’Urfey, Shadwell and Ravenscroft, among others). Importantly, it is the consistency of the finding for the analysis of The Revenge that gives us most confidence and allows us to propose that the long-standing attribution of this play to Behn can be considered robust and reliable. Although there are no certainties in this kind of work, we consider the results persuasive – and ones that can be critiqued and built upon by traditional approaches to authorship.
The results for the other drama dubia have not offered so clear a picture. One of the dubia associated with Behn is The Counterfeit Bridegroom (1677), a modest reworking of Middleton’s No Wit, No Help like a Woman’s, probably written in the 1620s or 1630s, although not published until 1657. The prominence (i.e. replication) of the source text is a feature that complicates our aim to identify the degree of likeness between the Restoration elements of the dubia, and Behn’s known plays. One strategy has been to incorporate Middleton’s works, including the source text, into the comparison when examining the linguistic profile of The Counterfeit Bridegroom.

Figure 2: Results for a Rolling Delta analysis of The Counterfeit Bridegroom; lower scores indicate a greater stylistic similarity (i.e. in the word frequencies) between each author’s plays and The Counterfeit Bridegroom. Adapted from Evans and Hogarth, forthcoming.
To a degree, this technique has been effective: it has isolated those parts that are least deviant from Middleton, and means we can control for “Middletonian” language features when looking at the profile of the play (see Evans & Hogarth, forthcoming in Digital Scholarship in the Humanities). However, unlike the results for The Revenge, the results are very inconsistent in terms of the play’s likeness with Behn’s known drama. Some tests show a potential similarity between Behn’s plays and the final acts of The Counterfeit Bridegroom, whereas others suggest that Behn’s contemporaries are much closer in linguistic style for the exact same sections. It seems probable that the degree of “collaboration” between Counterfeit Bridegroom and its Middletonian source text creates a linguistic profile for which it is very difficult to isolate any clear-cut authorial signal for the Restoration additions; that, and/or our tests have not been focussed on the right candidate author that would provide a clear and consistent result, akin to that achieved for The Revenge. As it stands, Behn’s involvement in The Counterfeit Bridegroom cannot be as confidently asserted as for The Revenge, but nor can it be ruled out.
The Debauchee is another play commonly attributed to Behn. It was likely first performed in 1676/7, the same season asThe Rover, and is another adaptation of a play written earlier in the century: this time, Richard Brome’s A Mad Couple Well Matched (written c. 1639, pub. 1653). As with Counterfeit Bridegroom, the computational tests produce inconsistent and inconclusive results with regards to Behn’s authorship. The role of Brome’s source text in the play, however, is a strong and persistent finding throughout.

Figure 3: Results for a Rolling Delta analysis of The Debauchee; lower scores indicate a greater stylistic similarity (i.e. in the word frequencies) between a play and The Debauchee. Scores show the lowest scoring segment of a play for each author in the test. In the case of Middleton, this is the source text, Mad Couple, throughout.
Depending on the quantity of words included in the analyses, which ranges from 100 – 1000, in intervals of one hundred, and the statistical techniques applied, the play can profile as being remarkably like Behn’s known comedies or, conversely, as being very dissimilar. Including Brome’s Mad Couple and a selection of his other plays in the tests pointed to the sections of The Debauchee that are least modified from Brome’s original. However, isolating a salient authorial signal in the Restoration additions using the evidence available did not show any consistent likeness with Behn’s dramatic style. Tests that use a greater number of words for comparison (e.g. the 500 or more most frequent words) are more likely to show a similarity with Behn. This is a finding that could suggest that (some of) the similarity is a result of topical similarity, rather than because of the less conscious linguistic preferences associated with function words that are captured more acutely in the tests using a small number of words (i.e. less than 500 most-frequent words).
What are the implications?
The three cases reported briefly here are, it should be acknowledged, valuable for what they can tell us about computational methods and how they respond to complex cases of collaboration and adaptation (what Love (2002) calls ‘precursory authorship’). The analyses have also offered new – somewhat unexpected – perspectives on Restoration dramatic practice, highlighting, for instance, the persistent unoriginality of Edward Ravenscroft’s plays, such as The English Lawyer (1678). His works often group with the pre-1660 source texts and others by those authors, reflecting Ravenscroft’s tendency to undertake only minimal revisions of the sources of his plays. Our investigations of Behn’s potential authorship of the dramatic dubia have also, of course, provided new perspectives on what constitutes Behn’s style, and the ways in which it aligns with and diverges from that of her contemporaries. A quantitative and corpus-linguistic perspective of her dramatic writing is an important contribution to the literature, in that it parallels the kinds of analyses more commonly undertaken for dramatists from earlier periods – the exemplary work of Jonathan Culpeper and colleagues on the Encyclopedia of Shakespeare’s Language is one such example.
However, there are also less positive outcomes that may arise from our computational investigation of the likelihood of Behn’s authorship of the dramatic dubia. One challenge that now faces our editors and the edition is how such inconclusive and ambiguous results are to be accounted for and used effectively. Here, the editorial ethos of the project is important. We made an early decision that the results of the computational analyses would only inform editorial approaches, not determine them – regardless of the result. That is, the editorial treatment of a play or other work of dubious attribution, would be on par with that of those of Behn’s securely attributed work. This principle has offered the editors scholarly freedom to develop their interpretations appropriately, sensitive to the specific findings of each case (e.g. The Revenge versus The Debauchee). This informs how they frame their play in their headnote, and in the kinds of editorial clarification offered in the commentary notes.
It might also be argued that the computational analyses – because of their inability to offer a consistent picture of The Counterfeit Bridegroom and The Debauchee – are in fact rather redundant: they offer no clarification or definite outcome in favour of either side of the attribution debate. To this, we would propose that the linguistic findings enrich our understanding of Behn as a writer, regardless of the verdict on the authorship of the dubia, as well as offering another perspective on why the dubia may have been associated with Behn in light of its lexical make-up. In the inconclusive tests, the results show that the dubia are more likely to be identified as “Behn-like” when the wordlist is larger, and therefore picks up on vocabulary linked to topic and content – rather than meaning the most frequent words (around the top 100-200), which tend to be grammatical items. This offers a perspective on why such works may have been attributed to Behn: words such as young, name, night, marry, heaven and fair may collectively indicate preoccupations associated with Behn, in contrast with the works of her contemporaries, bound up with her preferred plot structures and character interactions. It also suggests something about how Restoration authors went about adaptation, retaining the dialogic structures and patterns created through grammatical vocabulary, but incorporating new content words. As Hugh Craig and Brett Greatly-Hirsch have recently shown, there are interesting and relatively under-investigated developments in the language and style of drama post-1660, and our findings add to the growing and nuanced picture we have of this period of English literary history.
A third, related point are the implications that such inconclusiveness has on the perception of computational stylistics and its relevance to literary and editorial endeavours more generally – particularly in light of the work required to prepare texts by less mainstream authors. Attitudes towards computational evidence remain largely sceptical, sometimes antagonistic, and not unreasonably so: we still struggle to explain quite how the lexical profiling approach works at the level of an individual’s cognitive processes, as well as how such individual preferences intersect at the socio-cognitive level. Whilst computational stylistics, as we’ve tried to argue, offers more than authorship and attribution, this still remains an expected application of the approach. In seeking to promote Behn’s works and writings through new scholarship, we have the opportunity to demonstrate how this method of investigation can be useful for other works and periods. The digital preparation of the texts and the development of new analytic approaches are themselves a testament to the worth of the author or period under investigation. And there is the potential, too, to use the results for new authors, periods and, indeed, languages, to hone and refine the methods and burgeoning theoretical models of computational stylistics and attribution research. Behn’s literary writings, her social context, and her linguistic environment are radically different to that of Shakespeare or Henry James (two authors whose works have received extensive attention within the field), and the computational approach may actually force us to confront assumptions about what we think we know about authorial style, dramatic collaboration, and processes of adaptation.
Looking forward
Whilst we may be nearing the end of the formal, funded project (AHRC ‘E-ABIDA’), our work on Behn’s style and authorship has only just begun: we turn our attention next to the prose fiction and her poetry, as well as her spying correspondence composed in Antwerp in 1666; Akkerman’s recent, excellent scholarship has questioned the authenticity of some letters purportedly copied by Behn from her informant and possible former lover, William Scott. We anticipate similar challenges and discoveries with these genres as those encountered over the last few years with Behn’s drama – as well as the prospect of gaining insight into the degree of likeness between Behn and the dubia, and gaining further insights into an author for whom we have, almost exclusively so, primarily words on a page.
——
A Brief Overview of Method:
To establish the likeness between one text and the linguistic attributes most typical of the works of a given author, computational approaches involve two main methodological decisions. The first is the way that the language of those texts (the dubia, and the author’s known works) is measured. In our investigation of Behn’s writings, we have primarily focussed on lexical units: looking at the words that make up a text, and identifying their frequencies separate from their original syntactic context (known as the ‘bag of words’ approach). As established in the work of John Burrows, David Hoover and others, high frequency words can be very significant for attribution and authorship. High frequency items, often grammatical (function) words, such as conjunctions, pronouns and prepositions, are the “glue” of a text, but how an author applies that glue seems to be very idiosyncratic. Other studies have looked at word sequences (for instance, 2-grams are two-word sequences; e.g. for instance, instance 2-grams, 2-grams are, are two-word, two-word sequences), character sequences (so the sequence of graph-forms that make up the words of a text), and efforts are also turning to exploring the propensity of words to co-occur in particular spans of a text; e.g. within a ten-word span; every other word). Computer software, such as the University of Newcastle’s (Australia) Intelligent Archive, provides a swift and replicable way of collecting word frequencies in a text or set of texts.
The second dimension to the computational approach is the statistical treatment of these frequencies: an analytic step necessary to establish that the distribution of the linguistic items is significant, rather than due to chance, as well as to offer an exploratory means of identifying macro-level patterns within the data that are not evident to the human eye. Whilst we may all be able to identify that whe is a particularly idiosyncratic interjection favoured by Behn in her drama between 1677 and 1682 simply by reading her plays, it is more difficult to assert without computational assistance that she uses the relative markers which and who more frequently than many of her contemporary male dramatists (who prefer that). Whilst the individual merits of these quantitative observations are something to be determined on a case-by-case basis, collectively the statistical distribution of word forms and their frequencies determines the profile of a text. From this, the likeness of this profile to that of other works – such as those by the same author – can be established in quantitative terms.
A vital part of non-traditional authorship attribution are the texts themselves: texts that represent the purported author (Behn), the dubia themselves, and a set of texts representing other candidate authors or writers of the period in a more general sense; the last of these provide a valuable contextualising perspective on the quantitative results. All these texts have to be digital texts, and prepared in such a way that allows for a robust and replicable means of analysis. In practical terms, this means removing paratextual materials that are representative of the text as a whole (e.g. title pages, dedicatory epistles), as well as sections of the text that are not associated with the main author of the work, such as songs, epilogues, or forewords. Because of the variability of early modern spelling and its speculative connection to that of the presumed author in printed texts, it is typical practice to regularise the spellings of all texts to one consistent system. This can be achieved rapidly using software called VARD, developed by Alistair Baron at Lancaster University, although manual proofing is still required. Finally, it is helpful to use a mark-up language to allow different parts of a text to be extracted and analysed quickly and consistently: for example, focusing on all the dialogue in a play, rather than stage directions; or to remove all examples of direct speech in prose fiction, to analyse solely the narrative prose. XML tagging, whilst laborious to apply, allows for such flexibility in textual analysis.
In our preparation of Behn’s texts for the attribution analyses, we have followed these principles. Restoration literature, it seems, is something of a poor cousin – or the younger sibling – of Shakespearean and Jacobean texts, and pre-existing digital texts with the kind of mark-up and preparation we required are limited (unlike, say, the voluminous work done on Shakespeare’s contemporaries; in our exploration of some dubia, which relies heavily on earlier source texts, Hugh Craig has generously shared his corpus of Early Modern plays to allow us to investigate this dimension of the drama). Consequently, we had to prepare many of the texts from scratch, using EEBO-TCP versions where available and manually transcribing from the original volume when not. One of the trials of digital work such as this is the amount of time and effort that goes into text preparation, but which goes relatively unnoticed and unappreciated after the fact.
Mel Evans is a Lecturer in English Language & Linguistics at the University of Leicester. Her research explores the relationship between language and identity, with a particular focus on early modern English. She is a co-Investigator on the AHRC-funded ‘Editing Aphra Behn in the Digital Age’, in which she is leading on the computational analyses of Aphra Behn’s style and authorship, as well as editing Behn’s correspondence.