Gary Taylor and Gabriel Egan (eds), The New Oxford Shakespeare: Authorship Companion
Chapter 24 Using Compressibility as a Proxy for Shannon Entropy in the Analysis of Double Falsehood
In 2010 the present author published an essay that delved into the Double Falsehood/Cardenio quarrel and the related issues of authorship attribution (Pascucci 2012). At the time, the scope of the enquiry was limited to discriminating the hand of Shakespeare from Fletcher's and the results were necessarily incomplete, although they were consistent with those achieved by E. H. C. Oliphant (1919), Jonathan Hope (1994), and Brean Hammond (2010). In the present chapter, Theobald's controversial play is again analysed using digital tools and measured against a control set comprising a number of Elizabethan and Jacobean works by several authors. The aim is to see if more can be said about its likely authorship.
The existence of Cardenio, and its attribution to Fletcher and Shakespeare, was recorded in 1653 in the Stationers' Register. However, no manuscript of that Jacobean play has survived; we possess only the 1728 printed edition of Double Falsehood, which its self-proclaimed editor and adapter, Lewis Theobald, attributed to Shakespeare. In Biographia Dramatica, Edmond Malone attributed the Jacobean play to Philip Massinger; whereas Richard Farmer, in his Essay on the Learning of Shakespeare, attributed it to James Shirley (Gilchrist and Gifford 1812; Hammond 2010, 79–80). In more recent years, John Freehafer (1969) suggested the idea that the original play, which he believes was written by Shakespeare and Fletcher, was revised by William Davenant long before Theobald acquired it; while Harriet Frazier (1974), Jeffrey Kahan (2004), and Tiffany Stern (2011) maintain that Theobald was a forger. More recent investigators believe that his Double Falsehood was based on an older manuscript, namely Cardenio.
In his introduction to the Arden edition of Double Falsehood, Brean Hammond ascribes the paternity of the original play to Shakespeare and Fletcher and concedes, as does MacDonald P. Jackson, that the work may have undergone revision during the Restoration, possibly by Davenant (Hammond 2010, 53–4; Jackson 2012b). Jonathan Hope retains the idea of a Shakespeare–Fletcher collaboration, a conviction reinforced by Gary Taylor, Stephen Kukowski, and others (Hope 1994, 91–100; Taylor 2012a; Kukowski 1991).pg 408
Attribution through Compression Algorithms: LZ77 and BCL
In order to ascertain who wrote a literary text from internal evidence, it is essential to pinpoint the characteristic features of its style that distinguish it from the writings of other authors. This approach, which is at the core of stylometry, typically relies on occurrence, frequency, and distribution of collocations, function words, and other linguistic idiosyncrasies to identify the authorship of a text of unknown origin. The present chapter introduces a new method in this area and illustrates its application to the text of Double Falsehood.
Writing always contains a certain amount of redundancy, in the sense of repetitions of parts of the text. This redundancy enables a text to be successfully transmitted (orally and in writing) despite the noise (interference) it may encounter en route: the lost or damaged parts may be reconstructed from the parts that arrive unscathed. Because of this redundancy, a message may also be compressed, which is why various systems of shorthand and, more recently, SMS text-speak can reduce the number of characters needed to convey it. There is, however, a limit to how much a message, or the computer file containing it, can be compressed. Once all redundancy has been removed and the message itself would have to be cut to make it any shorter.
In 1947, Claude Shannon, an engineer at Bell Telephone in America, developed a means for quantifying the amount of information contained in a message in order to create efficient coding schemes for digital transmission. His key insight was that the quantity of information in a message is a way of stating how surprising it is to the recipient. In English, the letter u appearing after a q is utterly unsurprising because q is almost always followed this way, whereas a u followed by another u is most rare and surprising. (The words vacuum and continuum are the only ones in common usage to contain uu.) Shannon realized that an efficient encoding scheme would use short codes for common, unsurprising sequences such as qu and long codes for rare, surprising ones such as uu. Languages already do this: the most common words in English tend to be short and rare words tend to be long. Measuring the surprise factor of a message is the same as measuring its true informational content, or its Shannon entropy. This can be found by trying to compress the message to encode most efficiently the frequently occurring repetitions of unsurprising sequences.
Eliminating redundancy is the task usually accomplished by compression algorithms such as LZ77 and its variants—named for its creators' last names and the year of publication—which are the foundations of all compression software (or zippers) commonly found on personal computers (Ziv and Lempel 1977). We will briefly illustrate how LZ77 works before turning to how the Italian physicists and mathematicians Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto modified it to make the algorithm called BCL for the purpose of authorship attribution (Benedetto, Caglioti, and Loreto 2002).
In order to perform compression, LZ77 scans the text to be compressed in a sliding window and identifies the linguistic patterns that occur more than once in it. When it finds the repetition of a string of characters, it replaces the second occurrence with the pointer back to the first, which pointer comprises two numbers: d, the distance (measured in characters) back to the previous occurrence, and n, the number of characters forming the repeated sequence. The repeated matter can be words, parts of words, spaces, or punctuation, but to illustrate the process we will use words with the repeated strings highlighted in italics:
Truly, shepherd, in respect of itself, it is a good life, but in respect that it is a shepherd's life, it is naught (115 characters)
To compress this sentence we can replace each recurrence of a word or phrase with a parenthetical d, n pair that points back to its predecessor:pg 409
The d, n pointers are recorded as a pair of 8-bit binary numbers equivalent in storage size to a pair of alphabetic characters, so in all this message is now held as 90 characters instead of 115, and with no loss of content.
Once the procedure has been applied to all redundant sequences of characters in the text, compression is completed. To recover the original text, the decompression software—typically part of the same zipper used for the compression—needs only to replace each pointer with the string of preceding characters that it points to. We said that in the LZ77 algorithm the compression process is performed by observing the strings that fall within a sliding window that traverses the text. There are two reasons for working this way. In principle, we could consider an entire document at once and replace a string near the end of it with a pointer to its predecessor near the beginning, but with large documents this would make for large pointers. Using a moving window keeps the pointers small. Secondly, the algorithm was created with a view to the compression of continuous streams of data as they happen, as in broadcast video, for which the end of the file is not available until after the transmission is complete. Thus the algorithm looks only at the most recently received data that fall within its window and hence it can be used in real-time data applications.
This sliding window procedure allows us to apply the algorithm to a problem that its creators did not consider. As the window moves across the text the compression software maintains what is called a dictionary, a list of the most recent substitutions of strings that it has performed and may perform again in cases of further recurrence. If the document being compressed is literary writing (as with our example above), then this dictionary represents the author's personal habits of repetition. If authors differ in their habits of repetition, then a dictionary compiled for one author's work will not be ideally suited to the compression of another author's writing since it will not contain the new author's favourite repetitions.
If we could make the software attempt to compress the second author's work using a dictionary compiled for the first author's work then the relative efficiency of compression—reflected in the size of the compressed file—will reflect their differing habits of repetition. We could achieve this by appending the second author's work on the end of a document containing the first author's work and seeing how well LZ77 compresses this composite text. The algorithm's sliding window would pass over the first author's work, creating a dictionary appropriate to this author's repetitions as it compresses them away, and then at the 'join' of the two texts it would start to encounter the second author's work and attempt to compress it using the dictionary created for the first author's work. If this composite document ends up compressed just as much as the sole-authored work of the first author, then the two writers' habits of repetition are alike.
The LZ77 algorithm as originally written has the disadvantage that it continually updates its dictionary to reflect the latest repetitions falling within its sliding window, so that having created a dictionary for the first author's work it will start to rewrite this dictionary when it encounters the second author's work if this new writing contains different habits of repetition. We could minimize the consequences of this behaviour by making the appended sample of the second author's work much smaller than the first author's sample. This would allow the algorithm little time to adapt to the second author's habits of repetition (by updating its dictionary) before the whole process is finished. But we can do even better than that by modifying the algorithm.
pg 410Relative Shannon Entropy
For our experiments we use a modified version of LZ77 called BCL in which the learning process—the rewriting of the dictionary—is made to stop once the algorithm encounters the second author's text. Thus the algorithm is forced to compress the second author's writing using only a dictionary compiled to reflect the repetitions in the first author's writing. If the two writers are alike in their habits of repetition, this compression will be as efficient as it would be if the whole document were by one writer and the resulting file will be highly compressed. If the two writers are unalike in their habits of repetition, the dictionary will poorly reflect the repetition habits of the second writer, there will be fewer opportunities to insert d, n pointers to save space, and the resulting file will be relatively large, reflecting inefficient compression. Thus we can take the size of the resulting file as an expression of the relative likeness of the two author's habits of repetition, or what is sometimes called their Shannon entropy.
If we find that the habits of repetition are good for discriminating different authors' writings, then this gives the basis for a new authorship attribution test. When an anonymous text is appended to a number of different texts by different authors, BCL will yield a different compression ratio for each composite document. We measure and rank such results. If the assumption that habits of repetition are a good proxy for authorship is correct, then the composite document yielding the best compression should be the one in which both the anonymous text and the text it is appended to are by the same author.
The Main Experiment: The Control Set and the Limitations of Relative Entropy
For this study, we applied the above procedure by appending to each scene of Double Falsehood, one at a time, samples of texts by several dramatists, compressing these composites, and ranking them by how small they had become. To be meaningful, the control set must include the candidate author being tested since even a set that excludes the real author will yield a ranking based on the non-author-specific similarities between the writings. Ideally, one would create a control set that encompasses all the writers dating back to the time when the anonymous text was written, on the assumption that all are potential candidate authors. However, previous scholarship has created a list of the most likely candidate authors of Double Falsehood to which we may confine our attention: William Shakespeare, John Fletcher, Philip Massinger, William Davenant, and Lewis Theobald.
If it is true that Double Falsehood was written by Shakespeare and Fletcher, revised by Davenant during the Restoration, and then adapted by Theobald, the present method should be able to tell whose style of repetitions prevails in each scene of the play. To test the discriminatory power of this procedure, we expanded the control set to include works by writers who have been generally ruled out as authors of Double Falsehood: James Shirley, Francis Beaumont, and the team of Beaumont and Fletcher. Shirley was in fact deemed a plausible candidate author by Farmer, but this was soon debunked once his age at the time (16, if not younger) was taken into account. Jackson has recently shown that Beaumont did not participate in the writing of the play (Jackson 2012b, 160). The complexly multi-authored and still uncertainly divided Beaumont and Fletcher canon might well trick an attribution algorithm into yielding false positives. Our control set contains some of Beaumont and Fletcher's works, not only to rule them out or in as possible co-authors pg 411once and for all, but also to help establish that the technique eliminates candidates who are for other reasons quite implausible.
Texts for testing were acquired online according to availability. Most of them are from the Literature Online (LION) database and the remaining ones are from the free electronic text archive called Project Gutenberg. The plays used to represent each author's canon are listed here:
Shakespeare All's Well That Ends Well; Cymbeline; Hamlet; The Tempest; The Winter's Tale; As You Like It
Beaumont The Knight of the Burning Pestle
Fletcher Rule a Wife Have a Wife; Monsieur Thomas; The Humorous Lieutenant; Valentinian; With Without Money; The Faithful Shepherdess
Beaumont and Fletcher A King and No King; Cupid's Revenge; Philaster
Shirley The Lady of Pleasure
Massinger A New Way to Pay Old Debts; The Renegado; The Bond-Man; The Bashfull Lovers; The Unnaturall Combat
Davenant Albovine; The Distresses; The Cruell Brother; The Fair Favourite; The Just Italian; The Rivals; The Unfortunate Lovers
Theobald Orestes; The Fatal Secret; The Happy Captive; The Perfidious Brother; The Persian Princess
It has already been established that this kind of test is relatively immune to distortions caused by the genre of the writing under consideration (Pascucci 2006); whereas word choice is strongly shaped by subject, repetitions appear not to be. Date of composition, however, could be important and we have, where relevant, picked plays written around the time of Cardenio. Even where an author is represented by only a small part of his canon (as with Shakespeare), the tests undertaken here perform thousands of comparisons on small subsections of the text and have shown themselves discriminating of authorship despite reduced sample sizes.
For the purposes of this test, Double Falsehood was divided into scenes and each of the plays in the above list was divided into sections of 32 kilobytes, which typically equates to around 5,000 words. Each play section was appended in turn to each scene of Double Falsehood. The effectiveness of compression for each composite document was measured and ranked, on the principle that the most effectively compressed documents will be those where a scene of Double Falsehood (used to create the dictionary of repetitions) is followed by a section of writing from another play by the same author and sharing its habits of repetition. The texts were all stored in Unicode UTF-8 encoding (equivalent to plain ASCII) with all their punctuation, line breaks, titles, and stage directions removed, and capital letters lowered. Most of this regularization was done by software and then checked by hand and manually completed where automated conversion had failed.
As a first experiment to help to validate the method, it was decided to test known works by Davenant and Theobald against other known works by Davenant and Theobald to see if the procedure would correctly distinguish composites that were Davenant+Davenant and pg 412Theobald+Theobald from all other combinations. The division of the plays into sections produced the following testable materials:
Davenant Albovine (3 sections)
Davenant The Cruell Brother (2 sections)
Davenant The Distresses (2 sections)
Davenant The Fair Favourite (2 sections)
Davenant The Just Italian (2 sections)
Davenant The Rivals (2 sections)
Davenant The Unfortunate Lovers (2 sections)
Theobald Orestes (2 sections)
Theobald Plutus (2 sections)
Theobald The Fatal Secret (2 sections)
Theobald The Perfidious Brother (2 sections)
Theobald The Persian Princess (1 section)
Theobald The Happy Captive (2 section)
Total 26 sections
Each of these sections was appended in turn to each of the other 25 and compressed. The resulting compression ratios for the composite documents were used to rank them to show which combinations were most effectively compressed and which the least.
Here are the results for one of the rank-order lists, showing how compressible is a composite made from fragment aa of Davenant's The Cruel Brother onto which each of the other fragments is appended in turn:
Most-to-least compressible when appended to Davenant's The Cruel Brother section aa
#1 Davenant The Cruel Brother section ab
#2 Davenant The Just Italian section aa
#3 Davenant The Just Italian section ab
#4 Davenant The Unfortunate Lovers section aa
#5 Davenant The Unfortunate Lovers section ab
. . .
#11 Theobald The Fatal Secret section aa
#12 Theobald The Fatal Secret section ab
. . .
#24 Theobald Orestes section aa
#25 Theobald Orestes section ab
The rank order list cleanly divides into two halves, with Davenant's works at the top and Theobald's at the bottom. Moreover, within each half of the table the sections from each play appear together, suggesting that the test is capturing their coherence as works, and gratifyingly the writing that is most like aa from Davenant's The Cruel Brother is the other fragment ab from the same play. All the tested fragments provided similar results.
After proving that the algorithm can successfully tell Davenant from Theobald, we put together a second set of experiments to measure Double Falsehood against their works. First we equalized the canons by taking just five of Davenant's plays to match the five of Theobald's:
pg 413Davenant The Rivals; The Distresses; The Just Italian; The Unfortunate Lovers; The Cruel Brother
Theobald The Fatal Secret; Orestes; The Perfidious Brother; Plutus; The Persian Princess
These 10 plays yield 18 sections and each was appended in turn to each scene of Double Falsehood and a rank order of compressibility produced for each resulting composite.
The results attest to the presence of Theobald in Double Falsehood and limit Davenant to scenes 2.1, 2,4, and 3.1, as shown in these extracts from the tops of the rank-order tables:
Double Falsehood scene 2.1
#1 Davenant The Fair Favourite
#2 Theobald Orestes
Double Falsehood scene 2.4
#1 Davenant The Distresses
#2 Theobald The Fatal Secret
Double Falsehood scene 3.1
#1 Davenant The Fair Favourite
#2 Theobald The Fatal Secret
This test was confined to just the works of Davenant and Theobald so it is telling us only which of those two is the more likely to be authorially present in each scene. If we were to apportion the play to these two writers alone, the shares would be that Davenant left his fingerprint in 8 per cent of the whole play (144 lines out of 1815), or 21.5 per cent if we judge by the number of scenes he worked on (3 out of 14 in the whole play). As we will see when discussing the results of the main authorship attribution experiment, this ratio increases significantly once all the plausible candidates are included in the control set.
The Main Experiments
When Benedetto, Caglioti, and Loreto first tested their BCL modified version of LZ77 on texts by Italian writers, they obtained a surprisingly high 93 per cent accuracy ratio in determining authorship (Benedetto, Caglioti, and Loreto 2002, 3). Before BCL was used to discriminate between the hand of Fletcher and Shakespeare in All is True/Henry VIII it was tested on English literary works from the eighteenth century to the modern age and found to produce an even more startling 100 per cent accuracy ratio in 2,000 experiments (Pascucci 2006). However, when dealing with texts from the Elizabethan and Jacobean periods, the results produced by BCL were in some cases much less reliable and the same occurred while testing Double Falsehood. This may be due to the effacing of authorial distinctiveness that occurs in collaborative writing and/or subsequent adaptation, and perhaps also because the greater variability of spelling in the earlier periods makes the algorithm overlook some repetitions, thereby reducing the evidential base that the method relies upon.
Some of the scenes of Double Falsehood yield consistent results in which we repose considerable faith, and others do not. Let us illustrate this with extreme cases. When Double Falsehood scene 1.2 has each of the sections in the full control set added to it, the top of the rank-order table of compressibility for the resulting composites looks like this:
Double Falsehood 1.2
#1 Shakespeare All's Well that Ends Well section aa
pg 414#2 Shakespeare All's Well that Ends Well section ac
#3 Shakespeare All's Well that Ends Well section ab
#4 Shakespeare Hamlet section aa
#5 Shakespeare As You Like It section ac
This consistent run of Shakespeare plays at the top of the table is a strong sign that Double Falsehood 1.2 is by Shakespeare if it is by any of the authors tested, although of course it might also have been altered in minor ways by subsequent adapters. By contrast, the results for Double Falsehood 3.2 are much less clear:
Double Falsehood 3.2
#1 Theobald The Happy Captive section ap
#2 Shakespeare Hamlet section ay
#3 Shakespeare King Lear section aj
The problem, of course, is how to weigh the fact that Theobald comes out on top with the fact that the next two closest matches are to Shakespeare.
Mathematically, the measurement of relative Shannon entropy is a logarithmic function, so the significance of the rank order decreases rapidly as one moves down the table. Thus an author occupying slot #1 means a lot more than his occupying slots #2 and #3. But we must also try to factor in the substantial likelihood that Theobald was adapting existing writing by others and hence that our results might reflect hybridity in the writing. Such hybridity might well involve rewriting within particular lines so we cannot assume that by further dividing Double Falsehood into units smaller than scenes we will eventually arrive at non-hybrid units of composition. One response to this problem is to see if changing the segmentation of the control set sections makes any difference. We have been using relatively large sections of size 32KB (around 5,000 words), across which we necessarily average the authorial habits of repetition. What if we use smaller sections?
We created three more groups of tests in which the control set of plays were divided into 16KB, 8KB, and 4KB sections. Together with the original test on 32KB sections this gives four rank-order tables for each scene of Double Falsehood and we combine them by apportioning a weighting of 25 per cent to the author who occupies the #1 rank position in each table. For Double Falsehood scene 1.1 the results are that Theobald occupies position #1 in the 32KB, 16KB, and 8KB section-size tables, but Shakespeare occupies position #1 in the 4KB table. One way to interpret this is that the scene is essentially 75 per cent Theobald's because he heavily revised Shakespeare's original writing which now represents only 25 per cent of the measurable style remaining in the scene. Where the results suggest two authors of the same period, it is reasonable to assume that they collaborated rather than that one revised the work of the other.
If we apply this reasoning to the whole of Double Falsehood we arrive at the following scene-by-scene breakdown:
Double Falsehood scene by scene
1.1 Theobald heavily revised Shakespeare
1.3 Theobald revised Davenant who revised Shakespeare
2.1 Davenant revised Shakespeare (Possibly slight revision by Theobald too)
2.2 Shakespeare and Massinger collaborated
2.3 Theobald revised Massinger
pg 4152.4 Shakespeare and Fletcher and Massinger collaborated (Unreliable results)
3.1 Davenant revised Shakespeare
3.2 Davenant revised Shakespeare
3.3 Davenant revised Fletcher
4.1 Shakespeare and Fletcher collaborated
4.2 Theobald revised Shakespeare
5.1 Theobald revised Davenant
5.2 Shakespeare and Fletcher collaborated
Interpreting the Results
The part of this study that will most surprise those working on the Cardenio/Double Falsehood problem is our claim for Massinger's contribution. As in our previous studies, this method detects Massinger's style in Double Falsehood, and also in All Is True/Henry VIII that most other investigators attribute to Shakespeare and Fletcher alone. Jackson rightly pointed out that Double Falsehood is littered with expressions typical of Shakespeare, Fletcher, and Theobald and possibly more authors (Jackson 2012b). However, we give much less of Double Falsehood to Fletcher than have other recent investigators. Most importantly, we can say from this study that takes in Theobald's own compositions as well as those of Davenant and several pre-Commonwealth authors that there is virtually no possibility that Theobald simply forged Double Falsehood. No matter just who wrote which part, it is implausible that Theobald perfectly imitated the style of the disparate list of authors whose presence in Double Falsehood we have discovered.
In his essay 'Looking for Shakespeare in Double Falsehood', Jackson demonstrated that Beaumont was not a collaborator of Shakespeare and Fletcher and our results confirm that Jackson is right (Jackson 2012b, 160–1). Moreover, we found no scene in Double Falsehood that tested like one of the collaborative works of Beaumont and Fletcher. The tests described here strengthen the case made by Gary Taylor and John V. Nance that Double Falsehood comprises two layers of writing, one from the early seventeenth century and one from the early eighteenth. In addition the experiments here presented provide evidence for the third layer of writing—Davenant's—claimed by Taylor and Nance. Charles Nicholl may not be far from the truth in suggesting that Theobald's manuscripts were copies of a Restoration adaptation by Davenant, rather than Jacobean originals (Nicholl 2011, 84–101). Oliphant argued that from Double Falsehood scene 3.1 a new voice becomes audible, contrasting with Shakespeare's in Acts 1 and 2. Our results corroborate their idea that Fletcher took part in writing only the second half of the play and contradict Robert Matthews and Thomas Merriam's (1993) claim that Double Falsehood is predominantly by Fletcher.
In Double Falsehood scene 1.1, Shakespeare's hand is detectable only in a small stretch of the scene, possibly the first eight lines, as Oliphant (1919) and more recently Taylor (2013b, 137) have suggested. Recently Nance argued that there are no traces of Fletcher in the prose at the end of scene 1.2 and that a few expressions such as 'insist in your', 'I have formerly', and 'cannot find a' may be those added by Theobald (Nance 2013, 117). If Theobald or anyone else retouched the scene, his changes are too few and short to be taken into account in the overall assessment provided by our procedure. On the evidence presented here, Double Falsehood 1.2 is the only scene in which Shakespeare survives nearly intact. Little of Shakespeare survives in 1.3. Oliphant thought only lines 16–18 were his, while Jackson identifies the playwright's hand in lines 53–6. Such small proportions of surviving Shakespeare would be consistent with the results found here.
pg 416Double Falsehood scenes 2.2, 2.3, and 2.4 are the ones that most divide attribution scholars. By the standards for validation applied in these experiments, our attributions of 2.2 and 2.3 are relatively reliable: we are confident of Massinger's contribution here. For scene 2.4, on the other hand, our results are equivocal, although notably there are no signs of post-Jacobean writing. Further investigation of the relatively long scene 2.3 might benefit from dividing it into its prose and verse strands, which may have detectably different origins. The results obtained here for Double Falsehood scenes 3.1 and 3.2 show that both were so heavily revised by Davenant that they hardly retain their original Shakespearean elements, which in 3.2 could well correspond to lines 39–43, as Stephen Kukowski (1991, 88) suggested. Oliphant attributed scenes 3.2 and 4.1 to Theobald alone, but conceded that both may retain fragments of the original writer. In our results, Theobald's antecedent was Davenant: no one else's writing can be identified in scene 3.2.
Although revised by Davenant, Double Falsehood scene 3.3 is the only one in which Fletcher's hand is strikingly apparent, whereas in scenes 4.1 and 5.2 it emerges only in the 4KB-section tests. This perhaps means that only a little of Fletcher's contribution to the writing of these scenes has survived unaltered; this is a question that should be left to further discussion among Fletcher scholars. Scene 5.2 of Double Falsehood seems to be mostly Shakespeare's and to a lesser extent Fletcher's. Our results sketch a slightly different scenario from the nowadays widely accepted belief that Shakespeare's presence is limited to the first half of the play.
The method used here is replicable and highly independent of the investigator's previous experience or bias. (Unconscious bias remains a commonly wielded criticism of the entire field of authorship attribution by computational stylistics.) It is important to note that the linguistic repetitions in the texts are counted by our method, but their distributions across the texts are not. Moreover, the method makes no distinction between repetitions of whole words and phrases, which may plausibly be conscious authorial style, and repetitions of smaller units, which most likely are not. The method is automated and objective and uses a large number of string comparisons: over half a million in all for the experiments described above. On the evidence presented here, the possibility that Theobald forged Double Falsehood is eliminated. Theobald had a manuscript of a play containing contributions by Shakespeare and Fletcher, as many studies have shown, and, we believe, contributions by Massinger too. The likeliest explanation, then, is that Theobald had a manuscript of the lost play Cardenio.