Manuscript Relationship Calculator Help

This help page can be accessed from the input and output screens of the Manuscript Relationship Calculator to provide a reference to those pages. However it is also designed to be read sequentially to give a lengthier overview of what this tool is designed to do. For a 25 page introduction to why this tool is doing what it is doing you need to go here.

Data Input Format

The data input format is supposed to be easy to read and similar to the format used by the UBS for showing variants. The full UBS format is described here although what I support is a relatively pale reflection of that.

The number one restriction is that I expect everything to be normal ASCII characters which pretty much means it has to composed of letters or numbers from the English alphabet. Punctuation is allowed before or after the letters & numbers although it is removed and ignored. In particular note that the UBS indicators of [ ], () and * are all tolerated but ignored. (33) is treated as identical to 33.

The MSS then appear with one line per variant. The variations within the variant are then separated by two forward slashes (which must not be separated by a space).

p10 B 81 // p26 01 A E G 044 6 33 104 256 263 424 436 459 1175 1241 1319 // 1506 1573 1739 1852 1881 1962 2127 2200 2464 Byz [K L P] Lect

You may also have a number followed by a colon at the start of the line if required (for reference purposes) it is ignored.

Valid MSS

In order to help track typos, and to allow for extra authorities to be in the ingested data (such as fathers) without being in the output this program has a definition of what it considers to be legal MSS labels. The five types of predefined MSS labels are:

  1. A single upper case character eg A B C D
  2. A 'P' followed by a number eg P10 P45
  3. A number eg 33 01
  4. 'Byz', 'Maj' or 'Lect' in addition Byz [ K L ] and Maj [ K L ] are accepted and are treated as Byz/Maj K L would be
  5. 'L' followed by a number eg L4030 L302. Note however that by default this format is disabled; it has to be enable by a checkbox on the input page

Other MSS

In addition to those supported by default it is possible to specify that additional sequences of letters and numbers be treated as valid MSS. This can either be done because there are regular authorities you wish to include (such as certain preferred fathers) or because you wish to include extra magic tokens into the stream to measure them against the MSS. These extra MSS are to be added space separated into the box provided. An example of a useful list of extra MSS might be

MyOpinion Maj TR Gries Tisch Words Souter W-H N-U

Assuming you had then added these tokens to your input data it would allow you to see how closely the different heavy hitters (and yourself!) correlate with the underlying MSS.

Variant Output Format

Whilst it may not appear particularly useful the first thing the program does is give you back the data you handed it in a slightly different format. Here is an example of the two lines given on the input form:

Variant_1 2 variations
1 p10 B 81
2 p26 01 A E G 044 6 33 104 256 263 424 436 459 1175 1241 1319 1506 1573 1739 1852 1881 1962 2127 2200 2464 Byz K L P Lect
Variant_2 Ignoring ( L596 L593 ) 9 variations
1 B 0172 6 424c 1739 1881
2 01 A
3 C 33 81 1506
4 D G 1852
5 1912
6 P
7 044 256 263 365 424 436 1175 1241 1319 1573 1962 2127 2200 2464 Byz L Lect
8 104 459
9 K

In addition to telling you what you entered, it lists any things entered (in this case L596 and L593) which were not considered to be MSS. (The lectionaries only count if the input box is checked). Note too that some variants only have two variations and one of those is a singleton. The program does not count this as a valid variant and this variant will not appear in the output grid. However the singleton does count against the MSS in the singletons box.

Rare Readings

The program calculates and displays the rare readings for each manuscript. A rare reading is essentially a variation in which only one, two or three MSS have that particular reading. In the Variant Output Format table given above Variant_1, Variation 1 is a rare reading: a tripleton. Each of p10, B and 81 will have had one tripleton scored against them. Variation 2 of Variant 2 will have scored a doubleton against 01 and A. Variations 5,6 and 9 will have scored singletons against the respective MSS.

Within the rare readings box the MSS will have been sorted so that those with the most are at the top. Within those MSS that have the same number of rare readings those with the most singletons come first, then the doubletons, then the tripletons and finally the system devolves to alphabetic order.

Complete Agreements

The complete agreements table aims to show those MSS that never disagree with each other; at least on the input data. The table looks as follows:

Manuscript Subsumes
B p10(15/1) 0172(15/1) p40(15/1)
81 3(15/1) 65(15/1)
p26 E
01 p26(15/1) 0219(15/1)
044 1319c(15/3)
6 424c(15/6)
256 1319
424 1175 1241 Byz(15/14) L

The table on the left has all of the readings of all of the manuscripts in the boxes on the right. Those on the right that are in bold are in total agreement. That is the manuscript on the left has all of the readings of the one on the right and vice versa. The ones not in bold will have representations in fewer variants. This is show by numbers in brackets after then MSS. Thus on the first line B subsumes 3 MSS but it has 15 variations, each MSS on the right only has one. Manuscript 6 subsumes 424c; but 424c only has six readings. This table may be useful to aid in deciding that certain MSS are not really worthy of a place in the output grid.

Output Grid

One of the primary driving reasons behind this program is the generation of the similarity grid. This is a table of MSS versus MSS that shows how often the two agree with each other. Each table entry has two values. The first is the number of agrees, the second is the number of variations in which they both take part in a variant.

The grid is also color coded by the background. If the background is white then there is no correlation. The darker green the background the higher the degree of correlation between the two MSS. This is a visual way of computing the percentage agreement between two MSS.

In a data set where you have 40 MSS entered this table will have 1600 entries which is a lot of data.  Typically once the relationships between the MSS are better understood, either from studying the large table by eye or by using the Agreement or Clustering tools it is possible to narrow down the number of interesting MSS. You do not need to edit the ingest data; instead you can enter the list of interesting MSS in the box provided and only those MSS will appear in the Output Grid. Note that all of the MSS will still take part in the other tables and the clustering process.

Colwell Clusters

I am not claiming to give an authoritative definition of the Colwell Cluster; however I am giving an authoritative definition of how  I understand them and thus how they have been implemented in this utility. A Colwell cluster, or textual family to use a more normal term is a group of MSS that are considered to form a discrete group of MSS with shared characteristics. A key term here is discrete. A Colwell cluster is defined by two things: how close the members are to a central MSS and how far apart they are from other MSS not in the cluster. A good treatment of this matter is given by Waltzman but the most famous definition is given by the 70/10 rule: This suggests that the quantitative definition of a text-type is a group of manuscripts that agree more than 70 per cent of the time and is separated by a gap of about ten percent from its neighbors (Ernest C. Colwell and Ernest W. Tune, "Method in Establishing Quantitative Relationships between Text-Types of New Testament Manuscripts," reprinted in Studies in Methodology, p. 59).

This makes the algorithm for computing the cluster very easy. First you pick a MSS that you think is a suitable starting point. Then you collect all of the MSS that agree with that MSS 70% of the time or more. These form the members of the cluster. Now you have to check that their is an adequate gap to fit the Colwell definition. So you find the percentage agreement of the least agreeable MSS and subtract 10%. Thus if the least agreeable agreed with the starting point at the level of 73% the new threshold level is 63%. You then check the MSS again to see if there are any MSS within the range of 63-70%. If there are then the gap has been breached and the cluster failed. If there are no MSS within that range then you have a success.

The tool allows for you to enter the starting points in a space separated fashion. It also allows for you to change the amount of agreement required from the default 70% and the amount of gap required from the default 10%. One thing to remember however is that the 70% is specified from the starting point which is typically in the middle of the cluster. Thus with a 70% threshold it is quite possible to have two MSS in the cluster which only agree 40% with each other, although they both agree 70% with the seed element.

If you do not specify any starting point for the Colwell clustering then it will try to form a cluster from every single MSS in the list. This can mean that a given MSS may appear in more that one cluster. The program is not showing what the correct clusters are; it is simply showing you all of the ones you can choose from that would form valid clusters if you asked them too. Essentially you should view this as a quick and dirty fishing expedition to give you a few ideas for starting points you may wish to try.

Even if clusters are not successfully formed the output is designed to show the algorithm operating which will (hopefully) provide useful information. For example: here is the algorithm operating on the first four chapters of Romans (this work is Copyright Andrew Wilson) using 1506, 1739, B and 256 as my starting points.

Candidate MSS
1506 81(80%) 01(93%) A(80%)
1739 6(87%) 1881(73%)
B  
256 263(87%) 1319(100%) 1573(93%) 1962(80%) 2127(93%) 365(73%) Failed - Gap breached 424(67%)

The first two non-heading lines show successful clusters. The first shows a grouping of 1506, 81, 01 & A all centered around 1506. The closest is 01, 81 & A are both 80% away. Note that this grid does not tell us how close 81 and A are to each other; they may be identical or they may be 40% apart. Other parts of the output give that information.

1739 gives another small cluster with 6 near and 1881 27% away from 1739 and somewhere between 14% and 39% away from 6.

B has not a single manuscript in the input data within 70% of it.

256 shows a cluster failing because of the gap criteria. 6 MSS have fallen within the 70% grouping. The first 5 were all pretty close, the furthest 1962 was 80% away. But then came 365 which was 73% distant, this was within the 70% and therefore had to be added to the cluster. However, the 73% brought the gap boundary to 63% which was breached by 424 which was within 67% of 256. Alas the cluster died. In fact, at least on the first four chapters of Romans you will find the vast majority of Colwell clusters die because the gap is breached.

K-Clustering

I do not aim in this document to offer a full primer on what it means to perform K-Cluster, there are many references on the web for this, here is one of them. However I will attempt a brief explanation in this context which I hope will help a little.

In the Colwell clustering scheme every clustering pass is formed by starting with a given MSS and seeing which other MSS are close to it. The clusters then either work or don't. The K-Clustering algorithm works a completely different way. It starts by allocating every single MSS to its own cluster of 1 element. It then asks: Of all of my clusters, which two are the closest together? Having found which two clusters are closer together it then joins them. It will thus have one fewer clusters than before. It then repeats the process all over again, these iterations continue until the termination criteria is met.

There are a number of ways of deciding when you stop iterating:

  1. Stop when you are down to one cluster. This maybe seem a little strange but it is currently the default. The reason for performing 1-Clustering is that it produces the most complete dendrogram: which we will see momentarily.
  2. Stop when you get down to K clusters. Thus you may define: divide these MSS into 5 groups the best way possible.
  3. Stop when the distance between the next two clusters you would join is greater than n%. This is most similar to the Colwell cluster, it allows you to decide what you consider close enough to be and stops once it cannot find anything close enough.
  4. Stop when these MSS are joined together. This is the honesty test version of clustering. If you believe a particular set MSS are related then the system will keep iterating until they are; the other clusters are relationships you pretty much have to accept if you want to keep your link between the MSS you name.

The program defaults to the first option but allows each of the other three to be specified at the same time. It will then stop at the earliest condition that matches and specify the reason for the termination. The number and contents of the remaining clusters will then be displayed.

Dendrograms

The output from the clustering process is a table described as a dendrogram, and the contents are even more intimidating that the title. Notwithstanding I do believe the table contains a lot of good information the study of which would pay dividends.

The basic concept is actually very easy if you can visualize K-Clustering in your mind (of course those of you with a life probably can't!) At each clustering stage two particular clusters join together to form a new cluster. Each line of the dendrogram table tells you which two clusters decided to join together, why they joined together, and what they formed having joined together. With a number of very strong caveats (discussed here) you can think of the cluster formed in the final column of the table as a representation of some historic parent MSS which has now disappeared.

Here is some sample output for the first four chapters of Romans:

Cluster 1 Cluster 2 Distance Forming
E(100) AVG(100) p26(100) AVG(100) 100 E(100) p26(100) AVG(100)
1241(100) AVG(100) 424(100) AVG(100) 100 1241(100) 424(100) AVG(100)
1241(100) 424(100) AVG(100) Byz(100) AVG(100) 100 1241(100) 424(100) Byz(100) AVG(100)
1241(100) 424(100) Byz(100) AVG(100) L(100) AVG(100) 100 1241(100) 424(100) Byz(100) L(100) AVG(100)
1241(100) 424(100) Byz(100) L(100) AVG(100) 1175(100) AVG(100) 100 1241(100) 424(100) Byz(100) L(100) 1175(100) AVG(100)
1319(100) AVG(100) 256(100) AVG(100) 100 1319(100) 256(100) AVG(100)
65(100) AVG(100) 3(100) AVG(100) 100 65(100) 3(100) AVG(100)
K(100) AVG(100) 1241(100) 424(100) Byz(100) L(100) 1175(100) AVG(100) 93 1241(99) 424(99) Byz(99) L(99) 1175(99) K(93) AVG(98)
1573(100) AVG(100) 1319(100) 256(100) AVG(100) 93 1319(97) 256(97) 1573(93) AVG(96)
01(100) AVG(100) 1506(100) AVG(100) 93 01(93) 1506(93) AVG(93)
104(100) AVG(100) 459(100) AVG(100) 93 104(93) 459(93) AVG(93)
1241(99) 424(99) Byz(99) L(99) 1175(99) K(93) AVG(98) Lect(100) AVG(100) 92 1241(98) 424(98) Byz(98) L(98) 1175(98) K(92) Lect(92) AVG(96)
1241(98) 424(98) Byz(98) L(98) 1175(98) K(92) Lect(92) AVG(96) 2464(100) AVG(100) 91 1241(97) 424(97) Byz(97) L(97) 1175(97) K(91) Lect(91) 2464(91) AVG(95)
1319(97) 256(97) 1573(93) AVG(96) 2127(100) AVG(100) 91 1319(95) 256(95) 1573(91) 2127(91) AVG(93)

It is easiest to start by describing how each cluster is described. Essentially a cluster is described as a list of MSS. However if you think of a cluster as a collection of points on a graph then it is visually obvious that some points will be towards the center of the cluster and others will be further away. Numerically those points towards the center of the cluster will have the lowest average (mean) distance to the rest of the points in the cluster. Those points towards the edge of the cluster (the outliers) will have the largest average distances to the other members of the cluster. In addition to these measures of how close each given point is to the center of the cluster it is interesting to know the approximate overall size (or cohesiveness) of the cluster. This is given as the average (mean) distance of every point from each other.

Therefore in the representation of a cluster the MSS are listed in order, with the MSS closest to the center of the cluster being listed first. Thus the first MSS listed is the closest to being the archetype of the cluster (although more illustrious MSS may appear later in the list). At the end of the list the AVG is given which is the average or cohesiveness of the cluster. As an example the last column of the final row shows a cluster with 4 MSS in it. Two of them (1319 & 256) are towards the center of the cluster whilst 1573 & 2127 are further away. That said it is a pretty tight cluster with an average agreement of 93% between the MSS.

Having dealt with the representation of the cluster we can now watch the clustering process in operation; starting with the top line. The top line tells us that E and P26 are in 100% agreement (that is the Distance column) and that they are therefore joined together forming the first cluster E p26. The second line tells us that 1241 & 424 are also perfectly agreed and thus they are joined too. The third line is a little more interesting. This time it is not two single MSS that are joining, instead the cluster of 1241 and 424 are joining to Byz to form a cluster of three MSS. And so the process continues for the first seven lines, every time joining together two clusters that are in 100% agreement on this data.

I put the 'on this data' in bold because I suspect that those of you who know Textual Criticism are already hammering the table exclaiming that at least one of these relationships is wrong because of some particular reason. You are almost certainly right. These are relationships exhibited between the MSS in the first four chapters of Romans. As far as the first four chapters of Romans are concerned they are mathematically correct. To be truly correct far larger samples of data would need to be used.

The first really interesting thing happens on the 8th line (9th if you include the title line). Here the K mss is joined into the large Byz grouping. Note that the agreement is only 93%. The new cluster formed has six elements. The original five Byz docs are still in the center of the cluster but they are now only within 99% of the center because the addition of K to the cluster has offset the center a little. K is in the cluster but has 93% agreement and thus counts as a relative outlier in an otherwise tight cluster - one that has 98% overall match rate.

As the process continues the distances become greater (or the agreements become less) but the system is still showing you the sequence in which these things would join if you believe that you can trace the MSS all the way back to the original autograph. Of course that would be an extraordinarily lofty goal, but at least the display gives you something to work with.

Distance Metrics

An extremely important question that many people draw a discrete veil over is the question of how do you define how far apart two MSS or two clusters are? In the case of MSS there is at least one fairly easy measure although as I have argued elsewhere it is not the best measure. The measure used by this program is to define the distance between two MSS as

100 * Number_Of_Agreements / ( Number_of_variations_of_the_larger_of_the_two_MSS )

The key point to note is that the divisor is the larger of the two variation numbers. Thus if I am comparing two MSS and one has twenty variations and one only has ten then my highest possible agreement is 50%. Some will object to this but the reason is that if you divide by the smaller then you loose the transitive property of the comparison. This of it this way: If you know that manuscript X is 100% close to Y and Y is 100% close to Z how close to you expect X and Z to be. If you say 100% then you are assuming that the relationship between these MSS is transitive.

But suppose I divided only by the number of shared variations. The I could have three documents where document Y had twenty variations, X and Z each had ten variations but didn't share a single variation. Then you would have X & Y at 100%, Y and Z at 100% and Z and X and 0%.

If the above has already boggled your mind a little bit then I suggest you skip the rest of this section; is it about to get worse. Above I have defined the distance between two documents - but what is the distance between two clusters of documents? There are three fairly obvious possibilities:

  1. The average distance between all of the MSS in the two clusters. This is the default definition the program uses. It is fairly intuitive and it gives a measure that certainly relates to how close the points are together. The weakness of it is that it doesn't really prevent outliers. You may have two medium size clusters both of which have an outlier some distance from the center of the clusters. When the distance between the two clusters is measured then those two MSS may be really miles apart but that fact is swamped by the bulk of the points that are fairly close together. Thus this mechanism tends to produce some nice strong densely populated clusters but with some weirdo MSS that seem to have crept in by the back door.
  2. The nearest distance between all of the MSS in the two clusters. This is probably the most intuitive. If I have document X in my cluster already, and X is very much like Y then Y should be in my cluster too. This technique definitely produces interesting clusters and for every MSS in the cluster there is going to be a near neighbor that explains how it got in there. The problem you get is chaining. Chaining means that the MSS that pulls in the next new MSS is always the one most recently pulled in (rather than one of the older ones). The cluster then begins to be a very long thin cluster. And whilst any two points on the chain can be explained if you look at the MSS on the two ends of the chain they will usually be miles apart.
  3. The furthest distance between any two MSS in the two clusters. This is the clustering mechanism least likely to embarrass you. It essentially looks for the worst two MSS in the combined clusters and defines that as the size of the cluster. It is safe and reliable. However, the down side is that it takes a very pessimistic view of the problem and typically produces fairly loose clusters because it has no reason to produce a tight cluster if it already has an outlier in the cluster that is defining the cluster as large. Think of it a little bit like taking an ironing board on vacation: you know you need a huge suitcase so their is no point in packing light.

Nomenclature

Doubletons: A variation is a doubleton if only two of the Valid or Other MSS has that particular reading. Doubletons are generally considered to represent errors in the underlying MSS. I also believe they may be key indicators of a relatively late relationship between two MSS.

MSS: I am really using this term to refer to anything you wish to track as an authority in the manuscript relationship calculator. This could be real MSS, the opinion of your best friend or values you obtained by throwing arrows at a dartboard.

Singletons: A variation is a singleton if only one of the Valid or Other MSS has that particular reading. Singletons are generally considered to represent errors in the underlying MSS. It has therefore been suggested by Andrew Wilson that the singleton counts of an MSS gives an inverse approximation to its value.

Trebletons: A variation is a trebleton if only three of the Valid or Other MSS has that particular reading. Trebletons may be considered to represent errors in the underlying MSS. They may also indicate a recent relationship between the MSS involved.

Variant: I am using this to refer to a point in the original text where the underlying MSS have differing opinions. Exactly how you chose to define that is up to you - but for the mathematics I count the variant as my basic unit.

Variation: Within a given variant there may be two or  more variations. Each variation is represented by a group of MSS that have the particular version of the underlying text.

Tweet  

JavaScript Not Supported.

JavaScript Not Supported.

JavaScript Not Supported.

The Christian Counter

The Fundamental Top 500