Clustering the New Testament.
During Bible study last week, it was mentioned that people have used statistics to “determine” authorship of books of the Bible. Having a couple free hours last night, I tried my own experiment on the New Testament.
The procedure was easy: I downloaded the Nestle-Aland 26th edition of the New Testament; each book in the New Testament became a vector
, with
counting the number of times word
appears in the book. The cosine of the angle between two such vectors measured how similar the corresponding books are. I packaged these cosines into a matrix, the
entry of which measured how similar books
and
are.
Of course, this is a
matrix. To turn these numbers into a nice picture, I projected the books onto a lower dimensional space spanned by the eigenvectors having the five largest eigenvalues (this is known as Principal Component Analysis); I chose five dimensions, displayed using location (two dimensions) and color (three dimensions, namely hue, saturation, and luminosity). The result is the following graph:

The dots represent each book, and nearby dots of similar colors represent similar books. Some things jump out right away:
- The Gospels are all in the lower right hand corner.
- Paul’s epistles (and Peter’s?) are mostly in the upper right hand corner.
- Revelation is close to John.
- Hebrews and James are close to each other? Why?
All told, I think this is a pretty good graphical display of the structure of the New Testament, especially considering we used nothing but the Greek text and linear algebra!
Posted: January 22nd, 2008 under Theology.
Comments: 4
Comments
Pingback from k is one cat » Clustering Shakespeare.
Time: January 22, 2008, 10:24 pm
[…] ran my clustering program (which I just ran on the New Testament) on Shakespeare’s plays—which were conveniently packaged into a text file by Open […]
Comment from Niles
Time: January 23, 2008, 8:01 am
Awesome! How did you choose which two dimensions to display spatially? If you change coordinates, it looks like maybe Galatians, Hebrews, Acts, and Matthew would be in a cluster … what does that mean?
Comment from Niles
Time: January 23, 2008, 8:09 am
p.s. it took me a few tries to understand your explanation—I think I understand now, but it seems that you use ‘j’ first to represent a component of a book vector, and then later to represent the entire vector … maybe a notation change could make it more readable.
Comment from kisonecat
Time: January 23, 2008, 3:26 pm
I changed the notation as you suggested.
The spatial coordinates correspond to the two largest eigenvalues (the largest corresponds to horizontal location, and the largest eigenvalue is quite a bit larger than the second largest, if I recall correctly). In any case, being nearby is more significant than having a similar color. The color makes it a look a lot nicer though, I think.
Write a comment