Main menu:

Categories

Site search

 

January 2008
M T W T F S S
« Nov   Feb »
 123456
78910111213
14151617181920
21222324252627
28293031  

Archives

Meta

Clustering the New Testament.

During Bible study last week, it was mentioned that people have used statistics to “determine” authorship of books of the Bible. Having a couple free hours last night, I tried my own experiment on the New Testament.

The procedure was easy: I downloaded the Nestle-Aland 26th edition of the New Testament; each book in the New Testament became a vector v, with v_w counting the number of times word w appears in the book. The cosine of the angle between two such vectors measured how similar the corresponding books are. I packaged these cosines into a matrix, the (i,j) entry of which measured how similar books i and j are.

Of course, this is a 27 \times 27 matrix. To turn these numbers into a nice picture, I projected the books onto a lower dimensional space spanned by the eigenvectors having the five largest eigenvalues (this is known as Principal Component Analysis); I chose five dimensions, displayed using location (two dimensions) and color (three dimensions, namely hue, saturation, and luminosity). The result is the following graph:

New Testament Clustering

The dots represent each book, and nearby dots of similar colors represent similar books. Some things jump out right away:

  • The Gospels are all in the lower right hand corner.
  • Paul’s epistles (and Peter’s?) are mostly in the upper right hand corner.
  • Revelation is close to John.
  • Hebrews and James are close to each other? Why?

All told, I think this is a pretty good graphical display of the structure of the New Testament, especially considering we used nothing but the Greek text and linear algebra!

Comments

Pingback from k is one cat » Clustering Shakespeare.
Time: January 22, 2008, 10:24 pm

[…] ran my clustering program (which I just ran on the New Testament) on Shakespeare’s plays—which were conveniently packaged into a text file by Open […]

Comment from Niles
Time: January 23, 2008, 8:01 am

Awesome! How did you choose which two dimensions to display spatially? If you change coordinates, it looks like maybe Galatians, Hebrews, Acts, and Matthew would be in a cluster … what does that mean?

Comment from Niles
Time: January 23, 2008, 8:09 am

p.s. it took me a few tries to understand your explanation—I think I understand now, but it seems that you use ‘j’ first to represent a component of a book vector, and then later to represent the entire vector … maybe a notation change could make it more readable.

Comment from kisonecat
Time: January 23, 2008, 3:26 pm

I changed the notation as you suggested.

The spatial coordinates correspond to the two largest eigenvalues (the largest corresponds to horizontal location, and the largest eigenvalue is quite a bit larger than the second largest, if I recall correctly). In any case, being nearby is more significant than having a similar color. The color makes it a look a lot nicer though, I think.

Write a comment