The Readability of Blogs

You need 11.9 years of formal education to easily understand this site. Well, that's if you believe a readability test called the Gunning-Fog Index. The Gunning-Fog Index is basically an algorithm that analyzes text for sentence length, syllables per word, and word complexity. After crunching the numbers it comes up with a readability score that is supposed to predict how easily people will be able to digest the text. The Wikipedia article for the Gunning-Fog Index mentions that comic book text typically has a score around six, Reader's Digest typically scores around eight, Newsweek scores around ten, and so on. This puts onfocus.com on par with the readability of Time magazine.

The first time I ran into a demo of readability tests was at this page: Juicy Studio Readability Test. You can plug in a URL, and get back a Gunning-Fog Index score, and some other scores. I thought it was interesting and moved on. But for some reason it's been sticking in the back of my mind.

I'm bringing this up because I've been thinking quite a bit about the ways we measure blogs. And most of our measurement tools are fairly blunt. If you ask blog-measurement site Technorati what it "thinks" about your favorite blogs, you'll get machine answers like the number of inbound and outbound links. You'll get some info about traffic over time and Technorati's computed rank compared to other blogs. You'll see post-frequency and a list of common topics culled from RSS categories and Technorati tagging.

On the other hand, if I were to ask you some questions about your favorite blogs, you could probably tell me exactly why you like them. And it wouldn't have anything to do with inbound links or the other machine-based metrics. I'm guessing most of your answers would involve the writing style, tone, the topics the author covers, the fact that everyone else reads it, or maybe your personal relationship with the author.

You can't quantify something like tone, so you can't put computers to work analyzing tone. (I'd love to have a snark score for blogs.) But readability scores are a step toward a more human-style metric, and the scores can be crunched, analyzed, graphed, and averaged by computers. And I like the idea that the readability scores are laying there dormant within the sentences themselves, waiting to be tapped.

I'm not a linguist so I don't know how accurately these scores reflect readability. But I was interested enough in readability as a metric to do some digging around. A search on CPAN turned up the module Lingua::EN::Fathom which accepts arbitrary text and returns the Gunning-Fog Index score, along with several other scores including Flesch Reading Ease score, and the Flesch-Kincaid grade level. I thought it might be fun to plug in the top ten or so English language blogs as reported on Technorati popular to see if there's a "sweet spot" reading level among the most popular blogs. Of course many factors go into a blog's success, but I thought readability could be a reason some blogs hit the top of the tail and others don't. If nothing else, I figured I could find out if blog readers are more of a Reader's Digest sort of audience, or more of a Time magazine sort of audience.

So I cooked up a little Perl script that takes a list of RSS feeds, loops through the posts, strips out HTML, and calculates readability scores. If you want to run it yourself, you can grab the code here:

reading_levels.pl

In addition to the Lingua::EN::Fathom module, you'll need LWP::Simple for fetching feeds, XML::RSS::Parser for parsing them, and Math::Round::Var for rounding the scores. Add a list of feed URLs you want to analyze to the top of this file, and then run it on the command line, like this:

perl reading_levels.pl > reading_levels.txt

Once finished, the file reading_levels.txt will have a report with the individual reading levels for the sites, and an average for the group.

Caveats: this isn't a very robust feed parser, some feeds only have excerpts rather than full posts, and some feeds simply don't work with this script. I used the full feed posts if multiple feeds were available, and I skipped any sites that didn't parse.

So, what did I find? Well, here's the report for the top several English-language blogs as reported today by Technorati:

reading_levels.txt

(I skipped Post Secret because there's not much text to analyze.) The average Gunning-Fog Index score was off the "wide audience" charts at 14. That means the average person would need over 14 years of formal education to understand these blogs easily. The average Flesch Reading Ease score was 46.9, on a scale of 100. That's on par with state insurance form requirements. (seriously!) And the Flesch-Kincaid grade-level score was 11.8, meaning that it's appropriate for high school seniors, high on the scale. The most "ideal" site for a wide audience was Daily Kos, with a low Flesch Kincaid Grade level (9.05) and an above average Flesch Reading Ease score of 56.48.

So, what does this mean? I have no idea. My prediction that the most popular blogs would have very good readability scores didn't quite hold up. I can't pinpoint a "sweet spot", but maybe blog readers enjoy more densely layered text. (Think Time instead of Newsweek, but not quite Harvard Law Review.) I might take a look at sentence length and percentage of complex words next and see how those measure up.

I still think measuring readability has promise. Earlier today Anil was talking about TL;DR syndrome, and I think the popular blogs capitalize on this with short, frequent posts. But I also wonder if text density plays a roll. So in addition to saying, "too long; didn't read," I think there's the possibility of "too dense; didn't read". (insert joke here.)

Comments

This is fascinating, Paul. Do your survey for the top 1000 blogs!

Do you know if the web tool handles HTML/XML tags correctly? Because on both my blog and my RSS feed at http://www.somebits.com/weblog/ I get a Gunning Fog Index of 7.9 and a Flesch Reading Ease of 71.5. I try to write simply, but was surprised at just how simply the software thinks I write. No wonder I'm not A list :-)
Thanks, Nelson--I might have to do that. I think it'd also be interesting to compare the top 100+ with a random sampling that isn't the top 100. I need to talk with someone who designs these sorts of tests.

I'd have to switch to gathering feeds via RSS auto-discovery tag, because finding feeds by hand is time-consuming. But that shouldn't be hard.

I don't know how the web tool works--they don't provide the code. My script strips out anything in tags. Running your site through my script gives me:

Gunning Fog: 9.51
Flesch Reading Ease: 64.94
Flesch Kincaid Grade: 7.06

I'd say you're ready for a mass audience!
This is a very cool bit of research. Running my homepage through the Juicy Studio, I came up with:
Gunning Fog 7.86
Flesch Reading Ease 71.21
Flesch-Kincaid Grade 4.79

Surprisingly low! I think this is because a large part of my blog is basic referential linking with only a side comment or two from me. Most of my extended writing is in the form of book reviews. Focusing on only that category I come up with:
GF 10.25
FRE 63.48
FKG 6.88

There might be some tag-inclusion bias, but it seems like a pretty big gap to me. I really wish I had more time to dedicate to those longer posts. This readability issue you brought up also makes me want to beef up my referential work. I don't want to be "less readable" per se, but I think having those benchmarks will help as I work to have my blog showcase more balanced, and hopefully more mature, interesting, & nuanced writing overall. That's what the mission statement says, anyway :) Thanks for sharing this.
Thanks Mark, I also think analyzing RSS feeds will give you more accurate results. I don't know how the Juicy Studio thing operates, but I know it can't distill things down to just the writing in posts like analyzing an RSS feed can.

I should probably write a quick CGI script that accepts an RSS feed and returns reading level scores.
×

Search Results

No emoji found