# methodology.md -rw-r--r-- 4.5 KiB View raw

Obtaining word counts for various standards

I measured the word counts of several different standards for comparison against W3 standards:

For estimation purposes, each of these was rounded to the nearest 1,000. Word counts were obtained with wc -w. The main difference was translating each file format into something (1) wc could read, i.e. text, and (2) which gave a reasonably fair basis of comparison between formats.

To that end, an attempt was made to remove or reduce instances of formatting, layout, or decorative elements. However, navigation was left in where present — the table of contents in an RFC, the index on a W3C standard, the listings of POSIX tools — were all consdiered part of the specification. Additionally, informative text and examples were included, as the authors clearly believed they were necessary for a successful interpretation of their respective standards.

Also, this comparison was always going to be unfavorable to web standards, so in general I preferred to make less generous concessions towards non-W3C standards, and more generous concessions for W3C standards. For example, for non-W3C standards, I spent little to no time trimming out any fat or trying to eliminate any information which is not strictly considered part of the spec, whereas for W3C I went to some lengths to consider what data to include.


This was simply done with wc -w /usr/share/doc/rfc/txt/*.


find . -name '*.html' | xargs wc -w


pdftotext [pdf file] - | wc -w


I used a little bit of JavaScript to obtain a list of URLs to the latest versions of each specification from the W3C standards and drafts list. I chose to include drafts considering that most of them are already implemented by most browsers. I also considered specifications like WCAG part of the web, and included them in the count.

Additionally, I included specifications which are transitively relevant. For example, because SVG and MathML, among other things, are supported by mainstream browsers, I brought along the W3C XML specifications.

The full list of URLs I scraped is available here. I further reduced this to only HTML and XHTML files, omitted a number of duplicates (for example, the HTML5 spec is available in both several pages and one big page formats), then fed them into lynx --dump to remove the markup and layout and extract just the text. The full list of files which were included is here.

There are still the odd files which might not count - recommendations, or specs tangentally related to the web. However, such documents tended to be much smaller, and a review of the URL and file lists will show that the vast majority of the files considered are web related. I don't feel that pruning the remainder would have changed the numbers meaningfully enough to cause significant changes to the article or its message, and I am satisfied with these results as such.

And to put any remaining doubts to rest, I took about 100 million words off of the number I gave in my article. The real sum I ended up with is over 200 million.

And I didn't even let wget finish downloading all of the specs.


WHATWG is a browser-driven group which works on web standards, and in some cases is more relevant than the W3C standards. I choose not to use this, because:

  1. Most of the W3C specifications are still the only thing which documents the majority of the behaviors implemented by modern web browsers. WHATWG has a much smaller scope.
  2. WHATWG doesn't even cover a large amount of the stuff browsers are working on. JavaScript hardly factors into my word counts at all, yet is a major part of the complexity growth in browsers. WHATWG is does not represent a full web browser worth of specifications.
  3. Many of the the specs which describe web browsers outside of W3C are outside of WHATWG, poorly organized by a hundred different projects and interested groups with no central authority. It would be difficult to (1) find them all and (2) compare them on the same terms.

I feel like this just emphasizes my point. What the hell is a web browser? Can anyone even define it?