Read RFC Documents in HTML Form

I am currently toying around with a small little project of mine called Erling, an IRC bot in Erlang. Mostly because I want to learn how to work in Erlang/OTP properly, but also slightly because I want to be better at reading specifications and implement them properly. This means that I of course implement the whole specification from scratch. Yes, I know that I am currently reinventing so many wheels that even wheelwrighters look at me in shock. The point here is not to develop something new though, but to get better at Erlang and to get better at system architecturing.

For my part, working with specifications have been painless and without much hassle. I have heard about people stating otherwise, so I may just have been lucky with the specifications I have worked with. The only thing which has bothered me are grammar specifications: They are frequently ambiguous or formed in such a way that different implementations may be incompatible with each other. The way I have handled such issues before is by reading up on other implementations and use a bit of common sense. I could have done that here as well, but this time it was not a problem with ambiguity.

If It's Too Strange to be True...

Have you ever read code where you are sure there's something wrong, even though it's completely legal code and seems correct the first time you look at it? Say you for instance hit upon this Java snippet:

char[] in = init; //init here
int i = 0;
boolean ok = false;
while ('a' <= in[i] && in[i] <= 'z') {
  i++;
  ok = true;
}
while ('0' <= in[i] && in[i] <= '9') {
  i++;
}
while ('0' <= in[i] && in[i] <= '9' && in[i] != '3') {
  i++;
}
// ...

So, humm, why do they keep the last while loop at all? The code would work exactly the same without it. They must have forgotten something, right?

The same can be said for this specific part in the IRC client protocol document, rfc2812, at section 2.3.1[1]:

hostname   =  shortname { "." shortname } ;
shortname  =  ( letter | digit ) { letter | digit | "-" }
                { letter | digit } ;
                  (* as specified in RFC 1123 [HNAME] *)

If there's one thing you should notice here, then it is that the last clause in shortname has no effect whatsoever. If you can pick multiple letters and digits, you would already have done so when you picked up letters, digits or hyphens.

Is this a bug in the grammar, or is it just a blunder by one of the grammar designers? My guess was on the former, so I peeked into the rfc1123 document to get a better understanding of the HNAME specification. At section 2.1, it says

 The syntax of a legal Internet host name was specified in 
 RFC-952 [DNS:4].  One aspect of host name syntax is hereby
 changed: the restriction on the first character is relaxed to
 allow either a letter or a digit.  Host software MUST support
 this more liberal syntax.

Well, except for the definition of the first character, there isn't much to get out from this specification. As the shortname because the rest is documented in rfc952. There's nothing more to do than jump down and look at that one too, right?

And sure enough, the definition is a bit different than how it actually is defined in rfc2812: Page 4 of rfc952 has the following snippet for defining hnames:

hname = name { "." name };
name = let [ { let-or-digit-or-hyphen } let-or-digit ] ;

which essentially translates into this, when we take rfc1123 into account:

hostname   =  shortname { "." shortname } ;
shortname  =  ( letter | digit ) [ { letter | digit | "-" } 
                ( letter | digit ) ] ;
                  (* as specified in RFC 1123 [HNAME] *)

'lo and behold, there's a bug in the RFC document! This must surely have been found earlier, right? So why is it not fixed?

Scarab "Scarab" by Holly, CC-BY-NC-ND 2.0

HTML > Text and PDF

Here's the thing: The text and PDF version doesn't contain any errata notices. If you are as "stupid" as I was and just read those, you wouldn't get any information at all about erratas. Not good, but it's at least something you can prevent: Just ensure you read the HTML docs.

And yes, this issue was reported back in 2007, confirmed as a bug in 2010. But (perhaps surprisingly?) it is not "approved" according to the RFC, just "hold for document update". According to the IESG processing for RFC errata:

   2. Things that are clearly wrong but could not cause an 
   implementation or deployment problem should be Hold for
   Document Update.

I am not exactly sure what they mean by this. While it's true that this implementation won't create false negatives, it will cause false positives.

Now, whether false positives can be a huge problem or not is debatable, but going under the assumption that this cannot cause implementation problems is obviously a bad assumption. If the server blindly assumes that the hostname is legal and crashes during deployment because it wasn't, I would call that an "implementation or deployment problem". Perhaps I'm a bit more pedantic or paranoid than others, but technically, this can be an issue. Of course, it's very stupid of the developers to not do their research and to blindly implement standards/specifications, but it's still a problem "just" because the RFC is erroneous.

So, there you go folks. Read RFC documents in their HTML form, and be sure to look out for the erratas. There will almost certainly be some, so you better read up on them.

Hacker News Discussion






[1] I have taken the liberty to convert all grammar code over to equivalent EBNF format, because some of the specs use ABNF format, and rfc952 uses a mix between BNF and ABNF. The fact that it mixes BNF and ABNF is a bit interesting: Most likely is this because rfc733 didn't exist when rfc603 was defined. As rfc952 obsoleted both rfc608 and rfc810, it's likely that they just kept the existing format for compatibility reasons.

Tagged with: bugs, parsing.