Comments on Threading from Wayne Davison

This was posted to the bizarchive list by Pope Clifton.

Newsgroups: news.software.readers
Subject: Re: RFC 1036 question: References line
From: [email protected] (Wayne Davison)
Date: Fri, 7 Jan 1994 20:38:13 GMT
Organization: Borland International

According to Stefan Kurth <[email protected]>:
> Can I count on the References line not to have "holes" in it? I mean,
> if I have this thread:
>       [1]---[2]---[3]---[4],
> is it possible that article 4's References line contains only the
> message IDs of articles 1 and 3?

Not only can you NOT count on it to not have holes, you have to plan
on it sometimes (a) being partially or fully reversed (b) having
duplicate entries (c) referencing itself (d) having gaps (e) having
an entirely random order from all over the thread and (f) being
completely bogus.

> RFC 1036 says one can shorten the References line if it gets too long,
> but it doesn't say that one has to cut from the left side only.

Modern thought is that the first reference should be preserved in a
(possibly vain) attempt to keep the root of the thread around. (No
one (that I know of, at least) attempts to take advantage of this
yet, because it is too random to be useful, however it may prove
useful sometime in the future.)  This makes it very likely that you
will encounter the above References as [1]--[3]--[4]--[5] in a
future article.

> How do other newsreaders deal with this?

Here's what trn does:

It first checks if the article that we are threading already exists as
a dummy article in the article tree (which often happens when news
arrives out of order).  If so, we trust the References in this older
article more than the article that dummied it up, so we cut it out of
the tree (bringing its children along for the ride) and remove any
non-shared dummy parent articles upward in a loop.  Now we thread
it just as if it were a non-dummied article.

To thread the article, we start with the last reference on the References
line and look it up in a hash table.  If it doesn't exist we create a
dummy article for that message-id and proceed to the previous reference.
If at any point in the reference-search we bump into a pre-existing (real
or dummy) article (ignoring references to the article itself and references
listed more than once on the line) we consider the search complete.  If
at any point we encounter bogus information on the References line we
give up and assume the rest is trashed.  Note that trn uses a lax
algorithm for valid message-ids that makes provisions for people having
globally replaced the '>' character with all sorts of other citation
characters (ARRRGH), so it allows things like  <[email protected]|
to be processed as if the '|' were a '>'.

After the reference search we either have a parent article to which to
link this chain of articles (the incoming article, any dummies we created
and any children it may have), or we ran out of references.  In this
latter case we link it to the root of a thread that corresponds to its
subject, creating one if it doesn't exist or making the article a sibling
of the existing root if it does.  In the case of finding a parent, we
link it up with the article we found, but this might merge two pre-
existing subjects together into one common thread if we are replacing
a dummy article.

As a final check, we traverse the list of parent articles to ensure that
we didn't create a reference loop back to the new article -- if we did,
we assumes that the "child" article it has is bogus, and cut it off,
linking it to the root of the tree.

Someone should be able to improve on this "quick link" algorithm if
they want to take the time to process the entire References line,
probably ordering the list by date to attempt to combat the out-of-order
postings.  It seems like such an algorithm would be very messy to get
it to handle the defective References lines and the articles that haven't
yet arrived.  If you think you can write one, please do (and share it
with the rest of us).

> (Are you listening, Wayne?)

Listening, and ready to help.  The code in trn's rt-process.c isn't THAT
hard to read (it's actually commented) -- you might want to check it out.
-- 
Wayne Davison
[email protected]