Interesting comment in the audvidsyn mailing list I lurk on - a list for "practical conversations about syndication of audio and video".
"Because the destiny of audio/video is to be absorbed into the mainstream web. RSS w/enclosures is important because it is a form of hypertext designed for audio and video. The hypertext aspect is central. A single HTML document is not an appropriate format for collections of timed media. RSS w/enclosures is."
"Or, well, RSS w/ enclosures is a lot more appropriate than HTML anyway. HTML documents either represent a single moment or totally lack a concept of separate moments. RSS documents are collections of moments. They lack a lot of timing metadata (e.g. like the SMIL PAR tag and endsynch attribute), but at least they allow for the concept of multiple moments."