HOWTO: Authors beware of less-than '<' and greater-than '>' characters on the web by David Christensen dpchrist@holgerdanske.com September 15, 2009 Here is an excerpt of an e-mail newsletter sent to me by Republic Media: THE SATURDAY FRONT PAGE: www.republicmedia.tv THE REALITY REPORT Gary Franchi Covers Lancaster CCTV Story and much more... I like to post newsletters like this on the World Wide Web. I read my e-mail in plain text, but web pages are supposed to be written using HTML/ XHTML [1, 2]. For example, here is "hello, world!" in strict XHTML 1.0 [3]: "hello, world!" in strict XHTML 1.0

hello, world!

Valid XHTML 1.0!

Because HTML and XHTML place special meaning on the less-than '<' and greater-than '>' characters, problems can occur when I put a document containing these characters on the web. If I open the Republic Media newsletter with Outlook, select the content, copy to the clipboard, paste into Notepad, save it as a *.txt file, and then open the file with Internet Explorer, I see: THE SATURDAY FRONT PAGE: www.republicmedia.tv THE REALITY REPORT Gary Franchi Covers Lancaster CCTV Story and much more... Everything is there, but the text is monospace black on white, long sentences and paragraphs extend beyond the right edge of the window, and none of the URL's function as hyperlinks (e.g. web surfing dead end). That's why mark-up languages were invented. If I save the file as *.html, I see: THE SATURDAY FRONT PAGE: www.republicmedia.tv THE REALITY REPORT Now the font is proportional black on white and lines are wrapped, but the URL doesn't automatically become a hypertext link and the whole newsletter is one giant paragraph. Even worse, the second and third URL's have disappeared. This is not viable either. If I paste the newsletter Word and save as HTML: THE SATURDAY FRONT PAGE: www.republicmedia.tv THE REALITY REPORT Gary Franchi Covers Lancaster CCTV Story and much more... The text is monospace black on white, line breaks are preserved, and all the URL's have been converted into links. This method works for creating a stand-alone web page. I could make it fancier with Word's WYSIWYG capabilities. But, I want to put the content into a web content management system to gain the benefits such as consistent look and feel, organization, navigation, indexing, searching, etc.. If I paste the newsletter into Drupal [4, 5], a dominant open-source web CMS: THE SATURDAY FRONT PAGE: www.republicmedia.tv THE REALITY REPORT Gary Franchi Covers Lancaster CCTV Story and much more... I get everything I want except that some of the URL's are missing. Why? Looking at the original message, note that the second and third URL's are delimited with '<' and '>'. When Drupal finds a '<' immediately followed by another character, it treats everything up to the matching '>' as HTML/ XHTML (per the standards). But since "" isn't valid HTML/ XHTML, Drupal drops it. Why does Drupal drop invalid HTML/ XHTML mark-up? To prevent things like cross-site scripting and SQL injection attacks. Otherwise, some cracker could post harmful content on a otherwise legitimate web site and your computer would get hijacked the first time you browsed that page. So, content filtering is necessary for security reasons. Some readers might say "Anybody can see that '' is a URL; why can't Drupal?". Actually, Drupal can with a filter plug-in. But, adding a plug-in that purposefully breaks the rules is a risk. Assuming that the plug-in "works" by itself, who is going to verify that it doesn't break some other part of Drupal (such as whatever piece parses content according to the HTML/ XHTML standards)? And, what about interoperability? Adding a non-conforming plug-in means creating a variant standard. Who is going to validate/ patch all the other systems that interoperate with Drupal according to this variant standard? What about security? The XHTML and HTML standards are hundreds of pages in length; changing anything could create vulnerabilities in unobvious places. Therefore, adding a non-conforming plug-in amounts to taking a step down a slippery slope. Other steps are sure to follow (for example, e-mail addresses delimited with '<' and '>', etc.). Even if I did modify my Drupal sites to deal with "", what about all the other web sites out there; Drupal or otherwise? Trying to fix the problem at the receiving end makes about as much sense as putting the cart before the horse. The proper place to fix the problem is at the source -- authors: The root cause of the problem is authors who use '<' and '>' characters, or who use tools that insert these characters, in ways that are incompatible with the World Wide Web. Therefore, I suggest that authors do the following: 1. Avoid using '<' and '>' characters, or tools that insert these characters, in the content you create. 2. If you need a '<' or '>', put spaces around them so that they won't be mistaken as HTML/ XHTML delimiters. 3. If you need '<' and '>' delimiters, write your content in valid HTML/ XHTML, validate it: http://www.w3.org/QA/Tools/#validators and distribute the HTML/ XHMTL source. Here are two reasons why: 1. Authors who use '<' and '>' with the WWW in mind will have their content distributed far and wide. 2. Authors who do not use '<' and '>' with WWW in mind will have their content mutilated and then distributed far and wide. As an author, which result do you want? References: [1] http://www.w3.org/TR/html4/ [2] http://www.w3.org/TR/xhtml1/ [3] http://holgerdanske.com/static/xhtml/hello.html [4] http://drupal.org/ [5] http://constitutionpartyca.org/node/353 _Id: HOWTO-Authors-beware-of-less-than-and-greater-than-characters-on-the-web.txt,v 1.3 2009-09-16 03:38:16 dpchrist Exp _