FB Doug Meet

Search This Blog

Loading...

November 7, 2011

HTML5 Parser has landed

HTML5 Parser-Based View Source Syntax Highlighting

A new implementation of the View Source HTML and XML syntax highlighting has landed in Firefox.

Why?

The reason there is a new implementation is that the old implemention was based on the old HTML parser that we want to get rid of. The old View Source implementation was standing in the way of the goal to remove the old parser. Also, the old parser did some incorrect highlighting. Most notably, it flagged the unnecessary-but-permitted slash as an error on void elements (e.g. <br/>) because all such slashes were bogus in non-X HTML prior to HTML5.

The reason why the new implementation uses the HTML5 parser instead of using something new and Orion-integrated in the dev tools is that the new implementation was written before there were publicized plans to integrate dev tools with View Source. Furthermore, there is no way to get HTML syntax highlighting right without the highlighter running the whole HTML(5) parsing algorithm, because tokenizer state transition decisions depend on the tree builder state.

New Features

The first and foremost feature is not user-visible per se. It is the non-use of the old parser code in order to be able to get rid of the old parser. However, using the old parser had user-visible consquences.

More Correct Highlighting

As already mentioned, the old parser unconditionally highlighted the slash in <foo/> as red regardless of the element name. Furthermore, the old parser failed to get the highlighting of tricky inline scripts right (when the inline script contained the string </script>). Highlighting of SVG and MathML content in text/html was wrong, too, since the old parser knew nothing about foreign content in text/html.

Consider the following highlighting by the old parser:

<!DOCTYPE html> <html> <head> <title>Title</title> <script> var lt = "<"; <!-- var s = "<script>foo</script>"; --> </script><!-- Not quite optimal highlight there. --> <style> /* </foo> */ </style> </head> <body> <p>Entity: &amp; </p>  <noscript><p>Not para</p></noscript> <svg> <title><![CDATA[bar]]></title> <script><!-- this is a comment --></script> </svg> </body> </html>

The first occurrence of </script> is highlighted as an end tag. The content of the SVG title and script elements is treated as if the elements were HTML elements of the same name.

Compare the above to the highlighting performed by the new implementation:

<!DOCTYPE html> <html> <head> <title>Title</title> <script> var lt = "<"; <!-- var s = "<script>foo</script>"; --> </script><!-- Not quite optimal highlight there. --> <style> /* </foo> */ </style> </head> <body> <p>Entity: &amp; </p>  <noscript><p>Not para</p></noscript> <svg> <title><![CDATA[bar]]></title> <script><!-- this is a comment --></script> </svg> </body> </html>

The HTML script is tokenized according to the HTML rules. Note that <----> inside an HTML script is not a comment node! In the SVG subtree, title and script are not special and can have CDATA sections or comments inside them. (The coloring of the HTML script end tag is inconsistent with other end tags, though, due to technical difficulties.)

Better Error Reporting

The old parser highlighted errors so rarely that it was easy to think it was not doing it at all. However, it did indeed have support for highlighting a couple of errors. I am aware of it highlighting doctypes that used XML syntax inappropriate for HTML and highlighting the already mentioned XML-style slash in tags.

To get feature parity with the old implementation, the new implementation had to support at least highlighting the XML-style slash when the use of the slash is wrong per HTML5 / HTML Living Standard. Since highlighting the slash correctly is among the more difficult error highlights and the Java version of the parser (from which the C++ version is mechanically translated) already supported full error reporting since it was originally written for a validator, I thought I could add support for the easier error highlights, too, while at it.

But why stop at highlights without explaining them? I also made the parser attach error messages to the highlights as tooltips. (Unfortunately, Firefox has long-standing accessibility problems with tooltips, so the error messages are not keyboard-accessible at the moment.)

The new View Source implementation produces results like this (note the tooltips):

<!DOCTYPE HTML PUBLIC"-//W3C//DTD HTML 4.01//EN"> <form> <table> <h1>Error test</h1> <tr><td><div>cell<td>another cell </table> <form> <select><select> <div> <p class="foo"id="bar"> <p/> <br/> <h2><h3></h3></h2> <![CDATA[bogus comment]]> <svg> <![CDATA[this is text]]> <div> <![CDATA[bogus comment again]]> <!--foo--bar--> <!-- foo --!> <p><i><b>bold italic</i></b></p> <p>&#x0000;</p> <p>&auml </p> <p>&foo </p>   </a>

Note: The tooltips will have line breaks between multiple error messages in one tooltip when viewed in the View Source window in Firefox. The lack of line breaks in Firefox in other contexts (including this page) is a known HTML5 violation bug.

Off-The-Main-Thread Highlighting

As a consequence of the off-the-main-thread design of the HTML5 parser in Firefox, the highlight computation now happens off the main thread.

Limitations

The above may set expectations too high, so it is important to lower them right away.

This Is Not a Validator!

All the errors you have seen above are parse errors. Parse errors are errors defined as such by the HTML parsing algorithm. There is much more to checking HTML validity than just finding the parse errors.

For example, putting a div element as a child of an ul element or as a child of a span element is not a parse error. In an HTML validator, content model errors like that are detected by a validation layer above the parser. The View Source implementation in Firefox does not have a validation layer at all.

The lack of a validation layer has counter-intuitive consequences. The HTML parsing algorithm avoids parse errors that would be redundant with validation errors. For example, <div<div> is a start tag for an element named div<div. Since there is no such element in the HTML language, the validation layer would catch the error. However, when we do not have a validation layer, the typo goes unreported.

Please do not advertise the new View Source implementation by saying that Firefox now has a validator in the View Source window.

Not All Parse Errors Are Reported!

Even though all the errors that are reported are parse errors according to the specification, not all parse errors are reported.

  • Bytes that are illegal according to the encoding of the page are not marked as errors.
  • Forbidden characters are not reported as errors.
  • Errors that relate to the end of file are not reported.
  • Tree builder errors that relate to text (as opposed to tags, comments or doctypes) are not reported.
  • Parse errors related to xmlns attributes are not reported.

XML Syntax Highlighting

The old implementation used the HTML tokenizer for highlighting XML source. So does the new implementation. While the tokenizer has support for processing instructions when it is highlighting XML source, that is the only XML-oriented additional capability. As a result, doctypes that have an internal subset are mishighlighted and entity references to custom entities are mishighlighted. This is obvious when viewing the source of Firefox chrome files. However, the mishighlighting should not be a practical problem when viewing source of typical XML files on the Web (to the extent there are XML files on the Web).

Other Known Bugs

The new implementation broke the window title for View Source windows. Also, the highlighting of the end of named character references is off by one.

Release Schedule

The code landed on trunk in time for Firefox 10. However, the landing added quite a bit of code, so it is possible that the code gets turned off after the Aurora uplift. To test the new code, using Nightly is your best bet.


Written: 2011-11-04 by Henri Sivonen

Main Page

The following notice applies to this HTML file:

Copyright (c) 2011 Mozilla Foundation

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

HTML5 Parser-Based View Source Syntax Highlighting A new implementation of the View Source HTML and XML syntax highlighting has landed in Firefox. Why? The reason there is a new implementation is that the old implemention was based on the old HTML parser that we want to get rid of. The old View Sour ...»See Ya