SEO

February 19, 2011

Hash-bangs: 10% of me that understands this is pissed off!

Weblogs: Javascript

Breaking the Web with hash-bangs

Tuesday, February 08, 2011

Update 10 Feb 2011: Tim Bray has written a much shorter, clearer and less technical explanation of the broken use of hash-bangs URLs. I thoroughly recommend reading and referencing it.

Update 11 Feb 2011: Another very insightful (and balanced) response, this from Ben Ward (Hash, Bang, Wallop.) , great job in separating the wheat from the chaff.

Lifehacker, along with every other Gawker property, experienced a lengthy site-outage on Monday over a misbehaving piece of JavaScript. Gawker sites were reduced to being an empty homepage layout with zero content, functionality, ads, or even legal disclaimer wording. Every visitor coming through via Google bounced right back out, because all the content was missing.

JavaScript dependent URLs

Gawker, like Twitter before it, built their new site to be totally dependent on JavaScript, even down to the page URLs. The JavaScript failed to load, so no content appeared, and every URL on the page was broken. In terms of site brittleness, Gawker’s new implementation got turned up to 11.

Every URL on Lifehacker is now looks like this http://lifehacker.com/#!5753509/hello-world-this-is-the-new-lifehacker. Before Monday the URL was almost the same, but without the #!. So what?

Fragment identifiers

The # is a special character in a URL, it marks the rest of the URL as a fragment identifier, so everything after it refers to an HTML element id, or a named anchor in the current page. The current page here being the LifeHacker homepage.

So Sunday Lifehacker was a 1 million page site, today it's a one page site with 1 million fragment identifiers.

Why? I don't know. Twitter's response when faced with this question on launching "New Twitter" is that Google can index individual tweets. True, but they could do that in the previous proper URL structure before too, with much less overhead.

A solution to a problem

The #!-baked URL (hash-bang) syntax first came into the general web developer spotlight when Google announced a method web developers could use to allow Google to crawl Ajax-dependent websites.

Back then best practice web development wasn’t well known or appreciated, and sites using fancy technology like Ajax to bring in content found themselves not well listed or ranked for relevant keywords because Googlebot couldn’t find their content they’d hidden behind JavaScript calls.

Although Google spent many laborious hours trying to crack this problem, they eventually admitted defeat and tackled the problem in a different manner. Instead of trying to find this mythical content, lets get website owners to tell us where the content actually is, and they produced a specification aimed at doing just that.

In writing about it, Google were careful to stress that web developers should develop sites with progressive enhancement and not rely on JavaScript for its content, noting:

If you’re starting from scratch, one good approach is to build your site’s structure and navigation using only HTML. Then, once you have the site’s pages, links, and content in place, you can spice up the appearance and interface with Ajax. Googlebot will be happy looking at the HTML, while users with modern browsers can enjoy your Ajax bonuses.

So the #! URL syntax was especially geared for sites that got the fundamental web development best practices horribly wrong, and gave them a lifeline to getting their content seen by Googlebot.

And today, this emergency rescue package seems to be regarded as the One True Way of web development by engineers from Facebook, Twitter, and now Lifehacker.

Clean URLs

In Google’s specification, they call the #!-patterned URLs as pretty URLs, and they are transformed by Googlebot (and other crawlers supporting Google’s lifeline specification) into something more grotesque.

On Sunday, Lifehacker’s URL scheme looked like this:

http://lifehacker.com/5753509/hello-world-this-is-the-new-lifehacker

Not bad. The 7-digit number in the middle is the only unclean thing about this URL, and Gawker’s content system needs that as a unique identifier to map to the actual article. So it’s a mostly clean URL.

Today, the same piece of content is now addressable via this URL:

http://lifehacker.com/#!5753509/hello-world-this-is-the-new-lifehacker

This is less clean than before, the addition of the #! fundamentally changes the structure of the URL:

  • The path /5753509/hello-world-this-is-the-new-lifehacker becomes /
  • A new fragment identifier of !5753509/hello-world-this-is-the-new-lifehacker gets added

What does this achieve? Nothing. And the URL mangling doesn’t end there.

Google’s specification says that it will transform the hash-bang URL into a query string parameter, so the example URL above becomes:

http://lifehacker.com/?_escaped_fragment_=5753509/hello-world-this-is-the-new-lifehacker

That uglier URL actually returns the content of the article. So this is the canonical reference to this piece of content. This is the content that Google indexes. (This is also the same with Twitter’s hash-bang URLs.)

This URL scheme looks a lot like:

http://example.com/default.asp?page=about_us

Lifehacker/Gawker have thrown away a decade’s worth of clean URL experience, and ended up with something that actually looks worse than the typical templated Classic ASP site. (How more Frontpage can you get?)

Clean? Not on your life!

What’s the problem?

The main problem is that LifeHacker URLs now don’t map to actual content. Well, every URL references the LifeHacker homepage. If you are lucky enough to have the JavaScript running successfully, the homepage then triggers off several Ajax requests to render the page, hopefully with the desired content showing up at some point.

Far more complicated than a simple URL, far more error prone, and far brittler.

So, requesting the URL assigned to a piece of content doesn’t result in the requestor receiving that content. It’s broken by design. LifeHacker is deliberately preventing crawlers from following links on the site towards interesting content. Unless you jump through a hoop invented by Google.

Why is this hoop there?

The why of hash-bang

So why use a hash-bang if it’s an artificial URL, and a URL that needs to be reformatted before it points to a proper URL that actually returns content?

Out of all the reasons, the strongest one is “Because it’s cool”. I said strongest not strong.

Engineers will mutter something about preserving state within an Ajax application. And frankly, that’s a ridiculous reason for breaking URLs like that. The URL of an href can still be a proper addressable reference to content. You are already using JavaScript, so you can do this damage much later with JavaScript using a click handler on the link. The transform between last week’s LifeHacker URL scheme, and this week’s hash-bang mangling is trivial to do in JavaScript using a click handler.

At the risk of invoking the wrath of Jamie Zawinski, LifeHacker can keep its mostly clean URL of last week (http://lifehacker.com/5753509/hello-world-this-is-the-new-lifehacker) and obtain the mangled version by this regular expression:

var mangledUrl = this.href.replace(/(d+)/, "#!$1");

Doing this mangling in JavaScript (during the click handler of the link) means you keep your apparent state benefits, but without needlessly preventing crawlers from traversing your site, and any other non-JavaScript eventuality.

Disallow all bots (except Googlebot)

All non-browser user-agents (crawlers, aggregators, spiders, indexers) that completely support both HTTP/1.1 and the URL specification (RFC 2396, for example) cannot crawl any Lifehacker or Gawker content. Except Googlebot.

This has ramifications that need to be considered:

  1. Caching is now broken, since intermediary servers have no canonical representation of content, they are unable to cache content. This results in Lifehacker perceived as being slower. It means Gawker don’t save bandwidth costs by any edge caching of chunks of content, and they are on their own in dealing with spikes of traffic.
  2. HTTP/1.1 and RFC-2396 compliant crawlers now cannot see anything but an empty homepage shell. This has knock-on effects on the applications and services built on such crawlers and indexers.
  3. The potential use of Microformats (and upper-case Semantic Web tools) has now dropped substantially - only browser-based aggregators or Google-led aggregators will see any Microformatted data. This removes Lifehacker and other Gawker sites from being used as datasources in Hackdays (rather ironic, really).
  4. Facebook Like widgets that use page identifiers now need extra work to allow articles to be liked. (by default, since the homepage is the only page referenceable by a non-mangled URL, and all mangled URLs resolve down to being the homepage)

Being dependent on perfect JavaScript

If content cannot be retrieved from a server given its URL, then that site is broken. Gawker have deliberately made the decision to break these URLs. They’ve left their site availability open to all sorts of JavaScript-related errors:

  • JavaScript fails to load led to a 5 hour outage on all Gawker media properties on Monday. (Yes, Sproutcore and Cappucino fans, empty divs are not an appropriate fallback.)
  • A trailing comma in an array or object literal will cause a JavaScript error in Internet Explorer - for Gawker, this will translate into a complete site-outage for IE users
  • A debugging console.log line accidentally left in the source will cause Gawker’s site to fail when the visitor’s browser doesn’t have the developer tools installed and enabled (Firefox, Safari, Internet Explorer)
  • Adverts regularly trip up with errors. So Gawker’s site availability is completely within the hands of advert JavaScript. Experienced web-developers know that Javascript from advertisers are the worst lumps of code out there on the web.

Such brittleness for no real reason or a benefit that outweighs the downside. There are far better methods than what Gawker adopted, even HTML5’s History API (with appropriate polyfillers) would be a better solution.

(If you thought that invalid XHTML delivered with the correct mimetype was not fit for the web, this JavaScript mangled-URLs approach is far worse)

An Architectural Nightmare

Gina Trapani tweets: Lay down your pitchforks and give @Lifehacker’s redesign a week before you swear it off and insist that the staff doesn’t care about you. A week won’t solve Gawker’s architectural nightmare.

Gawker/Lifehacker have violated the principle of progressive enhancement, and they paid for it immediately with an extended outage on day one of their new site launch. Every JavaScript hiccup will cause an outage, and directly affect Gawker’s revenue stream and the trust of their audience.

Updates (9th February 2011)

Wow. I (and my VPS) am overwhelmed by the conversation this post has sparked. Thank you for contributing towards a constructive discussion. Some of the posts that caught my eye today:

All of the features that hash-bangs are providing can be done today in a safer, more web-friendly way with HTML5's pushState from the History API. (thanks Kerin Cosford & Dan Sanderson)

The Next Web reports that Gawker blogs have disappeared from Google News searches. A Gawker media editor is quoted that they hope to have it resolved soon. They are listed again but using the _escaped_fragment_ form of the URL. So much for clean URLs. Though, the link seems intermittently broken claiming the URL requested is not available (with a redirect to http://gawker.com/#ERR404).

I did like this tl;dr summary of this post over on theawl.com by mrmcd.

Webmonkey have a summary story, but link off to some very handy resources for clean URL strategies. (I first learnt HTML from Webmonkey back in the previous century)

Phillip Tellis, one of the handful of Yahoo's I regret not meeting blogs some Thoughts on performance, well worth reading. Also highly recommended is warpspire's URL Design.

Danny Thorpe talks about Side effects of hash-bang URLs, including URL Cache equivalence. Oliver Nightingale has a nicely worked example using HTML5's pushState in a progressively enhanced way (great job!)

The very short geeky summary of this post (try curling a Lifehacker article canonical URL):

$ curl http://lifehacker.com//hello-world- \ this-is-the-new-lifehacker | grep "Hello" $

or as Ben Ward put it: If site content doesn’t load through curl it's broken.

Broken HTTP Referrers

Watching my logfiles I'm seeing a number of inbound links to this post from gawker.com and kokatu.com - from the homepage (i.e. the fragment identifier is stripped out). So somewhere on those sites there's a discussion going on about my post, and there's no way of finding it thanks to Gawker's use of hash-bang URLs.

Weblogs : Javascript Breaking the Web with hash-bangs Tuesday, February 08, 2011 Update 10 Feb 2011 : Tim Bray has written a much shorter, clearer and less technical explanation of the broken use of hash-bangs URLs . I thoroughly recommend reading and referencing it. Update 11 Feb 2011 : Another ver ...... Read MORE » on Dogmeat

Singer YouTube KidsPrank Prison

Singer Faces 20 Years In Prison for YouTube Prank on Kids

Singer Faces 20 Years In Prison for YouTube Prank on Kids21-year-old Michigan resident Evan Emory currently faces 20 years in prison for "manufacturing child sexual abusive material". His crime: He posted a YouTube video that made it appear he was singing an explicit song to a classroom of elementary students.

Emory tricked administrators at Beechnau Elementary School into letting him perform a song for the kids on video, claiming he wanted to build his portfolio. He sung an innocent song in front of the kids, but when the room was empty recorded a sexually explicit song. ("I like the way you make your body move. C'mon, girl...See how long it takes to make your panties mine...I'll add some foreplay in just to make it fun. I want to stick my index finger in your anus.")

Through trick editing, Emory made it appear that he had been singing the song to the kids while they smiled and laughed along. He included a disclaimer—"No children were exposed to the 'graphic content' of this video"—and posted it on YouTube earlier this week.

On Wednesday, Emory was arrested on charges of manufacturing "child sexual abusive material". Said the county prosecutor:

"The bottom line in this case is that he walked into a classroom and took advantage and victimized every single child in that classroom," Tague said.

"This case is very disturbing to law enforcement officials. We have launched a full-fledged investigation with the sheriff."

At his arraignment, outraged parents of the kids in the video appeared at the courthouse to rally for jail time.

We can understand why the parents and school would be upset. But these are clearly laws designed to punish hardcore sex offenders—not some bro who came up with a misguided idea for a prank. In the end, the video appears to have been online for about a day or two and was probably seen by a few hundred people at most. This is a very broad definition of "victimization!" One law professor says the charges are likely unconstitutional.

As Radly Balko points out, the hysteria is fueled by the volatile combination of children + sex + The Internet. Add to that an overreaction by a humiliated school district. Here's hoping the judge realizes this, too.

Note: The embedded video is another one of Emory's pranks—not the video in question

Singer Faces 20 Years In Prison for YouTube Prank on Kids Adrian Chen — 21-year-old Michigan resident Evan Emory currently faces 20 years in prison for "manufacturing child sexual abusive material". His crime: He posted a YouTube video that made it appear he was singing an explicit song to a classro ...... Read MORE » on Dogmeat

HTML5 Periodic Table

HTML5 Elements

The table below shows the 104 elements currently in the HTML5 working draft and two proposed elements (marked with an asterisk).

Periodic Table of the Elements

Elements for html5advent.com

<html>

Document root element.

1html

<col>

Columns in a table.

col

<table>

Table of multi-dimensional data.

table

<head>

First element of the HTML document. Contains document metadata.

1head
79span

<fieldset>

Set of form controls grouped by theme.

fieldset

<form>

Form.

form

<body>

Document content.

1body

<h1>

Heading for the current section.

25h1

<section>

Contains of elements grouped by theme, for example a chapter or tab box.

25section
colgroup

<tr>

A row of cells.

tr

<title>

Document title.

1title
216a

<pre>

Text that is preformatted in the HTML code.

pre

<meter>

Control for entering a numeric value in a known range.

meter

<select>

Control for selecting from multiple options.

select

<aside>

Content related to surrounding elements that doesn't belong inline, such as a advertising or quotes.

aside
25h2

<header>

Navigation or introductory elements for the current section.

1header

<caption>

Title of a table.

caption

<td>

Table cell.

td

<meta>

Document metadata that can't be represented with other elements.

6meta
rt
dfn
em

<i>

Text in a alternate voice or mood, such as a technical term.

i

<small>

An aside, such as fine print.

24small

<ins>

Text that has been inserted during document editing.

ins

<hr>

Thematic break within a paragraph.

hr

<br>

Line break.

2br

<div>

Container with no semantic meaning.

86div
blockquote
legend

<optgroup>

Group of option.

optgroup

<address>

Contact information for the current article.

address

<h3>

Heading for the current section.

21h3

<nav>

Contains a collection of links.

nav

<menu>

Set of commands.

menu

<th>

Table heading.

th

<base>

Specifies URL which non-absolute URLs are relative to.

base

<rp>

Contains semantically meaningless markup for browsers that don't understand ruby annotations.

rp

<abbr>

Abbreviation or acronym.

abbr

<time>

Time defined in a machine readable format.

time

<b>

Stylistically separated text of equal importance, such as a product name.

b

<strong>

Text that is important.

48strong

<del>

Text that has been removed during document editing.

del

<s>

Text that is outdated or no longer accurate.

s

<p>

Paragraph content.

87p

<ol>

Ordered list.

ol

<dl>

List of term-description pairs.

dl

<label>

Caption for a form control.

label

<option>

Single option within a select control.

option

<datalist>

Define sets of options.

datalist

<h4>

Heading for the current section.

3h4

<article>

Section of the page content, such as a blog or forum post.

1article

<command>

Command the user can perform, such as publishing an article.

command

<tbody>

Contains rows that hold the table's data.

tbody
6link

<noscript>

Contains elements that are part of the document only if scripting is disabled.

noscript

<q>

Quoted text.

q

<var>

Mathematical or programming variable.

var

<sub>

Subscript text.

sub

<mark>

Text highlighted for referencing elsewhere.

mark

<kbd>

Example input (usually keyboard) for a program.

kbd

<wbr>

Opportunity for a line break.

    wbr

    <figcaption>

    Caption for a figure.

    figcaption

    <ul>

    Unordered list.

    12ul

    <dt>

    Term which will be described.

    dt

    <input>

    Generic form input.

    input

    <output>

    Contains the results of a calculation.

    output

    <keygen>

    Generates private-public key pairs.

    keygen
    h5

    <footer>

    Footer of the current section.

    1footer

    <summary>

    Caption of a details element.

    summary

    <thead>

    Contains rows with table headings.

    thead

    <style>

    Styling defined inline data.

    style

    <script>

    Inline or linked client side scripts.

    6script

    <cite>

    Title of a referenced piece of work.

    cite
    samp

    <sup>

    Superscript text.

    sup

    <ruby>

    Contains text with annotations, such as pronounciation hints. Commonly used in East Asian text.

    ruby

    <bdo>

    Defines directional formatting for content.

    bdo

    <code>

    Fragment of code.

    code

    <figure>

    Contains elements related to single concept, such as an illustration or code example.

    figure

    <li>

    List item.

    72li

    <dd>

    Description for the preceeding term.

    dd

    <textarea>

    Multiline free-form text input.

    textarea

    <button>

    A button.

    button

    <progress>

    Control for displaying progress of a task.

    progress

    <h6>

    Heading for the current section.

    h6

    <hgroup>

    Collection of headings for the current section. The highest ranked heading repesents the group in the document outline.

    22hgroup

    <details>

    Contains additional information, such as the contents of an accordian view.

    details

    <tfoot>

    Contains rows with summary of data.

    tfoot
    61img

    <area>

    Hyperlink area in an image map.

    area

    <map>

    Image map for adding hyperlinks to parts of an image.

    map

    <embed>

    Reference to non-HTML content.

    embed
    object
    param

    <source>

    Alternative sources for parent video or audio elements.

    source

    <iframe>

    Nested browser frame.

    iframe

    <canvas>

    Bitmap which is editable by client side scripts.

    canvas

    <track>

    Specifies external timing track for media elements.

    This element is still being drafted.

    track*

    <audio>

    Audio file.

    audio

    <video>

    Video.

    video

    <device>

    Allows scripts to access devices such as a webcam.

    This element is still being drafted.

      device*
      • Root element

      • Text-level semantics

      • Forms

      • Tabular data

      • Metadata and scripting

      • Grouping content

      • Document sections

      • Interactive elements

      • Embedding content

      HTML5 Elements The table below shows the 104 elements currently in the HTML5 working draft and two proposed elements (marked with an asterisk). Share this How are they used? Some suggestions: reddit.com news.ycombinator.com youtube.com google.com yahoo.com wired.com bbc.co.uk en.wikipedia.org w3.org ...... Read MORE » on Dogmeat

      Accounts similar to @mrjyn

      Similar to @mrjyn

      Accounts similar to @mrjyn:

      Following Unfollow

      CNET
      CNET is the premier destination for tech product reviews, news, price comparisons, free software downloads, daily videos, and podcasts.
      Followed by you!
      Following Unfollow

      Teen Vogue
      Fashion starts here.
      Followed by you!
      Following Unfollow

      Lisa Katayama
      TokyoMango, Creative Commons, Boing Boing, Wired.
      Followed by you!
      Following Unfollow

      Mike Mozart
      Mike Mozart Pro Product Designer, Reviewer, Inventor. JeepersMedia is a Most Subbed + Viewed on YouTube OVER 360,000 Subs
      Followed by you!
      Following Unfollow

      Grazia magazine Aus
      The editorial team at Grazia magazine Australia bring you instant updates on fashion, beauty, what's hitting the headlines this week and much, much more
      Followed by you!
      Following Unfollow

      Wired
      Official Twitter feed for Wired magazine & Wired.com. Your third-string quarterback this week: sports editor @
      Followed by you!
      Following Unfollow

      Blogcritics Articles
      BC Magazine is "a sinister cabal of superior writers," and these are their articles. Also follow Publisher Eric Olsen @
      Followed by you!
      Following Unfollow

      PLANETº Magazine
      Contemporary Fashion, Art, Music: GLOBAL. CREATIVE. PROGRESSIVE.
      Followed by you!
      Following Unfollow

      Biz Stone
      Co-founder of Twitter, Inc.
      Followed by you!
      Following Unfollow

      SFChron_alert
      Get breaking news alerts on the Bay Area's biggest stories from the San Francisco Chronicle and SFGate.com.
      Followed by you!
      Following Unfollow

      The Two-Way
      The official blog of NPR's Morning Edition and All Things Considered
      Followed by you!
      Following Unfollow

      WebMD_Blogs
      Better information. Better health.
      Followed by you!
      Following Unfollow
      National Geographic
      Official Twitter feed of the National Geographic Society public relations team
      Followed by you!
      Following Unfollow
      Amanda Kennedy
      Freelance web designer and Blogger guru based in Sheffield, United Kingdom. Check out my portfolio at www.AmandaKennedy.co.uk
      Followed by you!
      Following Unfollow
      Jake Tapper
      ABC News' Sr White House Correspondent. Dissecting my tweets with Talmudic meticulousness will result in wrong conclusions. RTs do not = endorsement.
      Followed by you!
      Following Unfollow
      O The Oprah Magazine
      O provides inspiration on everything from lasting love to luscious food, from the joys of reading to the rush of learning how to do everything a little better.
      Followed by you!
      Following Unfollow
      Vanity Fair Agenda
      The inner-sanctum of Vanity Fair's intrepid advertising and creative services troupe.
      Followed by you!
      Following Unfollow
      theGAVoice
      LGBT news outlet for Georgia; on streets now!
      Followed by you!
      Following Unfollow
      John Farina
      John Farina,Internet Marketer, Father and Florida Gator Fan.
      Followed by you!
      Following Unfollow
      Sandy Cohen
      Covering Hollywood's personalities and products for The Associated Press. Email me at smcohen@ap.org.
      Followed by you!

      Similar to @mrjyn Accounts similar to @mrjyn: Following Unfollow cnet CNET CNET is the premier destination for tech product reviews, news, price comparisons, free software downloads, daily videos, and podcasts. Followed by you! Following Unfollow TeenVogue Teen Vogue Fashion starts here. Followed by ...... Read MORE » on Dogmeat