Weblogs: Javascript
Breaking the Web with hash-bangs
Tuesday, February 08, 2011Update 10 Feb 2011: Tim Bray has written a much shorter, clearer and less technical explanation of the broken use of hash-bangs URLs. I thoroughly recommend reading and referencing it.
Update 11 Feb 2011: Another very insightful (and balanced) response, this from Ben Ward (Hash, Bang, Wallop.) , great job in separating the wheat from the chaff.
Lifehacker, along with every other Gawker property, experienced a lengthy site-outage on Monday over a misbehaving piece of JavaScript. Gawker sites were reduced to being an empty homepage layout with zero content, functionality, ads, or even legal disclaimer wording. Every visitor coming through via Google bounced right back out, because all the content was missing.
JavaScript dependent URLs
Gawker, like Twitter before it, built their new site to be totally dependent on JavaScript, even down to the page URLs. The JavaScript failed to load, so no content appeared, and every URL on the page was broken. In terms of site brittleness, Gawker’s new implementation got turned up to 11.
Every URL on Lifehacker is now looks like this
http://lifehacker.com/#!5753509/hello-world-this-is-the-new-lifehacker. Before Monday the URL was almost the same, but without the#!. So what?Fragment identifiers
The
#is a special character in a URL, it marks the rest of the URL as a fragment identifier, so everything after it refers to an HTML element id, or a named anchor in the current page. The current page here being the LifeHacker homepage.So Sunday Lifehacker was a 1 million page site, today it's a one page site with 1 million fragment identifiers.
Why? I don't know. Twitter's response when faced with this question on launching "New Twitter" is that Google can index individual tweets. True, but they could do that in the previous proper URL structure before too, with much less overhead.
A solution to a problem
The
#!-baked URL (hash-bang) syntax first came into the general web developer spotlight when Google announced a method web developers could use to allow Google to crawl Ajax-dependent websites.Back then best practice web development wasn’t well known or appreciated, and sites using fancy technology like Ajax to bring in content found themselves not well listed or ranked for relevant keywords because Googlebot couldn’t find their content they’d hidden behind JavaScript calls.
Although Google spent many laborious hours trying to crack this problem, they eventually admitted defeat and tackled the problem in a different manner. Instead of trying to find this mythical content, lets get website owners to tell us where the content actually is, and they produced a specification aimed at doing just that.
In writing about it, Google were careful to stress that web developers should develop sites with progressive enhancement and not rely on JavaScript for its content, noting:
If you’re starting from scratch, one good approach is to build your site’s structure and navigation using only HTML. Then, once you have the site’s pages, links, and content in place, you can spice up the appearance and interface with Ajax. Googlebot will be happy looking at the HTML, while users with modern browsers can enjoy your Ajax bonuses.
So the
#!URL syntax was especially geared for sites that got the fundamental web development best practices horribly wrong, and gave them a lifeline to getting their content seen by Googlebot.And today, this emergency rescue package seems to be regarded as the One True Way of web development by engineers from Facebook, Twitter, and now Lifehacker.
Clean URLs
In Google’s specification, they call the
#!-patterned URLs as pretty URLs, and they are transformed by Googlebot (and other crawlers supporting Google’s lifeline specification) into something more grotesque.On Sunday, Lifehacker’s URL scheme looked like this:
http://lifehacker.com/5753509/hello-world-this-is-the-new-lifehackerNot bad. The 7-digit number in the middle is the only unclean thing about this URL, and Gawker’s content system needs that as a unique identifier to map to the actual article. So it’s a mostly clean URL.
Today, the same piece of content is now addressable via this URL:
http://lifehacker.com/#!5753509/hello-world-this-is-the-new-lifehackerThis is less clean than before, the addition of the
#!fundamentally changes the structure of the URL:
- The path
/5753509/hello-world-this-is-the-new-lifehackerbecomes/- A new fragment identifier of
!5753509/hello-world-this-is-the-new-lifehackergets addedWhat does this achieve? Nothing. And the URL mangling doesn’t end there.
Google’s specification says that it will transform the hash-bang URL into a query string parameter, so the example URL above becomes:
http://lifehacker.com/?_escaped_fragment_=5753509/hello-world-this-is-the-new-lifehackerThat uglier URL actually returns the content of the article. So this is the canonical reference to this piece of content. This is the content that Google indexes. (This is also the same with Twitter’s hash-bang URLs.)
This URL scheme looks a lot like:
http://example.com/default.asp?page=about_usLifehacker/Gawker have thrown away a decade’s worth of clean URL experience, and ended up with something that actually looks worse than the typical templated Classic ASP site. (How more Frontpage can you get?)
Clean? Not on your life!
What’s the problem?
The main problem is that LifeHacker URLs now don’t map to actual content. Well, every URL references the LifeHacker homepage. If you are lucky enough to have the JavaScript running successfully, the homepage then triggers off several Ajax requests to render the page, hopefully with the desired content showing up at some point.
Far more complicated than a simple URL, far more error prone, and far brittler.
So, requesting the URL assigned to a piece of content doesn’t result in the requestor receiving that content. It’s broken by design. LifeHacker is deliberately preventing crawlers from following links on the site towards interesting content. Unless you jump through a hoop invented by Google.
Why is this hoop there?
The why of hash-bang
So why use a hash-bang if it’s an artificial URL, and a URL that needs to be reformatted before it points to a proper URL that actually returns content?
Out of all the reasons, the strongest one is “Because it’s cool”. I said strongest not strong.
Engineers will mutter something about preserving state within an Ajax application. And frankly, that’s a ridiculous reason for breaking URLs like that. The URL of an
hrefcan still be a proper addressable reference to content. You are already using JavaScript, so you can do this damage much later with JavaScript using a click handler on the link. The transform between last week’s LifeHacker URL scheme, and this week’s hash-bang mangling is trivial to do in JavaScript using a click handler.At the risk of invoking the wrath of Jamie Zawinski, LifeHacker can keep its mostly clean URL of last week (
http://lifehacker.com/5753509/hello-world-this-is-the-new-lifehacker) and obtain the mangled version by this regular expression:
var mangledUrl = this.href.replace(/(d+)/, "#!$1");Doing this mangling in JavaScript (during the click handler of the link) means you keep your apparent state benefits, but without needlessly preventing crawlers from traversing your site, and any other non-JavaScript eventuality.
Disallow all bots (except Googlebot)
All non-browser user-agents (crawlers, aggregators, spiders, indexers) that completely support both HTTP/1.1 and the URL specification (RFC 2396, for example) cannot crawl any Lifehacker or Gawker content. Except Googlebot.
This has ramifications that need to be considered:
- Caching is now broken, since intermediary servers have no canonical representation of content, they are unable to cache content. This results in Lifehacker perceived as being slower. It means Gawker don’t save bandwidth costs by any edge caching of chunks of content, and they are on their own in dealing with spikes of traffic.
- HTTP/1.1 and RFC-2396 compliant crawlers now cannot see anything but an empty homepage shell. This has knock-on effects on the applications and services built on such crawlers and indexers.
- The potential use of Microformats (and upper-case Semantic Web tools) has now dropped substantially - only browser-based aggregators or Google-led aggregators will see any Microformatted data. This removes Lifehacker and other Gawker sites from being used as datasources in Hackdays (rather ironic, really).
- Facebook Like widgets that use page identifiers now need extra work to allow articles to be liked. (by default, since the homepage is the only page referenceable by a non-mangled URL, and all mangled URLs resolve down to being the homepage)
Being dependent on perfect JavaScript
If content cannot be retrieved from a server given its URL, then that site is broken. Gawker have deliberately made the decision to break these URLs. They’ve left their site availability open to all sorts of JavaScript-related errors:
- JavaScript fails to load led to a 5 hour outage on all Gawker media properties on Monday. (Yes, Sproutcore and Cappucino fans, empty divs are not an appropriate fallback.)
- A trailing comma in an array or object literal will cause a JavaScript error in Internet Explorer - for Gawker, this will translate into a complete site-outage for IE users
- A debugging console.log line accidentally left in the source will cause Gawker’s site to fail when the visitor’s browser doesn’t have the developer tools installed and enabled (Firefox, Safari, Internet Explorer)
- Adverts regularly trip up with errors. So Gawker’s site availability is completely within the hands of advert JavaScript. Experienced web-developers know that Javascript from advertisers are the worst lumps of code out there on the web.
Such brittleness for no real reason or a benefit that outweighs the downside. There are far better methods than what Gawker adopted, even HTML5’s History API (with appropriate polyfillers) would be a better solution.
(If you thought that invalid XHTML delivered with the correct mimetype was not fit for the web, this JavaScript mangled-URLs approach is far worse)
An Architectural Nightmare
Gina Trapani tweets: Lay down your pitchforks and give @Lifehacker’s redesign a week before you swear it off and insist that the staff doesn’t care about you. A week won’t solve Gawker’s architectural nightmare.
Gawker/Lifehacker have violated the principle of progressive enhancement, and they paid for it immediately with an extended outage on day one of their new site launch. Every JavaScript hiccup will cause an outage, and directly affect Gawker’s revenue stream and the trust of their audience.
Updates (9th February 2011)
Wow. I (and my VPS) am overwhelmed by the conversation this post has sparked. Thank you for contributing towards a constructive discussion. Some of the posts that caught my eye today:
All of the features that hash-bangs are providing can be done today in a safer, more web-friendly way with HTML5's pushState from the History API. (thanks Kerin Cosford & Dan Sanderson)
The Next Web reports that Gawker blogs have disappeared from Google News searches. A Gawker media editor is quoted that they hope to have it resolved soon. They are listed again but using the
_escaped_fragment_form of the URL. So much for clean URLs. Though, the link seems intermittently broken claiming the URL requested is not available (with a redirect tohttp://gawker.com/#ERR404).I did like this tl;dr summary of this post over on theawl.com by mrmcd.
Webmonkey have a summary story, but link off to some very handy resources for clean URL strategies. (I first learnt HTML from Webmonkey back in the previous century)
Phillip Tellis, one of the handful of Yahoo's I regret not meeting blogs some Thoughts on performance, well worth reading. Also highly recommended is warpspire's URL Design.
Danny Thorpe talks about Side effects of hash-bang URLs, including URL Cache equivalence. Oliver Nightingale has a nicely worked example using HTML5's pushState in a progressively enhanced way (great job!)
The very short geeky summary of this post (try curling a Lifehacker article canonical URL):
$ curl http://lifehacker.com//hello-world- \ this-is-the-new-lifehacker | grep "Hello" $or as Ben Ward put it: If site content doesn’t load through
curlit's broken.Broken HTTP Referrers
Watching my logfiles I'm seeing a number of inbound links to this post from gawker.com and kokatu.com - from the homepage (i.e. the fragment identifier is stripped out). So somewhere on those sites there's a discussion going on about my post, and there's no way of finding it thanks to Gawker's use of hash-bang URLs.
@mrjyn
February 19, 2011
Hash-bangs: 10% of me that understands this is pissed off!
Singer YouTube KidsPrank Prison
![]()
Singer Faces 20 Years In Prison for YouTube Prank on Kids
Adrian Chen —
21-year-old Michigan resident Evan Emory currently faces 20 years in prison for "manufacturing child sexual abusive material". His crime: He posted a YouTube video that made it appear he was singing an explicit song to a classroom of elementary students.
Emory tricked administrators at Beechnau Elementary School into letting him perform a song for the kids on video, claiming he wanted to build his portfolio. He sung an innocent song in front of the kids, but when the room was empty recorded a sexually explicit song. ("I like the way you make your body move. C'mon, girl...See how long it takes to make your panties mine...I'll add some foreplay in just to make it fun. I want to stick my index finger in your anus.")
Through trick editing, Emory made it appear that he had been singing the song to the kids while they smiled and laughed along. He included a disclaimer—"No children were exposed to the 'graphic content' of this video"—and posted it on YouTube earlier this week.
On Wednesday, Emory was arrested on charges of manufacturing "child sexual abusive material". Said the county prosecutor:
"The bottom line in this case is that he walked into a classroom and took advantage and victimized every single child in that classroom," Tague said.
"This case is very disturbing to law enforcement officials. We have launched a full-fledged investigation with the sheriff."
At his arraignment, outraged parents of the kids in the video appeared at the courthouse to rally for jail time.
We can understand why the parents and school would be upset. But these are clearly laws designed to punish hardcore sex offenders—not some bro who came up with a misguided idea for a prank. In the end, the video appears to have been online for about a day or two and was probably seen by a few hundred people at most. This is a very broad definition of "victimization!" One law professor says the charges are likely unconstitutional.
As Radly Balko points out, the hysteria is fueled by the volatile combination of children + sex + The Internet. Add to that an overreaction by a humiliated school district. Here's hoping the judge realizes this, too.
Note: The embedded video is another one of Emory's pranks—not the video in question
Singer Faces 20 Years In Prison for YouTube Prank on Kids Adrian Chen — 21-year-old Michigan resident Evan Emory currently faces 20 years in prison for "manufacturing child sexual abusive material". His crime: He posted a YouTube video that made it appear he was singing an explicit song to a classro ...... Read MORE » on Dogmeat
HTML5 Periodic Table
HTML5 Elements
The table below shows the 104 elements currently in the HTML5 working draft and two proposed elements (marked with an asterisk).
How are they used?
Periodic Table of the Elements
Elements for html5advent.com
1html col table 1head 79span fieldset form 1body 25h1<section>
Contains of elements grouped by theme, for example a chapter or tab box.
25section colgroup tr 1title 216a pre meter select<aside>
Content related to surrounding elements that doesn't belong inline, such as a advertising or quotes.
aside 25h2 1header caption td 6meta rt dfn em i 24small ins hr 2br 86div blockquote legend optgroup address 21h3 nav menu th base<rp>
Contains semantically meaningless markup for browsers that don't understand ruby annotations.
rp abbr time b 48strong del s 87p ol dl label option datalist 3h4 1article command tbody 6link noscript q var sub mark kbd<wbr>
Opportunity for a line break.
wbr figcaption 12ul dt input output keygen h5 1footer summary thead style 6script cite samp sup<ruby>
Contains text with annotations, such as pronounciation hints. Commonly used in East Asian text.
ruby bdo code<figure>
Contains elements related to single concept, such as an illustration or code example.
figure 72li dd textarea button progress h6<hgroup>
Collection of headings for the current section. The highest ranked heading repesents the group in the document outline.
22hgroup details tfoot 61img area map embed object param source iframe canvas<track>
Specifies external timing track for media elements.
This element is still being drafted.
track* audio video<device>
Allows scripts to access devices such as a webcam.
This element is still being drafted.
device*
Root element
Text-level semantics
Forms
Tabular data
Metadata and scripting
Grouping content
Document sections
Interactive elements
Embedding content