Are Spiders Crawling Your SPA?

Written by Thomas De Craemer | 25 June 2015

Making your web applications Search Engine Friendly has always been important for scoring high in search engine results. More and more of those front end applications are evolving towards SPA's (Single Page App) which are inherently difficult to crawl, thereby potentially impacting your search ranking. However, the dilemma between a focus on UX or SEO is a fallacy. In this post we’ll have a look at how you can both offer a dynamic, fast and user-friendly web application and still keep search engine crawlers happy.

In The Beginning All Was Static

In the beginning of the Internet, all HTML pages that were sent to clients were assembled server side. It was in that period that Google first came up with their idea of a search engine. Google’s crawler could extract content and structure by simply analysing static HTML pages. The HTML that crawlers consumed was identical to what users got to see in their browsers.

As JavaScript started to gain ground, some HTML got manipulated and created dynamically client side. When the W3C in 2006 introduced its XMLHttpRequest specification (more commonly known as AJAX), developers could suddenly fetch server side content asynchronously after the initial page load. The introduction of popular JavaScript libraries such as jQuery and later on, AngularJS, made it pretty straightforward to build good looking, dynamic front end applications. The decline of RIA technologies such as Flash and Silverlight gave an additional boost to JavaScript’s popularity.

Crawlers of search engine providers were lagging behind. For a fair share of public facing websites there was a discrepancy between what the crawler saw and the end result in the user’s browser. In order not to undermine the relevance and completeness of its search engine, Google took the initiative in 2009 and developed a ‘proposal for making AJAX crawable’. The proposal contains an agreement between web servers serving dynamic web pages and the Google crawler. Shortly thereafter Google implemented the proposal - after which it became the de facto standard. Competing search engines such as Bing/Yahoo and DuckDuckGo have also started supporting this standard.

Google’s Offer You Can’t Refuse

According to Google’s AJAX crawling specification, there are two ways to tell crawlers they should activate their AJAX crawling mechanism when visiting your site:

The use of hashbangs (#!) in URL’s for client side routing purposes. In the original HTML specification, the fragment identifier (#) was intended to point a user to a specific section of a web page. Because these fragments are never sent to the web server, modern front end frameworks started (mis)using it as a means to store client side state and enable back- and forward navigation in browsers. Google states that whenever it encounters a URL containing a hashbang as a fragment identifier, it will consider the page to be ‘AJAX crawlable’. The routing module of AngularJS, for example, allows developers to easily customize the hash prefix to meet that requirement.
Including a special meta tag in the head of the HTML of your page: <meta name="fragment" content="!">. Whenever you have a simple web page that doesn’t contain any hash fragments, but makes heavy use of JavaScript, you can opt-in Google’s AJAX crawling mechanism by including this tag. When you’re, for example, using HTML5’s pushState to enable client side routing, this meta tag is a must have.

Once the crawler encounters a web page that meets one of these requirements, it will transform the URL by adding the query parameter ‘_escaped_fragment_’ and appending all hash fragments as its value. For example, when you have a hashbang URL like this:

www.example.com/#!/about

The crawler will transform it to this:

www.example.com/?_escaped_fragment_=/about

Similarly, when you have a web page that contains the special meta tag:

www.example.com

Google’s crawler will convert it to the following:

www.example.com/?_escaped_fragment_=

This strange looking query parameter makes sure the complete URL is sent to the web server. According to the standard, a web server should respond to these kind of requests with a HTML snapshot. That snapshot should be identical to the end result a user gets to see his browser, but in pure HTML form. HTML is a crawler’s favourite dish. As such, it will happily index your snapshot and expose the clean URL in search results.

Getting Your JavaScript Pages Indexed

The JavaScript processing capabilities of Googlebot have evolved quite a bit in recent years. With Google taking the lead, other search engines are sure to follow suit. For most websites, this type of processing will be sufficient. However, Google is not clear about which features are supported and which are not. Letting crawlers do all the heavy lifting also impacts their crawling rate since they spend much more time on each page.

When SEO is important to your business, you should definitely consider following Google’s crawling protocol. One way to do this would be to intercept requests containing the ‘_escaped_fragment_’ query parameter and serve a static, trimmed down version of certain parts of your application in pure HTML. This approach however, is a form of cloaking which is considered a bad practice in SEO land. It also adds additional overhead server side.

There is however a much better way to cope with this problem: use a prerenderer. A prerenderer is a tool that runs through your whole site and executes JavaScript, produces the resulting HTML and makes static versions of those pages. Once that’s done, it will cache those HTML snapshots and serve them to Google or Bing when requested. This method gives the best results for SEO. It lets you optimize exactly what is seen by the crawler and makes sure everything you want to be discovered by search engines is easily found.

You could perfectly write your own prerenderer by using a headless browser emulator such as HtmlUnit or PhantomJS. Luckily there are already quite some companies offering this functionality as a service: BromBone, prerender.io, SnapSearch and seo4ajax. In general, they're all using the same underlying mechanism. The prerenderer is mostly registered at the middleware level and intercepts HTTP requests server side. When they detect the request originates from a crawler (due to the ‘_escaped_fragment_’) they will render the requested page and return the HTML to the crawler. HTML snapshots are often cached, for example on Amazon S3, to ensure subsequent crawl attempts can be served quickly. Most of them are free for a small amount of web pages. Prerender.io even open sourced their complete prerendering middleware, which allows you to run the complete infrastructure in-house if you so desire.

Conclusion

For some sites SEO can mean the difference between profit and failure. An investment in SEO should be taken into account upfront, even more so when your company’s next web application will be a SPA. Consider using specialized prerender middleware when you take SEO seriously. Crawlers are closing in on prerenderers but will need some more time before they’ll be able to offer the same capabilities; especially when dealing with client side routing.

In the end, it all boils down to optimizing the User Experience. That experience often starts in the Google Search Box.

View full post