Advanced Insights on Google Indexing, crawling and Javascript Rendering

This blog post is a summary of the “Deliver search-friendly JavaScript-powered websites” session at Google I/O 2018, with an e-commerce lens applied plus few personal opinions thrown in. This talk is so important, I thought it was worth its own blog post. This presentation describes about how Google crawls and indexes sites, including how it deals with client-side rendered sites as made common by frameworks like React, Vue, and Angular. (Client side rendering refers to when JavaScript in the browser (the client) forms the HTML to display on the page, as distinct from server side rendering when code on the web server forms the HTML to be returned to the browser for display.)

This discussion is particularly relevant to e-commerce websites that have a Progressive Web App (PWA) built with a technology such as React/Vue/Angular that want all the product and category pages indexed.

Crawling, Rendering, and Indexing

How does Google collect web content to index? This consists of three steps that work together: crawling, rendering, and indexing.

Crawling is the process of retrieving pages from the web. The Google crawler follows <a href=”…”> links on pages to discover other pages on a site. Sites can have a robots.txt file to block particular pages from being indexed and a sitemaps.xml file to explicitly list URLs that a site would like to be indexed.

For example, an e-commerce site might put all product pages into the sitemaps.xml file in case products are not reachable by crawling. (For example, if JavaScript is used for category navigation UI, there may be no <a href=”…”> links in the HTML for the crawler to discover the product pages). An e-commerce site may also block indexing of the checkout page using robots.txt as that page does not contain valuable content to index.

To play well with crawlers, sites should have a canonical URL for each page so that a crawler can determine if two URLs lead to the same content. (A site might have multiple URLs that return the same page. One of the URLs should be nominated as the “canonical” URL.)

There was a period of time where the rage for client side rendered pages was to use ‘#’ (and ‘#!’ for an even shorter period of time) as a way to distinguish multiple pages. This worked with the “back” button in the browser history. Normally following ‘#’ links causes the browser to stay on the current page, which is the desired behavior with client side rendering. However to index PWA pages (e.g. product pages), the modern norm is to use the JavaScript browser history API, allowing different URLs to be recorded in the browser history without having to reload the current page. This is the best approach to use for your site to be indexed as Googlebot (and most other crawlers) ignore what comes after the ‘#’ character on the assumption that the ‘#’ identifies a different place to start the user on a page (the original purpose of ‘#’), not that the page will be different.

After a page is retrieved by a crawler, indexing extracts all the content to send off to the search indexes. This is also when <a href=”…”> links to other pages are identified and sent back to the crawler to add to its queue of pages to retrieve.

One useful tip – if your page uses JavaScript to capture button clicks (without <a href=”…”> markup), use <a href=”…” onclick=”…”> so the indexer will still see the URL, even though the user click will be intercepted by the onclick JavaScript handler.

Another tip – you can also use <noscript><a href=”…”>…</a></noscript> to embed other links you want crawled, but don’t want displayed.

Rendering is a step between crawling and indexing, created by the challenge of client side rendered pages. If a page is server side rendered, the crawler will have all of the content to be indexed already – no further rendering is required. If the page is client side rendered, JavaScript must be run to form the DOM (the HTML for the page) before the indexer can do its job.

At Google, that rendering is currently done using a farm of machines running Chrome 41, a somewhat old version of Chrome. This will be updated at some stage (maybe late 2018). That means if the JavaScript on a site uses newer JavaScript features today, it will fail to render.

A second problem is client side rendering takes up more CPU. Rather than doing such rendering in real time, Google currently sends available markup immediately for indexing, and then also sends the page to a secondary queue for additional processing by running the JavaScript on the page. Spare CPU capacity is used to perform such rendering, which could result in a client side rendered page being delayed by multiple days before its content is available for the indexer. (No time guarantees are provided – you can imagine the queue getting longer if multiple major sites rolled out new PWA support at the same time.) The old version of the page is then replaced by the enriched version of the page when available. This makes client side rendering less desirable for sites with frequent updates – the index may continuously lag behind the current content. It also means crawled links to other pages on a site may take multiple crawl iterations, each one incurring a potentially multi-day delay (if the pages are not all listed in the sitemap.xml file).

Another issue with client side rendering is not all non-Google crawlers support running the JavaScript to do client side rendering. Thus some indexers may not pick up all the content on your site.

So how best to build a PWA that can also be indexed?

Server Side, Client Side, Dynamic, and Hybrid / Universal Rendering

Server side rendering, as mentioned before, is where the web server returns all the HTML ready for display. This provides a fast first page load experience for users, is very friendly to indexers, but by definition is not a PWA.

Client side rendering of pages in comparison requires for all the relevant JavaScript files to be downloaded, parsed, and executed before the HTML to display is available. There are lots of clever tools around that try to break up the JavaScript into smaller files so the code can be downloaded incrementally as the user traverses from page to page on a site. Client side rendering is often slower for the first page, but faster for subsequent pages once JavaScript and CSS files start to get cached on in the browser.

Dynamic rendering is introduced in the presentation where a web server looks at the User-Agent header and then returns a server side rendered page when the Google crawler fetches a page and a client-side rendered version for normal users. (The server side rendered page can probably be relatively plain looking, but should contain the same content as the client side rendered page.) You just look for “Googlebot” (or equivalent for other crawlers) in the User-Agent header to work out if the request is coming from a crawler. (For extra safety you can also perform a reverse DNS lookup on the inbound IP address to make sure it is coming from the Googlebot crawler.)

Hybrid / Universal rendering is also becoming more widely supported by frameworks such as React, Vue, and Angular. Hybrid rendering is where the web server performs server side rendering of the first page (resulting in faster page display in the browser, as well as simplifying the job for crawlers) then uses client side rendering for subsequent pages. Today, this is easiest to implement when the web server runs JavaScript. (Magento for example runs PHP on the server side, which makes it harder to server side render React components as planned in the upcoming PWA Studio.)

Projects like VueStorefront.io and FrontCommerce do this today, and it could be added to PWA Studio in the future or by a helpful community member.

Other Tools

There are other tools that can be worth checking out.

Puppeteer is JavaScript library that can control a headless version of Chrome, allowing interesting automation projects.
Rendertron is an open source middleware project which can act as a proxy in front of your web site, doing client side rendering and returning the resultant page.
The Google Search Console allows you to explore how Google indexes your site. It has a number of new tools such as “show me my page as Googlebot sees it” which is useful for debugging. It also contains a tool to see how mobile-friendly a website is (you should try out multiple pages on your website to try out). This can also be useful to see if robots.txt is blocking files you thought were not necessary, but negatively affect the rendering of a page by Googlebot. (There is a desktop tool as well.)

Other Gotchas

Some other common issues that arise when pages are crawled include:

If you lazy load images using JavaScript, the images may not be found and included in the image search indexes. You can consider using <noscript><img src=”…”></noscript> to include references to such images without displaying them, or embedded “Structured Data” markup on the page.
Infinite scroll style applications that load more content as you scroll down the page (using JavaScript) require thought as to how much of the page Googlebot should see for indexing purposes. One approach is to have the longer page, but hide it using CSS, or creating separate pages for Google to index.
Make sure your pages are performant. Google will timeout and skip pages that are too slow to return.
Make sure your pages don’t assume the user first visited the home page (to set up “browser data” or similar). Googlebot performs stateless requests – no state from previous requests is retained, to mimic what a user landing on the site will see.

Conclusions

If you care about your site being visible in search indexes such as Google, and you are going to build a PWA, you need to think about how it is going to be indexed. If you need indexes to be updated promptly, the current best practice is to have the first page of the PWA server side rendered (using Hybrid/Universal rendering). This will work across the widest range of crawlers, with an additional benefit of the first page (normally) being faster to display (a traditional weakness of pure client side rendered solutions). Luckily the major PWA frameworks have Universal rendering support to reduce the effort required to get this going, as long as you can run a web server with JavaScript support.