https://gdsdata.blog.gov.uk/2015/12/22/scraping-page-data-using-importxml-in-google-sheets/

How to scrape page data using the ImportXML function in Google Sheets

When I’m doing content analysis on GOV.UK, sometimes I need to analyse page level data which is not tracked through analytics software but still represents valuable information. In these circumstances I use the ImportXML feature in Google Sheets which enables me to extract (‘scrape’) specific page data.

Recently, I’ve been working on a group of education content pages on GOV.UK. I wanted to be able to scrape when those pages were last updated in order to sort them based on the most recent update. The last updated information is found at the bottom of each of my pages :

pagelocation
Scraping this information enables bulk page analysis.

However as the last updated information has no metric to track within analytics software, I used the ImportXML function in Google Sheets. My results can be found in this spreadsheet using an example of the pages that I was interested in.

Finding the XPath

The location of the last updated information is consistent across all of the pages I wanted to analyse. This is handy as ImportXML requires you to identify the XPath of the specific piece of data you want to scrape. Therefore once I was able to identify the XPath of the last updated information on one of my pages, I could implement it into the Google Sheets formula and then apply it to all of the URLs I was interested in.

Using Google Chrome, the XPath of specific page data can easily be found by highlighting and then right-clicking on it and then selecting ‘Inspect’. Firefox users can get XPath data from the Firebug add-on.

Right click on page to bring up inspect option in Google Chrome.
Right click on page to bring up the Inspect option in Google Chrome.

Clicking ‘Inspect’ opens up the elements console and highlights the section of the page that I am interested in.  Below we can see the console has identified where the last updated information resides.

2. Console
The Elements console highlights the selected page section.

The console allows me to scrape the XPath of the last updated information by right-clicking onto highlighted section, selecting ‘Copy’ and then ‘Copy XPath’.

Copy XPath
Google Chrome enables you to copy XPath directly from elements console.

Creating the ImportXML formula

On the spreadsheet, in the column next to my list of URLs I entered the ImportXML formula in the format:

=ImportXML(Cell reference of URL, XPath)

ImportXML

I pasted the copied XPath from the Chrome console and hit enter to get the cell to show the scraped date information on my spreadsheet.

Last updated on spreadsheet
ImportXML scrapes data exactly as it is represented on page.

Sorting by date

Now that I have all the last updated dates for my pages, I'd like to sort them by most recently updated. To do this, I need to convert the date text string into a date value Google Sheets recognises. First, I need to remove ‘Last Updated:’ and just keep date information ‘8 December 2015’. I did this by using the MID function on Google Sheets in the next column (D):

MID function
MID function allows us to specify and return the date alone from last updated information.

Next I needed to convert the text string date (8 December 2015) into a date value (08/12/2015) that Google Sheets could sort. Therefore I used the DATEVALUE function on Google Sheets in the next column:

Date value function
DATEVALUE function helps convert the date string from last updated information into a sortable date value.

Dragging this formula down the column for all your URLs provides the last updated information in a sortable format:

Date sort
Now we are able to sort the pages by most recent update.

Application

I can use the last updated information in other relevant Google Sheets based analytics dashboards to monitor content performance following an update etc.

We're keen on hearing about your experiences of using ImportXML with Google Sheets. Please share any tips and examples you may have in the comments below.

Vin works with performance and communication analytics on GOV.UK content at GDS.

3 comments

  1. Alex

    Hi Vinith, thank you for the article. Great peace of advice as always.

    How do you overcome the problem of constant "Loading" or not proceeding the formulas in Google Spreadsheet at all? Do you have any remedy to that if lets say one wants to gather data from 1.000 URLs?

    Thank you in advance,
    Alex

    Link to this comment
  2. David Saunders

    Thanks Vin,
    does this formula's fundamentals also work when the date modified field is part of a meta / dublin core tag?

    <meta name="DCTERMS.modified" content="2015-06-24T09:27:57+10:00" schema="DCTERMS.ISO8601" />

    Link to this comment
  3. James Tuller

    Hello - I'm attempting to pull in the 5yr dividend yield from reuters on a list of 30 stocks. I've got the xpath, but having trouble. The stock ticker is in cell a2- =ImportXML(http://www.reuters.com/finance/stocks/financialHighlights?symbol=&a2&amp;, //*[@id="content"]/div[3]/div/div[2]/div[1]/div[6]/div[2]/table/tbody/tr[3]/td[2])

    Link to this comment