How to scrape page data using the ImportXML function in Google Sheets

When I’m doing content analysis on GOV.UK, sometimes I need to analyse page level data which is not tracked through analytics software but still represents valuable information. In these circumstances I use the ImportXML feature in Google Sheets which enables me to extract (‘scrape’) specific page data.

Recently, I’ve been working on a group of education content pages on GOV.UK. I wanted to be able to scrape when those pages were last updated in order to sort them based on the most recent update. The last updated information is found at the bottom of each of my pages :

pagelocation — Scraping this information enables bulk page analysis.

However as the last updated information has no metric to track within analytics software, I used the ImportXML function in Google Sheets. My results can be found in this spreadsheet using an example of the pages that I was interested in.

Finding the XPath

The location of the last updated information is consistent across all of the pages I wanted to analyse. This is handy as ImportXML requires you to identify the XPath of the specific piece of data you want to scrape. Therefore once I was able to identify the XPath of the last updated information on one of my pages, I could implement it into the Google Sheets formula and then apply it to all of the URLs I was interested in.

Using Google Chrome, the XPath of specific page data can easily be found by highlighting and then right-clicking on it and then selecting ‘Inspect’. Firefox users can get XPath data from the Firebug add-on.

Right click on page to bring up inspect option in Google Chrome. — Right click on page to bring up the Inspect option in Google Chrome.

Clicking ‘Inspect’ opens up the elements console and highlights the section of the page that I am interested in. Below we can see the console has identified where the last updated information resides.

2. Console — The Elements console highlights the selected page section.

The console allows me to scrape the XPath of the last updated information by right-clicking onto highlighted section, selecting ‘Copy’ and then ‘Copy XPath’.

Copy XPath — Google Chrome enables you to copy XPath directly from elements console.

Creating the ImportXML formula

On the spreadsheet, in the column next to my list of URLs I entered the ImportXML formula in the format:

=ImportXML(Cell reference of URL, XPath)

I pasted the copied XPath from the Chrome console and hit enter to get the cell to show the scraped date information on my spreadsheet.

Last updated on spreadsheet — ImportXML scrapes data exactly as it is represented on page.

Sorting by date

Now that I have all the last updated dates for my pages, I'd like to sort them by most recently updated. To do this, I need to convert the date text string into a date value Google Sheets recognises. First, I need to remove ‘Last Updated:’ and just keep date information ‘8 December 2015’. I did this by using the MID function on Google Sheets in the next column (D):

MID function allows us to specify and return the date alone from last updated information.

Next I needed to convert the text string date (8 December 2015) into a date value (08/12/2015) that Google Sheets could sort. Therefore I used the DATEVALUE function on Google Sheets in the next column:

DATEVALUE function helps convert the date string from last updated information into a sortable date value.

Dragging this formula down the column for all your URLs provides the last updated information in a sortable format:

Date sort — Now we are able to sort the pages by most recent update.

Application

I can use the last updated information in other relevant Google Sheets based analytics dashboards to monitor content performance following an update etc.

We're keen on hearing about your experiences of using ImportXML with Google Sheets. Please share any tips and examples you may have in the comments below.

Vin works with performance and communication analytics on GOV.UK content at GDS.

6 comments

Comment by Bill Ghere posted on 03 April 2018

My Copy XPath doesn't look like yours in the example

//*[@id="content"]/div/div[2]/p

The structure is the same, but the details are wrong. Is this a Chrome version thing or what?

Link to this comment
- Replies to Bill Ghere>
  
  Comment by Jonathan Jones @ MoneySuperMarket posted on 31 July 2018
  
  Hi Bill,
  
  I've installed a Chrome plugin called ChroPath. What you are looking for is the "rel XPath" rather than the "abs XPath" which Chrome now gives you.
  
  Thanks!
  Jonny
  
  Link to this comment
Comment by John Bambara posted on 06 March 2017

Thanks for posting this.
It set me in the right direction.
I have also added some script to run the import automatically every hour, then added in script to show last update time.

function getData() {
var queryString = Math.random();

var cellFunction1 = '=IMPORTXML("' + SpreadsheetApp.getActiveSheet().getRange('B2').getValue() + '?' + queryString + '","'+ SpreadsheetApp.getActiveSheet().getRange('C2').getValue() + '")';
SpreadsheetApp.getActiveSheet().getRange('D2').setValue(cellFunction1);
}
B2 is the web address reference, C2 is the Xpath, D2 is where i put the result.

Link to this comment
Comment by James Tuller posted on 07 February 2017

Hello - I'm attempting to pull in the 5yr dividend yield from reuters on a list of 30 stocks. I've got the xpath, but having trouble. The stock ticker is in cell a2- =ImportXML(http://www.reuters.com/finance/stocks/financialHighlights?symbol=&a2&, //*[@id="content"]/div[3]/div/div[2]/div[1]/div[6]/div[2]/table/tbody/tr[3]/td[2])

Link to this comment
Comment by David Saunders posted on 11 January 2017

Thanks Vin,
does this formula's fundamentals also work when the date modified field is part of a meta / dublin core tag?

<meta name="DCTERMS.modified" content="2015-06-24T09:27:57+10:00" schema="DCTERMS.ISO8601" />

Link to this comment
Comment by Alex posted on 03 January 2017

Hi Vinith, thank you for the article. Great peace of advice as always.

How do you overcome the problem of constant "Loading" or not proceeding the formulas in Google Spreadsheet at all? Do you have any remedy to that if lets say one wants to gather data from 1.000 URLs?

Thank you in advance,
Alex

Link to this comment

Blog Data in government

How to scrape page data using the ImportXML function in Google Sheets

Finding the XPath

Creating the ImportXML formula

Sorting by date

Application

Share this page

6 comments

About this blog

Subscribe to Open Notes

Categories

Recent posts

Sign up and manage updates

Finding the XPath

Creating the ImportXML formula

Sorting by date

Application

Sharing and comments

Share this page

6 comments

Related content and links

About this blog

Subscribe to Open Notes

Categories

Recent posts

Sign up and manage updates