Skip to main content

https://dataingovernment.blog.gov.uk/2016/03/09/using-xpath-and-python-with-the-google-analytics-reporting-api-to-report-on-a-large-data-set/

Using XPath and Python with the Google Analytics reporting API to report on a large data set

Posted by: , Posted on: - Categories: Data science, Google Analytics, Python

A year and a half ago, two GDS Designers asked me, “Can you show us how Highway Code content on the GOV.UK site is performing?”

This would have been a simple request, were it not for the sheer number of pages that existed at that that time.  You can see what it looked like on the National Archives site:

 

Highway Code pages on on National Archives
Some of the pages formerly within the Highway Code section of GOV.UK

 

There are a number of options you may consider in solving such requests, and some are better than others:

Option One - Collect and collate the data for each page separately

This method for collecting the data would involve the following steps:

  • Loading the ‘All Pages’ report in Google Analytics
  • Filtering each of the relevant pages separately
  • Then downloading the numerous data sets into a spreadsheet
  • Finally collating all of this data into one master sheet

You would then need to repeat these steps for each type of report you need to run. For example you may need a report on referring sites, or previous and next pages, or even the list of searches made on the page.

This method is clearly not very feasible and would cause nightmares, as you’d be driven to distraction trying to keep a track of everything.

Option Two - Using regular expressions to filter

We could also use the Google Analytics Reporting API with Google Sheets and build a number of regular expressions to filter the pages. However, the 256-character limit within Google’s regular expressions means that we would have to build a lot of reports and join them together.  

This method would also be problematic, as you would need to keep track of all the regular expressions and resulting tabs. This would be further complicated by the number of report types you’re using. Four reports would mean four times as many tabs!

Option Three - A better, and more exciting way!

By using an XPath expression, to navigate through elements and attributes in an XML document, we can get a list of all the links on the page that we need to report on. We can then use this to build a Python script.  This script will:

  • loops through the list of slugs (links on the page)
  • use the Google Analytics Reporting API to get the data we need for that slug
  • and then output the data to a handy csv file for analysis

This method builds on the Vin’s blog ‘How to scrape page data using the ImportXML function in Google Sheets’. Obviously I’m going to run through the third option here!

Using Python and Xpath with the Google Analytics Reporting API

There are 3 main steps for this task:

  1. Set up credentials and Python libraries to use with the Google Analytics reporting API
  2. Build a script that used the API to:
    • use your credentials to access the API
    • scrape the list of slugs from the /highway-code page
    • get the analytics data from the Google Analytics API
  3. Output the results and analyse them

Setting up your environment

The first thing you need to do is create a project folder that is going to contain all Python files and all your project files. In order to do the following, you will need administrator rights on your computer.

To use the Google Analytics Reporting API and XPath in Python, you need to use the following Python libraries:

  • google2pandas
  • pandas
  • requests
  • lxml

Installing google2pandas and setting your credentials

The first step to getting access the Google Analytics Reporting API is to install the google2pandas Python library.  You can follow the instructions on how to do this here.  These instructions will also show you how to set up your Google Analytics credentials.

Set up your remaining Python libraries

To install the libraries:

  • open a terminal window
  • navigate to your new project folder
  • type 'pip install pandas lxml requests'
  • hit the return key

in your window you will see something resembling the following:

Results for installing google2pandas
Using the terminal to install the google2pandas library

Set up and use XPath

The next step is to use XPath, which allows us to get the data we need from the web page https://www.gov.uk/highway-code.

I use a google chrome plug-in called XPath Helper which I find very useful as it allows me build the expressions I need. It shows me the XPath expression in one window and the results in another, making it easier to experiment with editing XPath expressions to get the data I want.

If you don’t want to use the Chrome extension, you can use the method described by Vin in his recent post.

If you don’t have XPath Helper, you will need to get it from the Google store.

XPath helper from Google Store
XPath helper from Google Store

Once it is added to your Chrome browser, you then need to navigate to the page you want to scrape.  In this case, the page is https://www.gov.uk/highway-code.

Open up XPath Helper by either click on the icon on the top right of your browser.

Open XPath
Open XPath

Or hold down the Cmd Shift and X keys on a Mac, or ctrl shift X keys on a PC.

This opens up two windows on the top of the page.

Opened Xpath helper
Opened XPath helper

To find the XPath of one slug, hover over the link and hold down the Shift key.  This will show the xpath to the text of that one item.

One XPath result
One XPath result

This result is not quite what we want, as we need the actual slug.

To get the slug element, we will edit the XPath expression by removing the ‘span’ section and adding ‘/@href’.

One slug returned
One slug returned

This returns one result, but we want the entire list of slugs that are on the page.

To get this we need to make one more edit and replace the li[1] entry with li[.]

The final result for the XPath expression we will use is:

/html/body[@class='js-enabled']/div[@id='wrapper']/main[@class='hmrc-internal-manuals-frontend-content outer-block']/div[@id='manuals-frontend']/article[@id='content']/div[@class='manual-body']/ol[@class='section-list']/li[.]/a/@href’

and results in the following list:

XPath list of slugs
XPath list of slugs

We are now ready to use this expression in our code so save it in a text file for later use.

Writing the Python script

Open up a new file in your favourite text editor, name it, and save it as a .py file.

First of all, import all the libraries that you will need:

Import the libraries
Import the libraries we need on the script

Then create a variable for the URL of the page we want to scrape.

Create another variable that holds the html of the page we are scraping:

Declare page variables
Declare variables for the scrapped page

We now use the XPath expression to get the data we want from the page and hold the data in another variable called hcode.

Variable containing the list of slugs
Variable containing the list of slugs

We can use this list to get the data we need from Google Analytics.

Complete list of slugs
Complete list of slugs

We now need to create empty Pandas dataframes variable that we will use to collect and store the metrics data from the API:

Declare empty dataframe variables
Declare empty dataframe variables

The next step is to set the dimensions and metrics as variables.  These will be the actual dimensions and metrics we want to collect using the Google Analytics reporting API:

Declaring variables to set the Reporting API Parameters
Declaring variables to set the Reporting API Parameters

The penultimate step is to run the Google reporting API in a ‘for loop’ that runs through the list of slugs, collects the analytics data for each slug and saves it in to the relevant dataframe:

The for loop that calls the Reporting API
The for loop that calls the Reporting API

As the script loops through the list of slugs, the results are collected in a pandas dataframe.  The data in this dataframe is added to the end of a dataframe called df1.

When the script have run through the list of slugs we can tidy up the data in the dataframe, do some manipulation on the data and export the data as a csv.

Tidy up and manipulate the data
Tidy up and manipulate the data

In this case we sort the dataframe using the pageviews column, add a column showing the total number of pageviews and cumulatively sum the pageviews in another column.  With this data we then can create another column that calculates the percentage of the total pageviews for a particular page.

Reusable

The beauty of using a script like this is that once you have this section of code working properly you can very quickly reuse the same code again, changing it to collect different metrics and dimensions.

For example in the complete script I have quickly and easily reused this code to make separate dataframes for ‘next page’, ‘landing page’ and ‘searches on page’ reports.

The complete script is available on Github.

The result

Using my final analysis of the data the page was updated and now looks vastly different.

The new highway code page
The updated highway code page

Please let us know your thoughts on this process in the comments below, it would also be great to hear if you or your team have done anything similar.

Sharing and comments

Share this page

3 comments

  1. Comment by Charles Meaden posted on

    Thanks for some really interesting approach here.

    We've done something similar using a combination of two commercial tools.

    1. ScreamingFrog to extract the xpath and URLs
    2. AnalyticsEdge to read the API and then combine the data

  2. Comment by Tim posted on

    Really useful guide. Thank you John.

    In one of the Twitter discussions about this @so_on has posted what looks like an interesting resource containing Python libraries for web scraping and also loads for parsing and processing the data:

    https://github.com/lorien/awesome-web-scraping/blob/master/python.md
    Last updated a few months ago.

    Tim

    • Replies to Tim>

      Comment by John Byrne posted on

      Really good stuff there and all in one place - I've used a few of them with varying levels of success