Data in government

One year of the Linguistic Data Subcommunity

2024-04-09T15:28:02Z

Cross-government and cross-public sector working are increasingly important, with departments appreciating that colleagues are likely to be facing the same issues with the same solutions. This makes communities of practice and communities of interest such as the Data Science Community valuable resources.

It’s been one year since the Linguistic Data Subcommunity held its first cross-public sector meeting, having run for more than a year within the Office for National Statistics (ONS). In this post, I’ll give a bit of background on our first year and provide some thoughts for people looking to set up their own Community as well as to Senior Leaders on the value of these networks.

Background

The Linguistic Data Subcommunity started within the then Methodology Division in the ONS. Several of us were using Natural Language Processing (NLP) methods for various projects and set up a regular catch-up meeting to discuss progress and blocks and share experiences. This fortnightly meeting expanded to around 20 people from different ONS Divisions and I approached the Data Science Community to ensure we were aligned with more cross-cutting work. Having seen interest across government, the Data Science Community Manager promoted the group to a wider audience. I changed the name to the Text Data Subcommunity to reflect a broader approach to collection, management, and interpretation of text data rather than purely focusing on the NLP methods. At the end of 2023, the Subcommunity agreed to change names again to include emerging interest in analysing audio and hand-written image data. Currently, we cover anything to do with analyses involving language.

What did we achieve in our first year?

Membership and meeting attendance

Membership went viral early in 2023, going from 20 members to 100 in our first two months. At time of writing we have 249 members. Of those, we have an average attendance of 58 people at the 27 meetings we’ve held.

A surprising turn of events was that entire teams would request to join. While not all team members might attend regularly, this behaviour points to staff finding value in the community enough to recommend us to their colleagues.

Another noteworthy occurrence is members notifying us of changes to their email address as they move roles around public sector. For a voluntary community of practice, this level of engagement, where members go out of their way to ask us to update their details, shows how valuable they find the meetings.

Meetings

Meetings have been run fortnightly for an hour since late January 2023. The meetings are mostly project show & tells or strategic discussions on topics like methodological deep dives, data maturity and the potential risks of AI. We’ve also had several discussions on skills and development, including with the Data Science Campus Capability Building Team who are using these to extend their existing NLP Learning Pathway. We do peer review or testing, for example a crackathon of the ONS’ StatsChat chatbot, but are increasingly being asked to review project proposals and follow-on updates to ensure best practice throughout a project lifecycle. This is very heartening as it shows both a commitment to quality and that the Subcommunity is seen as a supportive space where people can bring their ideas rather than somewhere people can only present “perfect” completed projects.

External presentations

Members of the Subcommunity took part in many of the DataConnect23 events, with four of the five NLP events involving members and the Subcommunity led a panel discussion on text data maturity. We had a panel discussion at the Data Science Community Showcase on NLP skills that had more than 200 attendees.

Members also took part in discussions and presentations at the Royal Statistical Society Conference, for a panel on evaluating AI, and other cross-public sector events such as the Home Office Data & Information Week where we presented on the role of the Humanities in analysis.

Impact

Having a central point of contact for anything related to linguistic data means that analysts and senior leaders can more easily find people or projects. We provided information to an ONS review on automation and provided ONS with a list of cross-public sector Large Language Model (LLM) projects to help them streamline their LLM project proposals by identifying areas that have already been researched. We’ve also helped numerous ad hoc requests for specialist skills from analysts across public sector through the Slack channel.

As well as technical support, the Subcommunity has helped other communities of practice by sharing knowledge and experience of running large and engaged networks and by trialling different presentation and management methods. We are part of the Data Science Community Working Group as well as being part of several other Subcommunities and their management groups as well as supporting the Department for Science, Innovation and Technology in establishing their communities of practice. Our advice on creating a supportive and empowering space for members has been adopted across multiple communities.

Feedback

Feedback from members has been that show & tells have been “inspiring” and discussions are “great” and “really interesting”. Members have expressed appreciation at the focus on quality and alignment with strategic objectives as it helps them argue for increased resources.

What didn’t go so well?

Mostly, not much. It’s all going great so far.

If I were to do it again, I’d have a central committee set up from the start. With great responsibility comes, well, great responsibility and, as much as I enjoy engaging the Community in activities I think are important, it’s tiring always being the sole chair and having to find and organise potential speakers (and worry if they have to reschedule). I’ve encouraged members to chair but no one’s taken me up on it so far. I would love to hear from anyone with an interest in joining me on this journey.

What do we want to do next?

There were a few things we didn’t get time to do in our first year and some other things that have emerged since we started.

Git repo: We have one but it’s empty and we don’t know how best to use it. We need to agree if we’re going to keep it and, if so, what we want to use it for.
Themed meetings:
1. Academic events: In the original NLP Working Group, we had two presenters from academia who led discussions on more theoretical aspects of linguistic data. These were both extremely well attended with a very engaged wider audience. We have some academic contacts who we are in discussion with for a quarterly series of discussions.
2. Best practice and reproducibility: We have arranged discussion events from the Office for Statistics Regulation (OSR) and by the Data Quality Hub, and are in discussion with the Analytical Systems and Pipelines Hub on a training series on reproducible coding.
Voluntary Application of the Code of Practice for Statistics: The Code of Practice for Statistics are guidelines for public sector that define three principles of trustworthiness, quality and value that statistics should abide by. While the Code is required for official statistical outputs, their use by Communities of Practice and similar networks are voluntary. But applying them would provide structure to our discussions and events, thus further strengthening the culture of best practice.

Get involved

If you would like to join the Linguistic Data Subcommunity, please reach out to Karen.

Getting the data fundamentals right at the Ministry of Justice

2024-01-30T13:24:26Z

The Ministry of Justice (MoJ) launched its Data Strategy in August 2022 and since then we’ve been making progress against the goals we set out. We’re also looking to the future - planning our roadmap for the next three years to deliver critical data improvement work and understanding how we will monitor and measure our progress.

In this post we’ll look at our work in the MoJ, and in the next blog we’ll share plans for data improvement in the wider criminal justice system.

Better data and skills to enable better decision-making

Data underpins our work in delivering justice outcomes. It helps us measure the impact of policy interventions, gives us operational insight into prisons and probation, and deliver better services for our users – along with so much more. Yet too often, our data is fragmented, hard to share and not exploited to its fullest extent.

The Data Improvement team is focused on improving the quality of data, access to data and the data skills of staff, so that MoJ and the wider criminal justice system will make better decisions based on data and improve the outcomes for the millions of people that rely on the justice system.

We are creating the foundations for our colleagues – in data science, data linking, analysis, operations and more – to be able to deliver the data-driven insight the MoJ relies on.

Our three-year strategy

Our roadmap outlines our data improvement strategy for the next three years. As with any roadmap, we have most confidence about the activity that’s coming up in the near future, and our work in the next few months will inform our work over the coming years.

Phase 1 – Prototyping

Following on from our discovery work, we’ve been developing prototypes of processes and tools to improve data quality, access to data and data skills. For the next few months, we’ll continue to test these solutions with users and iterate them. We will:

Test our data management policies (including data quality, data ownership, data standards and data cataloguing standards) by initially selecting one dataset in MoJ and working to measure and improve the quality of that data.
Support the Electronic Monitoring team to access and collect high quality data and feeding data requirements into new systems. Embed data management principles including ways to manage data issues through improvements to process and governance. You can read the full Electronic Monitoring data improvement plan here.
Improve our strategic approach to data architecture across MoJ by hiring data architects.
Put in place a new MoJ Data Board that will report to our Executive Committee and ensure we’re getting senior stakeholder input into our work.
Work with Digital teams and Data and Analysis colleagues to refine the process by which data user needs are considered for new projects and changes to existing products, and improve the quality and reliability of data that analysts have access to.
Test our new approach to publishing statistics in a more efficient way and which better meets user needs, and putting it into production on two publications.
Work with Learning & Development teams to improve data skills for two cohorts within MoJ and support that with cultural nudges and interventions, further testing our data proficiencies tooling and our data culture framework.

Phase 2 – Learning by doing

Next, we need to build up our improvement toolkit, by testing our ideas with more partners. From the middle of 2024, we’ll draw together our different prototypes and strands of work on exemplar end-to-end services or datasets. We will:

Select a small number of pilot areas and test our approach to data management, data literacy, data culture and stakeholder engagement:
- Roll out the data governance operating model more widely, starting with the data ownership solution, and embed that in the culture of the team and support it with data governance training.
- Deploy our data quality, data standards and data cataloguing procedures, and align that with literacy training and cultural interventions to improve quality of data captured.
- Automate processes for collecting and using Electronic Monitoring data, and use linked data and data science tools to inform evaluation and policy.
- Work with L&D teams to identify skills gaps and improve data skills.
Work closely with our Digital colleagues to develop a data catalogue MVP to make it easier to find data, understand the level of data quality, and the ownership and lineage.
Deploy a new Top of the Office Tool to modernise current performance reporting to MoJ’s executive committee, enabling better access and enhanced insight to priority key performance indicators.
Provide additional visualisation and data journalism training and tools to analysts and leaders so they are better able to tell impactful stories using data.
Establish a Data Sharing team and framework to increase the number of data shares that happen to support our analysis and data science work, and reduce the amount of time it takes to negotiate each one by implementing a simpler internal process.

Phase 3 – Scaling by empowering others

Once we have confidence in our approach and our skills, we can support other teams to lead within their own area. We can enable them to use their subject matter expertise and our processes and tools, alongside our advisory and consultancy support, to solve problems for themselves. We expect this phase to start in mid-2025.

Talk to us

Getting the fundamentals right in data is an important topic and there are growing number of government teams working on these complex issues. If you’re looking into any of these issues or solutions, please get in touch so we can continue to collaborate across the criminal justice system and public sector, and share lessons learned.

To deliver this ambitious programme, we’ll need passionate data professionals to join our growing team. If you’re interested in working with us to solve some knotty problems, keep an eye on Civil Service Jobs or email us for an informal chat.

Celebrating the power of community

2023-11-08T14:47:38Z

The Cross-Government and Public Sector Data Science Community provides regular opportunities for people in the public sector, with an interest in data science, to connect with their peers, learn new skills and collaborate. This autumn, we are hosting the first Data Science Community Showcase to celebrate the successes of our community, as well as those of other networks and communities supporting government analysts. The event will take place online over three days and you can book your place on individual sessions on Eventbrite. It’s an excellent opportunity to expand your network, showcase your work and hear from colleagues in government who use data science for the public good.

The Showcase will feature curated content coordinated by the Data Science Community, with support from our partners, focused on three content strands: your career in data science, the Data Science Toolshed and making an impact with data science.

Civil servants can count attendance of any Showcase sessions towards their One Big Thing data training.

Day one: your career in data science

We will explore skills-building and career progression in the data science space. This follows the success of our career panel events throughout 2023 and requests from our members to access career content.

On day one you can expect:

sessions on skills development delivered by the Data Science Campus, the Text Data Subcommunity and the Data Literacy Community
career panels featuring data scientists at all levels, answering your questions about entering and progressing in the field

Day two: the data science toolshed

At our June meetup we invited our members to enter the Data Science Toolshed, where developers code in the open, follow best Reproducible Analytical Pipelines (RAP) practice, and build tools to share. We had great feedback from the event and not enough time to explore the tools available to government data scientists, so we have decided to dedicate a whole day to the toolshed as part of the Showcase.

On day two you can expect:

sessions on how and why to code in the open, supported by the Data Science Campus and the Office for National Statistics (ONS) Analytical Standards and Pipelines (ASaP) team
demonstrations and case studies for tools and packages to support public sector analysis, in collaboration with the NHS-R Community and the RAP network

Day three: making an impact with data science

We will shift our focus to looking at the context in which we do data science. The questions we address on this day are important not only for data scientists, data engineers and data analysts, but also for policymakers, scientists and anyone in government whose work is impacted by data. Some of the questions cover how policymakers and data scientists can support each other and how data science improves the lives of the public.

On day three you can expect:

panel discussions and fireside chats between data scientists and policymakers, about how they work together and how data impacts policy
sessions on how emerging technologies will impact the work of data scientists and data engineers, with support from the Emerging Technology (EmTech) Community
sessions on how data science can improve the lives of the public

Our community is for everyone

This year, we have focused on broadening the reach of our Community activities and making it more inclusive. Engaging with a wider range of sectors and organisations in our programme, means more opportunities to grow your network, understand the data science landscape, and collaborate rather than duplicate.

The Data Science Community is managed by the Data Science Campus, part of the Office for National Statistics. By teaming up with partner networks and communities with shared goals, we aim to deliver a Showcase that highlights the many areas of the public sector that are underpinned by data science. We are delighted to be working with the RAP Network, the NHS-R Community, the EmTech Community, and our own subcommunities.

We would like to extend a special thanks to Data Cymru, who have helped us to curate the content and engage speakers, ensuring that the voices of colleagues in local and devolved governments across the UK are incorporated into the Showcase.

Get involved

Sessions are now live for booking on Eventbrite. Join our mailing list to be the first to know when more sessions become available. You can contact the Data Science Community team using government.data.science.community@ons.gov.uk.

The Justice Data Lab turns 10!

2023-07-05T14:50:30Z

Established in April 2013 with support from New Philanthropy Capital, the Justice Data Lab (JDL) helps third sector organisations evaluate the impact of their interventions.

For organisations working to reduce reoffending, being able to evaluate the effectiveness of their interventions is paramount. This requires access to relevant data, and the expertise and time to undertake appropriate analysis. Third sector organisations cannot access central reoffending data, because it is extremely sensitive information about individuals. This, combined with the need for specialist analysis, means that for many organisations working with people across their journeys through the criminal justice system, there would be no way to evaluate their work if the Justice Data Lab did not exist.

Organisations send us a list of their programme participants, and we identify these people in secure datasets such as the Police National Computer, build matched comparison groups, and produce a full impact analysis which quantifies the effect of the intervention on reoffending outcomes. This provides third sector organisations with a vital source of impartial and rigorous evidence that can be used to improve their work and secure essential funding.

Alongside our work with the third sector, the Justice Data Lab also leads on the evaluation of MoJ and HMPPS led initiatives, whose large-scale programmes have significant impacts across the criminal justice system.

Our achievements to date

This year marks the 10th anniversary of the Justice Data Lab. Within that time, we have produced 179 reports and have worked with over 50 organisations across the third sector, who provide all types of interventions from education, to accommodation, to justice system reform. Previously we have worked with The Clink, a vocational training programme which gives people in prison skills, qualifications, and routes to employment in catering and restaurant work; the Greater Manchester Intensive Community Order programme, which works with young male offenders who have received community orders in place of short custodial sentences; and the CHANGES programme at Nottingham Women’s Centre, which provides individualised support to women across 9 resettlement pathways in order to prevent reoffending.

This quarter, we published our latest evaluation, which looks at The Chrysalis Programme, an integrated personal leadership and effectiveness development programme that equips individuals with essential life skills, helping them to better own and drive positive personal change in their lives.

We’re the proud recipients of a Royal Statistical Society award for ‘Statistical Excellence in Official Statistics’, and our work has been covered by the BBC and Civil Service World. Our track record is reflected in the positive comments we receive from our partners:

Key4Life is hugely encouraged by the Ministry of Justice Data Lab’s analysis and validation, providing statistically robust evidence showing that Key4Life participants are significantly less likely to commit a re-offence compared with non-participants, and that Key4Life participants commit significantly fewer re-offences. […] Thank you to all those at the Ministry of Justice Data Lab for their support and guidance. Our staff, mentors, supporting employers and the young men on both our Prison and ‘At Risk’ preventative programmes can take great support from this positive validation.

Relaunching the JDL School

A key part of our success is our consistent methodology, which relies on a statistical method called propensity score matching (PSM). As experts in the use of PSM for evaluation, we also receive requests from other teams in the Ministry of Justice to advise them on the use of this technique. Previously we have run an initiative we call JDL School, where we walk other teams through our methodology to share our expertise and up-skill others in propensity score matching. Now, in our 10th year, we’re very pleased to announce the relaunch of our JDL School programme.

Over the next few months, we will be gearing up in preparation to be able to offer training workshops and bespoke assistance for teams within the Ministry of Justice who would like to undertake quasi-experimental impact evaluations using propensity score matching.

Get in touch

If you’d like to learn more about the Justice Data Lab, commission us for work, or participate in JDL School, please email us at justice.datalab@justice.gov.uk . Please be aware that we receive a large number of commissions and therefore we operate a waitlist for undertaking any new work.

Collaborative learning: closer ties with academia

2023-03-14T11:39:50Z

GDS x Imperial University Collaboration 2022

Collaboration and innovation are some of the key tenets of the Digital, Data and Technology (DDaT) profession. The Cabinet Office offers many avenues for productive collaboration, enabling internal and external partners to develop both professionally and personally. This includes up to 5 days of special paid leave per year for volunteering activity, cross-government programmes such as the catapult and accelerator schemes, and external collaborations such as the Teach Her mentorship; aimed at mentoring diverse women seeking career opportunities in DDaT.

In 2022 the Data Science community at Government Digital Service (GDS) collaborated with Imperial College London to champion these principles and develop new relationships.

Data Products, a team tasked with the development and deployment of novel data tools within GDS, played host to a project allowing 4 postgraduate students the opportunity to work on a real-world problem. This collaboration aimed to help the students develop their data science skills, and gain valuable experience working in a professional environment.

This sort of experience is rare outside of industry, where often datasets are clean and adhere to tidy principles, as is often the case on code-challenge websites. The students enjoyed this difference in working and commented:

This was our first time dealing with messy real-world textual data, which was a really rewarding experience. In the world of academia, we have previously been fortunate enough to enjoy "clean" datasets (especially as an undergraduate)... While initially frustrating, this gave us a useful opportunity to learn how to handle messy data in the real world

Our 2021/22 cohort was split into 2 pairs to encourage the development of new ideas and to encourage challenge. We wanted to replicate the situation in industry, where people from different backgrounds, and with different skill sets can lead to synergies and knowledge growth.

The students participated in a project investigating the interrelationships between pages on GOV.UK in an attempt to define what we refer to as life events.

In the Data Products team, we work with the understanding that people typically visit GOV.UK to find information and services related to a "life event". A life event describes an occasion in which we need to interact with the government in some way - whether life changing, like having a baby, or routine like registering for a fishing licence.

However, with over 500,000 pages on GOV.UK, and no single distinguishing feature by which a page can be easily categorised, there’s an interest in automatically identifying which life event a page belongs to. Successfully determining this holds the potential to facilitate access to government digital services and improve the overall experience of our users.

With only 4 months to complete onboarding, get up to speed with existing research, and produce and assess a piece of analysis, we had to work hard to ensure that the students were set up for success. With this in mind we curated a training suite and timetable, clearly laying out needs and expectations.

This timetable focussed on the students' early time with us, providing them with training resources to help them find their feet such as introductions to the Civil Service, working in an AGILE environment, and coding best practices (e.g. version control using Github). We quickly progressed onto subject matter training, providing resources on Natural Language Processing and Network Analysis. Over the course of their work, the students made use of named entity recognition and geometric deep learning using biased second order random walks.

We received high praise from the students for the layout of our onboarding and timetabling, who commented:

We are convinced that this [collaboration] was only manageable with the clear project structure that had been thought out in the beginning. We only realised the full value of this about six weeks into the project when we both became quite busy with our other commitments

At this point we focussed on providing the students with the independence to make their own decisions and set the direction of their projects. They developed and presented a set of project proposals to internal experts and stakeholders receiving feedback and direction. Whilst working on their projects we held regular stand ups to assess progress and blockers, embracing the AGILE expectations of failing fast and iterating on an initial product. At this point I must praise the students for their ability to diligently work on this project whilst balancing their prior commitments; dissertations, exams, and parallel work experiences. The resilience and dedication they demonstrated throughout this period was exemplary and makes me proud to have been a part of our collaboration.

By the end of the 4 month timetable both sets of students had managed to successfully complete a minimum viable product, analysing user journeys on GOV.UK in an attempt to define pages belonging to life events. We culminated our collaboration with a playback session, within which the students ran our internal experts and stakeholders through their analysis and results.

One of the most important outcomes of this collaboration is ensuring that future cohorts see the value in participating in partnerships with industry. As such, I’d like to close with the following comments from our cohort:

“We would highly recommend being part of a collaboration with GDS to learn more about how data science is applied with real world data and in a project which is truly impactful. Moreover, it is a great opportunity to meet new people, and to get insight into how the Civil Service operates. Not only will you be able to hone your coding and research skills, but it will allow you to experience real world data-science!”

To explore career opportunities with the Government Digital Service, please visit our careers site. For the latest news about all things analytical in the UK Civil Service, including placement opportunities and ongoing mentorship schemes please visit the Government Analysis Function. Public sector employees can engage with us via the #NLP and #graphs-and-networks channels on the cross-government data science Slack.

Using Data Science for Next-Gen Statistics

2023-02-14T15:59:23Z

As the 21st century progresses, using data effectively has become a priority for many organisations, including the Office for National Statistics (ONS). The ONS's unique focus, however, goes beyond just utilising data effectively. The organisations ultimate goal is to create a comprehensive picture of life in the UK by providing timely and robust statistics. This information empowers governments, businesses, and individuals to make informed decisions and plan for the future.

Reproducible Data Science and Analysis Team at the ONS

The Reproducible Data Science and Analysis (RDSA) team, sits within the Economic Statistics Change Directorate, and uses cutting-edge data science and engineering skills to produce the next generation of economic statistics. Current priorities include overhauling legacy systems and developing new systems for key statistics related to the economic impact of Brexit, the COVID-19 Pandemic, and inflation.

Over the last five years, the RDSA team has grown from 4 to 50 people – indicating the value they bring to ONS.

Speeding Up Decision Making: ONS's Faster Indicator for Road Traffic Sensors in England

Recently, the RSDA team successfully modernised a Reproducible Analytical Pipeline (RAP) for Highways England Road Traffic Flow data. This improvement has led to a significant reduction in the time it takes for the data to become available and be published in the ONS's Faster Indicators Bulletin, now taking approximately two weeks less. This means that data users and policymakers now have access to timely and accurate information on traffic flows, allowing them to make more informed decisions.

This RAP produces statistics for Road Traffic in England in a timely manner and is considered by the ONS as a Faster Indicator. Road Traffic statistics provide valuable insights into the UK economy's supply and demand of goods by understanding how domestic and foreign goods are transported across the country. This data was particularly valuable for economists and other experts to analyse the impact of the Coronavirus Pandemic on the UK economy.

In addition, the data has the potential to provide insights into the UK's supply capacity and the relationship between types of vehicles and regional economic activity. This information can be beneficial for the UK Government's Levelling Up Agenda, helping to support local communities and economies.

Deploying the Road Traffic Sensor RAP on the Google Cloud Platform

The Road Traffic Sensor RAP has been given a new lease of life thanks to its deployment on the Google Cloud Platform (GCP). By utilizing the Cloud Run service offered by GCP, we were able to take advantage of its ability to run software packages with ease. The Python package for the Road Traffic Sensor RAP runs smoothly on the GCP Cloud Run service, allowing our team to focus on writing and improving the code, rather than managing server infrastructure. Additionally, the cost-effective nature of Cloud Run means that we only pay when the package is in use, providing a cost-saving solution for our organisation.

If you're looking to run your application on the cloud, using Cloud Run is a great option. But in order to use Cloud Run, your application needs to be in a special format called a container. One popular way to create this containerised version of your application is through a tool called Docker. When using Docker, developers can create a file called a Dockerfile, which helps organize their code and files in a way that makes the application self-contained and ready to run on the cloud.

Let's imagine that building a containerised application is like building a house. Just like how raw materials like brick, tiles, and timber are needed to build a house, our Python package containing scripts and files is needed to build our containerised application.

First, we create a blueprint for our house, which is similar to creating a Dockerfile for our application. This Dockerfile tells Docker, the architect, how we want our application to be structured and arranged.

Once the blueprint is ready, the architect (Docker) takes it and creates the technical documents, similar to blueprints and structural drawings for a house. Then, the builders and engineers (GCP Cloud Build) use these technical documents to construct the house, which in this case is our containerised Python application.

Finally, just like how a property manager takes care of the repairs, maintenance, security, and upkeep of a house, Cloud Run takes care of the same responsibilities for our containerised application.

Deploying code to the Cloud using tools like GCP and Docker is certainly a convenient and efficient process. However, it's important to remember that this is just one aspect of the overall software development process. To build and maintain a robust and high-quality codebase, it's essential to adhere to best practices and utilise the right tools. For this reason, we highly recommend checking out the "Quality assurance of code for analysis and research" online book, written by the Quality and Improvement team at the Office for National Statistics. This book delves into important topics such as version control, modular code, unit testing, and peer review, all of which were crucial to the development of our Road Traffic Sensor RAP package.

Looking Towards The Future

The ever-increasing volume of data and the need to extract valuable insights from it presents a significant challenge. However, the potential for technology to improve various aspects of our society and economy is vast.

Public sector organisations must adapt and evolve with the times in order to continue making a meaningful impact on citizens' lives for years to come. The RDSA team will rise to meet these challenges and help the UK Government achieve Mission Three: "Better data to power decision-making" and Mission Six: "A system that unlocks digital transformation" — of its Digital Data Strategy.

By committing to ongoing learning and staying up-to-date with the latest trends and tools, the RDSA team is one of many at the ONS working to create more timely and robust statistics that empower governments, businesses, and individuals to make informed decisions and plan for the future.

If you're interested in learning more about the innovative work being done by the RDSA team, don't hesitate to reach out to our team lead, Rich Campbell at richard.campbell@ons.gov.uk.

DIT's Data Workspace - all our data, in one place

2023-01-11T11:51:49Z

The Department for International Trade (DIT) is committed to
supporting its people to make data-driven decisions. To help do this, we
created Data Workspace, which allows colleagues across the
department to collaborate, share, and safely store their data.

The service has two main components: a suite of tools which staff can
use to analyse data and create dashboards, and a data catalogue
which DIT, and other government departments, can use to access existing data or dashboards. Data Workspace was created in 2019 and
now hosts over 40 billion rows of data from over 200 different sources
as well as more than a dozen analysis tools. Options for tools include
JupyterLab and RStudio, as well as lesser known tools like Amazon Web Services (AWS) QuickSight and Theia / Superset. We even have some we’ve built ourselves!

This variety of tools and the fact that they are available to anyone who
completes mandatory civil service training, mean that people in many
different roles use them. Having a single place to search for and find
the data you need makes using it more accessible. My team provides
assistance to everyone from data scientists keen to use the latest
Natural Language Processing packages through to operational
colleagues who want to make dashboards to track their individual
progress.

Two users talk about how they use Data Workspace below:

“Having recently been brought into the Civil Service as an International
Trade Adviser (ITA), where I was previously a Delivery Partner, one of
the greatest surprises to me is how accessible the data relating to my
responsibilities is: both the detail and breadth of the data, but also the
range of analytical tools I have access to. Another surprise is the
freedom for experimentation – one can follow a whim as to what might
be a useful analysis without having to justify it to anyone in advance.
Having the proper tools goes beyond the boundaries of what an over-
stretched spreadsheet can produce and these can be easily shared
with colleagues to use themselves. Choosing an appropriate level of
granularity can make the same data useful as either an overview or for
detailed examination of the data.

On the topic of support, the Data Workspace team offer a great amount
of technical support and patience whilst learning to use these tools,
which is very encouraging. Data analysis allows us to accurately
understand a situation and look in the right place for solutions and
allowing the average user to do this themselves is an extremely
powerful tool.”

- Stephen Banks (International Trade Adviser)

“I'm Tayyib and I'm on the Digital, Data &; Technology Fast Stream.
Previously, I was a Data Scientist at the National Situation Centre
(SitCen) in Cabinet Office. My role in SitCen largely involved analysing
data from a range of sources across HMG to produce meaningful
insight for senior officials during crises. I’ve moved across to DIT as a
Data Engineer, building new pipelines and developing the Data
Workspace environment in line with GDS service standards.

My previous experience with data infrastructure largely involved using
Amazon Web Services (AWS) and other related analytical tools for
ingestion, data processing and analysis. AWS is a great platform,
however can be relatively inaccessible and at times overwhelming for
less data-experienced and tech-savvy users.

Data Workspace uses an AWS back-end but spins the infrastructure
into a significantly more accessible platform, removing a lot of the
hassle. It means that less data-experienced and tech-savvy users can
easily pull data using SQL, produce analysis in Python or R and
visualise insight creating dashboards in AWS QuickSight or maps in
QGIS. A huge advantage to using Data Workspace is that it can also be
shared across HMG and hosts the bulk of data that DIT uses on
everything related to trade, economic productivity and investment.”

- Tayyib Saddique (Data Engineer)

If you want to work with our team designing and supporting Data
Workspace, you can sign up for job alerts here.

Beyond checklists: critical thinking about data ethics, technology and AI

2024-03-22T14:43:20Z

In this post, we discuss what we have learned from creating the Data Ethics and Society Reading Group, and let you know what you can expect from our next event, on Artificial Unintelligence, on December 6th (sign up online).

It was a bright cold day in April…

In 2020, we (Harriet, Michael and Hillary) created a space where public sector employees from any department could come together and discuss data ethics, technology and society. The aim of the space was to breathe life into the concepts and topics mentioned in data ethics frameworks.

Hillary and Harriet had attended a DataKind ethics book club where Ruha Benjamin's Race After Technology was discussed. Unlike the DataKind event, which is open to all, we felt it was important to create a safe environment where public sector workers could discuss challenging topics freely, topics like feminism, race and sexuality - discussing how these ethical issues relate to our work with a critical lens.

To ensure equality, we believe that it’s important to be able to freely discuss inequalities and where and how we can apply data and ethics frameworks in our work.

And thus, the Cross-Government Data Ethics and Society Reading Group was born!

How to run a session: the Goldilocks approach

With the exception of Hillary, none of us were particularly skilled at event planning. Discussion groups began rather haphazardly, as we tried to figure out what worked best in a (virtual) public sector context.

The first book we read was Data Feminism by Catherine D’Ignazio and Lauren F. Klein, a book which is now entirely open-access. We split our discussion of it across three sessions due to its length and enormity of its contents (ranging from an initiative to record cases of femicide in an open, accessible manner, to the Gender Shades project).

Whilst we had great attendance at our first event, in retrospect, while we would undoubtedly recommend engaging with the discussion in this book, it was perhaps not the best starting point for data ethics. Each chapter could have been a standalone book in itself! Discussion points are definitely key!

We pitched our second event to be more like a journal club, bringing together journal and media articles we had found interesting, loosely themed around bias in data. This format was however less popular than a traditional ‘book club’-style reading and discussion group.

We again returned to the book format to discuss Kate Crawford’s Atlas of AI, which presents AI as a technology of extraction: from the minerals drawn from the earth right through to the labour of low-wage information workers.

Attendance varied substantially between events, with many no-shows. It was difficult to predict what would attract more people, but each event took the same amount of time to coordinate.

“And this one was just right…”

At the start of this year, we reflected on our sessions so far. We now hold one session per book, with suggestions of particular chapters of interest. We also introduced more structure, by committing to holding four sessions per year: one per book. You will now see us promoting our events via the Analysis Function, Operational Research & Statistical Service newsletters, in addition to the Cross Government Data Science Slack #ethics channel.

Some colleagues come to every session, others are intrigued by a particular book or author. We want to encourage everyone and anyone to engage with the material. You do not have to be someone who works with data, just a curiosity and willingness to think critically about the material.

This year, we have read:

Hello World: How to be Human in the Age of the Machine by Hannah Fry, in February,
Invisible Women by Caroline Criado Perez, in September.
Counting: How We Use Numbers to Decide What Matters by Deborah Stone, in May,

What to expect at a session

A Data Ethics and Society Reading Group session runs for an hour over lunchtime. A date for the session and a book will be confirmed (using your suggestions!), a few months in advance.

Numbers vary: sometimes we have smaller groups, but recently attendance has been closer to 50!

After a short introduction to the group and material, attendees are split into breakout rooms to discuss the book in smaller groups. Hosts suggest discussion questions, but each breakout room tends to take discussion in different directions. At the end of the session, we come back together to share highlights.

No notes or recordings are taken from the sessions, and whilst we encourage everyone to participate, we know that people engage in a multitude of different ways. We are open to suggestions on how we make the group more inclusive and cater for the needs of everyone. If you have an idea, please get in touch!

We are sure that the reading group will continue to shape-shift as time goes on. For now, we hope that it serves as a starting point for public sector colleagues to critically engage with data ethics alongside data practitioners.

Have you read…?

We are always looking for recommendations and suggestions of what to read and discuss next. We also encourage guest-hosts of the event, particularly in pairs.

We have a full reading list on our website. Some highlights include:

- The Ethical Data Scientist, Cathy O’Neil (Author of ‘Weapons of Math Destruction’), February 4 2016, Slate Magazine
- Automating Inequality, Virginia Eubanks
- Fair Warning, Abeba Birhane, February 24 2020, Real Life Magazine

Our website is powered by GitHub, which means that it is really easy for you to suggest reading material and be listed in our ‘contributors’ section. You can raise an issue listing all the information we need.

Our next session: Artificial Unintelligence

On Tuesday 6th December, we will be running our final session of 2022, by focusing on Artificial Unintelligence by Meredith Broussard.

To hear about events in future, you can sign up to our mailing list.

Michael and Harriet

How we are using machine learning to detect GOV.UK feedback spam

2022-10-10T08:22:02Z

User feedback is one of the most direct ways that the Government Digital Service (GDS) learns about user experience. It helps us identify problems, learn what is working, and hear from a range of users on the issues that matter to them.

We meet users where they are by providing feedback options on every GOV.UK page. Unfortunately, this also creates a lot of avenues for spam responses. These responses can dilute our insights, cause security concerns, and prevent real problems from being identified.

A multi-disciplinary team responded to this problem by developing a machine learning spam classifier. The process is part of an upgrade to the whole user feedback pipeline at GOV.UK, aiming to put critical insights in the hands of decision-makers more quickly. This post will explore the decision to use machine learning, how we built our solution, and the plan for next steps.

The problem with GOV.UK feedback spam

At GOV.UK, we received around 540,000 feedback responses from the public and other departments in 2021. Users can choose from several options on the website to comment on a range of topics, but are not actively prompted.

We asked an AI, Stable Diffusion, to produce cyberpunk style images for our blog post, using the prompt “a computer is being attacked by spam”. This was one of the results! (Rombach et al., 2021)

In early 2022, we saw spam responses surge due to a technical change on the front end, peaking at 12% of total feedback. These responses ranged from fraudulent advertisements and links to pornographic content, to multiple lines of code and incomprehensible combinations of characters.

To extract insights, colleagues suddenly needed to manually filter out tens of thousands of unusable responses over the year. The extra “noise” in the data made it nearly impossible to automate this extraction, as insights were diluted by spam. There was also a security risk to consider, as individuals could attempt to negatively exploit feedback mechanisms to disrupt the usual workings of GOV.UK.

Why we used machine learning

We quickly recognised that colleagues needed to derive their insights quickly, without needing to manually filter out spam. Based on the fact that we receive around 35,000 feedback responses per month, we calculated that it could take up to 4000 hours per year to manually categorise spam. This risked costing the civil service hundreds of thousands of pounds in accumulated hourly salaries for something that was theoretically simple to automate.

We saw the challenge as a great application for a machine learning (ML) model. ML models are computer programs that can automatically improve their own performance (judged by a chosen metric) as they are given access to more data. They are popular and powerful solutions for classification problems, and often deployed for use on email spam.

In supervised spam detection, the task is to predict the label of “spam” or “not spam” —known as “ham”— for each feedback response, as well as a probability that the prediction corresponds to the true label. The method is to provide the model with a “training set”, or lots of examples where humans have assigned the labels that we would like it to replicate. We then assess performance by passing through new data that the model has not been trained on —known as the 'test set'— and comparing the model’s predictions with the true labels.

Problems suited to ML can often be solved almost as effectively in less time and with less data by heuristic methods, so it is best practice to test these first. Before deploying ML to our problem, we experimented with “rules-based” spam detection on the new data pipeline built by the Data Insights team.

We quickly saw that we would benefit from ML’s ability to predict the probability of class membership as well as class labels, as rules-based methods appeared to struggle with spam that had subtle combinations of indicators. With ML, we could use the probability score to demand a high level of confidence in our model’s predictions, reducing the mislabelling of legitimate feedback. We could also ask for a breakdown of “feature importance”, showing us which characteristics were most likely to be present in responses labelled as spam.

ML therefore provided some clear benefits for our problem. We followed best practice by deploying rules-based approaches first, and located the areas where ML would have the advantage. Once we had made the decision, we focused on the tools and techniques that would help deliver a working solution as soon as possible.

The tools and techniques we used to deliver at pace

To align with agile principles of rapidly delivering a working solution, we prioritised quick development throughout the build. We used the Machine Learning Canvas to quickly define the scope of our problem, identify blockers to deployment, and assess the readiness of our datasets.

We identified the need to accelerate the selection of an appropriate ML classifier. To do this, we used PyCaret to automate the comparison of classifier models that we could fine-tune, using a simple function call. This helped us decide on a Random Forest Classifier, a form of ensemble learning that uses multiple decision trees to make an aggregated prediction.

The canvas also helped quantify the complexity of our datasets, and the need to standardise the versions used across teams. We implemented Data Version Control (DVC) to version-control our datasets and models, and to ensure the consistency and reproducibility of our results.

Once we had built a performant model, we used a confusion matrix to visualise the occurrence of false positives, where real feedback was identified as spam. We then ranked individual feature importance to assess the features used in the model’s classification judgements, to understand which derived features had the most impact on the outcome of the model's predictions.

Tools such as PyCaret and DVC meant that we were able to focus on deploying a working solution at pace. We used an agile approach by testing rules-based methods before ML, before using the Machine Learning Canvas to streamline planning and set priorities for quick development and iteration.

The next steps

The first iteration of our spam classifier is capable of delivering huge time savings to GOV.UK. We can run it on over a month’s worth of feedback data –around 40,000 responses– in less than five minutes, a fraction of the time it takes human reviewers.

Now, a careful iteration process is important for combating spammers that adjust terminology to outwit filters and cause “model drift”. We will deploy the model to larger, more complex feedback datasets and engineer improved features. We hope to improve our model’s accuracy by finding a classification threshold that strikes the optimum balance between “precision” and “recall”.

Open development remains integral to our approach, helping us refine collaboration across teams, test novel techniques, and speed up the processing of tens of thousands of feedback responses every month. If you are interested in seeing how this project progresses, subscribe to the GOV.UK blogs where we will post updates on the integration process, and the real world results for GOV.UK colleagues.

Splink: Fast, accurate and scalable record linkage

2022-09-23T09:06:14Z

A common data quality problem is to have multiple different records that refer to the same entity but no unique identifier that ties these entities together. For example, customer data may have been entered multiple times by accident, or have been entered in multiple IT systems separately.

Record linkage (sometimes known as entity resolution, or data matching) is a technique to link these records, enabling data to be deduplicated and joined between systems.

At the Ministry of Justice, we have developed an open source library called Splink to improve our record linkage methodology. This has enabled us to share new linked datasets with accredited researchers, as part of the ADR UK-funded Data First programme.

What is Splink?

Splink is a free library for fast and accurate record linkage, which is now in its third version. It has the following key features:

Faster and more accurate than other free tools
Able to link huge datasets, of tens of millions or records or more
Its development has benefitted from guidance from our academic advisors - three professors who are experts in data linkage
The software produces a wide range of interactive data visualisations that help to build effective models, explain linkage predictions, diagnose problems, and quality assure models
The software is compatible with multiple databases and big data processing engines, meaning it can run on a wider range of computer systems.
Capable of linking a wide variety of data, and makes no assumption about the entity type you need to match

How do I get started?

Splink is a free Python package that can be installed in the usual way - using ‘pip install splink’.

We recommend users start by looking at our online tutorial, which is part of our main documentation website. The tutorial runs through a full record linkage example, from exploratory analysis right through to prediction and graph analytics, and it can even be run interactively in your web browser.

How it works

Splink is an implementation of the Fellegi-Sunter model. The software generates pairwise record comparisons using an approach called blocking, and computes a match score for each pair which quantifies the similarity between the two records.

The match score is determined by parameters known as partial match weights. These quantify the importance of different aspects of the comparison.

For example, a match on date of birth lends more evidence in favour of two records being a match than a match on gender. A mismatch on postcode may provide weak evidence against a match because people move house, whereas a mismatch on date of birth may be stronger evidence against the record being a match.

This simple idea has a lot of power to build highly nuanced models. Partial match weights can be computed for an arbitrary number of user-defined scenarios, not just a match or non match. For example, a partial match weight can be estimated for a scenario where postcodes do not match, but are within 10 miles of each other.

These partial match weights are combined into an overall match score, which represents the weight of evidence that the two records are a match.

A more detailed video description of how this all works can be found here.

Get in touch

If you work for the government and would like help getting started with your data, please don’t hesitate to get in touch at robin.linacre@digital.justice.gov.uk. You can also ask us a question or raise an issue against the main (public) github repository.