In this post we’ll look at our work in the MoJ, and in the next blog we’ll share plans for data improvement in the wider criminal justice system.
Data underpins our work in delivering justice outcomes. It helps us measure the impact of policy interventions, gives us operational insight into prisons and probation, and deliver better services for our users – along with so much more. Yet too often, our data is fragmented, hard to share and not exploited to its fullest extent.
The Data Improvement team is focused on improving the quality of data, access to data and the data skills of staff, so that MoJ and the wider criminal justice system will make better decisions based on data and improve the outcomes for the millions of people that rely on the justice system.
We are creating the foundations for our colleagues – in data science, data linking, analysis, operations and more – to be able to deliver the data-driven insight the MoJ relies on.
Our roadmap outlines our data improvement strategy for the next three years. As with any roadmap, we have most confidence about the activity that’s coming up in the near future, and our work in the next few months will inform our work over the coming years.
Following on from our discovery work, we’ve been developing prototypes of processes and tools to improve data quality, access to data and data skills. For the next few months, we’ll continue to test these solutions with users and iterate them. We will:
Next, we need to build up our improvement toolkit, by testing our ideas with more partners. From the middle of 2024, we’ll draw together our different prototypes and strands of work on exemplar end-to-end services or datasets. We will:
Once we have confidence in our approach and our skills, we can support other teams to lead within their own area. We can enable them to use their subject matter expertise and our processes and tools, alongside our advisory and consultancy support, to solve problems for themselves. We expect this phase to start in mid-2025.
Getting the fundamentals right in data is an important topic and there are growing number of government teams working on these complex issues. If you’re looking into any of these issues or solutions, please get in touch so we can continue to collaborate across the criminal justice system and public sector, and share lessons learned.
To deliver this ambitious programme, we’ll need passionate data professionals to join our growing team. If you’re interested in working with us to solve some knotty problems, keep an eye on Civil Service Jobs or email us for an informal chat.
]]>The Showcase will feature curated content coordinated by the Data Science Community, with support from our partners, focused on three content strands: your career in data science, the Data Science Toolshed and making an impact with data science.
Civil servants can count attendance of any Showcase sessions towards their One Big Thing data training.
We will explore skills-building and career progression in the data science space. This follows the success of our career panel events throughout 2023 and requests from our members to access career content.
On day one you can expect:
At our June meetup we invited our members to enter the Data Science Toolshed, where developers code in the open, follow best Reproducible Analytical Pipelines (RAP) practice, and build tools to share. We had great feedback from the event and not enough time to explore the tools available to government data scientists, so we have decided to dedicate a whole day to the toolshed as part of the Showcase.
On day two you can expect:
We will shift our focus to looking at the context in which we do data science. The questions we address on this day are important not only for data scientists, data engineers and data analysts, but also for policymakers, scientists and anyone in government whose work is impacted by data. Some of the questions cover how policymakers and data scientists can support each other and how data science improves the lives of the public.
On day three you can expect:
This year, we have focused on broadening the reach of our Community activities and making it more inclusive. Engaging with a wider range of sectors and organisations in our programme, means more opportunities to grow your network, understand the data science landscape, and collaborate rather than duplicate.
The Data Science Community is managed by the Data Science Campus, part of the Office for National Statistics. By teaming up with partner networks and communities with shared goals, we aim to deliver a Showcase that highlights the many areas of the public sector that are underpinned by data science. We are delighted to be working with the RAP Network, the NHS-R Community, the EmTech Community, and our own subcommunities.
We would like to extend a special thanks to Data Cymru, who have helped us to curate the content and engage speakers, ensuring that the voices of colleagues in local and devolved governments across the UK are incorporated into the Showcase.
Sessions are now live for booking on Eventbrite. Join our mailing list to be the first to know when more sessions become available. You can contact the Data Science Community team using government.data.science.community@ons.gov.uk.
]]>
For organisations working to reduce reoffending, being able to evaluate the effectiveness of their interventions is paramount. This requires access to relevant data, and the expertise and time to undertake appropriate analysis. Third sector organisations cannot access central reoffending data, because it is extremely sensitive information about individuals. This, combined with the need for specialist analysis, means that for many organisations working with people across their journeys through the criminal justice system, there would be no way to evaluate their work if the Justice Data Lab did not exist.
Organisations send us a list of their programme participants, and we identify these people in secure datasets such as the Police National Computer, build matched comparison groups, and produce a full impact analysis which quantifies the effect of the intervention on reoffending outcomes. This provides third sector organisations with a vital source of impartial and rigorous evidence that can be used to improve their work and secure essential funding.
Alongside our work with the third sector, the Justice Data Lab also leads on the evaluation of MoJ and HMPPS led initiatives, whose large-scale programmes have significant impacts across the criminal justice system.
This year marks the 10th anniversary of the Justice Data Lab. Within that time, we have produced 179 reports and have worked with over 50 organisations across the third sector, who provide all types of interventions from education, to accommodation, to justice system reform. Previously we have worked with The Clink, a vocational training programme which gives people in prison skills, qualifications, and routes to employment in catering and restaurant work; the Greater Manchester Intensive Community Order programme, which works with young male offenders who have received community orders in place of short custodial sentences; and the CHANGES programme at Nottingham Women’s Centre, which provides individualised support to women across 9 resettlement pathways in order to prevent reoffending.
This quarter, we published our latest evaluation, which looks at The Chrysalis Programme, an integrated personal leadership and effectiveness development programme that equips individuals with essential life skills, helping them to better own and drive positive personal change in their lives.
We’re the proud recipients of a Royal Statistical Society award for ‘Statistical Excellence in Official Statistics’, and our work has been covered by the BBC and Civil Service World. Our track record is reflected in the positive comments we receive from our partners:
Key4Life is hugely encouraged by the Ministry of Justice Data Lab’s analysis and validation, providing statistically robust evidence showing that Key4Life participants are significantly less likely to commit a re-offence compared with non-participants, and that Key4Life participants commit significantly fewer re-offences. […] Thank you to all those at the Ministry of Justice Data Lab for their support and guidance. Our staff, mentors, supporting employers and the young men on both our Prison and ‘At Risk’ preventative programmes can take great support from this positive validation.
A key part of our success is our consistent methodology, which relies on a statistical method called propensity score matching (PSM). As experts in the use of PSM for evaluation, we also receive requests from other teams in the Ministry of Justice to advise them on the use of this technique. Previously we have run an initiative we call JDL School, where we walk other teams through our methodology to share our expertise and up-skill others in propensity score matching. Now, in our 10th year, we’re very pleased to announce the relaunch of our JDL School programme.
Over the next few months, we will be gearing up in preparation to be able to offer training workshops and bespoke assistance for teams within the Ministry of Justice who would like to undertake quasi-experimental impact evaluations using propensity score matching.
If you’d like to learn more about the Justice Data Lab, commission us for work, or participate in JDL School, please email us at justice.datalab@justice.gov.uk . Please be aware that we receive a large number of commissions and therefore we operate a waitlist for undertaking any new work.
]]>Collaboration and innovation are some of the key tenets of the Digital, Data and Technology (DDaT) profession. The Cabinet Office offers many avenues for productive collaboration, enabling internal and external partners to develop both professionally and personally. This includes up to 5 days of special paid leave per year for volunteering activity, cross-government programmes such as the catapult and accelerator schemes, and external collaborations such as the Teach Her mentorship; aimed at mentoring diverse women seeking career opportunities in DDaT.
In 2022 the Data Science community at Government Digital Service (GDS) collaborated with Imperial College London to champion these principles and develop new relationships.
Data Products, a team tasked with the development and deployment of novel data tools within GDS, played host to a project allowing 4 postgraduate students the opportunity to work on a real-world problem. This collaboration aimed to help the students develop their data science skills, and gain valuable experience working in a professional environment.
This sort of experience is rare outside of industry, where often datasets are clean and adhere to tidy principles, as is often the case on code-challenge websites. The students enjoyed this difference in working and commented:
This was our first time dealing with messy real-world textual data, which was a really rewarding experience. In the world of academia, we have previously been fortunate enough to enjoy "clean" datasets (especially as an undergraduate)... While initially frustrating, this gave us a useful opportunity to learn how to handle messy data in the real world
Our 2021/22 cohort was split into 2 pairs to encourage the development of new ideas and to encourage challenge. We wanted to replicate the situation in industry, where people from different backgrounds, and with different skill sets can lead to synergies and knowledge growth.
The students participated in a project investigating the interrelationships between pages on GOV.UK in an attempt to define what we refer to as life events.
In the Data Products team, we work with the understanding that people typically visit GOV.UK to find information and services related to a "life event". A life event describes an occasion in which we need to interact with the government in some way - whether life changing, like having a baby, or routine like registering for a fishing licence.
However, with over 500,000 pages on GOV.UK, and no single distinguishing feature by which a page can be easily categorised, there’s an interest in automatically identifying which life event a page belongs to. Successfully determining this holds the potential to facilitate access to government digital services and improve the overall experience of our users.
With only 4 months to complete onboarding, get up to speed with existing research, and produce and assess a piece of analysis, we had to work hard to ensure that the students were set up for success. With this in mind we curated a training suite and timetable, clearly laying out needs and expectations.
This timetable focussed on the students' early time with us, providing them with training resources to help them find their feet such as introductions to the Civil Service, working in an AGILE environment, and coding best practices (e.g. version control using Github). We quickly progressed onto subject matter training, providing resources on Natural Language Processing and Network Analysis. Over the course of their work, the students made use of named entity recognition and geometric deep learning using biased second order random walks.
We received high praise from the students for the layout of our onboarding and timetabling, who commented:
We are convinced that this [collaboration] was only manageable with the clear project structure that had been thought out in the beginning. We only realised the full value of this about six weeks into the project when we both became quite busy with our other commitments
At this point we focussed on providing the students with the independence to make their own decisions and set the direction of their projects. They developed and presented a set of project proposals to internal experts and stakeholders receiving feedback and direction. Whilst working on their projects we held regular stand ups to assess progress and blockers, embracing the AGILE expectations of failing fast and iterating on an initial product. At this point I must praise the students for their ability to diligently work on this project whilst balancing their prior commitments; dissertations, exams, and parallel work experiences. The resilience and dedication they demonstrated throughout this period was exemplary and makes me proud to have been a part of our collaboration.
By the end of the 4 month timetable both sets of students had managed to successfully complete a minimum viable product, analysing user journeys on GOV.UK in an attempt to define pages belonging to life events. We culminated our collaboration with a playback session, within which the students ran our internal experts and stakeholders through their analysis and results.
One of the most important outcomes of this collaboration is ensuring that future cohorts see the value in participating in partnerships with industry. As such, I’d like to close with the following comments from our cohort:
“We would highly recommend being part of a collaboration with GDS to learn more about how data science is applied with real world data and in a project which is truly impactful. Moreover, it is a great opportunity to meet new people, and to get insight into how the Civil Service operates. Not only will you be able to hone your coding and research skills, but it will allow you to experience real world data-science!”
To explore career opportunities with the Government Digital Service, please visit our careers site. For the latest news about all things analytical in the UK Civil Service, including placement opportunities and ongoing mentorship schemes please visit the Government Analysis Function. Public sector employees can engage with us via the #NLP and #graphs-and-networks channels on the cross-government data science Slack.
]]>
The Reproducible Data Science and Analysis (RDSA) team, sits within the Economic Statistics Change Directorate, and uses cutting-edge data science and engineering skills to produce the next generation of economic statistics. Current priorities include overhauling legacy systems and developing new systems for key statistics related to the economic impact of Brexit, the COVID-19 Pandemic, and inflation.
Over the last five years, the RDSA team has grown from 4 to 50 people – indicating the value they bring to ONS.
Recently, the RSDA team successfully modernised a Reproducible Analytical Pipeline (RAP) for Highways England Road Traffic Flow data. This improvement has led to a significant reduction in the time it takes for the data to become available and be published in the ONS's Faster Indicators Bulletin, now taking approximately two weeks less. This means that data users and policymakers now have access to timely and accurate information on traffic flows, allowing them to make more informed decisions.
This RAP produces statistics for Road Traffic in England in a timely manner and is considered by the ONS as a Faster Indicator. Road Traffic statistics provide valuable insights into the UK economy's supply and demand of goods by understanding how domestic and foreign goods are transported across the country. This data was particularly valuable for economists and other experts to analyse the impact of the Coronavirus Pandemic on the UK economy.
In addition, the data has the potential to provide insights into the UK's supply capacity and the relationship between types of vehicles and regional economic activity. This information can be beneficial for the UK Government's Levelling Up Agenda, helping to support local communities and economies.
The Road Traffic Sensor RAP has been given a new lease of life thanks to its deployment on the Google Cloud Platform (GCP). By utilizing the Cloud Run service offered by GCP, we were able to take advantage of its ability to run software packages with ease. The Python package for the Road Traffic Sensor RAP runs smoothly on the GCP Cloud Run service, allowing our team to focus on writing and improving the code, rather than managing server infrastructure. Additionally, the cost-effective nature of Cloud Run means that we only pay when the package is in use, providing a cost-saving solution for our organisation.
If you're looking to run your application on the cloud, using Cloud Run is a great option. But in order to use Cloud Run, your application needs to be in a special format called a container. One popular way to create this containerised version of your application is through a tool called Docker. When using Docker, developers can create a file called a Dockerfile, which helps organize their code and files in a way that makes the application self-contained and ready to run on the cloud.
Let's imagine that building a containerised application is like building a house. Just like how raw materials like brick, tiles, and timber are needed to build a house, our Python package containing scripts and files is needed to build our containerised application.
First, we create a blueprint for our house, which is similar to creating a Dockerfile for our application. This Dockerfile tells Docker, the architect, how we want our application to be structured and arranged.
Once the blueprint is ready, the architect (Docker) takes it and creates the technical documents, similar to blueprints and structural drawings for a house. Then, the builders and engineers (GCP Cloud Build) use these technical documents to construct the house, which in this case is our containerised Python application.
Finally, just like how a property manager takes care of the repairs, maintenance, security, and upkeep of a house, Cloud Run takes care of the same responsibilities for our containerised application.
Deploying code to the Cloud using tools like GCP and Docker is certainly a convenient and efficient process. However, it's important to remember that this is just one aspect of the overall software development process. To build and maintain a robust and high-quality codebase, it's essential to adhere to best practices and utilise the right tools. For this reason, we highly recommend checking out the "Quality assurance of code for analysis and research" online book, written by the Quality and Improvement team at the Office for National Statistics. This book delves into important topics such as version control, modular code, unit testing, and peer review, all of which were crucial to the development of our Road Traffic Sensor RAP package.
The ever-increasing volume of data and the need to extract valuable insights from it presents a significant challenge. However, the potential for technology to improve various aspects of our society and economy is vast.
Public sector organisations must adapt and evolve with the times in order to continue making a meaningful impact on citizens' lives for years to come. The RDSA team will rise to meet these challenges and help the UK Government achieve Mission Three: "Better data to power decision-making" and Mission Six: "A system that unlocks digital transformation" — of its Digital Data Strategy.
By committing to ongoing learning and staying up-to-date with the latest trends and tools, the RDSA team is one of many at the ONS working to create more timely and robust statistics that empower governments, businesses, and individuals to make informed decisions and plan for the future.
If you're interested in learning more about the innovative work being done by the RDSA team, don't hesitate to reach out to our team lead, Rich Campbell at richard.campbell@ons.gov.uk.
]]>The service has two main components: a suite of tools which staff can
use to analyse data and create dashboards, and a data catalogue
which DIT, and other government departments, can use to access existing data or dashboards. Data Workspace was created in 2019 and
now hosts over 40 billion rows of data from over 200 different sources
as well as more than a dozen analysis tools. Options for tools include
JupyterLab and RStudio, as well as lesser known tools like Amazon Web Services (AWS) QuickSight and Theia / Superset. We even have some we’ve built ourselves!
This variety of tools and the fact that they are available to anyone who
completes mandatory civil service training, mean that people in many
different roles use them. Having a single place to search for and find
the data you need makes using it more accessible. My team provides
assistance to everyone from data scientists keen to use the latest
Natural Language Processing packages through to operational
colleagues who want to make dashboards to track their individual
progress.
Two users talk about how they use Data Workspace below:
“Having recently been brought into the Civil Service as an International
Trade Adviser (ITA), where I was previously a Delivery Partner, one of
the greatest surprises to me is how accessible the data relating to my
responsibilities is: both the detail and breadth of the data, but also the
range of analytical tools I have access to. Another surprise is the
freedom for experimentation – one can follow a whim as to what might
be a useful analysis without having to justify it to anyone in advance.
Having the proper tools goes beyond the boundaries of what an over-
stretched spreadsheet can produce and these can be easily shared
with colleagues to use themselves. Choosing an appropriate level of
granularity can make the same data useful as either an overview or for
detailed examination of the data.
On the topic of support, the Data Workspace team offer a great amount
of technical support and patience whilst learning to use these tools,
which is very encouraging. Data analysis allows us to accurately
understand a situation and look in the right place for solutions and
allowing the average user to do this themselves is an extremely
powerful tool.”
- Stephen Banks (International Trade Adviser)
“I'm Tayyib and I'm on the Digital, Data &; Technology Fast Stream.
Previously, I was a Data Scientist at the National Situation Centre
(SitCen) in Cabinet Office. My role in SitCen largely involved analysing
data from a range of sources across HMG to produce meaningful
insight for senior officials during crises. I’ve moved across to DIT as a
Data Engineer, building new pipelines and developing the Data
Workspace environment in line with GDS service standards.
My previous experience with data infrastructure largely involved using
Amazon Web Services (AWS) and other related analytical tools for
ingestion, data processing and analysis. AWS is a great platform,
however can be relatively inaccessible and at times overwhelming for
less data-experienced and tech-savvy users.
Data Workspace uses an AWS back-end but spins the infrastructure
into a significantly more accessible platform, removing a lot of the
hassle. It means that less data-experienced and tech-savvy users can
easily pull data using SQL, produce analysis in Python or R and
visualise insight creating dashboards in AWS QuickSight or maps in
QGIS. A huge advantage to using Data Workspace is that it can also be
shared across HMG and hosts the bulk of data that DIT uses on
everything related to trade, economic productivity and investment.”
- Tayyib Saddique (Data Engineer)
If you want to work with our team designing and supporting Data
Workspace, you can sign up for job alerts here.
In 2020, we (Harriet, Michael and Hillary) created a space where public sector employees from any department could come together and discuss data ethics, technology and society. The aim of the space was to breathe life into the concepts and topics mentioned in data ethics frameworks.
Hillary and Harriet had attended a DataKind ethics book club where Ruha Benjamin's Race After Technology was discussed. Unlike the DataKind event, which is open to all, we felt it was important to create a safe environment where public sector workers could discuss challenging topics freely, topics like feminism, race and sexuality - discussing how these ethical issues relate to our work with a critical lens.
To ensure equality, we believe that it’s important to be able to freely discuss inequalities and where and how we can apply data and ethics frameworks in our work.
And thus, the Cross-Government Data Ethics and Society Reading Group was born!
With the exception of Hillary, none of us were particularly skilled at event planning. Discussion groups began rather haphazardly, as we tried to figure out what worked best in a (virtual) public sector context.
The first book we read was Data Feminism by Catherine D’Ignazio and Lauren F. Klein, a book which is now entirely open-access. We split our discussion of it across three sessions due to its length and enormity of its contents (ranging from an initiative to record cases of femicide in an open, accessible manner, to the Gender Shades project).
Whilst we had great attendance at our first event, in retrospect, while we would undoubtedly recommend engaging with the discussion in this book, it was perhaps not the best starting point for data ethics. Each chapter could have been a standalone book in itself! Discussion points are definitely key!
We pitched our second event to be more like a journal club, bringing together journal and media articles we had found interesting, loosely themed around bias in data. This format was however less popular than a traditional ‘book club’-style reading and discussion group.
We again returned to the book format to discuss Kate Crawford’s Atlas of AI, which presents AI as a technology of extraction: from the minerals drawn from the earth right through to the labour of low-wage information workers.
Attendance varied substantially between events, with many no-shows. It was difficult to predict what would attract more people, but each event took the same amount of time to coordinate.
At the start of this year, we reflected on our sessions so far. We now hold one session per book, with suggestions of particular chapters of interest. We also introduced more structure, by committing to holding four sessions per year: one per book. You will now see us promoting our events via the Analysis Function, Operational Research & Statistical Service newsletters, in addition to the Cross Government Data Science Slack #ethics channel.
Some colleagues come to every session, others are intrigued by a particular book or author. We want to encourage everyone and anyone to engage with the material. You do not have to be someone who works with data, just a curiosity and willingness to think critically about the material.
This year, we have read:
A Data Ethics and Society Reading Group session runs for an hour over lunchtime. A date for the session and a book will be confirmed (using your suggestions!), a few months in advance.
Numbers vary: sometimes we have smaller groups, but recently attendance has been closer to 50!
After a short introduction to the group and material, attendees are split into breakout rooms to discuss the book in smaller groups. Hosts suggest discussion questions, but each breakout room tends to take discussion in different directions. At the end of the session, we come back together to share highlights.
No notes or recordings are taken from the sessions, and whilst we encourage everyone to participate, we know that people engage in a multitude of different ways. We are open to suggestions on how we make the group more inclusive and cater for the needs of everyone. If you have an idea, please get in touch!
We are sure that the reading group will continue to shape-shift as time goes on. For now, we hope that it serves as a starting point for public sector colleagues to critically engage with data ethics alongside data practitioners.
We are always looking for recommendations and suggestions of what to read and discuss next. We also encourage guest-hosts of the event, particularly in pairs.
We have a full reading list on our website. Some highlights include:
Our website is powered by GitHub, which means that it is really easy for you to suggest reading material and be listed in our ‘contributors’ section. You can raise an issue listing all the information we need.
On Tuesday 6th December, we will be running our final session of 2022, by focusing on Artificial Unintelligence by Meredith Broussard.
Sign up online.
To hear about events in future, you can sign up to our mailing list.
Michael and Harriet
]]>We meet users where they are by providing feedback options on every GOV.UK page. Unfortunately, this also creates a lot of avenues for spam responses. These responses can dilute our insights, cause security concerns, and prevent real problems from being identified.
A multi-disciplinary team responded to this problem by developing a machine learning spam classifier. The process is part of an upgrade to the whole user feedback pipeline at GOV.UK, aiming to put critical insights in the hands of decision-makers more quickly. This post will explore the decision to use machine learning, how we built our solution, and the plan for next steps.
At GOV.UK, we received around 540,000 feedback responses from the public and other departments in 2021. Users can choose from several options on the website to comment on a range of topics, but are not actively prompted.
In early 2022, we saw spam responses surge due to a technical change on the front end, peaking at 12% of total feedback. These responses ranged from fraudulent advertisements and links to pornographic content, to multiple lines of code and incomprehensible combinations of characters.
To extract insights, colleagues suddenly needed to manually filter out tens of thousands of unusable responses over the year. The extra “noise” in the data made it nearly impossible to automate this extraction, as insights were diluted by spam. There was also a security risk to consider, as individuals could attempt to negatively exploit feedback mechanisms to disrupt the usual workings of GOV.UK.
We quickly recognised that colleagues needed to derive their insights quickly, without needing to manually filter out spam. Based on the fact that we receive around 35,000 feedback responses per month, we calculated that it could take up to 4000 hours per year to manually categorise spam. This risked costing the civil service hundreds of thousands of pounds in accumulated hourly salaries for something that was theoretically simple to automate.
We saw the challenge as a great application for a machine learning (ML) model. ML models are computer programs that can automatically improve their own performance (judged by a chosen metric) as they are given access to more data. They are popular and powerful solutions for classification problems, and often deployed for use on email spam.
In supervised spam detection, the task is to predict the label of “spam” or “not spam” —known as “ham”— for each feedback response, as well as a probability that the prediction corresponds to the true label. The method is to provide the model with a “training set”, or lots of examples where humans have assigned the labels that we would like it to replicate. We then assess performance by passing through new data that the model has not been trained on —known as the 'test set'— and comparing the model’s predictions with the true labels.
Problems suited to ML can often be solved almost as effectively in less time and with less data by heuristic methods, so it is best practice to test these first. Before deploying ML to our problem, we experimented with “rules-based” spam detection on the new data pipeline built by the Data Insights team.
We quickly saw that we would benefit from ML’s ability to predict the probability of class membership as well as class labels, as rules-based methods appeared to struggle with spam that had subtle combinations of indicators. With ML, we could use the probability score to demand a high level of confidence in our model’s predictions, reducing the mislabelling of legitimate feedback. We could also ask for a breakdown of “feature importance”, showing us which characteristics were most likely to be present in responses labelled as spam.
ML therefore provided some clear benefits for our problem. We followed best practice by deploying rules-based approaches first, and located the areas where ML would have the advantage. Once we had made the decision, we focused on the tools and techniques that would help deliver a working solution as soon as possible.
To align with agile principles of rapidly delivering a working solution, we prioritised quick development throughout the build. We used the Machine Learning Canvas to quickly define the scope of our problem, identify blockers to deployment, and assess the readiness of our datasets.
We identified the need to accelerate the selection of an appropriate ML classifier. To do this, we used PyCaret to automate the comparison of classifier models that we could fine-tune, using a simple function call. This helped us decide on a Random Forest Classifier, a form of ensemble learning that uses multiple decision trees to make an aggregated prediction.
The canvas also helped quantify the complexity of our datasets, and the need to standardise the versions used across teams. We implemented Data Version Control (DVC) to version-control our datasets and models, and to ensure the consistency and reproducibility of our results.
Once we had built a performant model, we used a confusion matrix to visualise the occurrence of false positives, where real feedback was identified as spam. We then ranked individual feature importance to assess the features used in the model’s classification judgements, to understand which derived features had the most impact on the outcome of the model's predictions.
Tools such as PyCaret and DVC meant that we were able to focus on deploying a working solution at pace. We used an agile approach by testing rules-based methods before ML, before using the Machine Learning Canvas to streamline planning and set priorities for quick development and iteration.
The first iteration of our spam classifier is capable of delivering huge time savings to GOV.UK. We can run it on over a month’s worth of feedback data –around 40,000 responses– in less than five minutes, a fraction of the time it takes human reviewers.
Now, a careful iteration process is important for combating spammers that adjust terminology to outwit filters and cause “model drift”. We will deploy the model to larger, more complex feedback datasets and engineer improved features. We hope to improve our model’s accuracy by finding a classification threshold that strikes the optimum balance between “precision” and “recall”.
Open development remains integral to our approach, helping us refine collaboration across teams, test novel techniques, and speed up the processing of tens of thousands of feedback responses every month. If you are interested in seeing how this project progresses, subscribe to the GOV.UK blogs where we will post updates on the integration process, and the real world results for GOV.UK colleagues.
]]>
A common data quality problem is to have multiple different records that refer to the same entity but no unique identifier that ties these entities together. For example, customer data may have been entered multiple times by accident, or have been entered in multiple IT systems separately.
Record linkage (sometimes known as entity resolution, or data matching) is a technique to link these records, enabling data to be deduplicated and joined between systems.
At the Ministry of Justice, we have developed an open source library called Splink to improve our record linkage methodology. This has enabled us to share new linked datasets with accredited researchers, as part of the ADR UK-funded Data First programme.
What is Splink?
Splink is a free library for fast and accurate record linkage, which is now in its third version. It has the following key features:
How do I get started?
Splink is a free Python package that can be installed in the usual way - using ‘pip install splink’.
We recommend users start by looking at our online tutorial, which is part of our main documentation website. The tutorial runs through a full record linkage example, from exploratory analysis right through to prediction and graph analytics, and it can even be run interactively in your web browser.
How it works
Splink is an implementation of the Fellegi-Sunter model. The software generates pairwise record comparisons using an approach called blocking, and computes a match score for each pair which quantifies the similarity between the two records.
The match score is determined by parameters known as partial match weights. These quantify the importance of different aspects of the comparison.
For example, a match on date of birth lends more evidence in favour of two records being a match than a match on gender. A mismatch on postcode may provide weak evidence against a match because people move house, whereas a mismatch on date of birth may be stronger evidence against the record being a match.
This simple idea has a lot of power to build highly nuanced models. Partial match weights can be computed for an arbitrary number of user-defined scenarios, not just a match or non match. For example, a partial match weight can be estimated for a scenario where postcodes do not match, but are within 10 miles of each other.
These partial match weights are combined into an overall match score, which represents the weight of evidence that the two records are a match.
A more detailed video description of how this all works can be found here.
Get in touch
If you work for the government and would like help getting started with your data, please don’t hesitate to get in touch at robin.linacre@digital.justice.gov.uk. You can also ask us a question or raise an issue against the main (public) github repository.
]]>Reproducible Analytical Pipelines (RAPs) are automated statistical and analytical processes. They ensure that analysis is reproducible, efficient, and high quality.
The Analysis Standards and Pipelines team have worked with teams across government to help them implement RAP. From our experiences we have developed an approach that works to build capability in the team, deliver a working and valuable product, and create enthusiasm for a new way of doing analysis.
We approach building RAP capability in teams by:
Managers are happier when they see quick wins. We work with analysts to identify small parts of the existing pipeline which have time-consuming manual steps or are the riskiest and prioritise these for development. We keep managers enthused by sharing successes.
"Just-in-time learning", where new techniques or principles are taught to analysts just before they use them, helps analysts to learn best practices. This allows them to embed their new knowledge by using it, rather than taking an “Intro to R” course months before they ever get to use it.
Paired programming, where two or more analysts code together with one “driving” and the other reviewing, spreads knowledge, aids peer review, and speeds up code development.
Agile is an approach to managing software development that helps teams deliver products to their customers faster. A big part of Agile working is understanding user needs. Gathering and understanding user needs allows analysts to create code which is fit for purpose. Frequent conversations with users, and showing them working versions of the product, help us understand if needs have changed so we can adjust our code appropriately. Opening these lines of communication between analysts and users is vital to allow for RAP to work effectively.
For this approach to work we need the right team, the right tools and support from managers. Each member of the team should have a base level of coding ability but does not need to be an expert. In previous projects we have started with training a small team first. Once this team is ready, we bring in more contributors.
A range of skills in the team is useful to help a RAP project flow smoothly. Team members who know the outputs and end users are useful to keep the product fit for purpose. The team must be enthusiastic as it is challenging to learn this new way of working.
Analysts need time commitment to learn RAP practices and to produce the code the right way. Managers must be made aware of how much time this takes, and they must commit to giving their teams this time.
Contact us for support deploying RAP in your team.
We recently released our cross-governmental RAP strategy. You can also read the Quality Assurance of Code for Analysis and Research for a detailed look at RAP principles. My colleague Rowan has recently written a blog about building open source tools for analysts to help them implement RAP.
If you would like to get involved, check out the RAP collaboration slack channel and consider joining the RAP champions network.
]]>