Data Week: How to manage your data
What the latest UK scandal tells us about data entry and management.
Welcome to Plugging the Gap (my email newsletter about Covid-19 and its economics). In case you don’t know me, I’m an economist and professor at the University of Toronto. I have written lots of books including, most recently, on Covid-19. You can follow me on twitter (@joshgans) or subscribe to this email newsletter here.
As regular readers know, I occasionally like to have a theme for the week. This week I am going to look at some issues associated with data: its collection, its collating and, today, its management. Today’s post looks that a recent UK government scandal involving Excel file formats.
The case of the missing cases
Because of my interest in artificial intelligence, people often ask me what Big Data is. In order to be more precise than ‘a ton of data’ the definition I have landed on is ‘too much data to be handled by an Excel spreadsheet.’ This is because an Excel spreadsheet is familiar to most people and it also has a limited amount of data it can store. Specifically, it is limited to just 1,048,576 rows or 2 to the power of 20. Want data with 1,048,577 entries or more, then you’ll need another program. (This is actually an advance in Excel spreadsheets. Back in the day with Excel as part of Office 2007, you were limited to 65,536 rows).
Well, unless you use the columns to represent a data point which is something … I can’t even. In that case, your column limit is 16,384 (or 2 to the power of 14) which is pretty small data as these things go.
So how does this relate to the pandemic? Well, if you are in the business of collecting pandemic numbers and the number of cases is growing exponentially, then you might hit capacity limits in your software. The case in point: the United Kingdom this week.
The government agency that helps oversee the UK’s pandemic response, Public Health England (PHE), said some 15,841 cases had been left out of national totals because of the error but did not specify what caused the so-called glitch. …
The 15,841 “missing” cases made public today were originally recorded between September 25th and October 2nd. All those who tested positive for COVID-19 were notified by the UK’s health authorities, but the failure to upload these cases to the national database meant anyone who came into contact with these individuals was not informed. It’s an error that may have helped spread the virus further through the country as individuals exposed to the virus continued to act as normal.
Hang on a second, why did the Excel limits matter? The UK have many more daily cases, around 25,000 now, but that is well below 1 million. But it is above 16,384 and … you can see where this is going.
But even that isn’t entirely satisfying. If you reached some sort of cap, you would notice that it was no longer going up and the error may be found quickly.
Part of the clue comes from the fact that it was one lab that was apparently sending the government data in a CSV format (Google it if that doesn’t hit a nerve) and then the government was converting it to use in their own database which is why the cap wasn’t noticed. Even then a million is a big number.
This video, which is worth watching if you want to deep dive into spreadsheet news, identifies the potential critical piece in the puzzle as the use of the older 2007 file format in whatever macro the UK government was using to take the lab CSV data and merge it into their broader database.
From Slate:
According to the BBC, the error was caused by the fact that Public Health England developers stored the test results in the file format known as .XLS. These .XLS files were then sent to the NHS after uploading to a central system. .XLS is an outdated file format, however, and each spreadsheet can have only 65,000 rows. By contrast, the .XLSX file format, which was first released in 2007, allows for more than 1 million rows. Because of the limited number of rows, each spreadsheet could contain about 1,400 cases, leaving excess cases off the file altogether. Although the issue was reportedly fixed by splitting the files into smaller batches, many are slamming Public Health England. “Why are critical databases in a national pandemic posted on Excel spreadsheets?” Jonathan Ashworth, the Labour Party’s shadow health secretary, said. “Why aren’t they using specialist data-based software?”
Anyone who had a ‘government scandal over Excel file formats in their 2020 event calendar’ can take a bow now.
Was this a big error?
It is hard to say but the BBC did some number crunching (and I don’t know what software they used) and showed that it was pretty significant as the following graphs and map shows.
That is not good. It is so not good that I would hope that every data manager for every government went rushing to their systems to see if it could happen to them.
Do the right thing
The lesson from all this is very simple: when you have to manage mission-critical data, you have to ensure that your management system is robust. The UK did what I suspect many governments had done and cobble together its pandemic data collection systems on the fly. It had to deal with labs that had their own procedures and then it had to stitch things together. That worked until it didn’t and, frankly, the mistakes seem big enough to have been policy-relevant.
Could this have been avoided? It certainly seems so. In this regard, I always come back to this essay by DJ Patil who was the former Chief Data Scientist of the United States who spent the early days of the crisis working with the Californian government and drew six lessons for us all.
He wrote this, that was somewhat prophetic:
Much of our current public health infrastructure was built to manage tuberculosis and measles and it is not up to the task of managing COVID. These outbreaks tended to be small enough to count cases in a notebook and in narrowly defined populations, impacting hundreds rather than millions of people and only a few hospitals. The evidence for the antique-ness of our public health infrastructure is the difficulty obtaining timely, accurate information on the number of infections, hospitalization rates, the infected fatality ratio, and deaths. As a result, journalists and citizen-scientists have filled a gap with their own sophisticated efforts such as the The COVID Tracking Project, The New York Times, the Los Angeles Times, the San Francisco Chronicle, Propublica, John Hopkins University, etc.
The basis of our public health reporting system is the case report. This onerous process requires significant time from both physicians and public health officials, often resulting in only partially completed forms limiting the insights that can be drawn from the data. Additionally, the data collected are aggregated by public health officials before being sent to the state and CDC. The standard for a case report is that the information is comprehensive and perfect, even peer reviewable, which makes them highly useful in retrospect but also full of confidential health information which makes them hard to share. The resulting time lag and arduous reconciliation process is frustrating to both policy makers and the public.
And he had a solid recommendation:
Given advances in the data infrastructure and cloud computing we need to invest in a unified data infrastructure to support the reporting of COVID-19 cases from both the increasing venues where testing will be conducted (including at home tests and daily worker tests) and the traditional medical centers. This system will need to have seamless interoperability with the electronic medical systems of the major hospital networks, support the mobile surveillance testers, and interoperable with other states’ systems. Finally, we need an improved model to match data from patients as they move across regions. Afterall, COVID-19 will not adhere to state or political lines.
This pandemic is going to be with us long enough that it is still worthwhile to build the system to get this right. Governments should take notice.