Seminar held on afternoon of 7th March 2016
Report by Ian White
There has been much talk about dropping the traditional decennial population census and obtaining equivalent estimates from big data and administrative databases. The main aim of this seminar was to examine the extent to which this may be achievable or is just hype. This seminar explored those big data sources that are currently available or imminent, and what they can provide for geodemographics users. The main focus was on spatially referenced data that can be applied down to small area level.
The seminar was enthusiastically chaired by Suzy Moat, Associate Professor of Behavioural Science at Warwick University. She highlighted the three main reasons that had brought the delegates together for this event: firstly, a shared understanding of the importance of knowledge about those socio-demographic characteristics of the population that have hitherto been the domain of the Census; secondly, recognition of the development over the past few years in possible alternative data, derived, for example, from the use of loyalty cards and mobile phones and other administrative sources – she cited her own experiences in using online data sources to predict human behaviour in the real world, such as estimating crowd sizes; and thirdly, a line-up of excellent speakers.
In the opening presentation, Keith Dugmore (now MBE), Demographics User Group, looked at the question from the commercial users point of view. He gave an overview of the early interest in Big Data and referred, in particular, to the McKinsey report in 2011 on ‘Big Data: the next frontier for innovation, competition and productivity’, and the more recent House of Commons Science and Technology Committee’s report on ‘The Big Data Dilemma’. He outlined the structure and role of the Demographic User Group, which comprises some 14 members representing some of the country’s major commercial and retail brands, and noted the increasing involvement with the academic community through links with the ESRC. Some of the burning questions being currently considered by the Group included:How are online sales changing? Will cash continue to be used? What’s the best source of parking data? What will be the impact of Crossrail on the South East economy? It was recognised that Big Data represented the best way for companies to better understand their customers. M&S for example held week information on 21 million customer visits to stores and 60 million items of clothing and food sold per week. All its data held for 4 or more years. In addition to such unit record data Keith went on to note that other sources of Big Data included: sample surveys; aggregated statistics or estimates for Census Output Areas; geodemographic classifications; postcode directories and look-ups; social media, and map data. Such data could be found through: the Government’s Open Data website where the data was free and covered a wide range of topics but were sometimes difficult to find; or through value added resellers where there was a cost involved but where specialist expertise was available. Keith’s checklist for assessing the value of data included: Was the topic relevant topic? Was the coverage and quality sufficient? Was the geography detailed enough? Was UK-wide data available and accessible? Was it up to date? And had any damage been inflicted by statistical disclosure control? Keith concluded by noting that DUG’s priorities were to see: a definitive National Address Gazetteer, and OS Map data enabled by the Public Sector Mapping Agreement; counts of people by location, and by time of day, from mobile phone data for Output Areas; aggregate statistics on income and wealth at Output Area level, created from government administrative files (now!); and for ONS to pool companies’ transaction data to create timely estimates of prices and growth, and statistics for small areas on market sizes and sales channels (especially online). All these data, he noted, were BIG.
Please see a copy of the presentation here.
Graham Smith (CACI), after reporting on CACI’s 40 years of experience data provision, focused on the story of the development of Big Data so far, the new opportunities that such data provided, and the data privacy concerns that came with it. He reported on the recent data explosion by noting the claim that “Every day we create as much information as we did from the beginning of time until 2003”. Big Data was typically characterised by the four Vs: volume; variety; velocity; and veracity. Historically, geodemographers had relied on the Census as the only source of Big Data – even the 1971 Census recognised that the questionnaire was “The Big form with a Big job to do”. But now there was a variety of inputs. Among the sources of government and administrative data, Graham focused on the Land Registry that provided a source of house price information consisting of more than 24 million definitive records dating back to January 1995, and more than three million title records of freehold and leasehold property in England and Wales. Examples of commercialsources of information on names and addresses included the edited Electoral Roll, lifestyle data sources, transactional databases, and niche databases, which together were used to create a 48 million- record Consumer Register to serve as a ‘spine’ and validation file for individual and household level variables. As an example of researched and derived data Graham cited the use of information on names and addresses together with date of birth data sources to model ages based on forename and other known attributes in order to create a full sex/age profile for every residential household and postcode. He noted, however, that the Census remains important for calibrating demographic classifications.
Graham went on to describe the opportunities for Big Data from the use of social media, card transactions, mobile phone applications, and smart metering. He noted, however, that customer-based data was only part of the story and purchasing decisions were still predicated on affluence and lifestage. In many cases customer data could only provide basic demographics. Geodemographics, on the other hand, added colour and context to customer segmentations and analysis.
Graham summarised the principle data privacy legislation issues covering awareness, compliance and supply, and concluded by noting some of the key considerations in using Big Data, notably; has permission been obtained to use the data; are there barriers to obtaining, processing and linking the data, particularly the linkage to an address; are there sufficient demographic variables; and how inherent is the bias?
Please see a copy of the presentation here.
In a joint presentation Ben Smith and Nick Henthorn (Telefónica) described what Telefónica does with its mobile data and how it could help other companies. Nick stressed that although data was currently available on 24 million customers, measures were taken to ensure privacy protection: data was anonymised, aggregated, and extrapolated by applying an algorithm to represent the whole population. Some two billion usable events are recorded and the data kept for two years. Data is available on age, home location (morning/evening), work location (daytime), affluence indicator, lifestage (evidence of children), mobile usage, regular commuting route, and print media (facebook, twitter, photos). Such data can be used typically for journey monitoring, profiling (for purposes such as digital advertising on the tube) and retail decision making. The importance of the Census still was, however, still recognised
.Ben illustrated some specific uses of mobile phone data for identifying mode and speed of transport. Repeated patterns of events can provide a profile of a regular journeys; train travel is typified by data referring regularly to the same location and time periods. Flows of visitors into and out of a particular region can be deduced, and High Street retail profiling by location of outlet can identify age and dwelling type of customer and time of purchase.
Ben Anderson (Southampton University) took the audience through a quick tour of the developing features of the UK Census evolution since the 1970s, and referred to the aim of the Beyond 2011 Project to investigate new ways to deliver the census. He felt it is important not only to retain old census-like characteristics but to encompass new census-plus variables, as well as achieving higher frequency of data collection and attracting new user markets. He noted the opportunities and benefits that data from electricity metering offered. Uptake was almost universal in contrast to the less than complete coverage of water and gas metring. Data on usage can provide assumptions on some census-like characteristics such as household size, dwelling type and tenure, as well as ethnicity and economic activity of the householder. There is interest in using level of consumption profile indicators to create census-plus characteristics such as life stage, household income and floor space.
Please see a copy of the presentation here.
Andy Teague and Jane Naylor (ONS) gave an update on ONS’s consultation on the plans for the 2021 Census and its continuing research into the aspirations, expectations and challenges in using administrative and big data as alternative sources for the future derivation of census-type data. Andy summarised the outcome of consultation and reserach programme of the Beyond 2011 Project that led to the Government’s acceptance of the National Statistician’s recommendation to carry out a full Census in 2021 with the target of achieving a 75 per cent response to online, and to continue research into investigating the best way to use administrative data both to support and enhance the 2021 Census and to improve annual statistics between censuses with the longer-term aim of replacing the traditional data collection methodology thereafter. He went on to summarise the aims of ONS’s Census Transformation Programme to use new legislation to create easy, flexible and rapid access to both existing and new data sources with the ability to link unit record data efficiently and accurately. Such methods must be able to produce statistical outputs of sufficient quality to meet user needs as well as being acceptable to the public and Parliament. There was a general acceptable, however, that administrative and big data alone would not provide a complete solution, and that data on some characteristics would still have to be collected by more traditional surveys. There would be regular assessments of the research to be published annually, starting this year, to enablefeedback from users on data quality so that, over time, methodologies would improve the balance between range of topics, the geographical detail, and accuracy/timeliness of outputs. ONS would build towards a recommendation in 2023on whether or not a census based on administrative data and surveys can provide statistics of the required quality. A major challenge would be to do this without a national population register. Andy noted that we would be first country in the world to attempt to do so. He summarised, with some specific case studies, the potential of a number of current administrative sources to provide data on those characteristics that are regularly covered in the census.
Jane considered Big Data to be such data that was not obtained from either a census, a survey or administrative sources. She highlighted the potential benefits and ethical concerns in using data from the some of the sources that were currently being investigated - such as Twitter (to gain insights into mobility and migration among, for one specific example, students), Zoopla (to monitor the characteristics of the housing stock), smart electricity meters (to monitor occupancy levels) and mobile phones(to model population density and commuting flows). She noted the opportunities that were offered through the use of such data (such as the reduction in public burden and cost, improvements to the timeliness of the data, and increased efficiency through the re-use of data) but was also aware of some challenging issues (such as the addressing statistical bias, coverage and definitional inconsistency, the means of accessing the data, and privacy concerns).
Please see a copy of the presentation here.
The seminar concluded with the speakers being invited to respond to a number of questions, concerns and issues raised by the audience. These covered:
The meeting concluded with thanks to Barry Leventhal and the Census Geodemographic Group for organising the day and to Suzy Moat for chairing the event.
Date: Afternoon of 7th March 2016
Venue: MRS, The Old Trading House, 15 Northburgh Street London EC1V 0JR
Our newsletters cover the latest MRS events, policy updates and research news.