THE MICS GIS INITIATIVE: HOW DO WE ANONYMISE SPATIAL DATA?

28 October 2022

Since 2021, the MICS programme has been working towards making geospatial data available for as many surveys as possible.

MICS GIS

Geospatial data can enable higher resolution analysis and yield more targeted and effective programming. One of the main challenges to using geospatial data for household survey data like MICS lies in the ethical use data. The location of clusters is an essential element of geospatial analysis. But to maintain the confidentiality of households, the precise location of clusters must not be revealed, and each cluster must be displaced within a set of parameters. How do we then ensure that the location is close enough so that it’s useful but not close enough to compromise the anonymity of those who are surveyed?

The overall procedure for geographical displacement of MICS clusters geocodes is based on the present global gold standard method of anonymisation of cluster locations, as developed by the Demographic and Health Surveys (DHS) program with small modifications. It should be noted that the DHS methodology is under constant refinement, as described here.

The process of displacement and anonymisation of cluster geocodes is achieved through a custom-built software plugin working on the open-source Geographic Information System (GIS) software QGIS.

 

The anonymisation process

The input data for the anonymisation of cluster geocodes are (1) geocodes of clusters (cluster boundaries as polygon layer, cluster centroids as point layer, multi-points per cluster as multi-point layer, or sampled household coordinates as point layer) and (2) boundary shapefiles representing the sampling frame of MICS at one level lower than the sample stratification/reporting domains (e.g., national admin levels 2).

Input data for the anonymisation of cluster geocodes: cluster geocodes (green dots) and boundary shapefiles (yellow polygons).

Input data for the anonymisation of cluster geocodes: cluster geocodes (green dots) and boundary shapefiles (yellow polygons).

 

Before the displacement of clusters, cluster centroids are determined. Determining cluster centroids is dependent on the type of cluster geocode input data. If possible, the cluster centroid will be calculated as a pole of inaccessibility (a geographical point that represents the most remote place to reach in a given area). If the pole of inaccessibility cannot be determined (e.g., there is only one or two points in the cluster), the centre of the cluster will be determined as a centroid of all input points. In the example above cluster centroids are available.

After cluster centroids have been determined, these are displaced using a set of conditions: In order to protect respondent anonymity and avoid disclosure of respondent locations, each cluster is displaced by a random angle and a random distance with the condition for clusters to remain within the original subnational boundary (usually one lower level than the sample stratification). The angle and distance of displacement is random, however, for distance a set of parameters are employed (as implemented by DHS). All urban clusters are displaced within a distance of 2 kilometres, while rural clusters are displaced within 5 kilometres, with a randomly selected 1% of clusters displaced within 10 kilometres.

Displaced cluster geocodes. The black lines link each cluster centroid (green dot) with the relevant displaced cluster centroid (red dot).

Displaced cluster geocodes. The black lines link each cluster centroid (green dot) with the relevant displaced cluster centroid (red dot).

 

The final output data is a set of anonymised cluster locations in a CSV file format. As part of the software plugin, multiple existing open-source databases are used to calculate covariates. Covariates are calculated for the buffer zones of clusters. A buffer zone is the maximum area in which the cluster can be displaced. A future news item will present the topic of covariates in more detail.

The buffer zones around displaced cluster geocodes.

The buffer zones around displaced cluster geocodes.

 

The original, unanonymised data are never shared outside the Global MICS Team without the explicit permission of the individual National Statistics Office (NSO). The anonymisation process for the traditional non-spatial data collected at household and individual level remains the same as always, ensuring that no respondent is identifiable in publicly available data.

The QGIS plugin carries MICS branding and terminology but can be used for any household survey. The plugin is currently in final review and will later this year be published together with a package of manuals. This will be announced in a future news item.

Geospatial location data of MICS will cover better the progress on topics ranging from maternal and child health, education, nutrition, foundational learning skills, child and adult functioning, water quality, social transfers, victimization, etc., at subnational level. These data will also provide more coherent information on progress towards the Sustainable Development Goals at the subnational level.