The World Health Organization estimates that air pollution causes more than 2.2 million deaths in the Asia Pacific region every year. Air pollution from fine particulate matter (PM 10 or 2.5) increases the risks of heart and lung diseases, stroke, and cancers, along with other diseases.
While there have been growing efforts to monitor air quality in recent years, ground monitoring stations–especially in low and middle-income countries–remain sparse due to deployment costs and complexity.
As part of Thinking Machines’ and UNICEF’s AI4D (Artificial Intelligence for Development) Research Bank program, we tested the feasibility of training a machine learning (ML) model on remote sensing data to estimate particulate matter PM2.5. Our exploratory and foundational research focused on Thailand, one of the countries in Southeast Asia that practice open burning of agricultural waste during the post-harvest season. Our work is fully open-sourced, including the following components:
PM2.5 ML models exclusively trained on open data. While our research focused on Thailand, we used globally available datasets to encourage similar research in other countries
Pre-processed datasets for model training and evaluation, including ground truth (PM2.5 data) and feature datasets, which represent aerosol, meteorological, human-related, and environmental factors
Haze estimation (PM2.5) model benchmarks for evaluating the performance of future research in Thailand
Scripts and technical documentation to replicate our research, including our PM2.5 data collection and feature generation scripts, and a demo notebook for rolling out the final model on a small user-selected area
impact
Open-source Models and Data
Open-sourced haze estimation models and pre-processed data allow other data scientists to replicate our work and add to the growing body of knowledge on machine learning for haze estimation
Limited research and ML benchmarks for Thailand PM2.5 estimation
We know that machine learning can augment spatial gaps caused by limited ground truth data.
Existing air quality studies in Thailand are often confined to a region or city-level analyses and focused on relationships between various environmental factors and PM2.5.
Our team built on Gupta et al. (2021)’s research which applied ML to estimate PM2.5 using ground truth data from Thailand’s Pollution Control Department and meteorology and aerosol data from MERRA2. We saw an opportunity to use updated PM2.5 data and increase the number of features to estimate haze across the country.
We trained an ML model that runs exclusively on open data to estimate PM2.5 in Thailand
To build the machine learning model, we selected open data counterparts to ground truth and feature datasets frequently cited in air quality research. We used readily accessible, global datasets such as
OpenAQ Ground-level PM2.5: air quality data from low-cost and reference-grade sensors across Thailand in 2021
CAMS MERRA 2 Aerosol Information: predicted total aerosol optical depth (AOD) as the sum of tropospheric aerosols: sea salt, dust, organic and black carbon, and sulfates; we used estimates at 550nm wavelength
Sentinel 5P Aerosol Information: offline high-resolution imagery of the UV Aerosol Index (UVAI), also called the Absorbing Aerosol Index (AAI)
We trained an ML model that runs exclusively on open data to estimate PM2.5 in Thailand
We combined these geospatial datasets with different file formats (tabular data, admin boundaries, tiles, etc) and conducted a series of data validation checks and alignment in preparation for regression-based ML modeling.
For OpenAQ, we generated a 1x1km bounding box per air quality sensor to generate the area of interest and log a station’s daily PM2.5 reading.
For each feature, we collected the data within the area of interest and aggregated or distributed the values to achieve a daily granularity. We then joined our daily features with ground truth data.
Loading...
Finally, we applied a simple mean imputation technique to fill in missing feature values and maximize the number of usable data for ML training and evaluation.
Building a model for haze estimation
Our best-performing PM2.5 estimation model has an accuracy (R²) score of 0.8672 out of the highest possible score of 1. This is comparable with the R² scores of other research which range from 0.5 to 0.9. Additional performance evaluation for our best-performing model shows a mean absolute error (MAE) of 4.0167 µg/m3 which means that the estimated value will be off by this much on average.
PM2.5 levels at higher ranges, from 55.5-250.5 and up, indicate poorer air quality. An average marginal error of 4.0167 ug/m3 is insignificant within these ranges. This means that the estimation model will still deliver accurate results when a location has these kinds of air quality conditions. However, at better air quality conditions where PM2.5 levels are between 0-55.4 ug/m3, this error can make it difficult to identify the correct air quality category.
Our tools
Python, Scikit-learn
Google Earth Engine
OpenAQ API
Modeling for the future
By open sourcing our code and documentation, we make it easier for other researchers to replicate and refine our initial methodology. This way, data scientists don’t have to reconstruct models solely based on technical discussions found in research papers. Our model benchmarks can be used to evaluate the results of future ML research for haze detection.
We made a code notebook for researchers to try! The notebook runs the model on a user-defined list of locations within a specified time and visualizes the model's outputs.
Haze estimation model rollout in Chiang Mai, Thailand for December 2021
Modeling for the future
We hope our foundational work in this open source project encourages other data scientists to build on our work and assess the feasibility, limitations, and opportunities for using machine learning in air quality monitoring by:
Validating, merging, and deduplicating ground truth PM2.5 data across other sources such as Air4Thai for increased coverage across the country
Conducting temporal validation experiments by utilizing multi-year ground truth data to assess how the model performs across years
According to spatial cross-validation metrics, the model might not generalize as well to regions that are totally un-represented in the data. Further experimentations are needed to improve generalizability
Modeling for the future
Our next project will focus on creating a Python package for wrangling geospatial data. Stay tuned for our findings in an upcoming blog post!
Keep in touch
If you’re interested in learning more about how Thinking Machines can help you, contact us!
This research is part of our AI4D Research Bank program with the UNICEF Venture Fund. Thinking Machines believes that increasing data practitioners' and program staff’s access to data and machine learning resources will enable grounded conversations about the limitations and potential of applying AI for Development.