Publishing Model Benchmarks for Air Quality Monitoring in Thailand

Build the foundation for data-driven decision making and leading-edge machine learning.

Document Intelligence

Turn unstructured documents into structured data tailored to your specific needs.

Location Intelligence

Make intelligent location decisions using the power of spatial analysis.

Customer Intelligence

Create real-time, comprehensive customer profiles and segments based on transactional behavior.

Enterprise LLM Solutions

Strengthen your competitive advantage with unique LLM- and ML-powered solutions leveraging your internal data.

See Overview

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor

Telecommunications

Optimize your networks and customer experiences

Sustainability and Climate

Move the needle towards achieving your sustainable development goals

Retail, Food, and Beverage

Accelerate your digital acceleration to stay ahead of demands

Industry 4

Text describing industry 4

Enterprise LLM Solutions

Strengthen your competitive advantage with unique LLM- and ML-powered solutions leveraging your internal data.

Field Unknown Model.authorIcon

Field Unknown Model.author

Error: No CMS Entry found

Field Unknown Model.estimatedReadTime

min read

Error: No CMS Entry found

The World Health Organization estimates that air pollution causes more than 2.2 million deaths in the Asia Pacific region every year. Air pollution from fine particulate matter (PM 10 or 2.5) increases the risks of heart and lung diseases, stroke, and cancers, along with other diseases. While there have been growing efforts to monitor air quality in recent years, ground monitoring stations–especially in low and middle-income countries–remain sparse due to deployment costs and complexity. As part of Thinking Machines’ and UNICEF’s AI4D (Artificial Intelligence for Development) Research Bank program, we tested the feasibility of training a machine learning (ML) model on remote sensing data to estimate particulate matter PM2.5. Our exploratory and foundational research focused on Thailand, one of the countries in Southeast Asia that practice open burning of agricultural waste during the post-harvest season. Our work is fully open-sourced, including the following components:

PM2.5 ML models exclusively trained on open data. While our research focused on Thailand, we used globally available datasets to encourage similar research in other countries

Pre-processed datasets for model training and evaluation, including ground truth (PM2.5 data) and feature datasets, which represent aerosol, meteorological, human-related, and environmental factors

Haze estimation (PM2.5) model benchmarks for evaluating the performance of future research in Thailand

Scripts and technical documentation to replicate our research, including our PM2.5 data collection and feature generation scripts, and a demo notebook for rolling out the final model on a small user-selected area

impact

Open-source Models and Data

Open-sourced haze estimation models and pre-processed data allow other data scientists to replicate our work and add to the growing body of knowledge on machine learning for haze estimation

Limited research and ML benchmarks for Thailand PM2.5 estimation

We know that machine learning can augment spatial gaps caused by limited ground truth data. Existing air quality studies in Thailand are often confined to a region or city-level analyses and focused on relationships between various environmental factors and PM2.5. Our team built on Gupta et al. (2021)’s research which applied ML to estimate PM2.5 using ground truth data from Thailand’s Pollution Control Department and meteorology and aerosol data from MERRA2. We saw an opportunity to use updated PM2.5 data and increase the number of features to estimate haze across the country.

We trained an ML model that runs exclusively on open data to estimate PM2.5 in Thailand

To build the machine learning model, we selected open data counterparts to ground truth and feature datasets frequently cited in air quality research. We used readily accessible, global datasets such as

OpenAQ Ground-level PM2.5: air quality data from low-cost and reference-grade sensors across Thailand in 2021

CAMS MERRA 2 Aerosol Information: predicted total aerosol optical depth (AOD) as the sum of tropospheric aerosols: sea salt, dust, organic and black carbon, and sulfates; we used estimates at 550nm wavelength

Sentinel 5P Aerosol Information: offline high-resolution imagery of the UV Aerosol Index (UVAI), also called the Absorbing Aerosol Index (AAI)

ERA5 Meteorological Factors: temperature, precipitation, surface pressure, and wind data

Facebook’s High-Resolution Settlements Layer: Thailand population count on a 30x30m tile level

MODIS Vegetation Index: 16-Day Average of NDVI/EVI which measures vegetation

We trained an ML model that runs exclusively on open data to estimate PM2.5 in Thailand

We combined these geospatial datasets with different file formats (tabular data, admin boundaries, tiles, etc) and conducted a series of data validation checks and alignment in preparation for regression-based ML modeling.

For OpenAQ, we generated a 1x1km bounding box per air quality sensor to generate the area of interest and log a station’s daily PM2.5 reading.

For each feature, we collected the data within the area of interest and aggregated or distributed the values to achieve a daily granularity. We then joined our daily features with ground truth data.

Finally, we applied a simple mean imputation technique to fill in missing feature values and maximize the number of usable data for ML training and evaluation.

Building a model for haze estimation

Our best-performing PM2.5 estimation model has an accuracy (R²) score of 0.8672 out of the highest possible score of 1. This is comparable with the R² scores of other research which range from 0.5 to 0.9. Additional performance evaluation for our best-performing model shows a mean absolute error (MAE) of 4.0167 µg/m3 which means that the estimated value will be off by this much on average. PM2.5 levels at higher ranges, from 55.5-250.5 and up, indicate poorer air quality. An average marginal error of 4.0167 ug/m3 is insignificant within these ranges. This means that the estimation model will still deliver accurate results when a location has these kinds of air quality conditions. However, at better air quality conditions where PM2.5 levels are between 0-55.4 ug/m3, this error can make it difficult to identify the correct air quality category.

Our tools

Python, Scikit-learn

Google Earth Engine

OpenAQ API

Modeling for the future

By open sourcing our code and documentation, we make it easier for other researchers to replicate and refine our initial methodology. This way, data scientists don’t have to reconstruct models solely based on technical discussions found in research papers. Our model benchmarks can be used to evaluate the results of future ML research for haze detection. We made a code notebook for researchers to try! The notebook runs the model on a user-defined list of locations within a specified time and visualizes the model's outputs.

Haze estimation model rollout in Chiang Mai, Thailand for December 2021

Modeling for the future

We hope our foundational work in this open source project encourages other data scientists to build on our work and assess the feasibility, limitations, and opportunities for using machine learning in air quality monitoring by:

Validating, merging, and deduplicating ground truth PM2.5 data across other sources such as Air4Thai for increased coverage across the country

Conducting temporal validation experiments by utilizing multi-year ground truth data to assess how the model performs across years

According to spatial cross-validation metrics, the model might not generalize as well to regions that are totally un-represented in the data. Further experimentations are needed to improve generalizability

Modeling for the future

Our next project will focus on creating a Python package for wrangling geospatial data. Stay tuned for our findings in an upcoming blog post!

Keep in touch

If you’re interested in learning more about how Thinking Machines can help you, contact us! This research is part of our AI4D Research Bank program with the UNICEF Venture Fund. Thinking Machines believes that increasing data practitioners' and program staff’s access to data and machine learning resources will enable grounded conversations about the limitations and potential of applying AI for Development.

Manila | Bangkok | Singapore

General Inquiries

[email protected]

Event Inquiries

[email protected]

Job Inquiries

[email protected]

Services

Customer Intelligence Location Intelligence Document Intelligence Data Platforms

Company

About Press Room Careers Contact us

Resources

Data Stories Case Studies

Cookie Policy Responsible Disclosure Policy

2025

Thinking Machines Data Science

Linkedin Facebook Twitter Instagram

Privacy & Security