- Mar 8
- 4 min read

Predicting Sewage Pollution with Satellite Data, Part 2

While water companies have been busy at work attempting to cover up their sewage spillovers (The Times article here), here at Marla we have also been busy working on improving our satellite-data-based, sewage-pollution-monitoring model (link to an earlier blog post introducing the project can be found here).

Sewage mixing into clear water, taken from BBC News

A quick recap: to investigate whether it is possible to make predictions on sewage spills with satellite data, we have previously built several supervised machine-learning models that take in satellite oceanographical data from today and spit out a guess on whether there was sewage pollution today. Results showed some promise, with our best model achieving 14 times better accuracy than a baseline model based on random guesses. Nonetheless, investigations continued as we believe there is room to further improve models by including more signals and improving on model design.

Strengthening our approach

For the past two months, we focused on two approaches for model improvement: adding more signals and adding lagged variables.

Signals added are mainly weather and geographical and include those related to rain, wind, waves, water depth and even the angle of the sea to the land for our target sites along the UK coastline. By including these signals, we hope we can improve model explainability, and also better see the effectiveness of satellite oceanographical data on monitoring sewage pollution.

The size and strength of waves could potentially impact the extent of sewage pollution

Lagged variables were added under the hypothesis that maybe past water quality measurements could tell us something about whether there would be a sewage pollution incident today. We also added future “lagged” variables as we thought the effects of a sewage pollution incident could potentially linger and be spotted from the satellite.

We ran multiple experiments to test the impact of adding various components of the above approaches, which culminated in a main run that includes all features and all the lags of all those features.

Results overview

After many iterations of dataset downloading, model running and code debugging, our best model attained an accuracy score* 2.6 times better than the result achieved in our old blog post.

To better answer our initial investigation question - “Is it possible to make predictions on sewage spills based on satellite data?” - we reviewed all our experiments to date and came to the following takeaways.

Takeaway #1: Weather changes everything

Including additional signals improved the performance of our models rather dramatically, with the accuracy score being 2.5 times the result from our old blog post. In particular, weather variables such as rain stood out with the highest feature importance score - a score which ranks features in terms of how much they help in explaining our target variable (sewage pollution yes/no?). The following graph presents the summed feature importance for each broad category of additional features for our main experiment.

Fig 1. Feature importance grouped by categories of features

Why is precipitation so much more important than the other features? It seems like this can be explained by how water companies are more likely to pump out sewage whenever it rains. This becomes clearer when we split our dataset into data points with no rain and with rain (precipitation ≠ 0) and group each set of data points by pollution status. The proportion of rainy days with pollution is more than 5 times the proportion of non-rainy days with pollution.

	Rain (Precipitation ≠ 0)	No Rain (Precipitation = 0)
# of datapoints total	121056	66055
# of datapoints with pollution	4673	492
(# of datapoints with pollution) / (# of datapoints total) x 100%	3.9%	0.74%

Table comparing the proportion of days with pollution between rainy and non-rainy days

While some extra sewage discharges during rainy days make sense to avoid overwhelming treatment plants, are water companies doing their best to prevent unnecessary discharges? This BBC Panorama investigation seems to disagree.

Takeaway #2: History is important

We found that adding lagged variables up to 7 days before on top of the original unlagged variables improved model performance. This was done for all features (satellite data and weather/geographical data), such that every original feature has 7 additional lagged counterparts. The grouped feature importance ranking for this experiment shows that the data from yesterday is the most important for telling us about whether there is an incident today (the number after "lag" denotes the number of days lagged, where 0 indicates the original unlagged feature).

Fig 2. Feature importance grouped by days of lag

No significant improvement in accuracy score* was attained when we added future "lagged" variables, indicating that including future signals had a minimal impact on top of just including past signals. This further affirms the intuition that sewage pollution is highly linked to rainfall. Whether it rains today or not has a large influence on whether companies decide to pump sewage out tomorrow.

Takeaway #3: Satellite data still makes a (rather marginal) difference

Given the dramatic difference in feature importance found in Fig 1, and how feature importance for certain oceanography variables are generally lower (in particular BBP, CDM, and CHL, with the other variables being KD490, ZSD, SPM), is it possible that a model with only the additional weather and water depth signals could outperform a model that also includes satellite data? We found that while there is a very slight overall edge from the model that also includes satellite data, the difference is extremely marginal.

This seems to reflect that the amount of signal that can be taken from satellite oceanographic data is rather limited, at least at its current frequency and resolution. It may also be possible that the large amounts of missing data for this satellite data can be blamed, though we did not find evidence that data points with less missing satellite data were better predicted.

Final thoughts

To answer our initial investigative question, while satellite data does give us some information on sewage pollution, it may be difficult to build a model that can give sufficiently accurate predictions with the current data available.

That is not to say this is impossible. Perhaps with further extensions such as gathering more historical data and building more sophisticated time-series models (and requesting even larger machines on Google Cloud), much better models can be attained. Higher resolution and frequency data from satellites, should they become available, will also likely help with extracting signals from the noise. In the meantime, we will file away this project to work on the upcoming launch of our Australian dive visibility app - stay tuned!

Notes

*F1 score was used as a proxy for accuracy for a better measure of performance since our dataset is highly imbalanced (there are many more sewage-pollution-free days in our dataset)