CoGeoTIFF Research on Space Tech SaaS Platform at Pixxel

This post contains details on understanding CoGeotiffs, building products around CoGeotiffs and existing tools, examples that might help.

Introduction

GeoTIFF is a public domain metadata standard which allows georeferencing information to be embedded within a TIFF file. The potential additional information includes map projection, coordinate systems and everything else necessary to establish the exact spatial reference for the file. While this is fine, these are large files. Serving large files over the internet (HTTP) is challenging. Even if it is decided to serve, it ends up costing more on the infrastructure side.

CoGeoTIFF

The sole purpose of a co-geotiff A Cloud Optimized GeoTIFF is to help developers serve TIFF images over HTTP. It is a regular TIFF file that has been run through a compression. It does this by leveraging the ability of clients issuing ​HTTP GET range requests to ask for just the parts of a file they need.

Creation of CoGeoTiff

The creation process is pretty straight forward. You can use GDAL or Rio CoGeo CLI tools to convert GeoTiff to CoGeoTiffs. Read about other challenges here

Initial Development

The development of CoGeotiff was carried out by CoGeotiff and Vincent Sarago. However, they only aid in the conversion of one image to another. Having a cogeotiff is not enough

Rio Tiler

This tool helps you extract information out of a CoGeotiff. You can see documentation and examples here

Tile Server

A tile server is a server that serves geoinformation metadata. Usually returns JSON over HTTP. Initially, Sentinel Tiler was developed by Mapbox which is just a wrapper of RioTiler in a Flask application

Marblecutter

Another example of a tile server: Marble Cutter. Initially developed to be deployed on AWS Lambdas.

Remote Pixel Tiler

Another example of a tile server by Remote Pixel along with examples: here

CoGeo Tiler

Another example of a tile server from Vincent. Ready to be deployed to AWS Lamda. Find here

Titiler - Latest WIP

Titiler is a lightweight service, which sole goal is to create map tiles dynamically from Cloud Optimized GeoTIFF. Also ready for deployment/use in production

Building A Product

While the above examples are tile servers, meaning they are still just servers which spits out JSON at best. A map layer such as OpenLayers or Leaflet.js or Geotiff.js (or a combination of these) have to be used to render maps onto a browser.

Example Products

Some of the examples that stale (not actively maintained) but do a good of explaining serving of cogeotiff along with a UI

Conversion from Sentinel 2 L1C to Sentinel 2 L2A

Case 1 - Tiff to COG conversion

Most of the raw data are available in tiff format. If it is decided to keep the original raw data, in the further processes such as rendering images on the client-side, these tiffs are to be converted to COG. This could lead to data replication (Warehousing both tiff and cog). Since tiff and cog have the same properties and cog doesn't lose any metadata associated with tiff files, COG would be the ideal solution

Case 2 - COG to COG conversion

If the data is already in COG, which is again a compressed tiff with all properties of tiff, then further processes such as machine learning, or other processing will not be impacted. Hence, storing as COG will be an advantage both in terms of storage space and further processing

Storage, Networking, Processing

Sentinel-2 scenes hosted on AWS are not in Cloud Optimized format but in JPEG2000. When performing a partial reading of the JPEG2000 dataset GDAL (rasterio backend library) will need to make a lot of GET requests and transfer a lot of data. This problem can only be solved by asking the user to select the boundary required for download, then convert those JP2 to COG. A region/tile previously requested/converted by a user could be logged and need not be converted again.

  • Worst case scenario is ending up converting everything to COG (It is mentioned here that if every tile is converted, it should cost around <$100,000 hypothetically)
  • Best case scenario is converting only the required tiles and then avoiding re-conversion of tiles/images

Downloading from S3: A researcher decides to download 1 tile of Sentinel 1 and Sentinel 2 L1 and L2 each. Assuming the tile size of S1 being 7.5GB and S2 being L1 and L2 each 4.5GB, a total download size of 16.5 GB. A pipeline on Glue created to extract these files, find fourteen bandwidths of images, and transform all of them to CoGeoTiff or any other format. Each conversion takes 10 seconds (per image). This Glue pipeline is then triggered by Lambda. Converted files are again stored in AWS S3.
Quota - 16.5GB internet download bandwidth ~= 7 * 16.5 = ₹289
Transformation assuming each image is 1GB in size - 10 seconds * 14 images * 3 satellites = 420 seconds ~= 7 minutes ~= ₹35
Lamda - assuming after free quota - ₹15
Total Pricing = 289 + 35 + 15 = ₹340

Downloading from Copernicus: A developer decided to download one hundred tiles of Sentinel-1 images. Assuming each tile being 7.5GB on average since Copernicus limits downloading to two downloads per user, we must use a queue. Download time takes around 10 minutes per time. Assuming the free quota is over, the total size would be 7.5GB * 100 = 750GB and 350 queues. A pipeline on Glue created to extract these files, find fourteen bandwidth of images, and transform all of them to CoGeoTiff or any other format. Each conversion takes 10 seconds (per image). This Glue pipeline is then triggered by Lambda. Converted files are again stored in AWS S3.
Quota - 750GB internet bandwidth - Copernicus egress and AWS Ingress is free
Transformation assuming each image is 1GB in size - 10 seconds * 14 images * 100 tiles * 1 satellite = 140 seconds * 100 ~= 3 hr ~= ₹105
SQS for 350 queues ~= ₹30
Total Pricing = 105 + 30 = ₹135

Either way, the output format should be in COG because it aligns well with the rest of the pipeline

AWS Pricing

Title Quantity Quota Cost (INR) Cost (USD)
S3 POST, PUT https://aws.amazon.com/s3/pricing/ - 1000 req ₹0.38 $0.005
S3 GET https://aws.amazon.com/s3/pricing/ - 1000 req ₹0.030 $0.0004
S3 Data transfer outbound 1 GB Monthly ₹7 $0.09
AWS Lambda 1 GB Monthly for 1M compute seconds and 2M requests ~₹1200 $15
AWS Glue ETL 1 GB Monthly for 1hr ₹35 $0.44
AWS SQS - 1m req ₹30 $0.4

Sentinel L2A to ML Case

Case 1 - Client-Side Rendering of COG

The key to making this work well is coordinating the data delivery between the client and server, through COGs, which can deliver just the information needed for the current view. More work is needed for this to operate seamlessly, but most JS stacks run the same javascript code in the browser and the server for maximum flexibility

Case 2 - COG to COG conversion

The ML model is unaffected by the input. Be it Tiff or Cog. Since they are virtually the same, the training process, time of convergence would remain the same.

Doing everything on-demand and through an API then opens many possibilities to tailor the data more to how the end-user wants it. The first of these was to enable ‘clipping’, which lets a user request just the geometry they care about, instead of trying to select the scenes that overlap with their area of interest. This will evolve to enable co-registration of images, application of TOA, atmospheric correction, surface reflectance, and eventually full analytic processing of images with operations like band math to create indices and even computer vision-based object detection.

Storage, Networking, Processing

Storage cost of JPEG2000 is less compared to that of COGs, but someone will have to pay to access/process the data. If the data is being stored and used for providing services around, COG should be a better long-term solution.

Short answer to the question: there is no such thing as an Ultimate data format, in the real world there are plenty of good data formats. At the end of the day, it relies on what you want the user to do.

COG Client-Side Rendering

Size: 50 Tb.

  • Storage: 50 Tb * 1000 * 0.023 = 86,000 Rs / month*
  • Data access: (1M * 5 (GET requests) / 1000) * 0.004 = Rs. 1500*
  • Processing time (1GB AWS Lambda): (1 second * 1M * 1GB / 1000) * 0.00001667 $ = Rs 2000*

Reading a tile from a COG is at least 3-4 times faster than for JPEG2000

Cost: 86k + 1500 + 2000 = Rs. 90,000 (*Estimated +/- 10k for processing, network transfer)

Future

COG is incredibly powerful and there is an emerging ecosystem of geospatial algorithms that can run fully client-side. It leverages a newer feature of HTTP called Byte Serving. Hotosm's ML Enabler allows you to apply ML algorithms to maps directly on the browser. Here it is explained how it makes it easier to collect and organize AI-derived map information. Soon, the task of applying ML to maps is going to be reduced. The project is not open source yet, but they plan to release it as the work moves from a research project to a more stable tool.

Conclusion

Building a custom tile server requires a good understanding of OpenLayers or Leaflet or GeoTiff.js along with the ability to request information (React.js state like instant updates) from the server and render the same on the client-side. While most of the images are stored in AWS S3 buckets, they can be served right from there. However, some of them are not stored as tiff. Those could pose a potential increase in conversion and storage costs.

Checkout: https://pixxel.space

Footnotes

Theory behind COG

Cloud Native Architechture