mercredi 11 octobre 2017

GDAL and cloud storage

In the past weeks, a number of improvements related to access to cloud storage have been committed to GDAL trunk (future GDAL 2.3.0)

Cloud based virtual file systems


There was already support to access private data in Amazon S3 buckets through the /vsis3/ virtual file system (VFS). Besides a few robustness fixes, a few new capabilities have been added, like creation and deletion of directories inside a bucket with VSIMkdir() / VSIRmdir(). The authentication methods have also been extended to support, beyond the AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID environment variables, the other ways accepted by the "aws" command line utilities, that is to say storing credentials in the ~/.aws/credentials or ~/.aws/config files. If GDAL is executed since a Amazon EC2 instance that has been assigned rights to buckets, GDAL will automatically fetch the instance profile credentials.

The existing read-only /vsigs/ VFS for Google Cloud Storage as being extended with write capabilities (creation of new files), to be on feature parity with /vsis3/. The authentication methods have also been extended to support OAuth2 authentication with a refresh token, or with service account authentication. The credentials can be stored in a ~/.boto configuration file. And when run from a Google Compute Engine virtual machine, GDAL will automatically fetch the instance profile credentials.

Two new VFS have also been added, /vsiaz/ for Microsoft Azure Blobs and /vsioss/ for Alibaba Cloud Object Storage Service. They support read and write operations similarly to the two previously mentioned VFS.


To make file and directory management easy, a number of Python sample scripts have been created or improved:
gdal_cp.py my.tif /vsis3/mybucket/raster/
gdal_cp.py -r /vsis3/mybucket/raster /vsigs/somebucket
gdal_ls.py -lr /vsis3/mybucket
gdal_rm.py /vsis3/mybucket/raster/my.tif
gdal_mkdir.py /vsis3/mybucket/newdir
gdal_rmdir.py -r /vsis3/mybucket/newdir

Cloud Optimized GeoTIFF


Over the last past few months, there has been adoption by various actors of the cloud optimized formulation of GeoTIFF files, which enables clients to efficiently open and access portions of a GeoTIFF file available through HTTP GET range requests.

Source code for a online service that offers validation of cloud optimized GeoTIFF (using GDAL and the validate_cloud_optimized_geotiff.py script underneath) and can run as a AWS Lambda function is available. Note: as the current definition of what is or is not a cloud optimized formulation has been uniteraly decided up to now, it cannot be excluded that it might be changed on some points (for example relaxing constraints on the ordering of the data of each overview level, or enforcing that tiles are ordered in a top-to-bottom left-to-right way)

GDAL trunk has received improvements to speed up access to sub windows of a GeoTIFF file by making sure that the tiles that participate to a sub-window of interest are requested in parallel (this is true for public files accessed through /vsicurl/ or with the four above mentioned specialized cloud VFS), by reducing the amount of data fetched to the strict minimum and merging requests for consecutive ranges. In some environments, particularly when accessing to the storage service of a virtual machine of the same provider, HTTP/2 can be used by setting the GDAL_HTTP_VERSION=2 configuration option (provided you have a libcurl recent enough and built against nghttp2). In that case, HTTP/2 multiplexing will be used to request and retrieve data on the same HTTP connection (saving time to establish TLS for example). Otherwise, GDAL will default to several parallel HTTP/1.1 connections. For long lived processes, efforts have been made to re-use as much as possible existing HTTP connections.

Aucun commentaire:

Enregistrer un commentaire