⭐ AWS Quickstart¶
Amazon Web Services is a subsidiary of Amazon providing on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis. Through the Development Data Partnership, Data Partners often share datasets via the AWS cloud with the support of the World Bank’s and your respective institution’s IT cloud team.
In that spirit, here are guidelines on how to access and retrive data from AWS services, such as AWS S3 and AWS SageMaker.
Please remember you must abide to the terms of the Master Data License Agreement. In case you have any questions or need any clarifications, please reach out to us.
Your team will receive AWS credentials (key and secret) that help you to be authenticated (signed in) and authorized (has permissions) to use resources. Please note that support, including provisioning additional compute resources and estimating costs, is provided by your respective institution’s IT cloud team.
The credentials are your nominal, non-shareable, non-transferable access to AWS resources. Never publish or commit your key and secret. Remember you are responsible for the use of the access granted to you.
Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services that provides object storage.
Let’s see a few options on how to retrieve data stored on AWS S3.
Using AWS CLI¶
The AWS Command Line Interface (CLI) is a tool to manage AWS services, including AWS S3, when authenticated with your IAM credentials.
After installing and configuring according to the instructions provided by AWS, you can execute operations with your credentials.
With the example of data provived by Waze for Cities for Myanmar and stored on a the World Bank-owned
s3://wbg-waze/ bucket, using AWS CLI, you can:
List available data
aws s3 ls --recursive "s3://wbg-waze/myanmar/*"
aws s3 ls --recursive --summarize --human-readable "s3://wbg-waze/myanmar/*"
Copy data from the cloud to local filesystem
aws s3 cp --recursive "s3://wbg-waze/myanmar" .
If you prefer a graphical interface to manage data, we encourage using Cyberduck. Cyberduck is a free cloud storage browser that allows you to explore files as if they were on your local filesystem.
pip install "dask[complete]" s3fs
Now, on the console, you can read the data with only one line of code. Magic!
import dask.dataframe as dd df = dd.read_csv('s3://bucket/path/to/data-*.csv')
If using a named profile, you can pass it as an argument (or export environment variable
# Options passed to s3fs.S3FileSystem storage_options=dict(profile="named-profile")) df = dd.read_csv('s3://bucket/path/to/data-*.csv', storage_options=storage_options)