S3 バケットから複数のファイルを一括でコピーする

はじめに

AWS コマンドはLinux の標準コマンドによく似た引数の並びで S3 へ簡単にアクセスできるようになっています。しかし完全に同じというわけではなく、例えば aws s3 cp コマンドでアスタリスクを付けてコピーしようとしても、ワイルドカードとして認識されず失敗します。

$ aws s3 cp s3://my-bucket/json-data* /mnt/.
A client error (404) occurred when calling the HeadObject operation: Key "json-data*" does not exist
Completed 1 part(s) with ... file(s) remaining

今回、BashとPython3 (+ boto3)で複数ファイルの取り扱いについて紹介したいと思います。

実行環境からS3へのアクセス権限を適切に設定しておいてください。IAMユーザーにクレデンシャルで権限を付与している場合は ~/.aws/credentials に記述しておくと便利です。IAMロールで権限を付与している場合は実行環境側では特に準備は必要ありません。

特定のディレクトリ全体を対象にする

Bash サンプル

rsyncのように一括でコピーできます。末尾のスラッシュを明示的に付けて「ディレクトリ同士のコピーですよ」と明示する、と覚えるといいかと思います。下記のサンプルの場合、S3上のディレクトリs3dirとローカルのディレクトリlocaldirは同じ構造になります。

$ aws s3 sync s3://my-bucket/s3dir/ /mnt/localdir/ \
      --exclude "*" \
      --include "json-data*"

コピーするファイルのリストを使う

Bash サンプル

リストを一覧で取得してから、目的のファイルのみを抽出して for do done ループで処理します。

$ LIST=$(aws s3 ls s3://my-bucket/ | grep json-data | awk '{print $4}')
$ for ITEM in $LIST
  do
      aws s3 cp s3://my-bucket/$ITEM /mnt/
  done

Python3 サンプル

Python3 では boto3 パッケージを利用してファイルをコピー(ダウンロード)します。Python3 及び venv 環境の構築はこちらの投稿を参照してください。

(venv35)$ pip install boto3

s3-cp.py

import boto3
import os

bucketname = 'my-bucket'
keyword = 'json-data'
saveto = '/mnt'

s3client = boto3.resource('s3')
bucket = s3client.Bucket(bucketname)
objects = bucket.objects.all()
for a_object in objects:
    if keyword in a_object.key:
        # Preparing to copy.
        savetopath = '{0}/{1}'.format(saveto, a_object.key)
        directory = os.path.dirname(savetopath)
        if not os.path.exists(directory):
            os.makedirs(directory)
        # Copy.
        bucket.download_file(a_object.key, savetopath)

(venv35)$ python s3-cp.py