S3 Files First Impressions

s3 files was surprisingly easy to get setup - take an exisitng bucket enable fs in the UI and mount it (not much sg or policy setup headache required), a big upside is for local dev we can just have our scripts write and read from a directory, and it just works in the cloud as well (but via s3 buckets mounted behind the scenes).

$ yum -y install amazon-efs-utils
$ mount.efs --version
/usr/sbin/mount.efs Version: 3.0.0
$ sudo mount -t s3files fs-0fe... /mnt/s3files
$ cd /mnt/s3files/
$ ls
dataset1 dataset2

If you're hitting mount error: /mnt/s3files: unknown filesystem type 's3files'

Ensure you're running amazon-efs-utils 3.x not 2.x or lower.

$ mount.efs --version
/usr/sbin/mount.efs Version: 3.0.0

If not:

sudo yum -y upgrade amazon-efs-utils

It really is clear it is borrowing a lot of patterns from EFS (mount targets appraoch, perf on small files isn't ideal). It seems to crash on git clone (which is probably not the intended use case):

$ time git clone git@github.com:.../...git
Cloning into ...
remote: Enumerating objects: 427502, done.
remote: Counting objects: 100% (8417/8417), done.
remote: Compressing objects: 100% (727/727), done.
fatal: write error: Bad file descriptor 328.91 MiB | 53.64 MiB/s
fatal: fetch-pack: invalid index-pack output

real	0m11.558s
user	0m0.062s
sys	    0m0.762s

Let's try it again (normally it takes 1m 30s on local fs):

$ time git clone git@github.com:.../...git
Cloning into ...
remote: Enumerating objects: 427502, done.
remote: Counting objects: 100% (8417/8417), done.
remote: Compressing objects: 100% (727/727), done.
remote: Total 427502 (delta 8070), reused 7736 (delta 7686), pack-reused 419085 (from 3)
Receiving objects: 100% (427502/427502), 1.39 GiB | 41.67 MiB/s, done.
Resolving deltas: 100% (320952/320952), done.
Updating files: 100% (24652/24652), done.

real	7m28.162s
user	1m18.650s
sys	0m16.642s

The audience for ETL/Parquet data modeling clearly benefits and it is not intended for something like git code sandboxes.

The biggest win is dev/live parity through a filesystem interface. You don't need to clutter your code with S3 library PUTS and GETS when you can just assume everything will be mounted on the filesystem whether running locally or remotely.

S3 Files First Impressions

Comments

More from this blog

Giving access to a single bigquery dataset

Graphing Time series in MySQL

Unlocking OLAP in MySQL

Application Time Periods in Postgres

Command Palette

Comments

More from this blog