S3 Files First Impressions

s3 files was surprisingly easy to get setup - take an exisitng bucket enable fs in the UI and mount it (not much sg or policy setup headache required), a big upside is for local dev we can just have our scripts write and read from a directory, and it just works in the cloud as well (but via s3 buckets mounted behind the scenes).
$ yum -y install amazon-efs-utils
$ mount.efs --version
/usr/sbin/mount.efs Version: 3.0.0
$ sudo mount -t s3files fs-0fe... /mnt/s3files
$ cd /mnt/s3files/
$ ls
dataset1 dataset2
If you're hitting mount error: /mnt/s3files: unknown filesystem type 's3files'
Ensure you're running amazon-efs-utils 3.x not 2.x or lower.
$ mount.efs --version
/usr/sbin/mount.efs Version: 3.0.0
If not:
sudo yum -y upgrade amazon-efs-utils
It really is clear it is borrowing a lot of patterns from EFS (mount targets appraoch, perf on small files isn't ideal). It seems to crash on git clone (which is probably not the intended use case):
$ time git clone git@github.com:.../...git
Cloning into ...
remote: Enumerating objects: 427502, done.
remote: Counting objects: 100% (8417/8417), done.
remote: Compressing objects: 100% (727/727), done.
fatal: write error: Bad file descriptor 328.91 MiB | 53.64 MiB/s
fatal: fetch-pack: invalid index-pack output
real 0m11.558s
user 0m0.062s
sys 0m0.762s
Let's try it again (normally it takes 1m 30s on local fs):
$ time git clone git@github.com:.../...git
Cloning into ...
remote: Enumerating objects: 427502, done.
remote: Counting objects: 100% (8417/8417), done.
remote: Compressing objects: 100% (727/727), done.
remote: Total 427502 (delta 8070), reused 7736 (delta 7686), pack-reused 419085 (from 3)
Receiving objects: 100% (427502/427502), 1.39 GiB | 41.67 MiB/s, done.
Resolving deltas: 100% (320952/320952), done.
Updating files: 100% (24652/24652), done.
real 7m28.162s
user 1m18.650s
sys 0m16.642s
The audience for ETL/Parquet data modeling clearly benefits and it is not intended for something like git code sandboxes.
The biggest win is dev/live parity through a filesystem interface. You don't need to clutter your code with S3 library PUTS and GETS when you can just assume everything will be mounted on the filesystem whether running locally or remotely.
