How to hash a file in a bucket S3 using AWS Lambda

Calculating the file hash of an uploaded file in an S3 bucket, it’s a big deal. Especially, when it passes through an API Gateway, this will encode the file to base64. We can get the ETag generated by AWS to check file integrity, but this is not a secure solution, because AWS may change hashing algorithm.

The best solution is to use S3FS package that we can install with pip. This package will give us a File System to read file by opening it. And then, we can read blocks and calculate the md5 hash or whatever algorithm you want to choose.

To install s3fs:

# Using pip
pip install s3fs

# From source
git clone git@github.com:dask/s3fs
cd s3fs
python setup.py install

This is a function that with return the md5 digest. Now, you can compare it or save it in database. Note that s3fs has already a checksum function

import s3fs

def hash_file(bucket, key):
    fs = s3fs.S3FileSystem(anon=False)
    with fs.open(f'{bucket}/{key}', 'rb') as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            md5_hash.update(byte_block)
        return md5_hash.hexdigest()

This package is also available for Node JS. You need to install s3fs and crypto with npm and then:

const hashFile = (filename, bucket) => {
    const hash = crypto.createHash('md5')
    const s3fs = new S3FS(bucket)
    const stream = s3fs.createReadStream(filename)
    stream.on('data', function (data) {
      hash.update(data, 'utf8')
    })
    stream.on('error', err => {
        console.error(err)
    })
    stream.on('end', function () {
      return hash.digest('hex') // <-- this is the hash
    })
}

Official documentation

You Might Also Like
2 Comments

Leave a Reply