There's an Amazon S3 bucket that we need to monitor to process files copied into it. Doing this is pretty straightforward by invoking a lambda function whenever the
s3:ObjectCreated:* event is triggered by S3.
The Internet is plastered with examples on how to set up this process. You should be able to get everything up and running after a few minutes following one of those tutorials.
But here is a different twist to this problem: we are going to be receiving files in pairs that need to be processed together. Specifically, we will be receiving an image file (let's assume a
.PNG file for simplicity) along with related information (metadata) stored in a
.JSON file with the same name. For example, we will be getting a file named
image1.json copied into the bucket, and we need to make sure they are processed together.
Let's define our problem a little bit more formally:
Files will come in pairs, a
.PNGfile and a
.JSONfile with the same name.
Files might be copied at different times into the S3 bucket. We might get the images first, then the corresponding metadata files at a later date, or vice versa.
We need to process the files as a unit. We can't handle the image nor the metadata until we have access to both files.
AWS triggers an event for each object created in the S3 bucket, so for each pair of files, we will be getting two separate lambda invocations. We can't make assumptions about the order of the invocations, or how long until both files are ready, so we need to build some synchronization plumbing to take care of this.
The idea behind the solution
I'm sure there are multiple ways to tackle this problem, but we wanted to make things as simple as possible, so we decided to have our lambda collect the files as they show up, and only trigger the processing step whenever we have the pair together.
We can't keep the files in the original S3 bucket because we aren't sure how long it will take for the pair to be ready, and the original files might be removed before we have a chance to collect both of them. This means that we need to copy the files to a separate S3 bucket as soon as we get access to them, and hold them there until we get access to the second file of the pair and have a chance to process them.
To keep track of which file we have and which one we are missing, we can use a DynamoDB table to keep track of where we are. Whenever the lambda function is invoked, we can check the DynamoDB table to determine whether we have both files of the pair, and only move to the processing step when we do.
The radiography of the lambda function
Here is a high-level description of what our lambda function looks like. Remember this function is invoked with every
s3:ObjectCreated:* event triggered by the source S3 bucket:
Copy the file from source S3 bucket into the target S3 bucket — this target bucket is our temporal space until we are ready to process the pair of files.
Get from DynamoDB table the record corresponding to the file — we can do this using the name of the file as its identifier.
If a record doesn't exist, create a new one with a status of
loading— if a record doesn't exist, it means that this is the first file corresponding to the pair, so we just need to create a new record and do nothing else.
If a record does exist, we can update its status to
readyand invoke the processing step.
The code that makes things happen
The gist above shows the Python implementation for this lambda function. Notice that the code makes the following assumptions:
Files will be copied in a source S3 bucket that's connected to this lambda function. You can do this by following any of the available samples published as part of AWS' documentation.
The lambda function will copy the files to a bucket named
temp-bucket. Make sure you change this reference in the code to the name of your own bucket.
There's a DynamoDB table created with the name
There's a (mysterious)
invoke_processing_stepfunction that I'm leaving outside of the code. This function receives the name of the file and takes care of processing the pair.
It could be a little bit more complicated
There might be more than two files that we need to process together. In that case, the DynamoDB table will have to store a little bit more information: which files have been read and which ones are missing. Extending the code to support this scenario shouldn't be much more complicated, so I'm leaving that to the reader.
You might also want to remove the records from the DynamoDB table as soon as you finish processing them. I'm assuming this is outside of the scope of this lambda function —or at least, it's not part of the main thread that I tried to follow with this post—, but keep that in mind.
It came out pretty good
In the end, the code came out pretty clean, and the process seems to be holding up pretty well. I'm curious about the results when we stress-test it by bombarding all sort of files into the bucket, but I'm confident things will go as expected.
I'd love to hear about other ways to solve this same problem or any considerations that we might have missed when designing this solution. Don't hesitate to reach out if you have any comments.