You always want to be surrounded by people smarter and more capable. People you can learn from and look up to. People that challenge you and keep you on your toes. Raising from the bottom should be your clue to start finding a new place.
Amazon FSx for Lustre is an excellent option to run workloads where speed matters. One of those scenarios is to train machine learning models.
FSx for Lustre integrates with Amazon S3 to transparently access files in a bucket as if they were locally available to your instance. The FSx file system will lazy-load data from the attached S3 bucket, and from that point forward, files will be directly accessible to your applications.
I'm going to assume that you can figure out how to set up an FSx for Lustre file system pointing to your S3 bucket, and focus the rest of this post on making it accessible to a SageMaker Notebook Instance.
I'd recommend creating a SageMaker Lifecycle Configuration to set up the file system every time the Notebook Instance starts. But you can do this manually too every time if you want.
Here is the script that you can use to set up the Lifecycle Configuration:
#!/bin/bash set -e # OVERVIEW # This script mounts a FSx for Lustre file system to the Notebook # Instance at the /fsx directory based off the DNS and Mount name # parameters. # # This script assumes the following: # 1. There's an FSx for Lustre file system created and running # 2. The FSx for Lustre file system is accessible from the # Notebook Instance # - The Notebook Instance has to be created on the same VPN # as the FSx for Lustre file system # - The subnets and security groups have to be properly set up # 3. Set the FSX_DNS_NAME parameter below to the DNS name of the # FSx for Lustre file system. # 4. Set the FSX_MOUNT_NAME parameter below to the Mount name of # the FSx for Lustre file system. # PARAMETERS FSX_DNS_NAME=fs-your-fs-id.fsx.your-region.amazonaws.com FSX_MOUNT_NAME=your-mount-name # First, we need to install the lustre-client libraries sudo yum install -y lustre-client # Now we can create the mount point and mount the file system sudo mkdir /fsx sudo mount -t lustre \ -o noatime,flock $FSX_DNS_NAME@tcp:/$FSX_MOUNT_NAME /fsx # Let's make sure we have the appropriate access to the directory sudo chmod go+rw /fsx
You have to make a couple of changes before saving this script:
FSX_DNS_NAME: This is the DNS name of the FSx file system.
FSX_MOUNT_NAME: This is the mount name of the FSx file system.
After setting those two parameters, the script does a few things:
- It installs the
lustre-clientlibrary. As of the time of this writing, this library doesn't come pre-installed on Notebook Instances.
- It creates the directory where the file system will be mounted. The script is using
/fsxfor this, but you can change it to a different directory if you want.
- It mounts the file system using the
- Finally, it sets the appropriate permissions to the
/fsxdirectory so you can read and write from it.
After saving the Lifecycle Configuration script, you can create a new Notebook Instance using this configuration. You also need to make sure your file system is accessible from the Notebook Instance. You can ensure this by creating the instance on the same VPN than the file system. Make sure you set up the subnets and security groups appropriately.
After the Notebook Instance starts, you can open the Jupyter Notebook's Terminal window and check that you have access to the contents of your S3 bucket inside the
Over the past few years, I've met a lot of data scientists and machine learning engineers. I've also had the opportunity to learn a lot about what a machine learning job means, so I decided to share some thoughts.
Here is the biggest misconception that I hear over and over again: "Machine Learning Engineers spend their time designing and training machine learning models." This idea is misleading. On the one hand, yes, they spend some of their time designing and training new models, but realistically, this is just a small fraction of the job.
Usually, Machine Learning Engineers spend most of their time dealing with data and putting their models into production. In a perfect world, there would be a Data Engineer that would help set up the data pipeline that will feed the machine learning model. It would also be DevOps specialists, and Software Engineers, and other roles helping with everything that needs to happen to take a model out of a laptop and into the real world. The reality is somewhat different.
Here is a high-level summary of everything you should expect in a regular Machine Learning Engineer position:
- Gather the list of requirements and understand the problem that you need to solve
- Communicate with stakeholders throughout the entire lifecycle of the project
- Do a lot of research to find the best way to solve the problem
- Design a solution that provides value and stays within the timeline and budget constraints
- Determine what's the data that you will use, based on its availability and usefulness
- Design and implement the solution to gather the data and make it available to your implementation
- Do feature engineering on the original data to transform it to your specific needs
- Create the appropriate machine learning models to solve the problem
- Design and implement the connections of your solution with existing systems and processes
- Glue together all components into a comprehensive solution that addresses the original problem
- Take your solution into production
- Design a process to keep your product up to date (further model training, updates, etc.)
Your mileage may vary depending on the your company and its characteristics, but in general, your job will be much more than just training models.
(In my opinion, Software Engineers that get into Machine Learning have an excellent advantage to succeed in the field.)
Don't be lazy. Don't wait for your company to give you something new, so you can use the time and advance your skillset. That's shortsighted and won't get you what you want.
I hear a lot of people complain about being stuck doing the same work they always do. "It's hard to move forward," they say. "I need them to give me something different."
Well, it doesn't work like that. People get boxed all the time, like it or not. Somebody has an idea about your skills and will keep using you in that role until the end of time.
I know you are waiting for the magic moment when you get the approval to do something new, challenging, and unexpected. But how anyone would trust you when all you've done is the same boring thing over and over? It becomes a catch 22, and you are the one that can break the cycle.
Nobody cares about where you are in your career as much as you should. Companies care about getting shit done. They care about making money. Taking risks giving people things that may be over their heads is not a brilliant strategy.
Get with yourself and think long and hard about what you want to do. Where do you want to be tomorrow? How do you want to improve your skills? If you are lucky and able to align your interests with your company's, there's a huge opportunity to level up, pitch your new skills to your boss, and get the company to pay for further development.
This strategy works. Waiting for the miracle doesn't.
I know this sounds crazy, but one of the biggest struggles that Machine Learning engineers face is to take their models and make them available so they can be used as intended.
To be a little more specific, taking what seems to work fine in a laptop and deploying it to production so others can take advantage of it seems to be more complicated than what you may believe it should be.
This is something that software engineers have dealt with for decades. There are tools, processes, videos, books, and tricks about how to do it. But deploying machine learning models is comparatively a very young problem, and very likely outside a Data Scientist list of competencies unless they come from an engineering background.
A lot depends on the model you have, so I can't answer all the questions for you, but if you are looking to deploy an Object Detection model using TensorFlow, I can show you a couple of options that will brighten your day.
This GitHub repository has what you need. You can get the code and the instructions on how to run it.
You have two options:
- You can deploy your TensorFlow Object Detection model on SageMaker, or
- You can deploy on-premises.
The beauty is that you can use the same exact Docker container to do that. One container, two options to get it running.
Random notes, if you are curious
You probably don't need to read any of this to get things up and running, but in case you want a little bit more information, here is some.
Dockerizing this thing makes sure you can deploy your model pretty much anywhere you want. I'm not going to explain what's Docker, but if you are reading this, you probably know already. (And if you don't, cancel your plans for the weekend, grab the popcorn, and solve that problem).
The structure of the files and code inside the Docker container makes SageMaker happy. This container was created to run on SageMaker first, then adapted to run locally (on-premises) as well. I use it both ways. I've never tried it elsewhere, but it is just a Docker container, so it should work as long as you can talk to it through HTTP.
If you go over the README.md documentation, you'll quickly realize that I spent a long time explaining how to set up things on SageMaker (not only the container but how to configure a training job and connect all the pieces to train a model.) You don't need any of this if you are training your model elsewhere, but I wanted to show how to run the entire pipeline on SageMaker.
Finally, at the time of this writing, the TensorFlow Object Detection API doesn't support TensorFlow 2.0. The latest working version of TensorFlow is 1.15, and that's what you'll get on the GitHub repository. I'm pretty sure I'll update it as soon as 2.0 is supported, but for now, that's the best we get.
Python 2 is out of support. I created this blog back in 2014, using Python 2 running on Google's App Engine Standard Environment. It worked great. It gave me zero problems during these 5+ years.
But it was time to extend its life for another 5+ years, so I updated the code to run in Python 3.7. And made a bunch of improvements while at it.
The most significant change was to simplify the architecture of the blog and the processing of pages. Back then, I did a lot of cool stuff to make sure the blog was fast. And it was fast! But I'm not sure that it was faster than a version without all that.
In theory, everything I did in 2014 makes sense. Hell, I might have to go back to a similar version in two weeks if I realize that this new engine doesn't cut it! But I don't think I'll need it.
It's a shame because I spent so many hours optimizing the blog. Premature optimization. I should have known better.
Anyway, I hope everything stays working as it's been. I'll keep an eye on it to make sure everything is okay. Then, I'll go back and clean every file related to the old version.
Let's see how it goes.
If you haven't checked FastAPI as an alternative to Flask, take a look at it, and you'll be pleasantly surprised by how capable, modern, and cool it is.
I'm not going to talk about FastAPI here, but I'll explain how to get a simple "Hello World" application running on Google's App Engine.
For this example, I'm going to be using the App Engine's Python 3 Standard Environment. Deploying to the Flexible Environment should be very similar.
You'll need to create three files:
requirements.txt - Here, you'll list your required libraries so App Engine can prepare the environment to run your application. Here's what's needed for this file:
fastapi uvicorn gunicorn
Whether or not you specify the versions of each libraries (e.x. gunicorn==20.0.4) is not relevant now. Either way works.
Then you need an
app.yaml file. This is the configuration of your application. Here we need to specify the runtime we are going to be using, and the entry point for App Engine to provision a new instance:
runtime: python37 entrypoint: gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker
uvicorn worker is the one that will allow us to run a FastAPI application. Also, notice I'm specifying that 4 workers (
-w 4) should be serving the app. This number of workers should match the instance size of your App Engine deployment, as explained in Entrypoint best practices.
Finally, we need a
main.py file containing a FastAPI object called
from fastapi import FastAPI app = FastAPI() @app.get("/") async def index(): return "Hello World!"
To deploy the application, and assuming you have the Cloud SDK installed and initialized; you can run the following command:
gcloud app deploy app.yaml
This command should deploy the application to App Engine. From there, you can visit the URL configured for your project, and you should get the
"Hello World!" text back.
Great developers spend much of their time writing code. It's not only about how much they study, read, and keep up with the latest trends but how much time they dedicate to hone their craft. They constantly make things. Again and again.
Great developers spend much of their time reading to improve what they already know. It's not only about writing code day in and day out, but about keeping up with what's new to stay relevant. They continually augment their knowledge.
I've seen people that can't make anything useful, but they are always on top of what's new. I've seen people stuck in the past, but are excellent building things with what they know. They both have a place. They are both in the middle of the pack.
To be a great developer, you need as much learning as doing. Every day is another opportunity for you to aim higher.
Ready for more? Visit the archive.