Container Day: Empire – Building a PaaS with Amazon ECS

this is going to be talking about Empire which is an internal platform as a service that we built at the company that I work for it and it’s based on docker and AWS services before I start though I won’t can I just have a quick show of hands who’s using docker in production right now well it’s actually quite a few what about ecs in production yeah couple cool so I’m going to start a little bit of background about why we decided to build a platform as a service and our path that kind of led us to build Empire and how we’re ultimately leveraging ecs or Amazon ec2 container service as the backend for Empire so but first a little bit about myself my name is Eric Holmes I’m an infrastructure engineer at remind and I like building things for other developers I work mostly with go these days but my background I did a lot with Ruby I went source a lot of crazy ideas on my github which is at ej homes so I work for this company called to remind and we build a product for teachers that makes it easy for them to communicate with students and parents we have three major features one of those is real-time chat the other major feature that we have is announcements which allow them to which allow teachers to send a fan-out message to all of their class like if they won’t have homework tomorrow then we also support files attaching files to both chats and announcements so right now we have about 25 million users during the back-to-school period which is basically next week which is when all the kids come back to school we add about 350,000 new users to our product per day and we spent about 5 million messages per day and we do all this right now with about 50 employees about 30 of which are engineers so we’re growing really fast so a little background about the historical architecture of our application so the product started back in 2011 we had a very kind of similar feature set to what we have today this is sort of after we discovered what the product really should be but the the the technical architecture was vastly different from how we are today because we didn’t have a lot of users we didn’t have a lot of scale so we started as like a lot of startups we started as a monolithic ruby application rails application and this single application served our API for our iOS and Android clients served our web dashboard handled all the message delivery that we did through sidekick use and it also handled class widgets which is what we this little feature that we have for embedding a message stream on users websites so this this architecture worked out really well initially we didn’t have a lot of users and iterating on a single monolithic application was really easy and really fast but we started to grow really quickly just by word of mouth and during these back-to-school periods we’re getting a little scary because we’re adding so many users per day and getting so many requests per day so I wanted to show a graph here what our traffic actually looks like during back to school but I was told that I can actually share the the actual graph but it kind of looks like this so it’s it’s very steep it’s actually steeper I should I should kind of like rotate this a little bit and it goes just really high then it kind of drops off in as the school year goes on but back to school it’s just this huge spike in traffic and spike in new users and and message delivery so during this back-to-school period with our message delivery still living inside of our single rails codebase we started to run to a lot of problems one of those being that are psychic queues were constantly backed up and it was difficult for us to really scale us out horizontally because it was really tightly coupled with the rest of the code base and we were storing a record of all the deliveries that we made to SMS devices or applications and this was a lot of delivery so and we’re storing all this inside of our single post grads database that stored other information about users and groups and we realize that we were really going to have to shard this or we were going to have to split it out into a separate service so we opted for splitting out into a separate service which i think is the right decision and we kind of continue this over the past few years as we’ve as we’ve grown and now we have approximately 40 to 50 or so production services up until very recently we were running all of these services on Heroku and we started using them back in 2011 because they were they allowed us to build a product very quickly and not

focus on building infrastructure to support it like how we would deploy things farruko is a really good decision I think for us initially because we didn’t have to hire a dedicated ops team and we just focused on building the product so but it became clear to us pretty recently that there’s a lot of constraints that Heroku provides that kind of start to break down for micro services or a service oriented architecture one of those being at all of our internal services needed to be exposed publicly because there’s no there’s no equivalent to like a VPC inside of her aku’s platform one of the other things that our databases need to be open up to Internet traffic and this is generally not a good idea and we didn’t have a lot of visibility into CPU memory and network performance of the house so if you know four houses dropping packets we actually wouldn’t even know and we didn’t have any control over the routing layer so if we wanted to do custom request IDs or have authentication moved into a middleware layer that was very difficult to do so about seven months ago we kind of took a step back and ask ourselves well what do we really want in internal platform as a service so one of those things was that we were we were already using a lot of AWS services like dynamo DB and redshift and kind of through proxy via Heroku so and the other thing is that Heroku does a really good job of making operations very simple so there’s a lot of operational simplicity about Heroku things like deploying configuring it’s just a couple of CLI commands away so this is something that we really wanted to maintain and thanks to Heroku we were kind of sold on the constraints of the 12 factor application model if you’re not familiar with 12 hacker apps it’s kind of like a philosophy or a set of best practices for how you should run stateless service applications or services it’s like configuring VN environment variables and making processes disposable and scaling out horizontally we also wanted to maintain shared patterns for deployment so we wanted all of our services to get deployed exactly the same way which is how we were doing it on Heroku and this is really good because of lowers the barrier of entry for somebody getting something into production you don’t want to get into a scenario where you know somebody has to make a change to a service but they have no idea how it gets deployed we wanted to maintain a very fast build and iterations build and release iteration cycles so since we were previously on Heroku we didn’t want to have a dedicated ops team we were kind of operating under a no ops philosophy prior to being on Heroku and we didn’t want to end up in a world where we would have to talk to you a big ops team to get something new into production so we’re also getting really concerned about the size of our surface area like I said before all of these internal services need to be exposed publicly on the internet and we were we wanted to decrease our service area and only expose what actually needed to be public and then the rest of that we wanted to live inside of a VPC so all of our internal services would only be accessible within a within our internal VPC and probably most importantly whatever we built for this we needed to be robust and resilient to failure we’re operating a fairly large scale production system and downtime is like a very serious thing for us so we needed something that wasn’t going to call this downtime and ideally we we wanted to use containers as a unit of deployment and I’ll talk about some more reasons about that why in the next slide um what we’re actually kind of technically already using containers on Heroku these a connector containerized technology but we wanted to take a next step and possibly use docker if we could so why containers the reason kind of boils down to speed and cost so containers they can be built really fast there’s some caveats to this there’s some bugs and docker that make that kind of hard to do in a build system but for the most part you can build you can building container push it up in you know 30 30 seconds to a minute which makes it very easy to make a change and then deploy to production they let us isolate dependencies as a portable easy really easy to distribute package I think probably the best way to think about containers is there they’re kind of just like statically linked binaries if you’re familiar with go go build statically linked binaries and they’re very portable containers are kind of just the next step you can isolate dependencies and they’re portable across you know windows and linux systems if you’re running vagrant or something like that they also let us build better

development environments and have more odd parity so the same images that we’re building in our build system can be run on a developer’s laptop if they want to debug something or if they even want to set up the entire service locally on their laptop they can just pull a bunch of docker images and connect them together they limit the number of moving parts and this just kind of boils down to immutability and infrastructure which i think is a really good idea the thing if we if we build a docker image now and deploy it if we take that image again in six months from now in to play it again there really shouldn’t be any change and then containers they allow us to use resources more efficiently so cost was a was an issue for us and we can run on average we’ve been running about 10 to 20 containers on a single host and that really allows us to bring in our costs so we decided that we wanted an internal pass and it turns out we’re not probably not the first company to want something like this or to try to build something like this so one of the most well-known companies that sort of expose the benefits of micro service architecture is Netflix and they have something called Asgard and they use a lot of adria services as well and they bake am eyes I think that might do something a little bit differently now but the last time I read about this is that they baked am eyes and they use that as a unit of deployment so it’s kind of similar to like building docker containers just had a bigger scale and it’s a lot slower unfortunately I’m so docker containers allow us to build things a lot faster soundcloud is another company that kind of built an internal platform as a service they call it suka they’re not using containers but it’s kind of similar to what we wanted where it was a something to Heroku that provided ease of operation and pretty much every other company inside of our investors portfolio is trying to do something kind of similar to this so ideally we didn’t want to build something if we didn’t have to we spent a lot of time researching open source passes and a couple of them look pretty promising at the time two of those were Flynn and Deus so for a couple of reasons we decided that neither of these really fit our requirements probably the biggest reason for that was that we were actually aware of any companies at our scale that were using either of these in production and pretty much every component that each one of these were using had some kind of alpha or don’t use it in production yet a statement on it which is a little scary so we wanted we wanted to ideally use a lot of stable components to build something on top of so Empire was born at the time it was about six months ago and it was just an internal unnamed project initially we started off building on top of core OS using fleet as a scheduler so fleet is basically a ties at CD together which is a key value store for clusters basically a distributed system d so you could schedule jobs on to a cluster cluster of machines we had a custom routing layer that was configured via compte and at CD so whenever a container would start up we had a something called regĂ­strate ER that would register this key into at CD then they would update the engine X configuration so that a new where the word to connect to the service or border that connect to the docker container and then we were still using dr. of the time we initially started out by building it as implementing a subset of the Heroku platform api so we were already really comfortable with using Heroku schools and turns out you know API is our interfaces so you can swap out implementation details and we could just use the Heroku CLI to do most of what we want to do this all worked out really well initially until we kind of started testing failure modes so we we discovered a lot of bugs and fleet that made us just very wary of actually putting this in production and it became really clear to us that we needed just a more intelligent scheduler fleet fleet was very kind of dumb and how it would schedule tasks on two on two hosts and it didn’t it didn’t manage desired state well so if a container went down it wouldn’t actually bring anyone up that might change now and I’m actually fleet is actually deprecated in favor of something else and then in our experience using SED for managing cluster cluster state and all the machines inside of our cluster at City was fragile and I think a lot of the things in this space are kind of fragile and it’s just it’s not something that you really want to run because it’s a it’s a pain to run it’s a pain to monitor and it’s a hard very hard problem to solve and this was this was back when net CD was not at the stable release it is now but doesn’t make it really any easier to run I think so we weren’t we weren’t really feeling like we were getting the resilience in the stability that we were trying to achieve

and ultimately we didn’t really want to run our own clustering software so we wanted to if we could you know take a service that could cluster all of our machines together and use that so an overall we were just feeling kind of very frustrated about trying to piece together all these new and just unstable technologies so around the same time conveniently ecs when generally available and we started looking into this to serve as our scheduling backend for the flap the platform so what is ecs i’ll kind of brushed it breeze through this a little bit because chad already talked about it so i would say in summary the best way to describe ecs to people is that it’s really the easiest way to just run docker containers on the cluster of machines and more specifically it basically pools hosts together as a single resource and then it provides a set of API is to place tasks onto a container instance that has available resources so one of the great things is that it supports the concept of services for scaling TAS out horizontally and it integrates is one of the actually the greatest features i think is that it integrates with elb a for connection training zero downtime deployments and health checks that will remove unhealthy containers so this is really important for us to you because in the previous implementation we hadn’t actually solve zero downtime at all which is something that we would actually we really need it to put something like this into production ecs is made of a few different components one of those be in the container instance which is essentially just an ec2 instance running an easy a the ecs agent and docker so the ecs agent is it actually runs as a doctor in container you can run it outside of docker if you want to put it’s usually easiest just to run it as a docker container and it’s the it’s the piece that accepts jobs from a scheduler from the ecs service and runs them using the local dr. Damon and then it manages the lifecycle of that so if the container drops they don’t notify ecs necs will reschedule it onto a new host and then the easiest scheduler is the piece that intelligently places containers on to host with available CPU and memory so then ecs defines a couple of different resources and a P is for interacting with them one of those being task definitions which I think you guys just think of them as templates for running containers so you can specify the image so I’m environment variables that you want to run and then you can use the service to take this tax definition link it together with a number of desired instances of it and the service will manage those and make sure that the desired state is kept the entire time then tasks are just a unit of work inside of ecs ecss terminology one of the great things about UCS is it provides just a raw API for running tasks which had demonstrated for the batch processing and I think of Coursera will too then clusters an ecs are just they just represent a logical pool of instances that you can schedule tasks into we tried ecs for quite a while and just to try it out we we we prototypes a back-end inside of empire and we felt like it would be a really good replacement for fleet in that CD so the reason for that is that it whited just a set of solid solid primitives that could serve as our scheduling back end so as a ecs has a GUI but it also has just like all abs services like a really good API behind it and probably most importantly for us is that it was a managed service so all the clustering that we were doing previously that we had to run ourselves we just didn’t have to do anymore and it’s something that we didn’t have to monitor either so if you know for cluster went down we we were less concerned about having to run that that piece of software and then the the failure modes that we tested they behaved like we would expect them to for something like this so we can bring up a cluster of posts and then just kill them all and then let them come back up and you know entire service would be healthy again in a matter of you know a couple minutes as long as it takes for the ECT has to come up things like killing killing container like if you log into a machine and you kill the docker container you see us will spin it up a spin up a new healthy one with the elb integration if you have health checks to a container if the container becomes unhealthy you’ll be you’ll be in ecs will bring up a new container then kill the old one and register it with the lb and then this also meant that because of the EOP integration we could get rid of we can get rid of our custom routing stack entirely and we could just do a service discovery via standard DNS to that ye lbs so every everything every single service that we bring up inside

of our Empire environment that has a web process attached to it we attached in the lb and then we just give it a DNS record inside of our internal hosted zone so if we have like an API service it’ll be available at HTTP / AP I and this is really good because if you changed environment so we have a production and a staging environment it’s always available at the exact same location so service discovery and this is really good compared to other service discovery methods like one of the more popular ones that’s talked about recently is like SRV records but that requires that you change a lot about how you how you talk to services and one of the benefits of putting a load balancer in front of it and especially especially elb is is that we get cloud watch metrics it’s a single binary so it’s really easy to run we run Empire just inside of a docker container and it’s only external dependency is postgres and the AWS the Amazon api’s so it provides an API and a CLI to create apps deploy docker images update configuration run one off tasks and also it also supports the concept of proc files which if anybody’s used for oku allows you to specify like a web process a worker process or any other process that makes up your single service the question that I always get asked is really like is it is it ready for production so we’ve been actually running about 15 or so production services for the past month and a half inside of ecs managed via Empire and so far it’s been amazing ecs has been really stable in our experience so we’ve been really happy with it and one of the important things to remember is that Empire itself it doesn’t manage it doesn’t it doesn’t watch anything or try to alter state so it takes a very like hands-off approach once you’ve deployed an application and then from there it’s all a degree of services so it’s l bees ecs ec2 just these existing stable technologies and one of the great things is we’ve seen huge performance improvements moving off of Roku and directly onto ec2 all right so I’m going to go ahead and jump into the demo now I’m going to start with a simple example first then I’ll show you how you can use empires application model to deploy more advanced architectures so before use an empire you need to provision an instance we have a really simple cloud formation stack that you can use to build an empire environment this will create all the necessary ecs 380s resources and return an HTTP endpoint for the Empire API URL so all you have to do is go to the Empire repo and click Launch stack and this is going to bring up the cloud formation you I so you’ll just click Next and it will present you with some options you can actually leave most these blank and when you click Next it’ll ask you a couple more things but once you do that it’ll take about 10 minutes or so to create an empire environment for the sake of the demo I’ve already done this before hand because it’ll take a little while but I can go ahead and jump into the next into the next section so once we have em pirate running one of the great things about empires that makes it really easy to deploy dr. images so you can deploy any docker image from the registry so for example I can really easily deploy the entry next image to an empire environment so before we actually do anything we’ll have to first set the Empire API URL so it knows where to connect to the Empire API so I’ve already gone ahead and done that and then we’ll need to login and with the example stack the username is fake and there’s no password so we can just log in so the next thing we can do is deploy for example just the standard entry next so we can easily deploy the official engine X image to our Empire environment that pulled the entry next image from the registry it created the ecs service for the web process and it attached an internal elb to this ecs service by default inside of empire all services are considered internal and they’re only accessible within the VPC it’s a lot more interesting if we can actually interact with this service from the internet since this is private we can’t do that right now Empire supports making a service public by adding a domain to the app I have a sample app that we can deploy called inspector so we can go ahead and deploy that publicly the first thing we’ll need to do is create it first and then we’re going to add a

domain to it and then we can just apply it and you can even specify tags just like you can when you dr. pol so there we go after just a couple of minutes the EOB will be available and then we can actually just curl this you so let me just show you the URL for the inspector first so this is the elb for our inspector application and then we can I’m sorry I already deployed it and then we can go ahead and curl this cool so inspector just it just takes the HTP request so then returns it as a response so we can see this is actually working and we’re deploying this application inside of our own Empire environment backed by ecs and on all these eight AWS services so we can actually show what’s running so right now we have one web process running and we can go ahead and scale this just like karoku we can scale individual processes so we can scale web up to say three processes you can even specify CPU and memory constraints when you scale so this is a go application and it doesn’t need a lot of RAM so we can actually scale this to maybe 10 megabytes of ram 256 so this is 256 CPU shares out of 1024 just a relative weight and we’ll scale it to 10 megabytes of RAM and then just to trigger this will have to trigger a deploy again so after a minute or two we’ll see the new processes running this takes a little while just because it’s doing a rolling restart of all the processes so we still see there’s one web process but after after just a couple of minutes you’ll see there’ll be three processes here so we can also set environment variables and we can list environment and every time every time we deploy or change configuration will see a new release so we have four releases and one of the greatest things about Empire is it actually allows for one-off processes so if you have migrations that you need to run like if you’re using a rails console or Python you can do that with amp run so for example we can just run bash inside of a container and this will actually connect our terminal to this running instance so we can use we can also use empire to deploy more advanced architectures this is kind of a trivial example so at remind we only expose a single Empire application publicly and this serves as our API gateway this application is engine X and open rusty and it routes to all of our internal services and applications like our public API our web client and our event service I have a good repo that shows a stripped-down version of this architecture and it’s available at ej homes / Empire demo and I’ve already gone ahead and cloned that so we’re inside that that repo right now we can bring this up really easily with dr. compose this is just going to bring it up locally alright so once we have it up we can go ahead and open this in a browser okay and this is just a really simple application to do application we can create to do is do the laundry take out the trash okay so when we actually deploy this to

production the architecture that we’re kind of going for is something like this where we have a public internet facing elb that is backed by our router or engine X router and this is gonna is going to have reverse proxy to our web application which is a ruby application and a ruby application is going to consume in an internal go API so this internal API is not exposed anywhere outside of the VPC so we gonna play this really easily to empire by just doing a couple of commands and i’ll show you what those commands are so i have just a little script that will show us what we need to do so the first thing we’ll do is we’ll create the router application and then we’ll add a domain to it so adding the domain makes it public again and then for each one of these applications we’re going to build a docker image we’re going to push it to the docker registry and then we’re going to deploy that that image to our Empire environment so let me go ahead and do this and I’ll just comment these out because I already did this step before hand alright so we can see that it’s starting by building the engine X image this is our router application and right now it’s pushing that to the doctor registry alright so that one was pushed and then it also it also deployed here so we can see that it created a new release for the router and now it looks like it’s building our API alright and deployed our API created a new release read and now we’re building and pushing our Ruby web application ok cool so we deployed all 3d images to to our Empire environment so now we can actually see that this is actually running inside of Amazon so we can do the same thing take out the trash do the laundry and we can we need to act the processes that are running so let’s take a look at the router app so we have one web process running for the router one for web and one for API we also really easily just make changes to our code rebuild the docker image and deploy it to Empire so if i change web let’s change one of these views and i’ll just change this to say hello world and we’ll rebuild this image push that new image to the doctor registry take just a minute or two you alright and then we’ll deploy this damage Oh cool and now it’s deployed let’s take a look at what we’re running see just

going to go out and kill this I might need to scale out this cluster it looks like this one didn’t start up yet so this will just take a minute or two for the new release to get deployed in the meantime I can go ahead and show you one of the cool things with Emperor on is you can actually do an something like curl our internal API so one of the annoyances if you’re running the internal services is that they’re hard to debug but it makes it really easy having something like em run to basically it asks us into the VPC so we can from the router application we can just curl our internal api and we can see that we don’t have any to do is right now let’s see if our web processes I’ve started up for web not yet you you okay there we go so now we’re running the new release of web we can open the router application again Oh open router elb all right and we can see that we running the new version so that concludes the demo you can reach out to me might get help it again is a TJ holmes twitter is at veteran and again the Empire repo is remind 10 n / Empire and the demo is ej home / Empire demo and I talked about 12 factor apps a little bit so you want to get more familiarized with 20 factor will talk about noon