Tiny Functions for Codecs, Compilation, and (maybe) soon Everything

>> We’re happy to have Keith from Stanford visiting us Keith was a student at MIT before this He’s done a bunch of work on congestion control, which won the second dissertation award, network emulation and, recently, on video processing and video streaming, which I believe would be the focus of his talk today Interestingly, Keith has been a journalist at the Wall Street Journal before this Back he wrote articles on technology and healthcare Keith usually gives very fun talks So, if I were you, I would pay very close attention to what he has to say. Welcome >> Thank you very much Thank you all for coming It’s a lot of fun to be here today I think I was here five years age, the last time I gave a talk here So, you all look the same I want to talk about a style of building systems that uses the best parts of functional programming Not the hard parts, not monads and category theory and all the hard parts the smart people talk about, just a very dumb way of using a functional design to design systems in a different way, maybe a more modular way, in some sense, or less modular way in other sense, and I’m going to tell you about four ways that we’ve done this The basic message here is that the benefits in thinking in this functional way are worth thinking about refactoring, whatever we’re most scared of in computer science So, in my world, it’s like the video code, or the audio codec, we treat this as a black box So there’s someone smart in Europe that produces a video codec and that’s like something we take as a black box We’re not going to appear inside that, or TCP congestion control Many of you have probably heard like, “Oh don’t ever try to reimplement TCP That’s just like a fool’s errand. Don’t do that You should be scared of that,” or the compiler like, “Oh you think, the compiler That’s like another black box,” or machine learning which is a black box But the idea here is like, these things that we’re scared of, this mega modules, if we break them up into small functions where we can understand the relationship between them, there are considerable gains on traditional metrics like performance and also in understandability and debugability Just the mere ability to save and restore the state of something is very powerful It gives you a real power to be able to save and restore the state of a codec, or a congestion control and say like, “Wait a sec, I don’t like what just happened Let’s rewind. Let’s try it again from the same place.” That’s a real power that you don’t have with a mega module, and if you can get it, it’s a lot of cool stuff you can do So, I’m going to tell you about four systems that follow this approach to designing systems with four different benefits So, the first is the system we call ExCamera that uses this functional idea to split up a video codec into small pieces so you’re able to get fine-grained parallelism for video encoding Massively parallel The second one is a system that splits up a JPEG compressor that’s actually the entropy coder and a JPEG file, and the benefit here is to be able to match the needs of a distributed file system, to be able to shard JPEG compression across the individual blocks of a file system, not files but just blocks The third one is a functional design applied to real-time video coding, and the goal here is to be able to match the output of the video coder with the varying capacity of the network, where have you functional program for that The idea here is to explore an execution path of the codec without committing to it Finally, we’re trying to build a general intermediate representation to make it easy, to encapsulate all these ideas, easy for programmers to design programs that can then be executed with 10,000 way parallelism on functions as a service infrastructure Things like Azure functions or AWS Lambda, or Google Cloud Functions, the exciting new services where you can rent 10,000 cores for five seconds in an economical way, how do we program that system? So, there’s four different benefits from this functional abstraction So, I’ll just go into it If anyone has questions, you’re never shy here at Microsoft, but please interrupt anytime So, the first one was at NSDI last year, and it’s about parallel video processing, and really the goal is low latency interactive video processing Thinking about what made the personal computer such a powerful thing, is really this taking things that were not interactive like editing a document and making them interactive I remember the first time I used MacWrite It seemed awful I was more of a MacPaint type person, but this idea that you have interactive word processing really ushered in a revolution, and now we have interactive collaborative word processing So, if you think about it, I have Google Docs on my screen, you have it on your screen I type in “x”, I see the effect on my screen immediately, I’ll reflows the document, but I also see the effect, you see the effect immediately on your screen That is very powerful, this collaborative interactive editing But people aren’t really reading texts anymore What is the most popular type of media in the world and on the internet? It’s video, yes. We spend most of our time watching video Most of the bits definitely over the internet are video It would be nice to have a Google Docs for video where you can make it instanet

Had anyone ever had made a whole movie? Maybe you have kids, you may have been home with me, and how long was the movie? >> Three minutes >> Three minutes. Did you edit the movie in anyway? >> No >> No, okay. Has anyone edited a home movie, like trying to make a change? Okay, you sir at the back, how long was your movie? >> Three minutes >> Was it like a child’s birthday party or something like that? >> No, that’s actually like a developer download video >> Okay, developer download video. Three minutes, okay. Did you edit? Okay, so after you made, if you wanted to make like the smallest change you could make, how long would it take before someone else could watch the edited video, with the rendering and the uploading and all that? >> Five to 10 minutes >> Okay. So, more than real time >> Yeah >> Okay, so you’re making a three-minute change will take you five minutes Imagine if every time you type an “x” in Google docs it took five minutes to see the answer So, that’s where we are right now, and it’s like we’re back to pre-Microsoft year or like the 1970’s like a word processing is where we are in video If you had interactive ability to play with video, there’s all kinds of stuff you could do You could say like, “Let’s try, let’s apply some filter Let’s see what happens,” and boom you can see it right away It’s like very direct manipulation Or instead of going to the next chapter in a movie, you might say like, “I want to see the next time this actor shows up Find this actor immediately. Find it Let me see where they are,” or if there’s some actor you don’t like, you just remove them from the movie automatically I mean, it would be great to be able to do interactive operations on the world’s most popular media, which is video But we can’t do this Okay, so why not? The problem is that the processing of videos is very computationally expensive To run a pipeline this on a video can take hours and hours So, for a 30-minute video, if it’s a 4K video, it can take many hours to process and to render So, how would we fix this? The standard to- oh, yes. Ganesh >> Question >> Sure >> The customers here are like professionals in these movie companies or is it just a regular average Joe? >>The question here is the target customer here are professionals or the movie companies are just the regular average Joe? I think, truthly, like most things in technology, it might start out at the high end and then percolate down I mean, in the 70’s, if you wanted an interactive word process you would go to Xerox Park and there’s four of them, and then by the 80’s, thanks in part to Microsoft, everyone had it So, I think it probably would percolate Yes, I would think so If we wanted to do this today, the standard toolbox would be like, “Let’s just throw a whole bunch of silicon at it Let’s just say 10,000 computers.” So, can you achieve interactive collaborative editing a video by using massive parallelism? The problem is it’s not so clear how to do it The problem number one is that you would need thousands of threads running in parallel and they have to start up of instantly So, you could either have thousand threads just sitting there waiting for you to do something, which is very expensive, or you need some way to instant startup thousand of threads. That’s problem one Problem two, however, is that even you have access to that much parallel compute infrastructure, it’s not clear, in the problem domain of video, how to exploit it Because with the today’s video codecs, the finer grain, the parallelism, the worse the compression efficiency So, this is what our work is directed at us solving those two problems So, we have number one, a framework to run 5,000-way parallel jobs with inner-thread communication So, not embarrassing crowd There’s some inner third communication on commercial cloud function services, services like AWS Lambda, or IBM OpenWhisk or Azure functions Then number two, we use this purely functional video codec to allow us to get fine-grained parallelism, finer grain than the interval between key frames, the of traditional parallelization interval of video processing So, this whole system we call it ExCamera was at NSDI last year. So, let’s just dive into it The part one is that these Cloud functions services, they’re really intended for microservices or event handlers, or asynchronous event handling, a blog post comes in, I guess that’s I don’t know, a tweet comes in and then fire off some event handler and then something happens But what we found is that it is possible to spin up tons and tons of them at the same time, and they can run arbitrary code So, you can have thousands of threads, you can run an arbitrary executable, they start up in less than a second and the billing, this is actually a very important point, the billing is also in subsecond intervals So, it’s not like a VM where you’re booting the operating system and then paying for a minute minimum I think when we did this work, Amazon was in an hour minimum You’re paying for a tenth of a second So, if you have an hour of CPU time to use, you could fire up 3600 threads for one second each On Amazon, you pay $0.09 for that So, normally, to go from an hour to one second to pay $0.09 is a deal I think a lot of people would take So, we built a library to efficiently manage computations on AWS Lambda that have this inner thread computation The main trick is how do you start up thousands of threads in seconds because Amazon’s HTTP endpoints are actually quite slow The answer is if you start up thousands of TCP connections, it works great So, that’s the trick End of talk there So, we have that framework, the mu framework

So, that’s part one. Part two is okay, now we have the threads We can start a 5,000 threads in a few seconds, but how do we use that parallelism? The challenge here is that with existing encoders, the finer grain of parallelism, the worse the compression efficiency Let’s talk about why that is So, what is a video codec? It’s two things There’s an encoder and a decoder, and the contract here is that pictures come in the left side, a stream of pictures, the encoder turns them into a stream of bits, and then the decoder, if you give it that stream of bits, turns it back into a fuzzy version of the original pictures This is the traditional model So how does this work? The reason that video compression is so efficient is that exploits the similarity between adjacent pictures, adjacent frames of video So if I was taking a video of you right now, in the first picture that I sent, the first frame, I have to send everything and it’s unlike a JPEG snapshot of looking at you folks But in the subsequent snapshots, I can tell the decoder, I’m the encoder, I tell the decoder hey save a copy of the last decoded frame, and now I’m going to send you a new frame go back in your memory and look at that decoded frame you have and I just want you to basically this person is still making the same facial expression last time, just copy this block of the image and put it there I don’t have to recode what they look like, in the background I don’t have to recode it So, there’s a huge savings in doing that and in the 4K video at a typical bitrate, the keyframe, the one that has to stand by itself, is roughly a megabyte But the subsequent frames are about 40 times smaller So, it’s a huge savings from exploiting this temporal redundancy So, the challenge here is that that traditional interface to video encoding, it’s based on this stream of pictures So there’s some encode routine that you give it a stream of pictures too and the result is one keyframe that’s the megabyte and then a bunch of interframes and then on the opposite side of the decoder you give it a keyframe and a bunch of interframes and it produces a bunch of pictures So, if you want to parallelize, let’s say we have 200 frames to encode, we could get a keyframe and 199 interframes If we want to do it in parallel by dividing up the video frame wise, the first thread takes 10 frames produces one keyframe and nine interframes, the next thread takes ten frames produces a keyframe and nine interframes, the next thread takes produces a keyframe and nine interframes, et cetera Every one of these threads produces a new keyframe that reinitializes the state of the decoder And every new keyframe is like a megabyte penalty So the finer grained parallelism, the more key frames you have and therefore the worst compression efficiency So what is the problem here? Why does each encoder want to start with a keyframe? >> Because it wants to be isolated from the state of the adjacent encoder >> Yes, exactly. The state in the system is implicit So there’s no way to tell the encoder, you know what? Start encoding from mid-stream And here is the state of the decoder that you can assume that it’s in There’s no language to describe the state of the decoder It’s just implicit. And what are the functional program people? They hate implicit state If you have stated should be explicit So we did that. We formulated this internal information, the state of the decoder and we made it explicit, the state So, we have this model of the video decoder as an automaton It starts out in a well-known state, it consumes the keyframe This is the compressed coded keyframe and in doing so it undergoes a state transition to a new state of the decoder And that state-transition also spits out a picture for display on the screen Then it consumes an interframe, one of these dependent frames It also goes through a state transition but it also spits out a picture for display on the screen This is our model of the video decoder and we have an explicit model of what the state means The state is a bunch of pictures from the previous frames that have to be saved for reuse and a bunch of entropy tables about certain which bits are more probable in different situations So, in a state transition the frame takes the decoder from a source state to target state and it changes these pictures of chains of probability tables and it also spits out some output picture This is our model of a video decoder. Yes sir? >> In the 4K at 15, I had a scenario you described What’s the binary size of one of these things? >> About 13 megabytes >> Thirteen megabytes >> Can I ask you a question? >> Please >> When talking about state are there any architecturally dependencies in your the state? About the underlying harper? Or your just [inaudible] obstruction? >> The short answer is no because the vast bulk of it is just pictures and pictures are going to be stored in some YUV raster or something like that So, probably not It’s 8-bit raster so even the Indianess is not really going to matter You can imagine someone has little Indian or something like that but I don’t think there’s a big difference All right so what we built was a decoder That is a fully standard decoder for this Google VP8 format and this WebM system, completely standard passes the conformance tests, but it is written in this purely functional explicit state passing style So, we have a decode function that’s a pure function with no side effects It takes in a state object and a compress frame and what it outputs is the new state object

and a picture for display on the screen So, if you wanted to decode a sequence of frames, the color would have to save this output state object and then call decode the next time with that value and the next frame and keep feeding it back in and have to thread it through explicitly I think the smart people call these the monads, I don’t know what they call it It doesn’t matter you just call this function a bunch of times not a big deal So then, we built the encoder that can start encoding from mid-stream So, the encoder instead of us having to state implicit inside it the evolving as a black box, it takes a state object, that’s the state of the decoder, and an image to compress and produces a frame and that frame is meant to be then applied to the decoder that’s in the corresponding state And finally, we have a transformation operator, a rebase It takes an interframe that was intended for one state of the decoder and it adapts it to be applicable to a different state of the decoder So, this is kind of like I see of a Git sticker on your laptop This is kind of like what you do in Git If I have a repository in a particular state and I ask a thousand software developers to each implement one feature separately, then I’m going to end up with a highly branched series of repositories where they all have a commit that depends on the same state So what do I need to do next to integrating back together? I rebase them I cherry pick each commit and I make it applicable to a repository that’s in a state different from the one it was originally designed for And the idea is if these commits are not interfering with each other then that rebase operation is pretty cheap So, we did the equivalent operation for video Where we take a compressed frame that’s intended for one state of the decoder and we adapt it to be applicable to a slightly different state and that’s a much cheaper operation than encoding a fresh All right. So, the way that we then try to do parallel video encoding is we do as much as possible in parallel, we encode tiny independent chunks But then in a serial operation, we rebase these chunks together that were encoded in parallel to make them playable in linear order by the decoder I’ll show you what that looks like So, we fire up a whole bunch of threads Each thread downloads a very short segment of totally uncompressed raw video So, just to give context here, the status quo systems like YouTube, they use chunks of 128 frames which is about five seconds, a little over five seconds And so, they insert a keyframe every 128 frames That’s their amount of parallelism So, we’re doing six frames which is a quarter second Sort like the 20X the parallelism of a system like YouTube So, each one downloads six frames and the first thing we do is we just encode it the old-fashioned way So, we produce one keyframe and five interframes in each of these chunks So next, the goal is going to then stitch all these together and eliminate all those keyframes So the first thing we do, is we learn about the exiting state of each chunk We figure out what it is and we send across this thread boundary In parallel, we send the exiting state of each chunk to the new thread That’s step one And then, we re-encode this first keyframe, again in parallel, we re-encode it in terms of this exiting state So, that was that encodes starting in mid-stream Now, at first, we don’t have a playable video because if you start from here, you end up here and now you’re stuck You see the problem? There’s no way to get finished So we have to do is rebase and this is the first time we end up with a serial operation All this was in parallel so now we rebase this chunk to be applicable to this state which is almost the same but not exactly And then, we rebase this chunk to be applicable to this state and then this chunk to be applicable to this state, we do that in serial order That’s the serial part but hopefully fast So there’s a bunch of different ways to do this You can have different numbers of frames in each chunk, we’re generally using six and different numbers of chunks rebased together, we’re generally using 16 so we end up with a keyframe every four seconds I’ll talk about the evaluation So, the first thing we can look at is just the compression efficiency So, here on the y-axis we have the quality and here on the x-axis we have the average bitrate So as the bitrate goes up the quality goes up So, what we see on the blue line is the best you could possibly, do this is the production grade, Google single-threaded encoder, and then in the orange here we have a multi-threaded encoder There’s some penalty from the threading because they split things up So then, let’s look at the dumbest thing we could do which is just encode those separate six frame chunks and don’t rebase them together So, just have a keyframe every quarter of a second So, that’s much worse, as the bitrate goes up the quality is much worse So, you can see even in individual decimal of S that we can probably see So now, the question is what happens if you rebase together in these strings of four seconds, you only keyframe every four seconds, and ends up here? So then, plus or minus three percent of the Google multi-threaded encoder So let’s talk about the performance That single-threaded encoder the best you could possibly do for a 15 minute 4K video takes seven and a half hours The multi-threaded encoder takes two and a half hours and you can imagine what if you just had infinite parallelism available to you? So we just took that video, we uploaded it to Youtube You know they’ve got millions of computers but they can’t exploit all the parallels and they only have these threads of a 128 frames each That took them 37 minutes And when we run it on AWS Lambda takes about two and a half minutes and it

costs about five dollars Yes >> Why Youtube use it [inaudible] >> Well if they use smaller than that they would pay a bigger and bigger compression penalty and if they use bigger than that it would just be slower And it would be harder to seek in the video >> [inaudible] Is there a reason why they’re optimizing price >> I don’t think they’re optimizing for price necessarily but the compression efficiency would be worse if they use a smaller granularity And the time to encode would be worse if they used a larger granularity I think those are the main engineering pressures But I also doubt they spend five dollars on a 15 minute video That, I don’t have that information from them. Yes sir >> In YouTube you don’t know how it’s been waiting in the queue or anything like that, you can’t tell >> No, you can’t tell But if you upload this to a commercial service like AWS, life encoders are even worse. Yes >> So when you’re doing rebasing what you’re doing is you’re taking advantage of the temporal locality of the previous frames to turn the [inaudible] smaller >> Yes, that’s right. And I just go back to the previous question Even if it was Lambda we can also end up in a queue just to be clear They have a limited population machines we have to wait in line to get our machines too So I think it is somewhat fair comparison except that we’re spending five bucks and Google’s budget is probably much much less on a per video basis So yes, when we adapt this keyframe to be an interframe it gets much much smaller >> [inaudible] or how do you kind of a supposed to separate items? >> Well everything we’re doing is at the frame level >> Okay so you only four iFrame but you don’t have mix frame like with i-slices and p-sclices? >> Well, I’m sorry. So within an interframe, some of the macroblocks are intra-coded and some are inter-coded and that’s up to the encoder So we also do the same thing If you have very fast motion or it’s changing a lot it won’t be worth it for the encoder to try and find a similar macroblock in the previous frame So it will just code that block a fresh and that’s an intra-blocker blocker ice slice in an H.264 terminology >> Yes >> [inaudible] like encoding a video to have constant bit rate, kind of like these spikes of [inaudible] being very high and things different being very small But some encoders do is to encode parts of the frame, a stripe as a full IT mandate, the rest of them will be a B-frame so you can have an average constant bit rate >> We’re not doing that Here, we’re just looking at the average bit rate over the whole video >> Can you go back a couple to where you just were >> Sure >> In this scenario, are each of these threads a separate AWS function instance? >> Yes >> Or are they actually doing IPC with each other? >> Both, yes They are separate Lambda workers and they’re doing IPC with each other >> Is there, in this serial stage, an issue with the reliability about, like, if one of them dies in the middle, you’re able to rerun that one only in continuous serial path? >> Sorry, in the last system I talked about, if I get to it, we will have that In this system, we don’t have any reliability story But the truth is we fire up all these Lambdas and then we just wait for them to get in contact with us and request work The truth is that among the Lambdas that actually wake up and request work, the later failures is extremely rare Most of the time, when we see a failure, its that Amazon bills us and claims we’ve invoked a Lambda but we never ever hear from it That’s the failure mode we see So, if you just wait for them to get in contact with you, that eliminates the bulk of the failures. Yes? >> I don’t have much knowledge about [inaudible] information of inter-states and the inter-frames, but naively one might think that while you’re rebasing thread two and to start ones out but in parallel, you can rebase thread four into thread three and so forth and get some kind of login thing Can you talk about why that doesn’t get you anything? >> Well, we might just not be smart enough to do that, but we haven’t figured out a way to do that The reason is that to rebase thread three into thread two, we have to know the state exactly But then, the rebasing operation changes all of these states, including this one So, we only learn this exiting state of the trunk after we’ve done the rebase >> If you had done that other merge, you are saying that there’d be twice as much work when you had to rebase the thread three to thread four output >> That’s right. Yeah. There may be some clever approach here but we haven’t found it All right The lesson here is that I believe that this service that the cloud operation is now offering, where you can rent 5,000 cores for a few seconds, is very powerful and I think has unexploited value We went up to see Amazon, we went here to see Amazon, and we talked to them about this They still, they didn’t, like, “It’s amazing, it’s so wonderful,” and they’re, like, “No,” and we just ask, “Can you increase our quota?

We want even more parallel threads, 5,000 is not enough,” and they said, “Well, how many do you need? How many requests per second do you get?” We said, “Well, it doesn’t really work like that Usually we’re doing nothing but then when we have a job, we want as many as you’ll give us Can we have a million?” They are, like, “A million per second?” I said, “No Most of the time, we’re not doing anything, but occasionally we’re going to want a million concurrently but only for five seconds each.” They don’t want to say that they don’t have the capacity but they also don’t say yes to that request But I really believe anytime you have a program where there’s a button that’s going to take an hour to do something, machine learning training, we can imagine a lot of tasks that take a lot of time There should just be another button next to it that says do it in one second and pay nine cents It would be very powerful abstraction and you can imagine this for image and video filters across a video or 3D artists At DreamWorks, when they want to decide where the light should be over Shrek, it takes them four hours to render a frame The utilization guys say, “Oh, that’s okay because we keep our cluster fully utilized We have a process over each frame of the video We can render the whole movie in four hours.” But that doesn’t help you if the artist, you’re the artist and you want to see a particular frame, it still takes you four hours So, the extent we can burst is very powerful in compilation and software testing, measuring all your tests in one second or interactive machine learning, or database queries, or visualization, or genomics research There’s all kinds of things that would be very powerful to do if you could burst general purpose tasks just 10,000 cores for a few seconds. Yes, sir? >> To be fair, this is a really hard problem Amazon and I had the same frustration trying to run things on Azure functions They are building these things out of VMs and out of all the boring stuff we have already and so the way they can give you x thousand data in a second is by having lots of idle machines sitting around waiting for your workload >> Yes >> If you ask to be able to burst to a million at once, they’re going to have to have a lot more idle machines ready to take that load and it’s going to cost you a lot more >> Yeah. Well, probably this is why Amazon and you are in a good position to do this, because they can statistically multiplex among many diverse kinds of customers Whereas if I wanted to set up this service, I’m the only person that would be using it >> There is a challenge in the multiplexing which is the security, different customers don’t trust each other >> Yeah >> Generally, from a security standpoint for a Cloud provider, the only story that people are willing to accept is a VM level isolation story, and that’s a very expensive proposition >> All right. But now we know where the problem is, because security of containerized infrastructure is a problem that you and I and everyone in this room can work on. It’s a well-defined problem >> I agree >> The people that trust VM level virtualization are still [inaudible] anyway That’s not so great, as we all know. You think it’s great? >> I think it’s better than container-level isolation >> Is that what you know or because it’s a better devil? >> It has a smaller attack surface That’s the main reason >> Okay. I think it’s an exciting problem, but you’re right, it’s a problem All right Let me try and get through the rest of the talk So, that was functionalization technique number one is getting more parallelism out of video encoding So, I’m going to talk about functionalization number two, which is trying to adapt to the needs of a pre-existing distributed file systems So, here’s more about robustness and reliability Okay, so this was also in NSDI last year This is what my colleague the Dropbox, or MIT and Stanford Okay. So, the problem here is if you run a Dropbox-like service, you make all these deals with people and I did check the public box, didn’t I? Well anyway, if you run Dropbox-like service, you have a wonderful company and you are storing a lot of other people’s files If you look at what those files are, you will see over a third of them are JPEGs It’s like a third, JPEGs, a third, videos, and a third, other If you’re Dropbox, you have, like, an exabyte in storage, you’re paying a lot of money to keep these hard drives running Can you save back-end space? Because in these JPEGs, you can probably guess what is the median download count of a file in Dropbox? >> Epsilon >> Yeah. I think it’s zero, because the common pattern is you have an auto uploader but no one ever looks at the file I think the immediate view count of a video in YouTube is zero Most people never tweet, they just read celebrities The Internet really doesn’t, no one writes anything They just read other people’s things So, you can see why a company that runs a big file system like [inaudible] would be interested in compressing back-end space because no one’s ever going to read it anyway A lot of people have had this idea If you look at the way JPEG works, it seems like you could do it, because there’s a three-layer pancake in JPEG There’s a lossless step where you transform the brightness and the colors Then there’s a lossy step This is the crucial step where you actually throw away information You quantize the luma and the chroma, the brightness and the color to some limited precision Then there’s another lossless step at the bottom of the pancake where you just take those quantized weights, quantized activations and you just write them out in some efficient way So, the more popular numbers are supposed to have fewer numbers of bits So, the question would be, can I just take out the bottom layer of this pancake, this Huffman code, and replace it with some more efficient technique that was invented or at least came off patent since JPEG was standardized So, a lot of people have done this,

you replace this DC-predicted Huffman code with an arithmetic code, which is a more sophisticated technique that was patented when the JPEG standard came out Then the only hard part is, how accurately can you predict whether the next bit is a one or a zero, because the arithmetic code gives you almost exact information, theoretic loss function So, if you’re absolutely sure that the next bit is a one and it turns out to be a zero, that takes infinite bits to encode If you’re sure it’s going to be one and it is a one, it takes no bits to encode So, if you can predict with good accuracy what the next bit, of course, if you think it’s 50-50, then if it’s a one or a zero, it takes one bit to encode The more accurately you can predict what the next bit is, the more savings you can have, but if you make a gamble and you’re wrong, it’s a big penalty All right. So, a lot of people have done this before You can reorganize the coefficients to make them more similar and therefore more compressible That gets you six or seven percent compression savings, but you can decode it very quickly You can also strip out the DC-predicted Huffman code and use an arithmetic code, with a normal probability model that people have agreed on So, that gets you maybe a 13 percent compression, but it’s slower to decompress Or you can be a crazy German person and you can sort all the coefficients in the whole file and then have a huge model like 100,000 coefficients, and then an arithmetic code, and that gets you like 23 percent compression efficiency but is very slow because there’s all these global operations If you’re Dropbox, it’s not going to work for you because they don’t have the whole file in one place They have this distributed file system where the blocks of the file are sharded based on their hash and then independently distributed So, there’s the client here, and the client goes out to these different file servers, and it requests or it stores different byte ranges of a JPEG file So, for the client to reassemble the file, it has to go out to different file servers and get a different block from each one So, what you’d really want is to replace the shard of the file on each server with a compressed file, Lepton file that represents the corresponding block, and then when the client request it, it should just get the decompressed block Anything else is not really a viable way to deploy because you can’t upgrade all the clients in the world to become aware of your compression format, and not all the clients are the same programming language or platform So, it really has to be a transparent block-by-block operation on a file system. All right So, the problem here, if you’re a big company is you want to store and decode the file in independent chunks They have to be able to start at any byte offset because I don’t control the file system sharding algorithm You have to be able to get greater than 100 megabits per second decoding speed, or else there’s some news story that Dropbox just got ten times slower to save money, that’s not good They’re very concerned to not lose any data So, this whole system has to be immune to any adversarial or pathological evil JPEG Every time the program is changed, they run it on a billion files with three different compilers The three compilers have to agree with each other and agree with the true input before they’re willing to deploy it Even so, by the way, there’s been problems Yeah, I settled that. Okay So, the question here is when the client retrieves a chunk of a file how does the file server re-encode that chunk from our compressed format back to the JPEG format Well, the answer here it’s the same as the previous answer Let’s do it in functional programming style where the JPEG encoder, that DC-predicted Huffman entropy encoder is written it is a pure function with no side effects who’s status applied explicitly That allows us to start encoding from midstream, just like before We want to be able to tell the JPEG encoder, “Hey, the decoder is already in this state, the state left behind by the previous filesystem block.” That’s where they are They were at that up by the offset So, you need to start encoding Huffman code words assuming that the decoder’s in a particular state So, we had to reason about what is the state, the byte-by-byte state of the Huffman encoder It’s not actually that complicated So, the real challenge here is that the byte offsets, there’s no reason for them to be in the middle of a Huffman code word A Huffman code word can be 17 bits, and the byte offset could be 14 bits into it So, you have to remember is how many bits into it am I What are the bits in the current byte that I was supposed to write out, but I haven’t yet because I don’t have a full byte Also, there’s this DC prediction, so, you subtract out the average value, so, I have to remember what those average values are So, it’s not that complicated The state is 16 bytes So, this is a much easier problem than the video encoding Video, we had to reason about these big states, and there’s lot of different probability tables, very complicated JPEG is not complicated We just had to honestly come up with a test suite and then just keep trying to get the state smaller and smaller, so, we’d understand it The problem is when the specification is written, they don’t talk about scope or what the minimal state is They say, “There’s this variable, and this variable.” It turns out you can derive one variable from the other, so you can make the state small It was a fun project to do, but it was not that complicated So, ultimately if you do this, you can share the file across different filesystem blocks You can also share the file within a file system block to be able to have a parallel decoder So, just within a certain particular file system block, not even talking about the benefits of distribution, we end up here So, we end up with almost the same, not quite, but almost the same compression efficiency as the German guy, but much better decoding performance because we can split it up into little

pieces and do them in parallel All right, so as of when I last learned, we had compressed 200 petabytes of JPEG files, we saved 46 petabytes, and they decided someone ran the numbers inside is economical to go to the full back catalog, all the 350 petabytes or whatever they have, and start compressing them So, they’re doing that at like 6,000 images per second There’s a cluster dedicated to this program You can see the clusters consuming about 280 kilowatts, and then at one point, they suspected there was a bug and they turned off the program It went down to like 140 kilowatts Then eventually the program was exonerated. They turned it back on again So, what’s interesting here is that there is a cluster running this program that consumes 300 kilowatts But like half of that is running the CPU fan and the LEDs on the front of the computer This is the empirical efficiency of compression Doesn’t matter if you’re running a program or not So, there’s a problem here for not me, someone else, to think about how are we more efficiently make use of computing resources Because it seems silly, you turn the program off, you’re still paying half the energy price What is the ratio in Azure? Is it better? Is that an, “I know, but I won’t tell you,” or it’s like, “That’s a crazy question to ask?” >> The same >> I have no idea- >> Okay. I understand >> Somebody knows >> Someone knows. All right >> They probably won’t tell you >> That’s fine. Well, great I’m telling you for Dropbox, it’s whatever this is This is in the paper. If you have a great solution, well I- >> It’s fine. Just [inaudible] for our community is- >> They’re honest >> Yeah >> Okay, great Okay. Well, then us, whatever community we are, we can know that they’re not doing a very good job yet Okay. Okay, great So, the lesson here a little bit of functional programming that can go a long way We’re talking about 16 bytes as the state that allows you to resume encoding the JPEG file for anywhere That turns out to be a powerful thing It saves them like, I don’t know, a lot of money So, in fact there are all these stories about Dropbox saving all this money and like 16 bytes. It’s not such a big deal But we had to get over our fear of something that seems like it should be opaque, the JPEG encoder If you just set up a test suite where you just said like, “Are these two files the same or not on decoding?” And then just start mucking around with it It’s not really not that hard to start fooling around with it So, if we’re willing to take these mega modules like a codec and split them up into this explicit state passing style, great dividends can be earned That’s the point here. Yes, sir >> So, maybe this question is actually a little bit more about the camera than electron part. But- >> Sure >> When you have an existing implementation of a codec that has not been built in this functional style- >> Yes >> Would you want to build the functional version? Do you throw it out and start from scratch or do you take your existing reference implementation and think about how to reverse engineer functionality into it? What’s the general? >> I think that’s a great question The truth is that we’ve done it both ways So, in the ExCamera Project, we did build our codec from scratch I mean I did that with my students In the Lepton Project, because it was like in a real company, we didn’t We took an existing JPEG encoder, and we transformed it to be functional Which I think is probably what you’d normally do in the industry If you could avoid writing it from scratch, maybe you would Honestly, I don’t think either one was a clear winner in terms of productivity or software engineering, best practice I think this approach wasn’t that bad We had to find the function that’s actually doing the writing and we had to lifted out of this They have some inner loop that’s like for I equals the top row to the bottom row, and then the left column to the right column So, they have some loop that they’re going over We had to lift that out, and just we had to ask, what is the inner state of the loop that you actually care about Because you have all these local variables that are evolving Can we write down the minimal set of local variables that matter and then can I write a function that just takes those variables explicitly and does one bit of it Then you can put it back as a stage in the loop and make sure it still works Then once you know it works, you can take it out of the loop because now you understand the state So, I think you can do it both ways >> For the video encoding thing where implementations or perhaps a bit larger and more complicated Do you think that retrofitting approach would be feasible or do you think it’s just like- >> So, I think for someone who truly understood the VPA format and the codec implementation, I think retrofitting would be feasible But for us, the only way that we could understand it was by implementing it in a clean way So, now that we’ve actually implemented it, I think we could now go and retrofit Google’s implementation But there’s no way we could have done that at the beginning. Yes >> So, how much of this is because Huffman coding was suitable to do this and not like, do you have a sense of like take out Huffman coding and replace it with something else? You can still do this or not? >> Well, is the question can you still get the compression efficiency or can you still parallelize? >> Can you still parallelize? >> Well, the reason it was so easy to

parallelize is that the state is something you can write down relatively concisely like 16 bytes If the state was really large, we can still parallelized, but we wouldn’t get the compression efficiency So, I guess we needed this concise state because the goal here is compression and performance So, if you had some very complicated encoder state, yeah, I guess we wouldn’t be able to do it Well for video for example, if we had to parallelize the video at any byte boundary I don’t think we’d get a gain because the state within a frame is very big >> So, it seem to put about 13 megabytes of stake in the ExCamera case, what was the computation to communication ratio in terms of overlapping the available six frame compute-? >> Yeah, yeah. Well it’s a great question The bulk of the communication in ExCamera is in downloading those raw frames from S3 Because those are totally uncompressed raw 4k-frames and we download six of them But the communication within Amazon’s Cloud is very fast, close to 10 gigabits So, both moving the states around from lambda to Lambda and downloading is pretty fast So, the computation is really the burden >> I’m just wondering about in terms of how much they are for this scenario like where is communication among many Lambdas at once in their network fabric and- >> Yes. So, let me say we have this draft from the paper I think I have it in my backup slides The communication is not the part that’s taken the time It’s really the computation. Yes, sir >> So, actually more for ExCamera Did you have any idea of whether there is an impact in the actual quality of the image after doing revise or something compared to your email video? >> Well, we quantified that, I showed that graph >> No. When I touched like the compression like the video they drive like the quality, less on fee- >> But, they thought it was the y-axis I’ll go back to that one >> So, here’s the graph I’m talking about So, the quality is on the y-axis So, it depends on close to the sort of probably the state of the art >> Okay. Thanks >> All right Okay So, let’s start with system number three where there’s yet another benefit of this sort of functional style which is in real-time video encoding being able to explore an execution path of the encoder without committing to it So, here the problem is in video conferencing So, let’s look at this clip So, this is my student just gave this strike in NSCI two days ago So, here is actual Chrome 65 WebRTC when the network gets bad So, look what happened, it just freezes and now the network is better and it’s still frozen, still frozen Sort of recovering, and okay, it’s finally back. All right We can watch that again or did you got the idea the first time? All right So, this is like state of the art There’s lots of people working on this This is the flagship WebRTC reference implementation in the latest Google Chrome browser It just does not handle network variability very well The reason is partly is that it doesn’t track these variations and network speeds so you can see when the network gets bad and good again that’s the pink part, WebRTC keep sending when the network is bad, that provokes packet loss or provokes queuing delay and then the encoder gets just like totally confused like the video doesn’t resume until almost here I think was even maybe at four. It just gets confused It’s like seems to have been written in a way that they can’t recover quickly from network glitches Yes, Kanesh? Sure >> See what is the optimum [inaudible] >> Well, the encoder chooses the bit rate to encode to So here, it’s encoding at about three megabits per second, and then by the end it’s really not There’s no video at all I mean it doesn’t recover until much later, but we can see with the graph in the paper, we can see where it’s just there’s no frames displayed at all I mean we’re picking on WebRTC because it’s the worst one, but it’s also like the great hope of the Internet So, anyway, what we argue in the paper is that the problem here is that the video codec is off in one place and the transport protocol is off in another place, and there’s people, computer scientists, networking people like me who work on the transport protocol and there’s faraway people like in Europe or in EE department that work in the video codec and they both have their own control loop and what we have is a new architecture, a tightly controlled, a jointly controlled architecture where we jointly control the frame by frame rate control decisions of the video codec and the packet by packet congestion control decisions of the transfer protocol in one control loop to try and do a better job So, we have this video where transport protocol We have this functional video codec The goal is to then respond quickly to network glitches and recover quickly after network glitch The bottom line we get four and a half times lower delay and about 60 percent better picture quality than the average of FaceTime, Hangouts, Skype, and WebRTC and that’s within without scalable or layer video coding So, here is the way the sort of status quo systems work including Skype, or Hangouts, or FaceTime is that the transfer protocol has some estimate,

some available bandwidth estimate about how fast the network is and it sends, it gives a target Well, just lost my laser. That’s okay It gives the target bit rate to the video codec, and the video codec chooses some encoding parameters like equality and some number frames per second and then spits out compressed frames back to the transport protocol The problem here is this sort of operate separately So, if the video codec generates a frame that’s too big, the transfer protocol kind of has to send it Even if the transfer protocol knows that frame is too big, it is going to congest the network In WebRTC they will send it anyway, and then they will say, “You know we’ve probably just congested the network, let’s pause frames on input to the video codec for a while to give the network a chance to recover.” It will be better just not send the bad frame in the first place, but why can’t they do that? What happens if the encoder splitting out frames and the transport just really just cancel one of the frames, it don’t send it, what’s going to happen? >> The sub screen frames can’t be decoded >> Exactly, the sub screen frames won’t be decodable That’ll be either badly decoded or won’t be decoded at all There’s no way to tell the video codec like “Hey, you remember that execution path you just went down? Forget about it, back up and let’s pretend that that frame didn’t get sent.” So, what we have here in our system is an integrated one control loop where every frame, we consult both the transport protocol and the video codec. We go around the loop So, let’s talk about the transfer protocol part of it first We want to ask what should be the size of the next frame so the transport protocol has estimator for this, we call it a congestion window in the networking literature, and so that’s how big the next frame should be and we measure this based on sort of the inter-arrival time, how fast packets are coming in There’s no notion of a bit rate here or sort of an average rate There’s just a congestion window This is how sort of like TCP, AIMD works, or TCP in CUBIC, and we try to do this I don’t have maybe time for this, but we try to do this in a video we’re way So previous schemes for adaptive bit rate, excuse me, available bandwidth estimation the problem is that the encoder pauses in the middle The receiver can think, “Oh well, I didn’t get any packets for a while The network must be slow.” You can get a pessimistic estimate of the network speed if the sender is not full throttle Normally, when people evaluate TCP, it’s like FTP or something like downloading a big file where there’s a full throttle sender So, if you don’t have a full throttle sender, then you can get confused So, we have this idea of a grace period where that time in between frames is explicitly encoded and the receiver can pause inference in the middle and say, well, I wasn’t expecting frames. Yeah? >> So, the reason why it’s not enough to keep the sender overflow all the time is because TCP is not aware of the frame boundaries, is that it? >> No, I would say you don’t want to keep a buffer at the sender side because that’s going to add delay >> Well, if you can fill the send over invention, the TCP window size be enough to adapt to the network variably >> So, I would say, again, you don’t want to have a buffer on the sender side You only want to be sending a frame when it’s ready to go straight out the network If you’re hanging on to packets at the sender side, you should just wait in to encode a later frame off the camera Is that, and to be clear, all these programs, Skype, and FaceTime, and Hangouts, they’re all using UDP with their own sort of custom congestion control All right, so that was the transport protocol The video codec also, the problem here is that even if we have a congestion window, we know what size the next frame should be that it can go right out It doesn’t have to wait a center side buffer No video codec can accurately hit a particular size target for an individual frame because the only way you learn the compressed size of the frame is to try and compress it, and you learn after the fact how big it was So, the strategy here that we want to do would be let’s just try some trial and error Let’s try a little better quality setting, a little worse quality setting and let’s see how it works out The problem is it’s not hard to do that with an existing codec because if you ask it to encode a frame, the encoder goes through a state transition, and the next frame is going to ask the gentleman back so the next frame is going to assume that that frame was received, and so it’s hard to say, well encode a good frame and then, I’ll forget you did that Now encode a bad frame based on the same state, there’s no API for that So, in our system, there is an API for that It’s the same one as before, you can encode given a state So, we can encode a branch We can encode two different possible compressed frames given the same decoder state So, that’s basically the strategy here At every time around the control loop, we ask the encoder to encode a slightly higher-quality version of the last successful frame and a slightly lower quality version of the last successful frame, and that gives the application three options depending on what the estimate is for the congestion window It can send the good one. It can send the bad one, or it can say, “You know what? I don’t like either of those options They’re both too big Let’s just not send a frame this time around,” and that’s something you can’t really do with a conventional codec API You can’t say, “Just forget about it.” So, if you look at the way this works, the way that congestion window evolves, let’s say it’s 30 kilobytes right now There’s some better options, some worse option, we’re going to send, which one would we can send right now? The worse option. All right So, we send the worse option because that’s the one that fits

It has to fit within the congestion window, so the better one was too big So, we send the worse option All right, that’s good Now, next time, the target frame size is bigger, so we can send the better option All right, and next time the target frame size is maybe really small So, what are we going to do now? Because we have this ability to explore an execution path without committing to it, we can just cancel it We say, “You know what? I don’t like either of these options,” and the next time the encoder encodes a frame it does it relative to the frame that actually got sent So, these options, it’s like they never happened You just encode a new frame So, what you can see here is that the frame rate is totally dynamic There’s not like a fixed frame rate The frames only get sent when the network can accommodate them So, there’s a gap here, but it’s much better to do it like this on the output than on the input where you’ve already encoded frames. Yes, sir? >> You’re not likely that as you skip frames the delta is going to get larger >> Well, yes The delta will get larger That absolutely will happen, but it’s still better than congesting the network. Yes? >> How are you getting the target frame size? >> Basically using sprout, the transfer protocol comes up with it >> You may be mistaken >> That’s true, that’s true. But you know, AIMD Ma Congression tutorial can be mistaken but this is something we can improve. All right So, to evaluate this system we built what maybe the first one, I don’t know if it’s the first one, it’s the first one that we know of, to try and do reproducible end to end tests of Black Box Video Conferencing systems So the goal is to have a reproducible input video, reproducible network traces, and run unmodified versions of Skype, FaceTime, Hangouts and Chrome, and then measure the quality of each frame and the delay of each frame So the students built the system where we have a expensive HDMI input-output card that has a hardware clock that timestamps each frames as it goes out when it comes in, and we simulate a webcam So we have this barcoded video of an actual webcam video conference We put a barcode at 60 frames per second at each frame so we can find the frame later We send it out, we pretend to be a webcam and we get this USB plug that looks like it’s a webcam We can plug it into a Mac running FaceTime or a PC running Skype, and we thinks it’s a webcam, we run it through the sender through our network emulator which is synchronized with the video to the to a receiver, with full screen the video on the receiver It goes out the HDMI port back into the HDMI capture which is timestamped with the same hardware clock, and then we just recognize the barcode So here’s a frame coming out you can see the barcode here, here’s the same frame coming back you can see the barcode is still decodable even though it’s a pretty messy video And so we can get the time stamp when it came back because we match with the barcodes and we can get the quality of the output frame relative to the input frame, we do this for every single frame So, if we look at the results, again quality is on the y-axis here and delay is on the x-axis, and we flipped the x-axis so the lower delays are here to the right All right, so, here’s sort of Skype and WebRTC within and without the scale of video coding and FaceTime and Hangouts So the first thing we did was try to just match the status quo So we wrote our own video conferencing tool with our toolkit, our codec that tries to work the way that we think the existing ones do with these sort of separate control loops, and we ended up right here That’s our attempt to approximate the status quo The next thing we did was keeping the codec the same, not functional, but improving the transport protocol So we took this sort of Sprout protocol that I published a while ago, sort of simpler version of that, and we tried to have frame-by-frame control of the size that each frame should be but with a conventional codec where it can’t sort of encode the multiple options and see what works best, and that ended up over here So we get lower delay because we have this new transfer protocol, but we don’t have very good quality and we’re still using this conventional codec So then we swapped out the conventional codec for our purely functional explicit state passing style codec, that can skip a frame when necessary and can encode the higher and lower quality version that picked them after the fact, and that ends up here So we end up this is a student written codec, it’s one graduate student, but it’s getting better quality than production quality each to 65 encoders, and this suggests that in this problem further improvement in codecs may have reached the point of diminishing marginal returns, but improvement of video systems I think still has a lot of low hanging fruit So this is on a Verizon-like network, here’s an AT&T LTE Trace again, we’re here, here’s a T-Mobile UMTS Trace We can also look at networks that don’t have any variability but just have sort of fixed conditions but occasional loss like a Wi-Fi link, we’re over here So we’re not getting the quality benefits but we still get the delayed benefits, and we can look at what it looks like, here I have some movies So here’s the same one we’ve seen before but now Salsify on the left You can see as the network deteriorates they both freeze but then Salsify recovers right away and WebRTC is still frozen, still frozen, it’s getting there, it’s getting there, and now it’s recovered So you can see you can recover much more quickly Here’s a different video where we look at episodes of packet loss So Salsify is on the left and we’re going to see some packet loss now,

they both die but Salsify has recovered WebRTC is still frozen, and it’s recovered, and we’ll try it again All right. Salsify is back, WebRTC is still frozen, there it recovers One more time, Salsify recovers, Chrome is still frozen one more time, so you get the idea here So they’re just not able to adapt the control fast enough to recover when there’s a sort of packet loss, yes sir? >> Is this something you can build on top of a hardware decoder or you can >> We believe you could build it on top of a hardware decoder but the hardware decoder might also have to allow you to export its state, and we really want the API, whether it’s software or hardware is not our fight All right So, the code is open source, we have a nice website we’re on Read It Today so that’s how you know we’ve really made it But ultimately, the point here is by jointly controlling the video codec and the transfer protocol and using this functional abstraction to be able to match what the codec can do to the capabilities of the network, both when to send a frame and how big it should be, we end up with dramatic gains over sort of a status quo, separate design All right? Do we have time to do the fourth part? >> Yes, we could go on >> All right because actually >> Several of us that came for the fourth part >> Oh the fourth part, okay great let’s do it, okay So, we have this powerful abstraction function and this powerful service, this function is service computing So, if you’re gonna be dividing up things up into little functions and running them on other people’s infrastructure, how can we make this sort of easy to program? So if you look at systems like our ExCamera and that Berkeley had one called Pywren, the problem is that they ask programmers to commingle the algorithm, what is it I want to compute, and then the schedule like what gets paralyzed and what gets run when? I also like the execution engine and the execution platform like there’s some API called Invoke Lambda and OpenWhisk or whatever So it seems kind of unfortunate, to commingle these things together because it means like if you write a program that’s targeting Lambda and you want to then adapt it to Azure functions or OpenWhisk, like this is all tied together And what if you want to try some different schedule, or run it on a different hardware, or there’s different amount of parallelization, these really shouldn’t be tied together If you give out systems like Halide or LLVM, the great success is by having an abstraction boundary where it makes sense to do it Having it in Halide case between the algorithm and the schedule, and LLVM between sort of the front-end and on the back-end and the optimization passes so that, I can implement a new language on LLVM and take advantage of their optimizers and their back-end code generators So can we do this, can we come up with a general intermediate representation for this Lambda-style, granular computing? Where someone else who wants to write an application that can paralyze across 10,000 threads they don’t have to worry about the back-end or the scheduler So, our approach here is to try and capture everything that you might want to do at that stage of the computation in a thunk So in the functual programming literature, a thunk is a parameter lists closure, it’s a little container of some code and some data So in our world everything is sort of content addressed So a thunk is a codelet which is named by the hash of its code which for us is just an X86-64 executable, and then the named data that it will be applied to So I’m going to play this function to this data and everything’s named by the hash just like in Git or something like that And it ends up being on a sort of micro container where the function is not allowed to access anything outside the name data, so /dev/random is totally out, talking over the network is totally out, any sort of undeclared IO is totally not allowed The only thing you can do is read the blobs that are declared as arguments to the function and the blobs are named by their their content hash So if you can do this abstraction, if you can pack things into this abstraction then, you can instantiate the micro container on Lambda, on OpenWhisk, on your own computer, on EC2, on Azure, or whatever you want, and you get the same answer And you can even imagine outsourcing it to sort of crazy people on Reddit and people come back to this and say, “I ran function action data y and the answer is z,” it sort of everyone gets the same answer So, there are sort of flagship application is software compilation where the idea is we model the operations of pre-processing, compiling, assembling, linking and sort of archiving, and we’d run an existing make file and the make file would run GCC and AR and the Linker and all these things Except it really wouldn’t really run the real programs, it would run models that know how to predict what dependencies those programs would need at runtime, and if you can predict the dependencies you can write a thunk The model for the compiler can say, “Well, I’m not actually going to compile anything, but in the future, I promise you I’m going to take the compiler which is a blob of code and I’m going to apply it to the source code file, you’re also going to need this “Include” file, and if you have those blobs you can do the operation later,” that’s a thunk, it’s sort of a future with no dependencies So, you might ask, what about like a dynamic task graph and systems like Dryad obviously this is not in sort of new territory here What if you want to detect faces in a bunch of pictures which could be between zero and N faces, and then in parallel recognize each face, but you don’t know up front how many faces there’s going to be

Well, you can imagine okay let’s just have an API called To Spawn A New Task, so there’s one task that runs and detects faces and then spawns in parallel a task to recognize each face I don’t want to do that, because that’s sort of language dependent, I have to actually have an API binding that’s like spawn a new task And how do you catch that if you have some tasks running on a Lambda that now has to spawn another task, how does it get access to the cache, and how do I name that sub-process invocation? So, we’re saying you can’t do that, you can only have IO with the sort of named blobs, and if you want to have dynamism, tail-recursion is the only mechanism So if you have some task that wants to farm out to an unknown number of other tasks, it has to return a thunk, and that thunk can list as many parallel dependencies as you want and they get forced recursively But this is our mechanism for enabling dynamism, in a language independent ways that the thunk just writes out another thunk So you’d have one thunk that detects the faces, and what it returns would not be the faces it would be a thunk that could list “The Recognizer” running on as many regions of the images it wants, and then that’s then a named thunk and then we can run it again, looking up in the cache So that’s our abstraction, tail-recursion is the only mechanism for dynamism So, we can look at sort of the graphs for like a compilation, here’s the graph for part of GNU hello So you can see there’s source code files here is hello.c, and those are the dependencies for pre-process files, and then those are the linear dependencies for assembled files, and those become object files, and then they get some there’s some libraries, they get linked together, and they get stripped So we get these task graphs or we can do it for video encoding, same thing, the ExCamera I showed you You can also compile that into this same intermediate representation Here’s what it actually looks like, we have a protobuffed document that describes the thunk So there’s some function or the name codelet and then there’s some some blobs with name blobs, so let’s I guess okay So, let’s look at the demo because soon as we’re here we’ll save this work Okay, so here’s F of mpeg which is a popular open source tool for video encoding, and we could make it if we wanted to, and it would look like that, take like 10 minutes or so, we’re just compiling a bunch of files What if we wanted to do this in parallel in the cloud, just rent a 1,000 machines or 2,000 machines and do it as fast as possible Instead of actually compiling it, I’m going to run the same make I guess I’ll run parallel, the same make program but I’m going to infer the thunk I’m prepending this gg-infer here Is this big enough? I can make this bigger We’re going to infer the thunk. This is going to run the same makefile as before, but it’s going to change the path When the makefile tries to run the pre-processor, or the compiler, or the assembler, of the linker AR random or strip, it’s instead running models of those programs They just know enough to be dangerous They just know how to say if I was really going to compile this file, here is the source code file I would need Here’s the pre-processed file I would need Of course, the dependency could be a real dependency, but in many cases it doesn’t actually know the compiler is running on our pre-processed file The pre-processor hasn’t actually happened yet The name of that blob is not the name of the real pre-process files, it’s the name of a thunk That’s going produce the pre-processed file By name, I mean the content hash Is it clear what I’m trying to say? We ran this makefile in a few seconds So as far as make is concerned, we’ve actually compiled it. Yes, sir? >> I understand what you’re trying to say but I think you’ve made a little bit of a leap from the thunk as a bunch of instructions and execute [inaudible] language so they’re going to produce a totally deterministic output to a thunk as the assemble or the compiler, which makes system calls On the OS, [inaudible] not deterministic constructions Are you just assuming that those things totally deterministic? >> We run it in a sandbox that tries to forbid any system calls that are going to allow non-determinism. Get random for example We forbid socket, so that’s our attempt to enforce determinism I don’t think determinism is solely required by this system, but you want it if you want to be able to double-check an output If someone tells you function X applied data Y gives you value Z and you’re suspicious, it would be nice to be able to double-check it and get the exact same answer >> Then, suddenly we build systems people worry about things like timestamps on outward [inaudible] >> Yeah, we care about that, too Our continuous integration tests verified that we get sort of bit exact output The truth is that, in open source world, they have already done a huge amount of work to get bit exact bills that were just sort of riding on the coattails of so you can now build 90% of Debian you can build in a bit exact manner Due to that they had to go through tar and makes sure it doesn’t put timestamps in, the compiler externally supply a random seed and all this stuff >> These high-level abstractions don’t kind of apply to other problems of [inaudible] >> That’s true. They also have to become deterministic In the image compression I showed you in Dropbox, they are very important that it’d be deterministic because what they’re terrified of is this they’d write a file in September and they won’t be able to decode it the next April The only mechanism we have there to guarantee determinism as we run it on a billion images and make sure that we always get the same answer with three different compilers

But guaranteeing determinism is definitely an important prerequisite for this work, at least if you want to be able to double-check, and it’s not one that I can solve for you. All right We end up with a file here that looks like, as far as make is concerned, it looks like FFmpeg It’s not really a binary, it’s just a little reference that says if you’re really interested in the contents of this file, here is the thunk that you have to force We can actually describe this and here’s the actual thunk I guess Now we’re writing in this ugly format but you know there’s a function which had some hash in [inaudible] of a hash There’s a bunch of arguments and it depends on a bunch of prerequisite some of which are thunks themselves so you can actually look at the full- Sorry We can look at the full three Oh, wow Okay. We’ve optimized this program too much Anyway, in the end, you could see there’s 5,180 subproblems and 4556 files totaling 50 megabytes These are the prerequisite input data to compile this ffmpeg program If we actually want to do it, so, right now if I did it, it would not take any time at all because it’s deterministic and therefore we’re going to find it in our cache We’re just going to find the thunk that says you know function X and data Y We’re going to know what that is So, I’m going to blow away the catche here otherwise it won’t be very interesting Let’s try it now We can run this on AWS Lambda We are going to have 2000-way parallelism, and we’re going to have a timeout of five seconds, if we don’t hear from particular thunk, we’re going to run it again All right, let’s see if this works. There you go You can see how much we’ve spent here in the lower right Here’s how many thunks have happened and how many are remaining Let’s see here So, there’s no files to upload because we already had the source code in the Cloud and it knew that because it’s all named by the hash. Let’s see Man, this worked great when I was setting up Here we go we’re in like the linking stage, the archiving- looked through here. We’ve got two running We duplicated some jobs We had some straggler failures All right, come on Hey here we go. Downloading the [inaudible] Okay, it worked, 34 seconds It’s twice as long as when I was setting up Ultimately, we compiled this thing We use 2,000 machines in parallel and we did it in the Cloud We spent like 20 seconds or something like that This, I sincerely believe is sort of the future of how we’re going to interact with the Cloud I think renting a virtual machine and booting an operating system is like not the right abstraction Like people are going to want to run thousands of thunks in parallel for very short jobs Like almost anything you do, why not paralyze it to a thousand way, if you can and if the overheads not too bad You can think about trusting other people’s assertions If someone signs a statement that says, “Hey, the file with hash such and such is this contents I’m probably going to trust them because if they’re lying it’s very easy to find out, you just rehash the file.” Now, what if someone tells you, “Hey, the felt with this hash, function X applied to data Y it produces a result Z.” To trust them, you want to be little more cautious because to double-check that you’d actually have to apply function X and data Y and see the output There’s different trust models you can imagine You can imagine, maybe I’ll trust other people’s cache assertions I’ll run the program in a sandbox but in the meantime I’m going to recompile it myself in the background If I look at it, if I get a disagreement, I’ll know something bad has happened Then, what’s important is that you can then prove to the world that someone is a liar Not just that you’ll stop trusting them If someone signed a statement that says function X applied data Y is value Z, if they sign a statement and I, later have a disagreement, it’s not just that I’ll stop trusting them I can send that to the New York Times and say like, “Hey don’t trust Amazon anymore No one should trust Amazon because here is a signed statement for Amazon that’s says function X to data Y gives value Z. Sincerely, Amazon You, The New York Times can try it yourself and you’ll see that Amazon is lying and this is proof that Amazon should go out of business.” This sort of after-the-fact ability to shame could be a powerful way to allow people to trust other people’s cache assertions without, sort of giving them free reign over your computer You can imagine using secure enclaves here, where you could put your problems up on Reddit and people could compete to solve them, maybe you’ll pay them and they come back, they say, “Hey, I ran your job function X applied to data Y produces value Z Sincerely, Intel SGX.” I don’t have to care who owned the computer the computed it as long as I trust Intel and the isolation mechanisms of SGX, the secure enclave That might be good enough, too You can imagine that, like running code for other people First of all, it’s kind of in a totally safe way here because the functions don’t do any undeclared IO They just read the blobs that they say they’re going to read and those blobs are named by their content hash You can imagine that that’s a safe thing You might run on your computer and hopefully, it would make you more money to run useful computations for other people than to search for hash collisions, which as we all know, is a huge amount of computational resources are currently going towards I think this ability to have things be deterministic and

to not care who’s running the calculation is a sort of futuristic way That we should be thinking about the interface between us and the people that own computers and you could do this for compilation, and video encoding, map reducing, database searches and all kinds of things Ultimately, I think that refactoring the big parts of computing into small parts of computing and having granular functional interfaces, both the computing resources like Lambda and two algorithms, like video encoding, or a compilation, or condition troll is going to enable new kinds of applications We should think about refactoring these things that will makes sense Just the ability to save and restore program state and say, “Go back, try it again from where you were.” It’s a very powerful tool and it let us do sort of video encoding with lots of parallelism, JPEG recompression that could fit in a distribute file system, real-time video encoding that matches that [inaudible] network I hope we’re going to be able to do sort of this general intermediate representation that lets anyone easily program for a target, this IR, that then you can run on all kinds of different back-end platforms Thank you for having me. Yes, sir? >> Those are 256-bit caches that you’re using? >> They were >> [inaudible] >> It was shot 256 and the smart crypto people told me to use that one >> Yes, sir, at the back >> Will this functional thing used sort of miss some assumptions? Like, any sort of One of the things that you didn’t make assumptions For example, with the frames, everything you’ve talked about is basically a frame [inaudible] a frame You never thought about other ways of parallelizing For example, splitting a frame I was wondering, why that is- [inaudible] for example, is one of the face detection that I can- I want to [inaudible] in parallel I can think of frame chop it in some way and just parallelize it >> Absolutely, I think when you’re going to take something that’s sort of monolithic and has sort of local variables, ungoverned amount of local state and talk about dividing it up so that the state becomes explicit, I think, that’s where there’s sort of human intelligence comes into it As finding a place to split the computation, where the state between the two parts, the communication that has to be exchanged makes sense for the problem domain In the case of JPEG encoding for example or video, we’re very interested that the state be concise The interframe state is more concise to reason about and write down than the sort of intraframe state In cases where that sort of burden that the weights are different, then, I think you have a different place you want to split In JPEG, we split it at every byte, in the video if we had to split it every byte, we wouldn’t be able to do, it wouldn’t work We only split it at every frame I think that the representation and the data structures matter a lot if you decide that you want a parallel algorithm that’s within frames Then, you want to write down the picture in a way that you can split it efficiently If you’re writing it as a JPEG and then, splitting a region of the picture off is very expensive, you basically have to decode the whole JPEG I think, ultimately, it’s important for us to figure out how to express these problems in a way that you can functionalize them I’m not going to suggest that I have some automatic way to do this >> Yes, sir? >> I’m thinking about applying the GT structure to the ex-camera problem >> Yes >> The 13 megabyte figure you mentioned, was that the poll that was the interframe size or was that also the size of the uncompressed state of the- >> Let’s just multiply it out here >> Exactly >> What I’m getting at is, it seemed like you got benefit there from keeping all of those [inaudible] alive and having each of them holding more local state than they were sending back and forth I was wondering if you tried to do that as thunks wouldn’t that results in much bigger thunks? >> Yes. So, one of those rasters, the 4K rasters is about 12 megabytes So, when we communicate, we only send one of them We don’t have the full three. So, that’s why I said it was 13 megabytes So, I think what we’re really trying to do here is abstract the language of the thunks from the execution strategy The funk itself I’m only has the hash The state it’s very small, and it’s up to the execution engine to schedule the computation where it wants to So, I think honestly, we’re going to use the the same placement with x camera, the monolithic system as it is with GG We’re going to have to teach GG that, if there’s a 13, by the way the hash, to be clear our hash format is the shots would just six hash plus the size. Those are both in the hash So, I think GG scheduler is going to learn that if there’s a 13 megabyte object that’s sitting on the hard drive of this Lambda, don’t send it, just schedule the compute to be where the data’s already available I think that locality where scheduling is going to be part of the scheduler algorithm It won’t be part of the programmer >> To a quick question, is actually I guess I really want What do you plan to do next with this work? Are you planning to continue extending GG to their problem spaces or where do you expect this features go next? >> Yeah. So, I think we want to

obviously demonstrate the power of the abstraction by implementing a few applications So, in our paper that will be submitted at some point, we’re going to have compilation We’re going to have video encoding We’re going to have a dynamic task graph like face recognizer, and face detected, and face recognized We’re going to have some sort of genomics thing We have some generic model where you can take a sort of simple Unix task where the dependencies are named on the command line That’s a very easy one to model We have a Python SDK where you can write out thunks I think we had some other ones, or escaping me So, we want to demonstrate a small number of applications but then I mean we’ve written it in a way that I think it’s going to become and they’ll be in and people will be able to use it, and I think the goal would be I mean for right it will be an easy to use abstraction for any programmer to implement their application in that language We also want more back-ends and we want it to be able to take an existing GG program, and without changing the application, be able to run it on Lambda, and OpenWhisk, and Azure Functions, and Google Cloud functions, et cetera. Yes, in the way back >> I’m thinking of smaller problem on smaller systems rises and things, like one benefit of decomposing a problem like [inaudible] into little bits is for cache coherency and cache where Do you think this functional style in this decomposition of little tests and the execution engine is also related to a more general way of developing cache mirror or cache friendly? >> I do There’s a large literature from the late 70s, about how to write, how to execute Lambda expressions efficiently on hardware, and do you use There’s different kinds of reduction strategies and reference, all this stuff I’m not a computer architect, but I do know that the architects have thought a lot about how to efficiently execute Lambda type expressions in a smart way My colleague John Osterhout, is working on this granular Computing Initiative and we talk about whether we could become the abstraction and he says, “Well, how long does it take to interpret a protobuf?” We say, “Very fast, maybe 50 microseconds.” He’s like, “I want the whole job is going to be five microseconds.” So, I think the question of what is the serialization format? If you want five microsecond, tasks is going to have to become a more efficient representation But I think the basic idea that the unit of computation is a named function applied to named data, I don’t think it’s so far-fetched >> Right. Where those two things are just addresses? >> Yes. Exactly. Yes sir >> So, for example, let’s say, I want to use these styles for a new program like [inaudible] >> Sure >> So, your approach requires we sit down and understand the whole crypto thing and to understand how to actually make my split Whereas, if I take a step back, if whether I want encrypt, I don’t encrypt something Okay, let’s take the data line, chops and encrypt them then build some circles of some data structure on top of that. So, it seems like >> I think what you just said you could definitely just expressed in the thunks I don’t think I’d be that hard I think you’d only have to come to a deeper level of understanding if you wanted some more complicated structure >> I see >> But if you’re telling me that the problem here is that it causes them to understand what we’re really doing with our lives, I mean it sounds pretty good But yeah I mean the computational process you’ve just described, you could write down very easily in thunks and you wouldn’t be paralyzing the crypt operation itself You would just be operating in a data parallel way and there’s nothing wrong with that Yes sir >> So, my understanding is that the forcing up of entire build is first you force the bulk that represents the final output block That then forces that thunk that invokes later which forces the thunks to invoke in higher and so forth >> Absolutely >> But what’s not clear to me is where does the suspended state of a partially resolved thunk leave and how is it resumed That is when the linker is waiting for all that stuff to compiles >> Oh yeah. Well, so we don’t do it that way If we’re going to force the thunk for the linker for example, we just look and we see, okay what are the data dependencies that it needs and if any of them we don’t have, we don’t actually start running the linker yet We just say, okay we got to force that one first So, it’s a normal order execution >> I see. So, there’s this scheduler that is looking at the linker and saying, oh, it’s not ready yet, I have to force this Then when that is ready, it has some algorithm for saying, here’s all the things that we’re waiting for that where all the things are waiting period something finished what can I run now >> Yeah. Absolutely. So, it’s like a hash table in memory it’s not a sophisticated strategizing

>> So, there’s a single scheduler that has the state of their current job or current graph of thunks that’s being transitively executed and it’s aware of polling and driving that sort of transitive forcing process >> That’s true and that’s scheduler does synchronize these cache assertions to a global store when it feels like So, in addition to the fact that it has some pending execution, it also has this cache because it’s looking up in the cache because the linker will depend on some thunk which ultimately runs the compiler So, the scheduler the first thing it’s going to do is say, well, do I already have a cached answer for what that thunk resolves to? And if I do, I don’t have to run it and now I just have it So, if you had multiple schedulers, they’re both reading of this cache So, if one happens to right before the next one leaves, then in that sense they cooperate >> But there’s no kind of thunk level representation of the partially results stage of the graph The entire kind of, the mapping from thunk to value is kept in the scheduler and this separate cache It’s not stored in S3 in a granular way >> Well, it could be I think we might write it on disk honestly Because we have a thunk that’s written in terms of all kinds of other thunks When any one of those thunks gets resolved to its data we can rewrite the parent doc and say what does depend on the thunk anymore, we’ll just rewrite it to depend directly on the data That itself is a resolution step that we can record to disc We may, I don’t know if we do or don’t right now, but we could record that to disk even if it’s not fully >> But that’s an incomplete step because the upstream thunks have no idea about that until they get executed and you rewrite them as well It’s sort of like that rewrite is not visible to the dependence without scheduler’s intervention >> Well, no. I don’t quite agree with that because as soon as it goes in the cache, then any upstream thunk that wants to try and resolve itself is going to find that assertion in the cache >> I see, interesting So, you could have multiple schedulers that are all processing on this concurrently and they can see each other’s updates that all? >> Yeah. I think you could do that Yes >> You skipped passed the slides where they look interesting >> Oh sure. These are the graphs So, this is Compiling Mosh with 1,000 way parallelism You can see on those row really harsh So, at the end we’re running only one So, okay. So, what we see here is the blue is is IO fetching the dependencies, the green is actually compute, and the red is uploading the answer back to S3 So, here’s an execution strategy where we just every thunk is independent, downloads runs uploads You can see the fact that we, it’s not such a big deal because the downloading all sort of happens in parallel But what really kills this is just the amount of parallelism we’ll able to extract in the main file At the end when we’re linking there’s only one job running and someone could have meant a parallel linker, but that’s not what we’re doing So, you can see at the end we’re spending half our time on the serial part of the task This is mosh, here’s FFmpeg and you can see just the archive, and the linking the stripping is literally half the time >> So, is this system available for people to play with? >> Yeah. Sure, I mean it’s still in development but yeah absolutely you can I will post the URL So, well It’s right here, you just clone that if you want It’s StanfordSNR/gg There’s like, we’ll see if we can get there Here it is, and here we have a read me Okay So, yeah. We’re trying to develop this is sort of like a usable piece of open source software, but it’s definitely still sort of in development >> You need a license? >> What was that? >> You need a license >> We need a license, yes