HUGUK Nov 2014 Cloudera: Mark Grover & Ted Malaska

alright welcome everybody today everyone talks about application design that the Duke the slights that i’m showing here are present on slideshare calm / dude art book are marked over you can follow me at Martha no score over and with me is tense Alaska you can follow him back at the last count since Griffin the state’s we’re going to start off with the American National Anthem okay the stuff you’re going to talk about is talked about in much more detail in this book we are cooperating so there are four of us and to a pair or two of us are not here the book is called to do application architectures he can buy the book from or rally com about eighty percent of its done so what you buy today is simply release so it’s raw and unedited and as more and more choppers get added you can get updates to that I am a software engineer x la tarea there was a few projects I’m a committer and EMC member on and I’m computer cooked interview hi fine stupid flu and append it’s a principle solutions architect a player who was previously at FINRA which is regulatory financial body in the States and he’s contributed code to Hadoop HBase parfum average height all right so what we’re going to talk about today is application design in context with one use case the use case we chose today is a quick point analytics stuff we talked about applies to that music is of course but applies to other is that are very common to it so the problem is simple at the end of the day many people want to do analytics Anderson analytics platform like Google Analytics but they want to do something like that in house but what they have and so this platform looks like this they may have sessions or users they want to track clicks how many pages per session to the sea what’s the average session duration things of that sort what they start off with something like this right so it’s just plain simple logs and the problem they’re trying to solve is this they go from these simple logs to a sophisticated system where they contrived in line of place now they’ve figured out the seducing is really good in poplar for their click stream analytics use case but the see all these things are on it right you see things like zookeeper hayao HBase scoops part it and the kind of wonder like which one should i use do I need all of them so yeah or if not how do I choose which one do I use and how do i integrate the tips that I’ve chosen to use for my application so that’s the exact gap you’re trying to fill two understandings a little bit about what each of these technologies dear choosing what technology stats choose and then integrating that I’ve come to fill in the right attitude architecture for your question once you have chosen all this you still have a few questions that you need to answer do I store my data in HDFS or HBase what format do I store my today what kind of compression do I use what how do I design my schema HDFS or HBase how do I get data into a dupe how do I get it out of there how do I managed metadata this was information about the data itself so there may be an implicit scheme associate with the data where do I store that schema how do i express that schema there is also metadata about who’s running what who’s populating what data sets so where do I store that how to wear it down how do i access and process data right the way you not produced or use hi do I use pig various far and how do I work is great on this right so you’ve got usually the number of processes that run one after the other sometimes you will work off and give one thing depending on the results what system do i use to work is real so this is the context in which you are operating if you obviously only have an hour so we are going to talk about all these things at a higher level more detail in the beaufort longer presentations for example we have a tutorial that’s three hours long as an arrow keys for this all right let’s get started so first architectural consideration you’re going to talk about is data storage and model pertains to how to acquire mundane I do where do I store so there are basically two options one is do I use hdfs or do I use HKS and then there is once I’ve decided where I store my data how do I lay that out on the framework what format or compression formats you are used so talking a little more about the storage clear you’ve got HDFS and HBase we’ll talk about ecfs on the left first stores data directly guess files on the file system and it’s very fast for full scans off the data

but it has grounded for performance for random reprice right so if you want to poke just one line from a record from a file that takes forever H based on the other hand stores data on each files in HDFS with you especially a custom format it’s got slow scan so if you want to scan a whole bunch of data from the file is if I should order that’s go but if you wanted to do random reads writes that’s faster all right so what do we decide to do for our use case or use kisses sick stream analytics and you want to look at data for a month or two months years ten years and having a good level right and so we want to rely heavily on scans instead of randomly rights right most of our parties are going to be skin heavy so we choose to store our data in HDFS instead page base and you would notice that there’s the incoming raw log data also going and processes it data eventually into some CSS will talk about processing section later but at this point only need to know is that will have raw data and we’ll have some process data both of that both of those data types are very scan heavy in terms of how we query it and that’s where you can store both of them in HDFS the raw data you’re going to store okay so here’s the storage format what format will be stored in the rotting or even store in this file format called a bro with snappy compression in Iowa we don’t have time to go in details of how we chose that whatever is simply a row roaring attic format it snappy cystic impression that we apply on that the process data are we going to store it in our cave which is a column to format which comes in handy when you’re only going to operate on small subset of columns in your data set and you’re good at the massive aggregations on large amount of data right so you can eradicate a whole bunch of columns nusa number of i/o and make things faster so we’re going to store a raw data with nav row with snappy and process data with RK and here’s one more thing so once we have decided that we’re going to store it in HDFS we’ve decided what file format impression for you to use how do we structure our HDFS schema to store that data right hdfs that the entire day is a file system so the structure or the schema that file system looks like a simple directory structure and this is what he suggested so you will have a bunch of directories the first one is slash user that contains user-specific information so I as a user will have full access to the directory HDFS that slash user slash mark we suggest you create a slash etl which stores all the data that’s a pipeline to getting ready right so if you take off your cluster like a kitchen or a restaurant the slash etl is the kitchen in the back where users are not allowed to go your end user not allowed to common folk it’s only four chefs and trained professionals to cook up the data right so that’s / etl your final VI analysts or users are not loved to meet that data or even see that is solely for specific ETL processes or etl engineers who are allowed to go into that particular directory and read things or process than weird about them / test is fairly obvious kind of like your file system it’s temporary if I’m going to share something with Ted I’ll drop it in there and send an email look at / 10 / data if I’m continuing the restaurant analogy this is a seating area off the restaurant this is where you put out the stuff that’s ready for consumption right so this is where your end users and everybody has read access almost no one has write access except that one process that populates that dear right and the / app you would find that HDFS is now being used to store information or processes or jars that weren’t used by multiple nodes of the cluster so for example if you’re running lucy has an orchestration to love you can drop in your easy workflow files there because technically any node of the cluster can run that workflow so it needs accidentally those spots so those go and / app again once they’re dropped people what we need read access to that the process is only need weed ancestor that so we create a separate directory for that because it doesn’t really house data how’s this code that is used binary processes and there’s one more thing we’ll talk about advanced HDFS schema design a lot of people use this this is called partitioning so oftentimes you have your data in a directory this directory on the left is dataset and Scott and files all pawpaw block to pen that’s unpartitioned when you partition it it looks like something like the structure on the right and that’s you’ve got the same files but instead there are

present in sub directories of their own and sub directories are actually named in a particular way it’s column equals the value and then things of five so essentially what you’ve done is for all files that exist in called equals bad subdirectory they contain that they’re holding the data and definitely corresponds to that particular value of the column right so if you find yourself doing for these over and over again on data having greater clauses for example on column you have reduced the i/o by a factor of n because on average because you’re only wearing one particular subdirectory so this is fairly commonly used it’s called partitioning and I think most people do it by time stamp which is this life so the question really comes like what column you are partitioned by and the answer to that is depends on how you got queries of the data most off so in our use case you will likely create your data can you show me how many people from London in the last three months came to my website and bought umbrellas right I notice the key component there was last three months right you could extend the very last six months and you may have data for ten years we want to make sure that you only play the data from last three months we reduce the Iowa overhead burning the data in the last three years so will partition our data set by timestamp and you can manifest itself as daily partitions or monthly partitions defending or already partitions depending on how much data mapper partition you want to make sure that you don’t have two small data or very small data in a partition nor a lot of data a partition so the canonical sizer is about one gig or partition on average if you have more than that if you have with it that you are obviously not using your IO over you you have more I or head then technically is optimal if you have too many partitions that’s very small data then your meta store that stores these partitions gets to bog down so less than 10,000 and have at least two get off partition is fairly good canonical number alright so next we’re going to move on to ingestion we’ve talked about how we store this data but how do we bring the him and that’s going to talk about that all right so it’s roughly about three different ways you get these data into a do the easiest is the put method how many people will even know what this is what it actually used to do all right so essentially it just puts the pilot it’s kind of like you know a copy of your operating system there are other ways to do it to the NFS mount and stuff like that apply the main main point about it is would react reliable and the data is guaranteed to be in the same file structure so if you have one file and ends up being 15 up to do and in the same name and everything in size of the skim border this isn’t necessarily true for all cases so we’re doing something like flu or coffee to get the data in we’re actually moving in parts of the five and what could happen is the file on the other end could end up being many files but the data would be in whatever order it landed and it’s partition could be whatever style we decided to roll the file the same data would be over there but it would be it would be not in the same structure it wouldn’t be the same right so Oh nowadays a lot of questions are coming up between should i use Blue Morpho how many people know flume o’clock all right so half so fluid Cochran are these so let me go through quick bloom is a streaming ingestion tension that initially started at Facebook it was originally called scribe then it was flume one and now its bloom and G and essentially and we’ll go through in a couple slides it’s made of a couple components the the big ones are really this the source which takes some data the Interceptor which can do something to it a channel which is kind of like a buffer area and sink which is like I want to write to something and it’s traditionally used to take from any type of source to HDFS arrangements and one of the other contenders recent days of business project of kafka kafka came out of linkedin and comp itself is not they don’t exactly overlapping dickly how many people are familiar with like an mq kind of system like RabbitMQ Lydia okay so so cough is like + mq but one of the couple of the big benefits of it is um what is you can replay data for up to like think that seven days it’s going to figure below are you going to replay data it has

really advanced replication which allows it to protect against data loss or interruption which flew does not have but Koffler doesn’t actually have anything that persists the data to disk or then consumes it from a source what it actually is just a message-delivery system right so when when clutter I decided to support Kafka we’re like cloth / flu and cough for a flume and what we really came up with this haka in flu which means we were getting the best of both worlds so remember when I talked about the different parts of flume where you have a source and you have a channel and you have a sink right one option is you can save the data through a source and you go through the normal of flume architecture to be ingested but some people are like well I don’t want to have that channel I don’t need to have two buffers I don’t need to have a channel in flume and have Kafka be another buffer which is like message to buffer I don’t need to so the idea here is no simple we remove the source we have kalka as the channel and then we just use the sink to directly write from Kafka to HDFS so any questions okay so again delivering event data to HDFS to write in as many threads as we can muster and if all the data will be there it just may be in a different order and it may be will definitely not be in the same file name but it’ll be no its regular kepka essentially income has this idea of producers and consumers all this is is a proxy on top of chakra that says I want to consume and send it to the sink that’s all it is so there’s no magic you see yes now there is a version of proper that comes powder that’s you know support to some level but in theory should work with you know any recent version but no it’s not something customer it’s just like a regular consumer yeah so essentially if you look at the implementation it has a producer which if you chose you could use a source in the source essentially then itself becomes a producer in the kafka and then the sink is essentially a consumable so it’s getting the best of both worlds and they don’t overlap that much at that point yes having it’s gonna do much you can write the pool separating each of those dreams I’d say you can buddy the partition give you a point that is so thanks Wyatt talking about a blue booming on time and that much capacity then we get chick with each three I said it with something right white one guy advantage of using Lumia blue okay so copper by itself does not write to HDFS would make it right right yes so you have to write a right where you use something on a random get up right where the gloom sink which has been tested for many many years as advanced functionality and essentially it just is always on and always is consuming it right now in terms of capacity in terms of this sink you got to remember to sink itself I have a diagram all right you this okay the sink and settled right is not resource heavy the sink itself essentially says give and then right right so its limitation is really never found to the cluster so that means you can put as many sinks on a given node as you want until you can max out if you have a 10 gige you’re really good divoone give you five out of network but if you have a tempe you can set these up pretty significantly right and then every additional node you can add you can have more of these it doesn’t have to be a VP note because you don’t even need this because you’re using hot guys your channel off is going to be disk right so you can put these on the same notes and call pizza so I don’t know of a way to get it in faster than this this is hot as simple as possible yes pasta but it’s like say make you out having a vegas next with these ice comes in right so iron you want to do partition on your play well there’s a couple things they would go you would probably use different topics right but also there’s

something over here all up what is it a path escape and you can do that partitioning on the fly in in the in the enemy well cough but also with the topics like I said which you can also purchase Nair applied with the sinks I mean any other questions so the kafka kafka enthusiasts will say if you lose a plume know there is an interrupt between white so this is a right there could be a couple events inside this buffer when then you lose the node those are on disks and you don’t lose you don’t lose them and when you turn the note back on they continue but you had a delay from when an event is hit right that’s the big worry with Kafka if you turn our replication with nice as much hard work because now you’re storing the dead right and you’re using twice amount of network if one no goes down it pulls it from the replica that success so that would be the only reason so if you had an SLA that every event had you guarantee delivery to HDFS within a minimal amount of time we would probably need a replica channel if it’s just I want to get stuff there as fast as I can it’s probably not what I mean something to pile on to that I mean of course if you have data at multiple data sources and multiple data destinations kavkaza so you can do that the rebels gotta do this one on your agent and there’s three I want to present you it’s just a couple that are stuck in the buffer that’s the issue that a cocktail person will argue if that’s not an issue for you that’s not an issue for you there’s a replication you pay for you pay for a reputation cost network and custom okay all right all right so I won’t let me skip fastest bloom architecture fan and we got a lunch box amount we just waste so much time pasta but essentially the way you do it is you got to leave you how to pray how many buffers you want you’re going to many clients do a film agent and you get a mini some agents with us essentially you can have a sink / disk essentially on your HDFS cluster and I hope you have a good number of those for our clickstream example we essentially are going to do follow them for everything in stores because there’s no reason to stream minute and by the way we need to file them don’t do one foot at a time typically you can read from a disk for threads at a time and every every eight you know what you do is another thread to HDFS and every thread gets another spindle to write you can write to it much much faster get more than one foot for david s coming in real time we’re going to flume it in right now we’re after use casa and this is just if you were going to do a different type of a way of doing it devis scoop you don’t know scoop you write a sequel see such stuff out of your database as best possible um remember this is the stuff that I really want to talk about so forgive me for skipping over the adjacent this is processing engine it seems like this audience is relatively new to some of these things so I just want to walk through this so essentially right now in my world there’s four groupings of how i see processing engine in there are other processing intervals these are the ones I’m useful and probably the most mature ah so the first one is MapReduce it’s an oldie but goodie um it’s batch of the best speed you’re going to give a MapReduce is probably 30 seconds to 2 minutes this is your best time rank and that’s what a little bit of data and the reason is is it likes to touch this a lot this is an image i came up with to try to explain what MapReduce is doing so just to get an idea reads all the data in it spills data out a lot of sorting it reads it back in and it finally writes out partitioned data and then that gets red and transferred over the network and then immediately written down to the local did you read in and then eventually right now very very I Oh intensive right that’s fine because the dupe has lots of disks and for a long time this was the standard but there’s a couple of issues with it one is the speed and we’ll get to that with spark in a minute but in others the code how many people written MapReduce all right how many people really enjoy reading measures how many people can write MapReduce and over 100 lines all right okay so madly is pretty heavy right I don’t have the slide up here I said I’m together yet where it shows like mapreduce insist to the joint it was like three hundred lines of code and I then I wrote the pig or the spark

equivalent it was like two lights right so so immediately when people addressed first about MapReduce was the fact that it was very awkward to use it took a computer scientist with some pretty decent understanding of parallel programming to not screw it up right so people people out of Facebook developed hi which is a sequel interface to MapReduce so if anyone wants to get started with to do this is normally where they start this is for a very very long time is how I need my money I would go into companies that used hive exclusively and said everything is slow and then I would not be as high I’ll move to Prague remember these I use one of the slowest things on planet but it’s a good way to start um the other language is down here like pig live in anybody’s picture so it’s a nice scripting language one of its downsides is you have to learn pig the other downside is if if you’re we get started there’s this concept called tuples and bag health and can be kind of frustrating to somebody who’s a Down programmer or just a sequel programmer but it is better than running our MapReduce there’s some other ones down here crunch and cascading probably in the order that they’re used but still this probably has of this sheriff eighty percent of ten percent of a five percent of the market share and these are ways to program MapReduce with Java in a way that was made off of this concept of google called pooja and what is essentially it’s like they’re trying to do like a functional coding language style and i’m not going to go into it too much because i think spark kind of perfected it and when you look at spark it’s it’s it’s taken this to a very clean clean way of doing it now i’ll have a code example up here but i can walk it through it later on through some clear examples then becomes a spark please promoter Spartan ok spark is the new kid on the block or races are cured cancer and so far it’s actually going pretty well compared to the other things that have tried to cure cancer in the past it’s doing very well it’s very very fast they just came up with some the latest benchmark numbers but people are saying they cheated it’s still much much faster and one of the big benefits is when you go look at how mapreduce didn’t they try to avoid all of this and try to keep it in memory but more than that if I was trying to do something very complex I would have many map reduces back to back to back to back to back right in far we’re just going to try to do a victim fit into memory it’s going to read that data once it’s going to try to keep it in memory all the way to a win it writes it out so it skips all of that right which makes it very very good for graph processing in order processing machine learning and stuff like that and then on top of that it’s about 10 times less code the code is readable the code if you do Scala or Java it reads like what you’re trying to do as opposed to you’re doing MapReduce to do what you’re trying to do um very powerful API very powerful machine learning can work in Scala Java and Python which is really nice we’ll get into what our TVs are in a minute and then we’ll talk about and then it has this concept of a dag engine but at the picture of dagga engines left here honey not at all so daddy engines this one day I was like I’m going to spend the whole day I like told my kids to go away and I’m like I’m gonna learn what if daggett because there are ways like Daddy engines or the new thing was probably who knows how to dye into this ok it’s if it’s a graft if nothing to a loop in the knife again so I we’re at google it and what oh my god it’s so simple in the concept of that is it’s essentially or closed if you were using fizzy or pagan essentially says this is your execution plan that’s essentially the output and then briefly about our TVs are waiting put data into like memory or there’s lots of options they can skull to disk they can be compressed in memory they can be civilized in memory they can stays object remember you have all the options in the world but the big thing is they’re trying to avoid hitting disk and you know they’re immutable oh the 1i thing about spark before I go to a hollow of all the projects ever done by the patching spark has more contributors of any other project are out beating even HDFS which is the 2nd contender and it was the fastest project ever in history to go from incubator status to full-blown Apache status so in terms of

potential and momentum this thing is frizzy comparatively speaking to anything that’s ever existed so there’s huge potential here and there are issues I worked with some data sets up in the tens of billions and they’re running with some issues but as soon as they’re identified the big result soon as if you have any questions about this I looks and then there’s in power so Google figured out a long time ago that I’ve sucks or the concept of I’ve sucks because they had their own it and they’re like why can’t i have like why can i have what i had when i had MPP engines are being the winner feat m.p p.m. Nestle parallel processing database kind of like an empty so or over cars oh okay you’re just like yeah I’m right area something that and so they came up with this thing they give up with a combination of distinct aroma one f1 and essentially from one ive one was to say i want to have the same fastness that those other systems on top of Cheetos so what at Cloudera do we hire the guy who built gravel on f1 and we essentially rebuilt it as out in palo does not use MapReduce it’s mostly written in C++ you know this code generation of the Flies was with our fast interacts with HDFS and HBase it uses the hive metascore so anything can happen hive you can work with the power jvc an odbc but the big kicker about this is it’s pretty fast you can get things you can get queries return in seconds I went couple weeks ago I was telling you about the three minute joins between two billion records and half a million records but a filter on the two billion records to get a select number of rows was taking two seconds so two booting records on tendo’s like two seconds so it’s very very fast makes it this thing called parquet and we talked about it earlier but if you’re interested parquet is a column or format cannot remember column or floor mats which means it stores the data as columns instead of rows and it has a couple benefits one is it only gives you about twenty to forty percent better compression ratios and if you’re only reading a portion of the columns it’s extremely much faster than reading all the comments because going has to be precise the columns of Carol a questions on these cuz that was only what if you race a using ever oh and you think oh whoa about this Paul cable that you saw I have to reflect everything or can mix and match you can so you can use any of the file formats with any of these pools so avril is fine a lot of people use emerald the nice thing about parquet we’re just talking about this today the same api’s you can know you give a bow and arrow schema and you talk to it like adding comments and everything like that you are K has an avril interface where you can give it an Avro schema and all of your and you can look at it as it’s a bro and you can find that as if that were leavin right to it as it as if it’s a growth you can do the same with risk so a lot of times i recommend to my customers you want to have a afro centric every world and that means a bow has one huge key benefit and that’s in streaming because it has this thing called like checkpoints so if you have a massive failure during your streaming process you only lose the section that has a failure as opposed to losing the entire file so i normally you have room for that and then i use parking or table starters because especially if you’re reading only a portion if you know you’re always going to be every single column that might be a recent especially for compression column or normally i have some numbers on the laptop and I can’t regulators so even so usually Snuffy it’s about 20 to 40 veterans our families so good and the reason is I mean it’s not it’s not okay crazy right okay i come from the US and i’m sure you guys have stock market we got a stock market so in our stock market where it could be out of ticker time and money right now if you sort your data by ticker up what was it take her then time and then you put in a column or format right for a long long time the first columns for you facebook you’d only by sat up confessed to it would be like three bytes for the entire you know one gateway right and then time is this number that increases right you put that in as a long RK actually has a thing to know that that’s a number and then it steadily increases and only

scores the delta right and then compresses that so then that turns into all this nothing and then the price almost never changes too and they sort of that is the Delta and then compresses to know it’s nothing but a bro can’t a bro season as a row doesn’t see it as a value that’s progressively changing so it’s it’s not doing any type of magic it’s just going to lose them who’s done image grocery in the past right ok so there’s only huffman there’s only like so many compression algorithms but the trick is to remove as much noise before you apply the compression of the sub say assault our pages do it it’s saying I’m going to separate these things out into logical areas so when I do apply my compression collect there’s less entropy within the data set that I’m compression the momentum that are currently has yes far being what’s the new stage for ah so we were talking about this left other day too um so I don’t have the numbers of the thing and pilot is still way away faster that’s right now um Sparky’s written in Java so it’s unlike you to ever meet the speeds in the end of Impella from padres point of view um we very much are an evolution company we’s ever win who’s ever winning that’s what we’re going to use so right now we use right now or two right now our tradition alive and Impala we’re very very soon here it’s going to be in pala in a high one spark and we’re going to go with both in whichever one helps the client for is going to be used and this helps us internally because it encourages our Impala team to push the letter of the envelope and it it pushes our spark team to make their stuff better to competition is going to further right but for right now at least I have the numbers stand Impala is way faster and way better for multi users as opposed to spark right house part sequel is probably way more configurable and easier to interject into an this is the ETL process that make sense as it stands right now at least the numbers of it I I have not seen sparks equal a performance oh yeah but are people with sparkly shop something smart so I’ve seen some benchmarks where they sort of kind of so anybody who doesn’t benchmark understand that they have a motor so uh always test it yourself but to be fair spark has a huge advantage over pal because park is not sequel and you can do a lot of things that sequel sucks at better outside of C so if I was if I had a bunch of analysts that didn’t know job i would say here’s an using pal if i had people who had a decent skill set a new Scala a new Java and they were doing an ETL process I’ll say you spark right I think I’m Paula’s more for your sequel jump jockeys that just need to do you know aggregations regular breweries and stuff like that and it’s farkas for massive honey at least that those are the numbers I’ve seen all right any other congestion sucks before we go into what are we doing now that’s you this is no longer me thank you alright so attended a very good overview of all the engines that we have and advantages and disadvantages those we want to bring those engines to the context of our use case and talk about what makes sense where we wouldn’t have a chance to go too deep into this use cases but we’ll talk about the four things you usually need to do in click stream analytics here’s the first one so people come to your website your web server is simply tracking what keep what’s happen and what users come to your website but oftentimes you want to group them into visits or sessions right so why came to the website click a bunch of pages and then I went away and three hours later I showed up again that’s the second quiz right so the first thing that you people usually want to do with specialization so there are these three clicks that happen altogether that more than 30 minutes later some to have struck what’s happened and that’s some other person showed up later on and had to extract legs right so these are three sessions the first view from first person and last one from a different person why session eyes but it helps us answer a bunch of questions that people who are marketing analyst one answer how do you session eyes we need to figure

our two things the first one is given a list of place this is essentially the log file from the web server how do I figure out which clicks in from the same people the second one is once I’ve had this click on a per person basis how do I figure out which clicks belong to the first session in which clicks belong to the second session just simply say let’s when you look at the logs so you need to answer these two questions the first one you can use the IP address or cookies or a combination of both them you want to keep it calm and web data just like all large amounts data is not one hundred percent accurate right so the goal here is not to end up with something that’s 100 snack great i would say maybe five percent or ten percent air max ten percent probably it’s a good idea so to choose comes from the same user we just simply use the IP address a second one which is the part of the same session there are tools out there who have already figured out the 30 min is a good number and that’s realistic it may depend on your organization your use case to change a tweak the 30-minute number but we’ll go with their events so all we say is two clicks of the same user that are 30 minutes apart the previous quick is session a and the next click is in session be the beginning a session be specialization we can use MapReduce or spar the quote for that is all available on our github the link to it listed on the title slide we are going to choose M art because while that can speak at volumes of our house barca solving its evolving as a rapid rapid pace so that’s AP I actually recently changed for your writing client applications so for simply the reason that MapReduce API is stable more people understand it we choose MapReduce but the code for both doing the same thing in MapReduce since mark is available on the get a profile and you can go and see how long MapReduce code is on our shores part code is they probably on that note on the github also they start streaming example it does this session ization in iterative 10 seconds and outputs the results to HK so you get real times we we didn’t talk about squash doing it oh yeah you should talk about that primitive you want meet us back here so I work a lot with TD who’s like one of them is they leak emitter because for sure so spark screaming is essentially a spark in a loop and there’s some added things in there for that essentially that’s all it is it keeps the JDM sup so it allows itself to have a lot of reuse we remove some of the slides the main caveat is you essentially say here’s the thing i want to do and i wanted to repeat but you only write it once the same code you could essentially put into spark it which is work as an e tail but now you’re going to say these are the instructions i want to do and you can stay here come on a session eyes one of the other caveats it has allow it to work in an iterative fashion is you can put our td’s for the future and future iterations so exposed to miss specialization essentially what you would do is you would leave an RDD with all the placeholders of these are all the active sessions open so the next time you run you would say I got these new records should I update any active sessions or some of the active sessions dead and essentially at the end of that process you would use some counters send the 70 results hbase and then you can create base to get real-time results so oh and one of the other things about structuring which is really important in 52 they fix the last way to lose data as of our streaming which with very very exciting to everyone so you previously and whatnot one you could lose data as extreme if the driver process died because the driver if I’m going wherever your head don’t worry I just understand the specifics is this process called the driver and the driver keeps metadata of where all the RTE’s are in memory but if you lose the driver you lost all the metadata which means you have all this memory that you have no clue what it is right so what they did what they did to solve it was they have a right ahead lock so the driver is ready to write a headlock if the driver dies and your driver starts up reits ready to look and you can’t lose David this is really cool so that’s a money too but in many ways same everything you get from Sparky different story really easy to code as pretty much all the ideas you can do it to want to do you could even do a lot of machine learning operations in sparks training which is kind of nice where you can do we can talk about later if you want but like you do k-means to see if things are deviating from the norm in real time and then signal people that says the behavior of what we were washing is changing so whatever you rule did role was to catch broad may not be

valid anymore because people are doing different things yeah so otherwise it’s so yes so we at Cloudera had a huge debate lasting almost six months on which one to support and which one to really go gung-ho and we talked to a lot of our clients that love storm a lot of our clients in love spark streaming and don’t even wrong we have lots of clients that have stormed without them I know how to pudding storm there’s coatings in the storm examples in the book it just for the use case that normal people would do it from least our benchmarks mark screaming was faster in terms of small dents in storm it’s got to do a record level aki all the time and keep track so the longer your topology is more that becomes this cut off to off off off and then the other caveat was just the complexity was to code in storm compared to the complexity was so in storm you build a topology kind of like in MapReduce you do MapReduce revenues methods and spark you say what you want to solve right and then it builds the topology so it it’s kind of like Aunt first maiden how I see it right and hand you say do this and maybe you say I want a project right it’s kind of different so we went back and forth and back and forth and back and forth and then the added advantage of the lambda architecture where you want to do some things in streaming but then you want to do them again and not screaming because you know you might want to just justify to yourself that they’re exactly the same and the fact that you can reuse most your code not all almost all of it was a huge win so the company as a whole was like we will support storm room service engagements but as a whole we’re going after spark streaming you really think that’s the future that was that was our take up talks to me you said you may keep on how does it will be started off from stash and yes yeah it is there’s a couple tricks to do that essentially you can clean stone from many locations and you have these things called receivers which is almost exactly like a flume source which you can have one that says once I read everything on this directory unless there’s anything new i’m not going to read it again so it only reads it in the first time right that’s a simply that you can do other ways but that’s just a very simple example and once it reads it in the first time it can populate that and as long as it doesn’t send any more in your fine right so yeah it’s definitely a way to do it because you need to have something to start with anything and the important thing is this is very important it’s not one already d that lasts forever our TVs are immutable so what happens is this the rdd of state passed one exists for past two but then passed to makes another our DD for that for three the past one will dock so an RD he never itself gets modified it just continually makes more and more and more sounds good lighting- or any other of what could’ve died they depend upon what resilient level you have such artista so if you set your artemis the resiliency one yes if you set your our needs to have more resiliency you have to lose any number of notes right it’s like that so it depends on what your level resilient resiliency so a lot of people just do resiliency of one and then they use the land architecture to deal with any issues cool any other questions who this is important in fewer lawrence park screaming from shooting is less than storm but both can not give you only once actually almost nothing and gives you writing what so you have to do lambda anyways right so sure i was thinking by but overseen i have stuff for 15 oh alright so there will be session ization so we chose a MapReduce over spark but quote for both of them is available on github the next one is in processing that most people want to do again this doesn’t necessarily just

apply to click stream is filtering so oftentimes you have because of infrastructure that guarantees like they were just saying it’s very hard to be exactly on semantics so you walk into at least one so you may have duplicates of records in your area so deduplication it’s fairly important but also you may have flicks that are incomplete you may have records that come from sources that you don’t want to hear about by these are spiders and boxed up and be coming to your website and clicking and you want to filter out those right now right so for example here in the hood stream analytics use case we want to filter out particular IP addresses quickly finds I’m useless and so this can be done in MapReduce ice bar pic for sequel engines a very simple for you to say where I pee not in this list spark gets a simple filter function math games is probably a hundred less money for the number code and you can do that either or what we decided here was to simply embed this into our session ivation job to the job that MapReduce job that we just learned about the whole obsession nation it will go back to filter out all the records that are from IP addresses that we don’t care about you can definitely have a standalone process on for more specialization that will do the filtering for you but we decided to just use the same process the next one is deduplication again exactly exactly one semantics are pretty hard to do it so you may have multiple records and so this deduplication just gets rid of them and can be done in all edges again very simple single line code and spark probably underlying code map reduce pig is also a very single statement that you write and you can say distinct this back and it will do distinct on that thing is that that’s what we use so the code that you find in to get a free bonus pic yeah it’s really simple thing to do with other engines as well the last one is bi this is remember we have data coming in we have raw data set that we populate or generate some some processing that processing that specialization filtering deduplication now we have some bi to life in Tahoe or MicroStrategy or maybe a customer tool that you have that you want to access particular data aggregations from the process data set and for that we use in call at the end right sao paulo’s had mentioned very fast sequel engine you want to be able to have your real-time answer to the course of your progress so are learning how many people in the past three months Barbarella’s in London would come out very fast and poly that overuse it has regular DB coab saturdays have those tools are already all right so there’s a sponsored or registration which is lining up all these things together I don’t want to take up the next be first time so I’m going to skip over this this is in the slides and you’re welcome to go through it and here’s me skipping to it so final architecture I just want to put all these pieces together and walk through this the first part is on the very left this is ingestion on the top left you would see all these web servers that have data we are ingesting them via flume optionally will have cough got and four and half tops here as well if you have secondary data sources like operational data stores or CRM systems that manage I your data to fortunately you may have data about wearing expense toward if relational database you want to enjoy that how many clicks are being viewed every day for a relational data in Google import that with scoop and that’s what we do number two is that big Center star that’s to do our bread and butter and that’s where all the processing happens to things like deduplication session ization filter it happen there and the number three is to stuff on the right this is after the processings happened what do we do but they’re right so on the top you see the BI tools like MicroStrategy but at the body we start to see you can use far to do some additional machine learning analysis or graph processing and ask on it you can also export that to your local disk and do analysis on Excel or our Python you can also have custom apps built on top of the Duke that would do what else is on top of that then you have orchestration which we didn’t get a chance to talk about but that’s especially that fought there at the bottom with the streamlining everything so it triggers one thing triggers together alright and that’s about it thank you very much maybe a few more questions yourself lucky yeah Kennedy do you catch video any other

questions for us my idea of a magic card or so I see them and as the places in there partner I love you America I think you like informatica I go what do you need a permit so sticky do a bit Eric America depends i’ve understood i think though the offering a operations is that you can lower the bar of skill it’s been a lot of junk food license or they could offer that program in informatica what you were doing and if these are our targets well now we’re gonna do i think those are their main operation oh i will just talk about so the topic registration is running for Senate inauguration of muscle districts and what advice we had was the orchestration musician who is is the one you have expertise right so it’s very much for your organization has investment both in time and people in a particular tool and local managers are officially about our partners to question and I would say if you’re starting staff there’s probably not a whole lot of benefit is he does like a particular target I think all these tools are really good but if you have energy semester already then it saves you a lot of castles are training people to have to start using the CLI or some other to go home what you’ll see I have not seen it you got it doc today we realize I have not seen it in matrimonial others in love yeah so like I said spark is more momentum in hdfs so I would not want to be competing the spark right now there’s one more question durable we’ll be able to make a loop so the there are a few options there first one the D factor one is high mass tort which is the first tool that wanted to store metadata they don’t restore like him to correct and well so there are two different has a better time mastering rushing about the first time and that will come back in the second time the first kind is the implicit scheme of financial that makes us about your names right so that goes into med stores also averaged stores that team are right behind the day I self pleasing have a schema file that will store that data there’s a second kind of date metadata said I would let us say I pathetically you were effectiveness and you want to hit 28 you wanted to know things about this file you want to know how every scene cut into the system how it goes down the system college site because you have to report to some kind of regulatory that would be on the quad our site that would be seen like now outside of exterior side I know that the partners that produce stuff like that but hot air sighs I know that we’ve got them through its public second master card mastercard has passed pci compliant through with us through through that method I mean that is in America I don’t know if you guys have to get these that comply an American is all about like personal information with credit cards okay you got that too okay yeah so it’s different text and edit it so we’re going both of that direction yeah so we work for the finest come on two and three that’s all time for Audrey when is the update in here after one my way steak no time I think we’re going on the fact be possible of this oh yeah yeah we’re before one that one right now I find that too um so one of two just go eat italian being bored being blown enough right now so it’s unlikely that it’ll be in Qatar right now normally what we do is we have to have our own set of tests your own to immigration test and it’s not until it passes all that to what it is if we find that the bill does itself is not stable but there

are components of the build up wanting in our customers we pull back some of the dearest and release it as like one that one but with additional jurist you always have fun instability yeah and spark is moving so fast there might be yes I did the last question so jaded by captain depends my other one we wanted 0 lost 1-0 loss i will use talk about workbook because there is there anything since you can lose data in any kind of screening environment even caught you can with it if you add if somebody had a nicer frozen you lose a single record your your fire right at that point you’re doing straight chat some when it to exist and but even and as risky was easy but if it’s a little bit less talking and the next most best and then flume would be the next most well once you’re in the Duke your possum right foot once your you do give it all these three nodes at least act within like a minute to be in touch it’s getting to do that especially like if your servers are way over here you go through you know network and your slots machines in the middle more machines there are the more opportunities for you write a one string to do data wats is really unlikely I have a I have a cluster that’s an I one of the one of the biggest foster system native yes I one of the eight fingers eight clusters and I am something some guys over here it will reach in AWS I didn’t know that this was custard and then we’re changing out power supplies and I can see my nose just going every 15 minutes it’s disappearing because I think it took the guy 15 minutes ago now to do and we didn’t lose any data we were able to get filled as fast enough and was able to do its internal altercation if we don’t lose data as we finally called it get aggressive please stop turning are no joke but um but you just make sure the response is slow yeah so you lose three within within a minute you’re going to loosen it right but I actually have some banks thanks for like the most worried actually have one thing that has done radicality and power stations so he has to lose more than two power station nearby or accessibility so it depends on how addicted three sided up and off your application of four but he has more weight with the one in Miami Valley bit of karaoke bar so it a today’s if you want anything what is just already HP because we know in so right now there’s no indexes and in how long were HDFS he will there be indexes yes so if you look at the park a4a to format it supports industry so just got any neat so you look dick so it’s like I disagree with anesthesia you’re familiar with an empty set right now it’s similar to how to tease it doesn’t it just scans with every drive a guy so like I was telling the guy over there two billion records in like 10 seconds on at NHS oh yeah it’s not the end of the world and that’s too Terrance so right now there’s nowhere this partition will I next thing happen yes I’ll bet you’re exactly right if you needed to grab something by its unique identify our HBase is much better proof because now you’re talking about to 20 milliseconds of sex right we got a big remote