Time Series Analysis with Spark and Cassandra | Christopher Batey

okay so let’s just do the little hands up first who here has used Cassandra one-two-three so we’ll spend some time introducing what Cassandra is how about spark okay how about spark on the batch analytics or spark on the streaming so streaming not many mainly just batch then okay so a little introduction just to see why I’m talking about this so I actually work for a company called data sacks and this will be the second to last survey who’s heard of data stacks quite a lot it’s increasing they’re they’re quite big in America and they’re slowly overtaking Europe too now so essentially what data stacks do is they take Apache Cassandra they take Apache spark under-par actually solar and build a product out of it and that product can be your operational database but it can also be your analytics platform and you can add in tech search as well so everything I’m going to talk about today you can do with open source Apache Cassandra okay open source spark it’s just a lot easier with datus axis product but don’t think I’m a Salesman you can you can do this with open source I promise my experience with Cassandra’s mainly come from working at Sky so anyone heard of sky yeah everyone’s either I say that a different country like No so essentially Sky moving more into their online television okay and when they initially thought let’s do some internet television no one thought it was going to take off okay surprisingly it did everyone likes watching Game of Thrones on the right pod so it got busier and busier and busier they’d originally opened out sourced a small company to build the platform a smallish company that’s still the main thing which like produces internet television for Sky now it’s this large monolithic Tomcat application backed by Oracle falls over on a not a daily basis but we’ll go every every two days but what we were doing was we’re replacing that system with a system backed by Cassandra so I can go quite slowly when I first thought I was going to have to cram this into 30 minutes and then 45 minutes I thought I was gonna have to speak very quickly and given I’ve got a northern accent you probably wouldn’t have understood a word okay so but now I’ve got a full hour because the other speaker dropped out I can go a little bit more slowly so I’m happy for question throughout so he can make this interactive if you want so what we’re gonna do first is talk about a bit about motivation and I’m going to do that via an example application which is online which you can go and play with and the the scenario is doing analytics on weather data so it’s both ingesting the weather data at high volume so you need a data store which can take it but we’re also gonna do stream analytics we’re not going to go into batch but the API is for spark is so similar that you’re pretty much learning how to do batch as well and then we’re going to go into the tech stack so I’m gonna give men a bit more time on Cassandra given that most people here have not used Cassandra then I’ll do it a little overview of spark and then we’ll go and look at some code okay and obviously that’s a lot in one day so I’m gonna put the warning sign up because I’m gonna throw lots and lots information at you and I’ll be around for the hour afterwards for the beers so you can you can ask me questions for hours about it so everyone recognize these three terms yep this is is the last quiz I absolutely promise who here would say they’ve built like an operational system like bucking a website or mobile apps or you know like something like an what would classify as an OLTP system a few how about batch who here works or technologies like Hadoop MapReduce and things like that who here does the middle one so people want answers but they want it reasonably quickly they want reports they want like recommendation engines the one in the middle so kind of like send me real-time analytics processing that’s pretty even so the technology stacks that we use today for these things so there’s lots and lots of databases okay a couple of them say my sequel Oracle Postgres most of them relational people like and Couchbase have come into this space and on all these databases you can kind of use for like reporting right you know assuming your dataset isn’t too large you can continue to do reporting style queries which are going to give you a response time in seconds maybe minutes eventually you’ll run out and you’ll take all of that data and you’ll extract it and you’ll stick it into something like a loop that’s actually storm and at that point you’ve got an UT you’ve got a new set of infrastructure it’s probably going into HDFS and you’re probably running MapReduce on top I’m gonna suggest something quite different okay for the transaction processing your operational database we’re going to talk about using Cassandra okay open source database from Facebook but it’s changed quite a lot since then for our batch analytics and when I suggest we’re using spark okay and in the middle to do kind of are you know keeping our dashboards up-to-date to see what our customers are up to do or working out you know the

monthly min and max for the weather and when I suggest we use spiked streaming and there’s an application online which looks a little bit like this which you can take and you can play with it uses quite it’s all written in scala it uses akka it uses spark and it uses Cassandra scenario is essentially there are lots and lots of weather stations around the world I’m not necessarily talking about big ones I could it could be raspberry pies okay they all produce data okay I’m assuming they may be produce say an event every hour but if there’s millions of these suddenly your application which is going to ingest this data needs to handle you know hundreds if not millions of transactions a second okay and we can put that into certain databases but eventually they’ll fall over in a heap okay when you get to certain data levels and certain transaction amounts you know you’ll need to you’ll need to do techniques to make them scale and you can do that with relational databases don’t get me wrong you know Facebook is the prime example that if you put enough effort and money into something you can make it scale but if they tend to stop being relational databases okay you start denormalizing you start removing constraints all of these types of things okay you start sharding so we’re going to talk about how to use Cassandra for that part but for the aggregation type of thing so constantly keeping up to date how much it’s rained today you know the min and Max for the year or the month and all these types of things that’s what we’re going to use spark streaming for now if this was a longer talk I’d first show you how to do but spark and then saving it back into Cassandra so you could you know run your application working off that data that’s been produced by the the batch jobs but we’re gonna skip that and just go right to the streaming okay so it’s just going to go through this flow it’s going to go through it’s gonna go into spark and then it’s going to get saved into cassandra both the raw data and aggregates and things and I am terrible at producing you eyes but if you were to deduce at UI for this it might look a little bit like this okay so in it’s real data by the way so it’s actually weather day you can get online historical data and I’ve imported 20,000 real weather stations into it and then there’s two ways to ingest data one is kind of like we actually have downloaded some data from online and we can shove it through it the other is it we can actually just generate some data so we can say between this date and this day make up some weather data and the idea is that we should be able to build things like this okay and these should be updated in near real time okay so within say 500 milliseconds we’re gonna keep things like the daily precipitation up-to-date okay if you want to play with this afterward it is on github like everything is these days so you can go and grab it the dashboard part is actually not quite in there yet it’s in my private Fork I mean public fault but you can go and play with that and it’ll be in the main one within a week or so so if we’re gonna build a system like that we need to start learning about hopefully because hundreds back so let’s first learn a little bit about Cassandra so you might get bought for you’re the one person who knows Cassandra so you’ve probably used it today okay if you use spot Spotify Netflix if you’ve ever got an iPhone you’re using Cassandra most of the time if you google say there’s Apple use Cassandra you’ll your presentations online where they’re running around seventy five thousand nodes in production so it’s run a somewhat larger scale than most databases Netflix they run a couple of thousand nodes in production I came from Sky they there but they’re very modest it’s it’s more like a hundred there’s lots of finance companies using it because it’s really good for time orientated data okay cuz you get to pick how it’s stored on disk which is why a lot of people use it for it doesn’t matter how the date is ordered it’s just the fact that time series is the most popular type of order datum so if you hear the classic Internet of Things it’s used a lot for that because most Internet of Things data is actually time series data and it’s used a lot for things like online gaming because one of the other big features of cassandra is question so the actual iOS operating system itself the things behind that is is Cassandra so if you can actually find two that is the biggest one that’s like four hundred million users correct so I would suggest I wouldn’t want to couple my iPhone application to a database because I wanna be able to change my buck and infrastructure when I say most mobile apps would use it I would say the services that are backing those mobile apps will be backed by Cassandra because like Sky for instance we replace all of our back-end services using a completely different technology stack didn’t have to change the front-end if we’d coupled say the front-end to the actual database running in the back that wouldn’t have been the case that depends on your integrating technology you can pick mostly with mobile the common one is HTTP simply because everything has a

HTTP client so all of ours with HTTP go okay so sweet spot for data modeling is order data okay and it’s because I’m I’ll explain how you get to pick how it’s ordered on disk so you get to do particularly fast range queries slice queries however most people that end up using Cassandra are using it for non-functional requirements they haven’t thought ooh my data model fits in here perfectly like so you might with a graph database they’ve they typically say migrating from my sequel or or Oracle because they’ve hit a scaling point okay and essentially the kind of characteristics and I’ll explain how of Cassandra it has pretty much linear scalability and that’s of durable rights okay it’s data center aware it’s completely topology aware and you’ve kind of done any studying on capp and availability you realize it’s pretty much impossible to build an available database of it if it doesn’t understand its topology I mean like what its data centers are what it rocks are what happens when there’s a partition between rocks what happens when an entire rack fails we’re going to understand how Cassandra actually handles all of that and most relevant for this talk is how we can use that data center awareness to do analytics directly on top of our operational database so it’s about five years old to academic issues because they’re not issue because they were written by companies is most of the theory behind Cassandra so I promised I wouldn’t do another hands thing who’s read dynamo paper no it’s a really good read it explains how to build an always-on database and typically you’re sacrificing consistency for being always on okay and it was Amazon who did it for their basket service because they eventually realized that actually if that if you can’t add things to your basket and you can’t check out a basket they don’t make a great deal of money so they’d rather things be incorrect sometimes than to have downtime in that service the Google BigTable is very much around the data model so it doesn’t use anything from Google BigTable it uses this very similar data model so Amazon dynamo is all about key value store Cassandra is not a key value storage what’s called a wide column or a columnist or and that essentially from a programmers point of view it’s a month of maps but I’ll show you that in it in an example every node in the Cassandra cluster knows which data center it’s in and which rocket on and the benefits of that is you can run it you can run one can take logical database across the whole world but each node knows which region it’s in and each client knows to speak to its local data note local nodes unless they’re all dead okay so you can actually run multiple data centers to get your application close to your customer to your client wherever’s actually accessing it that’s really important for services like Netflix you know you’re not going to be you don’t want your Netflix to be talking to the American Netflix servers but again for us they don’t have to be real data centers you can make them up ok so one really good reason for making up a data center is to segregate nodes so you can run more analytical style queries on them and not affect the performance of your say you know the servers which are you know servicing your website or your your customers ok so that’s what we’re going to talk about mainly today so here’s the paper I’ll let you read it now no you don’t don’t read it now but it really is a really great paper to read if you’re a software engineer it’s not overly technical it’s very much describing a real-life problem that we can all relate with and how they went about designing their database which solves that problem and there’s some proper computer science in there as well so we’re not going to go through it all because we want to get on to spark quite quickly because we’re here for Cassandra and Spike but I’m going to talk about the things relevant to being able to run analytics directly on top of what is normally your operational database so we’re going to talk about how the data is essentially distributed around many nodes in what’s called the Cassandra cluster and we’re going to talk about how the data is replicated to avoid losing it and to handle things like partitions there’s a few parts of dynamo which Cassandra just isn’t okay two main parts from my point of view is the fact it’s not a key value store and if you have a distributed key-value store conflicts are extremely important because if the database isn’t aware of the data you’re storing and people are writing it from multiple parts of the world if there’s a conflict there’s not much a database can do it’s like no the blobs are conflicted okay so to get around that the dynamap paper talked about using vector clocks and that’s just a way of working out when there’s a conflict and then giving it to the client to you know please sort this out merge these two baskets safe with Cassandra it’s a wide column row it’s wide column a store and you can actually date the model around conflicts and when we look at the data model I’ll show you how to do that so we’re gonna start off with consistent hashing which

essentially is avoiding a thousand node table scan so Cassandra is all about building performant queries which are going to perform the same when you’ve got three nodes of Cassandra and the same when you’ve got a thousand nodes of Cassandra and there’s one thing you definitely can’t do if we want to do that and that’s go to every node for any query we have to be able to pick a single node or a very small number of nodes to either read or write our data and it essentially does that with consistent hashing so we take the first part of the key so here’s some data and this is kind of what I mean by a map of maps so we’ve got the first key which is Jim and then we’ve got the a which is kind of like the second key and then 36 okay and then Jim car Ford okay so that’s what I mean by saying map of maps the first key say Jim Carol Johnny Suzie that’s what we’re gonna hush okay and the hash gives us a nice value and it’s actually 128-bit value but I don’t know how to say these numbers I could in powers of two but it’s hundred twenty eight bit number we’re gonna do it pretend it’s really between a zero and a thousand okay so we’re gonna pretend that with murmur three it doesn’t even in case you’re watching Jim hushes to 350 you know Carol hashes $2.99 eight and Johnny hashes 250 then each node in this large cluster owns a range so you might say they’ve got four node cluster which would be a very modest Cassandra cluster I might say that node a owns between 0 and 249 you know no B owns between 250 and 499 or if you’ve seen Cassandra logos you might see a ring that’s pretty much what it’s trying to tell you that’s what the ring represents it’s this hash ring okay so you know that between these two a and B is owned by B between B and C is owned by C okay and that means that we can work out both each Cassandra node in the cluster can work out where the data should be and the clients can work out so inside the driver execute the same algorithm and can go directly to the node which has the data so we can see that because you know Johnny hash is 250 it’s owned by no day around the world then just increase the replication and fruit for some more users to read it locally but if you’ve got something more than my facebook which is really connected you’ve got this person’s connected to all those people and then he’s connected that one person which is in another country so we’re gonna go into replication next and it’s it’s not going to be that clever you’re gonna have to design it yourself Titan if you’ve come across that which is built on top of Cassandra is like a layer to help you with gruff gruff style problems and tighten itself as recently the Elias the company that built it has recently been bought by data stack so that within a year or so there will actually be helped to do that kind of thing but Facebook that that level of scalability as a graph problem is is quite unique but everybody kind of likes modeling as I as gruff but the only craft databases we have aren’t particularly scalable okay they tell you their normally horizontally scalable not vertically I mean sorry the Namie vertical scalable not horizontal so you always hit like a bottleneck so if I don’t tell you any more about Cassandra then how fault-tolerant is it what happens if node a goes down do we lose our availability well the answer would be yes and you could set Cassandra to do this you get to pick Cassandra’s replication okay but if we want an always-on kind of fault tolerant database we kind of have to restore our data bit data and more than one server ideally more than one rack I did more than one data center okay we don’t have to but if you if you want to have a always up system in cap theorem and ap system you kind of need to so how does replication work in Cassandra well there’s two ways there’s a stupid way in the clever way a stupid way which is also called the simple way it’s basically just for setting up a quick dev environment you never ever ever use this in production you normally using what’s called the network topology and basically what that does is each node in the cluster knows which data center and Rakatan you can tell it you can lie but you probably shouldn’t unless you’re doing with this kind of work mode isolation it can also be worked out save your work running an AWS it can work out based off the region and the availability zone or if you same similar thing for GCE the Google compute engine or it can be based off the IP or you can be as simple as giving it a property file right and what it then does is it hashes to find the first replica okay then it walks around the ring and then tries to put your piece of data on different racks so if you ask Cassandra to replicate the data three times and you’ve got three racks in your data center it’ll put each piece of data once on each rack okay by default in AWS if you’re running say in Europe and you’ve

got three availability zones it will treat the availability zone as a rack and the region as a data center and actually store your data across all of the variability zones so then you can handle partitions between the availability zones or the racks you can handle two entire racks or availability zones going down without losing data okay unfortunately though because under cronic create these racks for you you do need you need to get your ops person involved I think the next couple of slides will be pretty relevant for that so I’m going to talk about a scenario which we’ve got two data centers okay and technically there’d be as one ring but I think it shows it better if we do them as two rings okay and as soon as they’re in different data centers Cassandra assumes the network link between them is slow okay so it’s going to do different things between nodes and the other data centers two nodes within the same data center so let’s pretend we’ve got one in Europe and one in Asia and you get to pick a replication factor per data center so we can say R we want three and three in each you know if you really paranoid you might say five odd numbers are really good because then you can work in majorities what a client will do okay the client will know which is its local data center connect to one of the local nodes assuming that you’ve configured it it is the default it will actually talk to a node which is going to handle that read or write okay so it’s going to connect to that one just for this one request it’s called the coordinator it’s known it’s not special for any other requests just for this one request every single transaction every single kind of interaction with Cassandra you pick a consistency what you’re normally trading off is latency versus consistency by default it’s one okay and we’ll talk about the other ones after this slide that basically means I want one of the replicas out of all six to acknowledge that right before you tell me it’s really done okay and if it’s talked to one of the replicas that’s gonna get done pretty much right away okay and then a whole host of asynchronous replication goes on okay so because we’ve asked for three replicas what that coordinator is going to do is get it out to the other two replicas in that local data center it’s then going to designate a coordinator it’s all happened at the same time but I can’t say all at the same time it then designate two coordinate in the other datacenter okay copies are over the one the slower Network once it’s then that node responsibility to then replicate it out to the you know however many replicas in that date sent you can’t have a different number of replicas in each data center people often want three or perhaps five for that operational one where they really can’t handle downtime for bats analytics they might feel cheap and you know not want to buy so many hard drives so they might replicate it once and you know realize if they do lose a node they need to rebuild it from you know the data and the other side so to probably fully answer your question I’ve got one funny slide with cuts on for us I often get asked now is that coordinator special is the first replica special is that a master can I turn that off you know all that type of thing with most databases there’s some form of like master slave or something which means that the ops person really doesn’t want you to turn off one of the nodes that’s not the case with Cassandra it will happily use the two of the replicas that it designated because they’re on different racks that are all completely equal the one that originally has two can be down as far as you can it’s concern okay so there’s no kind of special nodes in Cassandra for each request as a coordinator for that request that coordinator goes down the clients will just roll over to another one the tunable consistency is where you get to trade off this latency and decide whether you want to jump across a one okay so defaults one if you want you can do all that would be all six in that previous example okay you don’t really have an available database if you pick all okay because as soon as one nodes down some queries will fail because obviously couldn’t read or write to all the nodes if one of them are down so you don’t use all okay that makes people who are moving from relational database is happy for the first day and then they stop using it what you normally do is something based on um say a majority a quorum and you have two types of these okay one is just the simple quorum and in the previous example we had six replicas so a quorum a majority would be four which means that we would have to jump over the one to it for that you know user and duck transaction to complete which is slow what you would normally do is have your local clients talk to the local data center for both reads and writes and to use something called local quorum which means you know get it to a majority of the client the replicas in this data center let the other stuff go to the other data center

happen asynchronously the only reason a client would flip datacenters if you had a catastrophic failure of all the nodes okay that makes sense aren’t you a question cool if you’re interested in a much longer talk on just eventual consistency the sky called Christos he’s like the head of cloud engineer at very grand title hello I think cloud engineering at Netflix and they are constantly running tests where they actually always write at one but they actually like a constantly reading it in each of the regions of the world unlike seeing when it actually comes out okay they Netflix Robert who needs a Netflix tech blog start reading it it’s awesome especially we use things like Cassandra and spark they’re a huge spike streaming user of course that has to be included as part of the query yeah absolutely yep so we’re going to talk about data modeling after we talked about spark because we’re gonna I’m gonna do it based an example rather than theory but the kind of no free lunch is the fact that you have to store your data how you’re going to read it so the coordinator itself can do that there is nothing built into their client which would do that okay you can’t implement it yourself because it’s kind of called in not gonna go too much into it because it be a full Cassandra talk but you pick the load-balancing policy and there’s a bunch of default ones one is based on the latency of the nodes and you pick the one that over the last of rolling window has been quick another one is you just go directly to the nodes which actually have the data and things the coordinator kind of does that so even if you’re asking for say a read consistency of one it might speculatively ask an extra couple of nodes anyway and give you it back but there’s nothing built-in to the driver to do what you said okay but most of the drivers are actually open source and separate so there’s Netflix actually have their own driver so you there’s there wouldn’t be anything stopping you from dip from doing it the downside I would say would be you’ll be overloading the cluster because they don’t start it but we’ll talk we’ll definitely be talking about how spark reads data from Cassandra because it it’s pretty important for anything which spark connects to to be aware of how it partitions data and to read it cleverly but we’ll talk about that in a bit so last section on Cassandra we’re going to skip data modeling until we go to the you know the actual code and how we date a model for that problem it’s just roughly about how Cassandra scales so if I say you know where the data is what you know the client knows other nodes know what happens when it moves you think why would it move well one of the reasons for moving is if we add more nodes that cluster okay Cassandra can scale both Lynley both in the data size and the you know say the read or write throughput so there’s two reasons for scaling almost okay and what happens is you know highest told you about those ranges each node owns a particular range the truth is each node actually owns hundreds of little ranges okay it’s just that didn’t one explain that at the time and when an you know joins the cluster he basically comes along and says can talk to any of the node and you can find out about the topology of the cluster and it says oh give me a bunch of them and then they get streamed over while they’re being streamed over the reads and writes go to the original people ok the original nodes they’re not people and then when it’s finished streaming then it joins fully and the communication between say a driver and a Cassandra cluster is two-way and asynchronous so the cluster can decide to tell the driver something whenever it wants and one of the reasons it might do that is to tell it about additional nodes inside the cluster okay so that’s roughly how scaling works it just works in Reverse in for taking away nodes you say node you’re going away and he says oh I better stream my data source um places if the node cutters trophical e dead then you actually just tell Cassandra it’s not coming back and obviously there were extra replicas for the data owns and then that stream to new owners so that’s a short as I could make a Cassandra intro because we wanted to talk about spark and the example any questions on like pure Cassandra before we move on to on to

yeah so I I don’t know a lot about HBase I know quite a lot about and Couchbase etc it’s conceptually the same data model you know it’s backed by HDFS as opposed to Cassandra doesn’t use an external storage system it’s completely implemented itself so doesn’t you say white tiger doesn’t like that so yeah I don’t know a huge amount about each Basin s there are one of my colleagues always talks about the the technical debt of moving database migrations when you become successful so if you’re an optimist by all means the issue is that cassandra removes functionality from you that wouldn’t work in a distributed environment so when you’re on 10 nodes okay so you’re gonna you’re gonna lose things like joins and aggregates and things so you would be putting yourself through some pain but with the advantage that you can scale you can start scaling when you want to yeah the features we’re going to talk about it why you’d use SPARC on top of Cassandra and it’s normally to bring back features that you don’t have and say that you normally would get in a relational database so yeah the kind of things which you do to a relational database to make it scale is basically what you have to do from the start with Cassandra so Before we jump right into spot rows or more questions nope before we jump right into SPARC I’m now gonna kind of defend Cassandra white can do lots of things okay because I’m just all about what I call scalable predictable performance which sounds like a marketing term doesn’t it but I promise it’s not okay what I mean by performance is how quickly does it come back for a read or write what I mean by scalable is does it change when the amount of data you’ve got or the amount of region writes you want to do increases okay can I just add nodes to my Cassandra cluster my five millisecond query on my three node cluster still takes five milliseconds 100 no cluster the answer should be yes okay and it does that by you know doing the consistent hashing going to a single node uh perhaps two nodes if you want a higher consistency and we’re going to talk about data modeling soon but the way you data model it allows it to do very few disk seeks okay and it restricts you from doing queries that would make it do multiple disk sees so you can only read things which is sequentially stored on disk but there are times when that is too restrictive okay and one of it is once you do kind of reporting around a little queries I’ve stored everything my customer has done say watch this movie say for sky watch this movie streamed this bought Batman you know watch this channel caught you know called the call center and complained what happens if I just want to get like I don’t know a report of what they’ve been up to how many times have watched Batman how many times they’ve watched Game of Thrones all that type of stuff you cut you could do that in a relational database assuming that you’re at an appropriate scale okay you can’t do that in Cassandra okay you cannot do that you can’t do group buy you can’t do aggregates you’re limited by how you can query because you need to denormalize it so it could your queries are almost satisfied by say a single partition okay a single computer but we’re going to talk about today is how we can run spark on top of it and when we’re happy with a slightly slower response time so we’re talking about real-time analytics it might be 500 milliseconds it might be 10 seconds if we’re talking about our batch job to try and work out what our customers are doing and which shows that becoming popular we might be happy with it six hours if you’re using Hadoop MapReduce you’re probably happy with 24 hours and I’ve made it safe who’s seen this cartoon I think this is one that can you read it is it big enough yeah sorry it’s inappropriate but I scribbled out the bad bit okay but essentially when you use the type of technologies that scale they often do kind of screw you over okay because they essentially stop you from querying ways that you previously could and you get annoyed and managers inevitably come up to you and ask you for say a count or what’s this customer been up to or what quite common the sky was the police want to know after a missing person they want to see what they were done on sky over the course of three days or a week which it actually happens and you can’t do it because you didn’t model your data to store it that way it’s a spark saves you so when I built most of my first systems with Cassandra SPARC was not yet popular and there was not yet a connector between the two so a lot of the times the things I’m going to show you how to do things with spark I actually you have to write programs to do when you do them directly against Cassandra but from the people who know my produce from reading it’s essentially a like-for-like replacement alternative it is a distributed compute engine it allows you to take very large data sets

be them from HDFS or Cassandra Tables and do distributed computation on them okay my favorite part about it actually my favorite part is it’s got a shell so you can do interactive work but the other great thing about it is you can do batch analytics and you can do streaming with the same API it’s not say storm versus Hadoop MapReduce it’s virtually the same programming API okay the same infrastructure and you know what you won’t be that cutting edge if you use it if you’re using a popular Hadoop distribution it’s probably already there but it’s it’s gonna be on that it’s actually creeped into the products like data Stax’s so now when you install data stacks is Cassandra version of Cassandra you can have spark nodes just in there by clicking a button so you can start doing these type of like analytical queries okay you can set it up to do it open source as well if you want it’s split into a number of components ok sparks not a database ok you need data ok normally people would you take TFS ok in the examples I’m going to show you we’re going to use Cassandra ok on top of that is the general compute engine that’s where your programming scala ok who here likes Scala it’s nice it is nice for this type of programming ok there are higher abstractions you can use s you can use spark SQL you can also do streaming which is what we’re going to concentrate on there also machined I mean graph which we’re not going to cover in this in this talk the programming model that we get to play with ok it’s called an RDD and essentially that’s just an abstraction on top of a very large data set so our puny human minds don’t need to worry about the fact it’s a large data set ok you can turn a cassandra table into an RDD you can turn a HDFS file into an RDD one you’ve got it you can then do lots of operations and there’s more than two so you can filter and flap them up and mop and then you can do things which are going to cause shuffles like group buys nothing happens in spark when you do that okay doesn’t suddenly kick off computation it’s only when you actually do something which is visible like a side effect that could be saving it back to a file it could be putting it into Cassandra it could be I don’t know mailing it to your mom that’s where that’s when spark actually works out the most efficient ways to actually do those 15 maps that you did separately okay it’ll it’ll squash them all into one and work out the most efficient way a couple of other things to note is after you’ve computed an RDD after you’ve actually done some computation you can catch them what that means is to say I’m gonna by default it doesn’t cache anything because you imagine it’s very big datasets and you’ve only got so much memory inside your you know your spark computation cluster but if you’re going to reuse the same one over and over again you can cache it okay so you can take some data from Cassandra do some computation leave it in memory inside your spark cluster and then continually reuse it anyone seen spark word-count anyone seen it Hadoop MapReduce word come it’s pretty nice in spark I didn’t put the didn’t have enough slides to put on the the MapReduce one okay but essentially here this is what an ID is you don’t have to put the types in scholar but I did just like really show people on ideas what you can then do is do these operations so we’re taking a file from HDFS you know you can then split it you know by space you can then map it into a tuple which is the word and then one and then you can reduce by key okay and that would be the Aquatic Center buted word come I’ve one version of this presentation which actually does six different word counts some from Cassandra some from socket streams and all sorts but you’re only getting one today because it’s quite a short talk we’re actually gonna talk just about the second the other side so this side it’s a spike streaming so if you don’t batch processing you can make it as big as you like okay you can as much data as you can fit in Cassandra or HDFS you could run some type of batch processing on top of it there is a limit with say stream processing because you’ve got to be able to consume it within a certain window okay and that’s basically what spark streaming it it takes your stream of data where does your stream of data come from I don’t know a queue Kafka 0 mq a socket Cassandra you know and it breaks it up into lots of little chunks that look just like rdd’s and then when you kind of do the same kind of programming your flat mapping your mapping your filtering it actually does it every say window and you pick that window so you could the smallest to say every 500 milliseconds so what we can do is we can stream weather data every 500 milliseconds do some computation and then we can save the results back to Cassandra which our dashboard can read from that’s like kind of the end goal and let’s see how we do it so we’re gonna do I’ve kind of put in Cassandra data modeling in here too just because we have to model something to solve this problem so the most important thing when you’re using SPARC is how you deploy it and how it partitions data okay if someone says we support SPARC the first thing you ask them is like how does it avoid moving data around the network okay and because Cassandra already distributes data very

nicely the SPARC connector which is um it’s a separate open-source project what it does is it builds what are called SPARC partitions it’s confusing we gave them the same name out of Cassandra partitions you know pieces of Cassandra data that all live on the same node okay see what’s part will do it will never start off you know doing a bit of computation that would mean getting data from more than one Cassandra node so then the logical thing to do is to walk a spark node directly on your cassandra node so you’d have the same server both processes okay if you’ve got DSC data sax Enterprise this just happens you don’t have a choice in the matter if you’re deploying this kind of infrastructure yourself with the open source versions of the project you just make sure this is the case and what you typically would not do is put a spark node on a node which is servicing you know your customers okay you designate these on your ah your analytical like Cassandra nodes this is the batch version my awesome architecture diagrams and this essentially says whether data comes into Cassandra but process takes raw data out does some fancy stuff written Scala save it back and then read it directly from Cassandra for say our dashboarding we’re not going to do that we’re going to do the streaming version so it’s not going to go into Cassandra outlet Cassandra into Cassandra it’s just going to come in through our stream we’re gonna save the raw data for later batch analytics if we want to do them but we’re also going to build up some aggregate tables to draw some fancy graphs to impress managers what does the raw data look like I did not make this up this is real raw data that you can go and get from you know online historical data for many places this is what Cassandra table looks like it shouldn’t look does it look familiar it probably looks familiar okay the language in which you interact with Cassandra is very similar to SQL but not quite the same okay you’ve got some more types you’ve got different types that encourage you to do normalize like maps and sets and things but that let’s have a look at this so we’ve got a weather station okay so this piece of data is going to be associated with a particular weather station we’ve got some time okay we’ve got the year month day and hour so this data model is implying that we’re not gonna receive more than one piece of data for a weather station per hour okay then we’ve got loads of weather stuff so we’ve got what we got temperature the hourly precipitation the 6:00 hourly precipitation which way the winds going this is the raw data that we don’t want to throw away we want to keep it just in case we want to do batch analytics most of people I speak to who do stream analytics always want to be able to reproduce the same results with batch analytics just in case okay especially if it’s doing billing or anything like that so they don’t trust their stream analytics quite enough to actually throw away the original data not to mention you might decide you want to process it in a certain way like a different way in the future all right we’re gonna have a look at Cassandra primary keys the only thing hard about working with Cassandra is picking the primary key so anyway you can kind of go wrong okay the first part of the primary key so this bit here the weather station okay that was the partition key that decides that all data for a particular weather station will fit on a single node if that’s not the case you partition it some more so you can actually put other fields in here okay we could have put the year in there the rest of it is a bit like Google BigTable we get to pick how it’s order them so it’s actually gonna be ordered by year month day hour so it’s actually gonna be ordered in time okay the other thing we do is we flip it round to descending rather than ascending just so if you do a limit query in Cassandra it gives you everything back in order so you don’t need to do things like sorting after you get the data back out of Cassandra okay so we can just pick whether we want to ascending or descending which would depend how you wanted to read it and display it you think with weather data you probably want the latest weather first so there we go that’s the partition key we all know that and then the clustering columns and this is how it can look on disk exactly so first thing on disk it’s going to be the partition key they’re all going to be together so we know it’s all on one node and then the rest of the data is stored in order just like that that’s pictures of what I just said essentially so the things that are clustering columns have essentially become the second key you know when I said map of maps and then this is just storing the temperature it would be a bit big if I showed all of the fields but this is like say on the seventh it was minus five and the 8th it was minus five point one on you know the ninth it was minus four point nine and just to reiterate that means we can do this we can have a thousand node cluster with many petabytes of data and we can go yonk right into the right place and then once we’re in the right place we can then read a slice of weather data but by doing a single seek on disk and then a sequential read so that’s what makes cassandra have it’s scalable performance so we go in we’re in the partition and then we seek to wherever we want so here we’re doing a where clause so we specify the year the month from the day but then we want a time slice of our now cuz I’m just gonna limit you here it’s not gonna

let us do a range on this on say the day and the hour because they wouldn’t be stored next to each other on disk so that’s the limitation of pure Cassandra but we’re gonna solve it over SPARC this see this is quite a hard model to deal with at times and if you use Cassandra a couple of years ago it is actually the model that you you programmed him it’s not anymore cql is around two years old you now get to see it as those tables and when you read it back into your application it does this fancy animation it essentially gives you it as if it was a sort as a table so you need to understand the Prime the primary key it’s hard not to understand it because you can’t query it anyway apart from how you’ve structured that primary key but you do get it back inside you’re a pro inside your application as if it came from a relational database that programming model that we’re kind of used to we can also store metadata in Cassandra so nothing sparked so far if we just wanted the latest weather data we could just do this with Cassandra we’ve also got a bit of weather station metadata so again I’ve put in around 20,000 real weather stations okay and we’re gonna store where it is and the elevation and things like that why who knows the next thing we’re gonna do is build things which cannot be done with Cassandra the kind of things which you could do over the relational database if the data set was small enough and you had enough patient we want to build some aggregates so the first one we want to build is say the aggregate temperature and if we look at the primary key this is per day the actual application online keeps the aggregate temperature for the day the month a year and we can basically see the only ones really I’m selecting here are the min and the max okay so that we could do in a relational database you could certainly not do it in Cassandra if you had many petabytes of data you could certainly not do it in a relational database and that you had an awful lot of pent patients another aggregate so let’s say we want the precipitation so the graph I showed you at the start was actually the daily precipitation and you can stream loads of data through this and actually keep this up to date so the UI does actually refresh and constantly get updated from the data in Cassandra here per day so we look at the primary key we’ve got year-month-day for a particular weather station and we’re using what’s called a cassandra counter okay that’s the magic which we’re going to talk about in just a bit if we were using a say an integer inside the database we’d have to read them write a Cassandra counter allows us to just increment so for each weather event that streamed in I’m going to show how this works we can keep incrementing for that day until no more come and then it’ll remain static assuming we we create some magic spark streaming stuff we’re going to end up with a table like this okay we’re actually going to be able to constantly keep these aggregate tables up to date which means that our you know dashboards etc can just read directly from Cassandra you would scale your streaming part of your application based on how much data was coming in and the rest of your app they say the dashboard application based on how many users you had using it so this is what we’re gonna end up with there’s no more double arrows now the data just comes in goes into spikes streaming get saved as raw data and the aggregates get saved so we look at some Scala code now just to see how this works the full scale that you can go and run this as well though you don’t need to need to write this down so this is what the data looks like so there’s big files checked in bad but they’re zipped which will actually have real data or we can we have there’s a load generator for creating run mandate somewhat run them data okay and the first thing we do is turn this long comma separated line with all the data into a nice case class okay come across Carla case classes I wish Java had them essentially it’s an immutable kind of POJO okay and we just turn that string into that nice case class that’s got nice names and things to create a stream with spark streaming it’s a one-liner again I haven’t shown you how to create spark context but that’s just another one-liner the thing you give a stream is how long you want that micro batch to be how often do you want the computation to happen okay I’m like because you get two you get two really cheap aggregation inside that little window I’m not gonna talk much about Kafka like where we’re streaming the data from one assumes quite irrelevant the first two lines of this are to do with Kafka but the next two lines would be the same regardless of whether we were doing it you know getting the data from a socket getting the data from ActiveMQ whatever and again remember is that comma separated value it’s a nice collar API again we get to split it and this is just some Scala magic for creating a case class from an array okay so we instantly there have a stream of case classes there’s nice raw weather data now if you remember the rdd’s and streams are reusable so we can use this stream twice we can actually inside the actual code it’s used many a times it’s used for all of the aggregates the first thing we can do with this very complicated line of code is to save it directly to Cassandra okay because the kate class much is that table remember that raw weather data table you

literally just have to put save to Cassandra and the connect for spark Cassandra uses implicit stew add all of these methods to streams and rdd’s and things so you can instantly go save the data for later batch analytics okay some of the examples in the one online do use batch analytics you hit it with a GDP real question it does a little spark batch thing and then give you an answer remember this aggregate so this is the code which would actually produce this aggregate okay so we do a map so we map it to look just like the table so this is just a tuple in Scala and we don’t need to do anything else and the answer is don’t we need to use the old values but because we use cassandra counters no cassandra counter if you put five in it it actually adds five it doesn’t turn into five so for all of these events streaming through our thing constant here updating it okay so every single message that ends up on cough get ends up updating a counter in cassandra and that’s really cheap that goes directly to that one node and just increments a counter you know there’s no kind of like you know cross all of my data analytics like batch processing them and then we end up with something like this so for the real example you can either get it from files and there’s a a way to just import the files and send it on to the cafeteria or there’s a way to do dynamic generation ends of on a cuff it ends up on a cuff Q and then streaming consumes it all right all into Cassandra and then there’s a definitely not pretty dashboard which displays it I haven’t yet added all the aggregates to the dashboard but essentially you can go in you can see the weather stations and then you can go and look at some of the aggregates but my idea about building this example up is to events you like give people in the audience like little raspberry pies because you can actually get rust-free pies which have like weather little weather stations on and have people actually be able to upload data into it but I’m not quite there if you want to go and play with this stuff it’s quite a it’s a near production ready example it’s not just Noddy code okay you can happily use this as a it’s one of those probably the main reference application for sparking Cassandra integration up until now it’s primarily developed by a genius called Helena she is a spark connector connector and a spark connector committer that’s a mouthful and Affco committer she works on Cassandra I’ve just died adding some stuff now just to make it more visual so that’s pretty much up today we talked about Cassandra so I think that was new for most people so it’s a really interesting database if you want to build an always-on system and it can handle crazy amounts of durable writes depending if you’ve got enough money for the hardware SPARC is really useful too well I anticipate that everyone who uses Cassandra will eventually use spark on top of it even if it’s just to fix their broken schema because I’ve saved it away and they want to save it a new way or and your requirements come in it’s really easy to build a new table from an old table stream processing is kind of I think a game changer previously it had to be multiple technology stacks whereas now we can use Kafka and sorry Cassandra and SPARC to do but many things it’s our operational database it’s our stream processing and it’s our batch processing okay it’s not three teams it’s not three set I’m not saying if you’re going to build say a bi solution from the start you would use SPARC on top of Cassandra rather than HDFS but if the date is already in your operational system there’s a really good point to say let’s not extract it and put it somewhere else let’s do it directly on top so that you can run these knows at same time that we updated correct so Diskin if you have say datacenters Cassandra datacenters so designated nodes of different spec you can absolutely get behind because the ones which is actually ingesting the data could be your fancy you know SSD back blah blah blah and then you could have some really slow ones so there’s if it’s a if it’s a temp if it’s a small period then Cassandra just catches up okay the guys who are handling the rights they they store it until like guy acknowledges it for default by three hours if it’s a bigger problem than that then what you do you need to go to the nodes which are so you know your analytics nodes and tell them to tell them hey you definitely were you know behind at some point go and get the data and actually requests you know and it uses Merkel trees to repair the data and you there’s a metric that comes out of Cassandra which is called dropped mutations which is it recovers so what it does it because it takes the data and essentially builds up a Merkel tree of you know that table and on that table there and then tries to stream over the smallest possible data because you can’t have many terabytes on the Cassandra node so even the Merkel tree can be meant very big so yeah you have to rebuild a whole note you can recover

itself thank you very much