Anaconda Cluster – Introductory Webinar

I can people see my desktop yes I can then is it possible to expand it we get this is it in full screen mode okay you um did I’m not sure do you see the me I do this that better yes okay thank you so why don’t we get started I welcome this afternoon to the continued mental exam econo cluster webinar my name is James McCarthy again want to thank you for your time to jump on the webinar today the reason you’ve been invited is a couple different sources perhaps you downloaded a copy of our anaconda server product and we invited you to the sent to the webinar today or you had an interest in anaconda cluster and we know who you are and you’ve had some discussions and communications with us a continuum or we also tweeted out this information over the past 24 hours and you join as a result so we’ll get started on the webinar so want to jump in the next slide so today we’re going to be joined myself James McCarthy sales director in these coasts I’ve been dateland who is a software developer and a data scientist of continuum based in Michigan and Ivan mills or inside sales manager based in austin texas also if you have any questions during the webinar if you can use the chat button will try and address those questions at the end of the webinar thanks you you so let me give you just a little bit of a background on continuum analytics of salts before we get into the demo and the discussions about anaconda cluster our mission is that we help the world discover produced share and collaborate better by moving expertise to data our company provides software products training integration and consulting services to corporate government and educational clients worldwide we’re composed of three integrated in related business units which include services software products and R&D the products part of our business involves all the efforts involved in product izing and selling our premium solutions that solve specific customer pain points anaconda server anaconda add-ons Hikari enterprise and anaconda cluster constitute our major products today with other products in the pipeline the services side of our business involves consulting and training efforts that provide customers with specific solutions we have between 30 and 40 developers and consultants working embedded at customers or working remotely to help solve our customers problems ord comprises of developers and projects that are almost always open source and are typically funded by open source friendly customer contracts such as the federal government and the effort from this part of our work has produced products like number honda blaze bouquet pie parallel and improvements to other important projects such as numpy and side pot so again let me just give it a little bit of a background the companies based in austin texas we have a large presence here in the new york marketplace we’ve been in business for roughly three years the interesting metric here is that we’ve grown from 29 employees over 90 in just a little over a year the revenue that we are generating come from our consulting training and software products that will talk a little bit about today as an example we’ve created in product call anaconda the download rate Fanta Conda is about a hundred and forty thousand times per month the feedback from people who use it tell us that they love it it makes a life so much easier and so we’re trying to do is write things that really as it relates to the open source community by creating this

product and allowing people to use it for free we’re focused on the Python ecosystem but we’re not exclusive we’ve also started using an incorporating support for the art programming languages are programming language in our products and of course we’ll talk a little bit about anaconda cluster in our API bridge if you will into the dupe and spark platforms so that’s a little bit of a background on continuum analytics kind of the rules of the road with respect to the webinar today and at this point what I’ll do is let Ben give you a little bit of an overview of anaconda cluster and get into the more details of the product and the demo again if you have any questions please use the chat button will try and address them at the end of the webinar thanks Brad bed now can people hear me you can just ping it ok great yeah right so so anaconda cluster was built for for kind of two people in mind one is is the data scientists or statistician that needs to take their local analysis and scale it up across lots of clusters that often are going to need to run and if i do type these that’s what we’re focused on now although we’re not limited to that and they need to be able to manage the runtime very easily they need to be able to say i want to bring up a spark cluster and need to be able to install very easily numpy syfy specific versions of those things across the cluster scikit-learn and then take my local experiment and then kind of run it against data that that either exists in the crowd of data that gets provided there or something that’s maybe even exists kind of in your company’s data warehouse the other person that that we have in mind is in the traditional DevOps role somebody that that isn’t it responsible for doing modeling or machine learning or anything like that but they’re tasked with setting up machines and being able to to set the Python environments and mum x easily so tools like a pet and chef and salt are really great at setting state across the cluster but they don’t really flex as well when they need to be able to dynamically set all the packages all the Python packages within the cluster so for example if you try to call it for the system pythons been sent off you’ll get Python 26 that’s right let me shut off my notifications just for that umm so so any kind of clusters tries to kind of help those two people out so it’s we can very easily take our our local analysis and scale it up can underlie all these Hadoop beasts with with anaconda and control the the Python at runtime across all the nodes for now we’re supporting HDFS spark I’ve Impala a yarn is the resource manager that we’re working with now we have kind of some nice cities around they starting up a nightlife on that book on the server ipython parallels coming things like that but for the moment we’re targeting ec2 and digital ocean as their cloud providers there’s really nothing that stops us from targeting other cloud providers like a jour Rackspace and then we can also do bare metal installs and the big thing is that you get the right move again the thing that you get most is managed Python across the cluster there’s some lead ins about maybe managing our across the cluster to that hasn’t been proven out quite yet but that’s kind of where we’re going as well so so to again Conda for us solves the Python packaging problem solves the runtime passive packaging problem because we can install our there’s nothing actually that keeps us from installing other Python runtimes like pie pie or ironpython if it works cross-platform Conda cluster similarly also you can launch clusters cross-platform sorry you can launch clusters on on Linux through through these different environments but you can’t manage the stage across windows quite yet and it works with more than just one bite on more than just just Python so when I think about all the so I’m a developer and data scientist when I think about all the things that I have to know in order to do my job and manage this is kind of what my setup looks like where I have the dupe and Impala and hive spark

these are kind of my life large data analytics tensions and then I have traditional databases that I often need to work with I need my statistics and machine learning tool set Weka is something maybe that I’ve used in the past that unit doesn’t really exactly have a Python interface and then I also need kind of these things that underlie these statistics package pie tables numpy pandas should should obviously be there out isn’t under intercept and we need to kind of like interoperate between all these things so spark has actually you know been really great for for a lot of us that do that need to kind of work with HDFS and also work in Python that that we haven’t made the leaps yet to scala or we’re not intending to will not we’re not the Java developers can we need to kind of work with all these different things so a spark job might call pandas and my cuff scikit-learn and those in turn called pipe april’s and the output of that spark job might need to go and hit a a Postgres database in the end being able to manage all of that and keep that on your head is really hard but for me it’s very difficult when internally when when people want to play with these systems it was very difficult for me at least to get clusters up and running with all the Python libraries that I wanted right away so if i wanted to do a large-scale text data mining project i would go through other other providers of managers bring up a CD a cluster and then right bash scripts or combinations of fabric scripts are all any any kind of thing i could could reach my tool belt to go and set the state on them on all the nodes so i could just very easily install an LP cake in the end it was much easier for me to just start writing along with with other devs continuum to start writing a proper proper CLI tool and this is what what kind of cluster is so you get Conda on on end notes yet manage Conda on n nodes it solves the remote Python packaging problem and then we we also of course have provisioners built-in and our cloud provisioning is done through lib clubs so this is why we can kind of look Clive kind of abstracts away the problems of defining specs for I want to watch on an azure or or digital ocean or ec2 and again anaconda cluster can target both cloud in and bare metal and so a lot of people that have talked to in the past especially HPC folks often want to see this this kind of diagram it helps to understand but what’s actually how the what the topology of the network often looks like so right now we assume that you’re you’re working mostly from your laptop and your laptop can actually access all the nodes in your cluster so there’s a designated head node but there’s nothing that keeps you from manipulating things on compute nodes or slaves it’s which masters and slaves we prefer head and compute so most things go through the head node but there are a few instances when we need to connect to all the nodes in the cluster at once we’re trying to scale that back as much as possible so the installed tools that you get when you new launch anaconda clusters anaconda HDFS I’ve Impala spark currently these are all turned on by default the size somebody asks the client usually cannot see the compute nodes I think in some in some apologies you can’t see the computers what our assumptions are is that you you can but again we’re trying to remove as much of that compute node interaction with the client as possible and we’re going to start doing everything through through the master there’s any number of reasons why that’s a better model for us a better model for most people so you get all of these things by default our future work is going to start including other streaming tools like storm spark streaming again other clustering technologies like batch processing some great engine and condor API I pipe in parallel there’s really nothing that keeps us from targeting these other things they’re just a plug in a way so underlying all the parts of of conduct clusters is actually a salt managing all of this so just creating the salt recipe munching it a bit debugging it a bit and then we now have these these capabilities across all ordinance so so the way the thief the way thats a the code is actually

designed this as a library first and then a CLI second so there’s very basic primitives in here right now there’s creating a cluster destroying some management of what clusters you have I often have many clusters that are running a devil on a production one cluster for another group perhaps you might only have one I’m some configuration management around the providers that you have in profiles will speak to in a minute and then we have some nice things like easy SSH being able to easily connect to to master so so other so we at first were building things along the command line wanted to give a lot of those options and over I things and eventually we took we have took all of that out and now we actually encourage more or less horse users to use documents to store configuration in store state of what what their what their cluster should be so so in this cases they’re all going to be quiet configuration so you want to launch on AWS East you provide your your authentication key name possible security group and if one is in supply it will build one for you and then the path to the private key and then your particular zone similarly if you wanted to build one for dissolution it would look very very similar but instead of us East you say New York one and then and then the particular cloud provider so all of this is codified in a provider’s file and I’ll show a little bit of that on the command line in second and then profiles reference these these provider files so having to split between these things makes very easy to share profile stop sharing authentication so I need to be I often by myself wanting to say oh go create this cluster here’s what the cluster looks like I don’t need to share my authentication I don’t need to bake that into what my configuration looks like so I have a spark profile it has a provider a number of those the note ID the type were also outlining things like very soon we’ll have you no plugins you can pick and choose whether you want hive or Paula ipython notebook maybe the particular Conda environment that you want in there and in this example here’s another here’s just a one node of OneNote machine this is something that I built for restraint that’s a very beefy this is a hvm bucks so I can also just manage one that I don’t have to manage and and the result of launching a cluster or trying to attach to a bare metal one is actually very simple it’s just a bit of metadata about one it’s created machines that you’re managing and in this case we assume homogenous machines with the same SSH key and user name so in this case this machine was launched februari 26 earlier this year it has to compute nodes in one head node the private key is is in my home I use this particular provider called simple eight of us and the user was xubuntu so to get to kind of bootstrap on to bare metal machines or to even to the end you just list the IP and the court or the private key to attach to those machines and the user name and you can now bootstrap a lot of the bootstrap those machines so we can actually install not only in bare metal but an air-gapped bare metal situation and then the a lot of the mappings of what you can do locally in Conda to what you can do remotely on in end nodes is kind of housed in this managed subcommand so if the little verbose were working on actually trying to to reduce the amount of verbiage you need to spell these things that kind of cluster managed all that can be codified into one thing install the cluster name and then the the packages that you want so even saying that actually is a little little tiring so we were very conscious of how the CLI looks and how it feels you can get info on your cluster just like you can get info locally you can create new packages you can create that sorry you can create new environments you can set a default environment again there’s easy SSH and of course there’s help so there’s some advanced parts of it I got I get very tired of typing out the cluster name so you can set a default in this case the default is is demo you can run a command so in this case run hostname on on all of the f on all a note you can also submit a

Python file so for for now the submit command all it does is takes your file pushes that up to the head node and calls Python on it it’s very it’s very naive but it also became very effective for being able to remotely submit Python jobs against a spark cluster so I could code up my my pipe and I’ll show demo this is second I can code up my my Python file to call to yarn code on my spark file to call to yarn and we’ll just go in and create a spark job running through yarn and then a really nice a small Peter but it’s become very useful is this comet cluster history so I can look back at all the things that I’ve done in the past it became more useful than I’ve been I thought it would be this especially is useful when I’ve worked on Windows machines which don’t have a built in history if you’re using a power shower or the DOS prompt so you can install very easily it with within the Conda cluster channel I right now we we do authenticate via token so it’s not open up to the public quite yet we should contact sales for one that that should happen I believe it will be a few weeks from now police in limited release documentation is hosted on github at continue my oath at github dial / comet cluster and then again contact sales if you if you want some more some more information alright so how about a little bit of a demo um so can people see see this can we see that terminal yes again okay thank you up see can people still see my screen open something mashup bethel oh yeah okay can you still see my Maya yes we can okay sorry about that okay so um so they said creating a butt so I have one cluster running and then I also have some VMS that i’m testing so to create if you want to created a cluster running on on ec2 took on the cluster create the name of the cluster demo test and then the profile that that we might want so so in this case i have a profile medium test maybe we should look at what that looks like actually have a lot of profiles to find big profiles you can see I have all these these different different things to find where I have a medium test a small test something that I test against amazon a strata profile one that percent toss one that’s more design for sparkle in the production again i just want to to start a medium cluster so this will launch i guess it should be called large test so this is a m1 large on ec2 in this case the security groups all open but you can provide whatever security too if you want Oh we’ll build one for you and this will go through the motions of installing provisioning the notes and then installing anaconda and all the Hadoop be so and this is for right now we’re installing against seid EHS distribution so we’re installing seid EHS version of Hadoop hi impala the one place where we we differ is actually with spark so we’ve been building we can call it we can build either installs for from cbh or we can I actually have a condo package or or spark as well which has been really nice for some of our dads that one play with the spark 1.3 data frames i was able to build a cluster very easily that has spark 1.3 distributed throughout the cluster and not just relying on CD HS version of spark so i can move a little bit a little bit faster in that regard so this will take around around 10 10 minutes so you can see it’s installing anaconda right now actually it’s um song mini-con data because we don’t agree crew try to be conservative about space on ec2 things can get a little like it a little problematic if you’re trying to install into a instance store of EDS volume mmm

and then it pulls the latest latest updates our own kind of we will come back to this are there any questions before I keep going been talking for a little bit and I realize I haven’t positive if not i do not i can just keep going you you ok so all I’ll keep going in my case so I saw this this is this clusters name is demo deaths but I said before I have this other poster that I launched earlier called demo and you can see that that demo test has now been added to our list um so one thing I can read as i said before really do is his uh just ssh onto onto the note um it has python murder Yemen if I roamed Conda the list here will show me that it currently installed packages from come to info that also shows me a little bit about about the setup just on the head node and for the particular environments installed similarly advice I can do those same operations across the entire cluster within the manage subcommand so in this case this is going to run on on every single node and then report back now in this case this isn’t exactly useful information this is where I’m trying to work the entire team is trying to massage things so you can if I scroll up you can see each node reported back that’s nice but when i run across a hundred notes or a thousand nodes that becomes kind of incomprehensible so we’re building out systems to to run against every note bring it all back to master and if there’s a difference report that back that’s infinitely more useful than all this information flowing back at third of course so in giving with a cheeky option will show us what what environments we have installed if i want to install a package or i guess in this case maybe i should list ship of it I can list all the packages that are installed so there’ll be a lot of salt packages that are there and then spark sequel life these are determining a terminada and tornado things that are kind of work against the ipython notebook to install just install numpy and this is also something from sashing a little bit many times when you interact when I interact with like a lot of notes in my cluster I kind of want to do something and forget about it I don’t want a lot of information flowing back to me there’s always so in this case is it installed some pipe but you can get a lot of them kind of warm fuzzies a reassuring information thats did install if you do want a lot of information from one back in this case I can install maybe i’ll install specific version of an empire 1.6 you can always stream stream data back so in this case it’s going to pipe standard out back to my terminal so you can see that there’s a lot of was already kind of too much information about what’s happening so this happens on every single on every single node it will try and install numpy 1.6 so get your free to do those things yeah and then you can also create new environments so if I want to create a new environment it’s just everything after this is the same as just like you’re doing this locally kind of create dash n the name of the environment of a particular pipe zone that I might create Dave instead of iphone 27 Python three and install install pandas and Python 3 and instead of so this will take a little bit of time not too much in this case i’ll pass the Lobos command which instead of streaming it back it’ll try and collect it and collect all the standard out and train presented the way that’s a little bit more organized again it’s not them it’s not the kind of thing that you want to do if you’re running against 100 nodes or a thousand nodes and that’s really kind of my metric for whether things are working correctly does this work for one node 4 100 nodes in for a thousand um this will build a new environment and install Python 3 and and pandas in this case it will it will take a little bit of time because none of these are these all have to be pulled again from from the continuum repositories perhaps we yeah let’s check back in with

our other cluster so you can see that it installed started salt in the millions again this is what we’re doing that state across the clusters across the moon installed java and it just finished starting it just is still working on on bootstrapping HDFS again are there are there any questions right now is this all call making sense to people okay great oh sorry third I was gonna say yeah Ivan’s been kind of collating the question so much revived and had any questions that have come up to be the chat button or not if he does I’m sure he’ll he’ll jump in and ask him appropriate as appropriate pen okay thanks Oh took they if you scroll back up you can see that it installed all these things into uh into the various configuration hoods in the cluster and now if I did I’d look at this old command that proof provided the elicited all the environments on earth on it the cluster you can see I have this new environment here but it’s not the default one to set the default environment so this can be useful if you maybe want to support multi-tenancy or you want to not completely hose your production environment you want to try something in this is also how we were able to install both spark 1.3 and and 1.2 in the same in the same cluster actually would people care to see that not that that that’s okay um so some of the other things that I referenced in the slides were things like so run command can become useful it’s just kind of a debugging or dev dev tool by default it runs only on only on on unmastered but passion having that be all flag will run on on ahead and all the computers that can be nice if I need to oh if I need to like mount a mount a driver I need to but in this case we push something to HDFS set that was pulled onto the infinite so to confirm that that all the few let’s come check back in here or almost so we have yarn zookeeper now installing it was almost done as I said before and I Python notebook is is already already started so will be started I can this is the yeah please look here you can see that the head notice by four oh that 81 1863 you can either launch those instances but now you can manage those those kind of services with notebook yarn HDFS those kinds of things and easily connect to them and we start them to connect you can see that this this is just something that becomes helpful from the command line so i don’t have to kind of copy paste all bi ip’s so connect ipython will just bring up the ipython the book here then i can create them in this case I this is a Python 3 which is sad going to soon be named Jupiter and I can start a a spark drop from inside here or it can launch one from the command line perhaps I should should launch run from the command line first just see so this is let me get out of here a very naive improve spark chop so again we tell spark to use yarn and then we’re going to kind of spread out a thousand numbers chopped up into 100 bin across the cluster and then do a map across all that data in this case is going to do in Bella with the data and all of that it’s doing is going to print out the host name of whatever whatever note it ran on and then collect all the distinct

distinct returns and then print it out so again trying to find the balance between older styles of working with with hpc machines and newer styles of where I wants information back to my command line by default we don’t provide any we’re not returning us data back so I can launch my job and then kind of forget it not have to worry not not worry about where the outputs coming to if I do want data back i can always if I do I can always stream data backups sorry submit so this takes this Python file uploads it to master and then it calls a Python on it and this will take a little bit of time because of the clusters on this most out of things and it’s going to default to running as the HDFS user so for those of you that played in in HDFS and sparkle and things can get permissions can become somewhat problematic so we try to run everything through the HDFS user if we look on the application you can see there’s a current current running num running job come get us um and now it’s just reducing all the stuff like and we have a kind of the debugging / spark is kind of trend up and in this case it it printed out both the compute nodes host names in this case via uneasy to the host name is typically the internal IP so this is almost done its bootstrapping bootstrapping hide and I think yeah unless this is any questions I’d be happy to show one more things off for then thank you I think Ivan has a couple questions that came in as a result of the chat function uh ya been I have some general questions here I have one from John and San Jose who wanted to know how much time dis did this tool save off off of your workflow it saves quite a bit I can go and watch a cluster and then kind of forget about it for a little bit keep doing some other some of the task and then contact switch back to it the other in the past when I’ve needed to bootstrap machines with with anaconda or install various packages of kind of run through any number of ridiculous schemes to get those things on which take a lot of time actually to get working and they’re not really robust so when I needed to work against a son retention and get particular versions of numpy I would write no op son grid functions that would download anaconda and then install numpy and all the nodes that’s kind of silly it’s not it’s not really repeatable so in this is so kind of cluster actually say it’s quite a bit of time like I’m just very easily like as I’ve shown you I can just install now I’m ready to install any package I want from from the Anaconda repo or even from from the blaze channel or any bin start channel for that matter so I can install the latest please if I I wanted to since I have olive plays on on every node in the cluster and it installs all the defaults as well as a panda’s and photo and I think number and things like that we’re also kind of building out some nice things around working against GPU so we’ve been it’s very popular to look into deep learning and do kind of neural networks on on GPUs so we are playing with both Fiano and pipe pipe line so you can very quickly spin up the cluster spin up a GPU cluster install the python packages that um that are our machine learning and you know that work experts are are used to playing with and then run against a run on the front on those very powerful clusters

great thanks Ben I have another question here from denson from the New Orleans he wants to know if we have support for ipython notebook and I Python parallel yes so by default ipython notebook isn’t installed so this is the ipython notebook run run on the head of a demo so it’s there if I wanted to install them on all the notes I I could but but by default beat you get that baked in ipython parallelize is coming a lot of people have asked for it so that’s it’s coming shortly um but it’s definitely on our radar Great Danes and finally I have a question about the cost but I believe Jim will cover that in one of the next few slides so I’ll let Jim answer that question um I guess maybe I haven’t even so one thing I didn’t I didn’t show tech so it sure there’s nothing up my sleeve I can hold on so T muck speaking he can sorry t max is a useful tool for for starting a session and being able to disconnect to it disconnect from it so I’m just going to show you that i have hive up and running on unmastered Earth’s or just hive is up and running the the Impala shell is often run on the computer so SSH to another computer sorry and I’ll create that table boo that’s just um that’s kind of nothing and the Impala shell on the computer / obviously see it show tables oh so with Holly you have to invalidate the metadata and then now we have this this table hood um so so hyvin an Impala and have become very useful for the blaze team for providing a place to benchmark how how these various analytics engines or sequel analytical engines kind of work over standard CSVs or part k files or even hdf5 files for that matter almost up other there are other questions there’s no further questions been okay um in that case I think I’m I think I’m done there’s no other it’s no other questions if i’ll come back to the slide Jim yes you might want to build the slide i will thanks Ben that was great a great demo I’ll give gay people an opportunity to see the product in action and as Ivan mentioned there was one question on pricing so we price it per cluster and it’s a thousand I’m sorry pronotum dollars per node we do have special pricing for AWS are the cloud deployments we also offer a 30 day trial customers who want to test it out for that period of time and additionally we do have priority support that’s available for clients and basically it gives you email phone support priority issue handling 24 by 7 access to issue tracker and the background is roughly it is it’s three thousand dollars it allows you to put three named users into an email alias that can contact us connect to us and vice versa 24 by 7 for priority handling so it’s pretty simple from a pricing perspective it’s really just per node thousand dollars and as I as we mentioned there is an opportunity to do some some trials for a short period of time if there’s no other

questions that pretty much concludes the webinar demo tag and thank you very much spin for your time running through the product it was a live demo and it went off pretty flawlessly myself James McCarthy here in East Coast and Ivan mills I’m here’s the contact information to reach myself in your ivan if you have any questions I’ll follow up some concerns we’d love to hear from you and will will conclude at this point and since you did register for the webinar today we do have your contact information so myself myself Ivan will probably reach out to you via email or phones to see how things went a from your perspective follow up with any questions that you may have has of the Anaconda cluster product again thanks for your time have a great day and we look forward to talking to you guys soon thanks you sorry I missed these other questions is Jason bolt still there can you type in the chat you okay then yeah I have a confident applications contact information is not on all right I see a lot of questions here oh yeah Ben ice just saw that too Jason looks like you might still be online so maybe you can just go ahead and answer that in case he is and not we can follow up with him so Jason I’d be happy to talk with you may be offline um later today or tomorrow just contact them Ivan or urgent thank you again but you