Intro to Machine Learning: Lesson 1

Okay, so let me introduce everybody to everybody else, first of all So we’re here at the University of San Francisco learning Machine Learning or you might be at home watching this on video so, hey, everybody wave here is the University of San Francisco graduate students thank you everybody and wave back from the future and from home to all the students here if you’re watching this on youtube, please stop And instead go to and watch it from there instead. There’s nothing wrong with YouTube But I can’t edit these videos after I’ve created them So I need to be able to, like, if you updated information about, like, what environments to use how the technology changes and so you need to go here right so you can also Watch the lessons from here Here’s lots of lessons and so forth right so That’s tip number one for the video tip number two for the video is because I can’t edit them all I can do is add these things called cards and cards or little things that appear in the top corner of the top right hand corner of the screen So by the time this video comes out I’m going to put a little card there right now for you to click on and try that out Unfortunately, they’re not easy to notice so keep an eye out for that because that’s going to be important updates to the video all right So welcome we’re going to be learning about machine learning today Then so after everybody in the class here you all have Amazon Web Services setup so you might want to go ahead and launch your AWS instance now Or go ahead and create one short Jupyter notebook on your own computer if You don’t have Jupyter notebook setup then What I recommend is you go to Sign in there sign up And You can then turn off enable GPU and click start Jupyter, and you’ll have a Jupyter notebook instantly that costs you some money. It’s Three cents an hour, okay, so if you don’t mind spending three cents an hour to learn machine learning Here’s a good way, so I’m going to go ahead and say start Jupyter And so whatever technique you use There you go one of the things that you’ll find On the website is links to lots of information about the costs and benefits And approaches to setting up lots of different environments for Jupyter notebook Both the deep learning and for regular machine learning so check them out because there’s lots of options So if I then go open Jupyter in a new tab Here I am in crestle or on AWS or your own computer we use The Anaconda Python distribution for basically everything you can install that yourself and again There’s lots of information on the website about how to set that up We’re also assuming that Either you’re using crestle Or there’s something else which I really like called which is another place You can fire up a Jupyter notebook pretty much instantly Both of these already Have all of the fast AI stuff pre-installed for you So as soon as you open up crestle or paperspace assuming you chose the paper space fast AI Template you’ll see that there’s a fast AI Folder. Okay, if you are using your own computer or AWS You’ll need to go to our github repo fastAI/fastAI And clone it Okay, and then you’ll need to do a conda env update to install the libraries And again, that’s all information we put on the website And we’ve got some previous workshop videos to help you through all of those deaths so for this Class I’m assuming that you have a Jupyter notebook running, okay? So here we are in the in the Jupyter notebook and If I click on fast AI that’s what you get if you get clone, or if you’re in crestle you can see our repo here all of our Lessons are inside the courses folder and the machine learning part one is in the ml1 folder if You’re ever looking at my screen and wondering “where are you?” Look up here, and you’ll see that tells you the path fastAI / courses / ml1 Today we’re going to be looking at Lesson 1: random forests, so here is lesson 1 RF

So there’s a couple of different ways you can do this both here in person or on the video you can either Attempt to follow along as you watch or you can just watch and then follow along later with the video It’s up to you I would maybe have a loose recommendation to say to watch now and follow along with the video later just because It’s quite hard to multi-task, and if you’re working on something you might miss a key piece of information Which you’re welcome to ask about okay, but if you follow along with the video afterwards then you can pause stop Experiment and so forth but anyway you can choose either way I’m going to go to View -> Toggle header, View -> Toggle Toolbar and then full screen it so to get a bit more space So the basic approach we’re going to be taking here is to get straight into code start building models Not to look at theory We’ve got to get to other theory okay But at the point where you deeply understand what it’s for and at the point that you’re able to be an effective practitioner So my hope is that you’re going to spend your time focusing on Experimenting so if you take these notebooks and try different variations of what I show you Try it with your own data sets the more coding you can do The better the more you’ll learn Don’t, you know, my suggestion or at least and all of my students have told me the ones who have gone away and spent time studying books of theory rather than coding found that they learnt less Machine learning and they often tell me they wish they’d spent more time coding The stuff that we’re showing in this course a lot of it’s never been shown before this is not a summary of Other people’s research. This is more a summary of 25 years of work that I’ve been doing in machine learning So a lot of this is Going to be shown for the first time and so that’s kind of cool because if you want to write a blog post about something That you learn here, you might be building something, but a lot of people find super useful all right, so There’s a great opportunity to practice your technical writing, and here’s some examples of good technical writing okay page by showing people stuff Which you’ve it’s not like “Hey I just learnt this thing I bet you all know it” often it’ll be “I just learned this thing and I’m going to tell you about it and other people haven’t seen it” In fact this is the first course ever that’s been Built on top of the fastAI library so even just stuff in the library is going to be new to like everybody Okay, so when we use a Jupyter notebook or anything else in python we have to Import the libraries that we’re going to use Something that’s quite convenient as if you use these to auto reload commands at the top of your notebook You can go in and edit the source code of the modules and your notebook will automatically update With those new modules you won’t have to like restart anything so that’s super handy Then to show your plots inside the notebook you’re wanting that plot in line so these three lines Appear at the top of all of my notebooks You’ll notice when I import the libraries that for anybody here who is a experienced Python programmer I am doing something that would be widely considered very inappropriate. I’m importing star (*) Generally speaking in software engineering we’re taught to specifically figure out what we need and import those things The more experienced you are as Python programmer the more extremely offensive practices you’re going to see me use for example. I don’t follow What’s called pap 8 which is the normal style method style of code used in Python? So I’m going to mention a couple of things first is Go along with it for a while. Don’t judge me just yet right there’s reasons that I do these things And if it really bothers you then feel free to to change it right, but the basic idea is Data science is not software engineering, right? There’s a lot of overlap You know we’re using the same languages and in the end these things may become software engineering projects, but what we’re doing right now is we’re prototyping models and Prototyping models has a very different set of best practices That are taught basically nowhere, right? They’re not really even really written down But the key is to be able to do things very interactively and very iteratively right so for example

“from library import *” means you don’t have to figure out ahead of time what you’re going to need from that library It’s all there, okay Also, because we’re in this wonderful interactive Jupyter environment it lets us Understand what’s in the libraries really well so for example later on I’m using a function called display right so an obvious question is like well what is display so you can just type the name of a function and Press shift enter remember shift enter is to run a cell and it will tell you where it’s from Right so anytime you see a function you are not familiar with you can find out Where it’s from, and then if you want to find out What it does? Put a question mark at the start Okay and here you have the documentation and Then particularly helpful for the fastAI library so the fastAI library I try to make as many Functions as possible be like no more than about five lines of code. It’s just going to be really easy To read right if you put a second question mark at the star It shows you the source code of the function Right so all the documentation plus the source code, so you can see like nothing has to be mysterious and We’re going to be using, The other library we’ll use a lot is scikit-learn Which kind of implements a lot of machine learning stuff in Python. The scikit-learn Source code is often pretty readable and so very often if I want to really understand something I’ll just go “??” and the name of the scikit-learn function I’m typing and I’ll just go ahead and read the source code As I say the fastAI library in particular is designed to have source code That’s very easy to read and we’re going to be reading it a lot, okay All right so today, we’re going to be working on a kaggle competition called Blue Book for Bulldozers So the first thing we need is to get that data, so if you go kaggle bulldozers Then you can find it so kaggle competitions allow you to download a Real-world dataset that somebody, a real problem that somebody’s trying to solve and Solve it according to a specification that that actual person with that actual problem decided would be actually helpful to them Right, so these are pretty authentic Experiences for applied machine learning now of course you’re missing all the bit that went before Which was why did this company to start up the side that predicting the option sale price of bulldozers was important Where did they get the data from? how did they clean the data? and so forth okay, and that’s all important stuff as well But the focus of this course is really on what happens next which is like how do you actually build the model? one of the great things about you working on kaggle competitions Whether they be running now or whether they be old ones is that you can submit yours to the leaderboard even old closed Competitions you can submit to the leaderboard and find out how would you have gone Right, and there’s really no other way in the world of knowing whether you’re competent at this kind of data in this kind of model than doing that right because otherwise if Your accuracy is really bad. Is it because this is just very hard like it’s just not possible then the Data is so noisy You can’t do better? or is it actually that it’s an easy data set and you’ve made a mistake? and like when you Finish this course and apply this to your own projects This is going to be something you’re going to find very hard and there isn’t a simple solution to it which is You’re now using something that hasn’t been on keggle or your own data set Do you have a good enough answer or not? Okay, so we’ll talk about that more during the course and in the end We just have to know that we have good effective techniques to reliably building baseline models Otherwise yeah, there’s really no way to know there’s no way other than creating a keggle competition Or getting you know a hundred top data scientists to work at your problem to really know what’s possible So keggle competitions are Fantastic for learning and as I’ve said many times I’ve learned more from competing in keggle competitions than everything else I’ve done in my life So to compete in the keggle competition you need the data This one’s a an old competition, so it’s not running now, but we can still access everything So we first of all want to understand what the goal is

And I suggest that you read this later, but basically we’re to try and predict the sale price of heavy equipment And one of the nice things about this competition is that if you are like me you probably don’t know very much about Heavy industrial equipment options right I actually know more than I used to because my toddler loves Building equipment, so we actually like watched youtube videos about front end loaders and forklifts But you know two months ago, I was You know a real layman So one of the nice things is that? Machine learning should help us understand a data set not just make predictions about it so by picking an area Which we’re not familiar with it’s a good test of whether we can build an understanding, right? because otherwise what can happen is that your Intuition about the data can make it very difficult for you to be open-minded enough to see what does the data really say? It’s easy enough to download the computer. Sorry to download the data to your computer You just have to click on the data set so here is train zip and click download Right and so you can go ahead and do that if you’re running on your own computer right now if you’re running on AWS It’s a little bit harder right because unless you’re familiar with textmode browsers like a links, or links. It’s quite tricky to Get the data set to keggle so a couple of options one is you could download it to your computer and then? SCP it to SCP works just like SSH but it copies data rather logging in I’ll show you a trick though that I really like and it relies on using Firefox For some reason chrome doesn’t work correctly with keggle for this So if I go on Firefox To the website Eventually And what we’re going to do is we’re going to use something called the JavaScript Console, so every web browser comes with a set of tools for web developers To help them see what’s going on and you can hit here Developer Control+shift+I okay, so you can hit ctrl shift I to bring up This this web developer tools and one of the tabs is network Okay, and so then if I click on train zip and I click on download Okay, and I’m not even going to download on let’s gonna say cancel, but you’ll see down here It’s shown me all of the network connections that were just initiated, right And so here’s one which is downloading a zip file from storage Google API .com blah blah blah That’s probably what I want. That looks good, so what you can. Do is you can right-click on that and say copy Copy as curl so curl is a UNIX command like WGet that downloads stuff Right so if I go copy as curl That’s going to create a command that has all of my cookies, headers everything in it necessary to download this authenticated data set so if I now go into My server right and if I paste that you can see a really really long curl command One thing I notice is that at least recent versions have started adding this – – 2.0. Thing to the command that doesn’t seem to work with all versions of curl so something you might want to do is to oopsy-daisy a Is to pop that into an editor find that 2 Get rid of it And then use that instead Okay, now one thing to be very careful about by default curl downloads The file and displays it in your terminal so if I try to display this it’s going to display gigabytes of binary data in my terminal and crash it okay so to say that I want to Output it using some different file name I always type – o For output file name and then the name of the file bulldozers Dot and make sure you give it a suitable a suitable extension so in this case

the file was okay, so bulldozers dot zip There it is okay, and so there it all is so I could make directory bulldozers And I couldn’t move my zip file into there oops wrong way around Yes Thank you ah You you Okay, and then you if you don’t have unzip installed you may need to Sudo apt install unzip or if you’re on a Mac That would be brew install unzip if brew doesn’t work. You haven’t got homebrew installed so make sure you install it and then unzip Okay, and so there the basic steps One nice thing is that if you’re using crestle most of the data sets should already be pre-installed for you So what I can do here is I can say open a new tab Here’s a cool trick in Jupyter you can actually say new terminal and you Can actually get a web-based terminal, and so you’ll find on crestle There’s a slash data sets folder slash data sets kaggle Slash data set slash fastai. often the things you need are going to be in one of those places Okay, so assuming that we don’t have it already downloaded in paper actually paper space should have most of them as well then we would Need to go to fastAI let’s go into the courses machine learning folder and What I tend to do is I tend to put all of my data for a course into a folder called data You’ll find that if you try And if you’re using what we using git right you’ll find that that doesn’t get added to git because it’s in the git ignore right so So don’t worry about creating the data folder. It’s not going to screw anything up, so I generally make a folder called data And then I tend to create folders for everything I need there So in this case, I’ll make bulldozes CD and remember the last word of the last command is exclamation mark dollar I’ll go ahead and grab that curl command again Okay unzip bulldozers There we go, okay So You can now see I generally have like anything that would change that might change from person to person I kind of put in a constant so here. I just defined something called path, but if you’ve used the same path I just did just got to go ahead and run that and Let’s go ahead and keep moving along, so we’ve now got all of our libraries imported, and we’ve set the path to the data You can Run shell commands from within Jupyter notebook by using an exclamation mark so if I want to check what’s inside that path I can go LS data slash bulldozers Okay, and you can see that works Or you can even use Python variables if you use a Python variable inside a Jupyter show command you have to put it in { } Okay So that makes me feel good that my path is pointing at the right place if you say LS {PATH} and you get nothing at all then you’re pointing at the wrong spot. Yes This up here usually Yeah, so the curly brackets refer to the fact that I put an exclamation mark at the front which means the rest of this is not a Python command it’s a bashed command and bash doesn’t know about capital path because capital path is part of Python So this is a special Jupyter thing which says expand this Python thing please before you pass it to the shell

Question thank you So the goal here is to use the training set which contains data through the end of 2011 to predict the sale price of bulldozers and so The main thing to start with then is of course to look at the data Now the data is in CSV format Right so one easy way to look at the data. Would be to use shell command head to look at the first two lines paired bulldozers and even tab-completion works here Jupyter does everything Right, so here’s the first few five lines, okay, so there’s like a bunch of column headers, and then there’s a bunch of data So that’s pretty hard to look at so what we want to do is take this and read it into a nice tabular format Okay, so Does Terrance putting classes on mean I should make this bigger, or is it okay is this big enough font size, everybody? okay So this kind of data where you’ve got columns representing a wide range of different types of things such as an identifier of Value a currency a date a size I refer to this as structured data now I say I refer to this as structured data because like there have been many arguments in the machine learning community on Twitter about What is structured data? Weirdly enough this is like the most important type of distinction is between data That looks like this and data like images where every column is of the same type Like that’s the most important distinction in machine learning Yet, we don’t have Standard accepted terms, so I’m going to use the term structured and unstructured But note that other people you talk to particularly in NLP And NLP people use structured to mean something totally different right so When I refer to structured data, I mean columns of data that can have varying different types of data in them By far the most important tool in Python if you’re working with structure data is pandas Pandas is so important that it’s one of the few libraries that everybody uses the same abbreviation for it which is PD So you’ll find that One of the things I’ve got here is from fast AI imports import star, okay The faster, I imports Module has nothing but imports of a bunch of hopefully useful tools So All of the code for first AI is inside the fast a I directory inside the first AI repo and so you can have a look at imports And you’ll see it’s just literally a list of imports and you’ll find there Pandas as PD and so everybody does this right, so you’ll see lots of people using PD dot something. They’re always talking about pandas, so pandas lets us read a CSV file and So when we read the CSV file We just tell it the path to the CSV file a list of any columns that contain dates And I always add this low-memory equals false. That’s going to actually make it read more of the file to decide what the types are This here is Something called a Python 3.6. Format string. It’s one of the coolest parts of python 3.6 You’ve probably used lots of different ways in the past in Python of interpolating variables into your strings Python 3.6. Has a very simple way that you’ll probably always want to use from now on and it’s you to create a normal string You type an F for the start? And then if I define a variable Then I can say hello Curly’s Python function Okay This is kind of confusing these are not the same Curly’s that we saw earlier on in that LS command right that LS command is specific to Jupyter and interpolates Python code into shell code These Curly’s are Python 3.6. Format string Curly’s they require an F at the start so if I get rid of the F It doesn’t interpolate okay, so the F tells it to interpolate and the cool thing is inside that Curly’s you can write any Python code, you’d like just about so for example name dot upper Hello, Jeremy Okay, so I use this all the time

And it doesn’t matter because it’s a format string it doesn’t matter if the thing was Always forget my age. I think I’m 43 It doesn’t matter if it’s an integer right normally if you like to do string concatenation with integers place and complains No such problem here, okay, so So this is going to read path slash train dot CSV into a thing called a data frame Pandas data frames and R’s data. Frames are kind of pretty similar, so if you’ve used R before Then you’ll find that this is a you know reasonably comfortable so this file Is nine point three Meg and it’s size is Sorry 112 Meg 112 Meg and it has 400,000 rows in it okay, so it takes a moment to import it But what it’s done We can type the name of the data frame DF raw And then use various methods on it so for example df_raw.tail will show us the last few rows of the data frame By default it’s going to show the columns along the top and the rows down the side But in this case there’s a lot of columns, so I’ve just said dot transpose To show it the other way around I’ve created one extra function here display. All normally if you just type DF Raw, or if it’s too big to show Conveniently it truncates it and put little “…” in the middle so the details don’t matter But this is just changing a couple of settings to say even if it’s got a thousand rows in a thousand columns Please still show the whole thing Okay, so this is finished. I can actually show you that so if I just type this is really cool in in Jupyter Notebook you can type a variable of almost any kind a video HTML an image whatever and it’ll generally figure out a way of displaying it for you, okay? So in this case. It’s a panda’s data Frame it picks it out a way of just playing it for me And so you can see here that by default it’s actually doesn’t show me the whole thing so So here’s the data set We’ve got a few different rows. This is the last bit the tail of it alright last few rows This is the thing we want to predict Price okay, and then all of the other we call this the dependent variable The dependent variable is the price And then we got a whole bunch of things we could predict it with and when I start with a data set I tend Yes, Terrance. How can I give you this? Hello Jeremy hi Tara I’ve read in books that you should never look at the data because of the risk of overfit Why do you start by looking at the data? Yeah, so I think she’s gonna mention I actually kind of don’t like I I want to find out at least enough to know that I’ve like managed to import it okay But I tend not to really study it at all at this point Because I don’t want to make too many assumptions about it. I would actually say Most books say the opposite most books do a whole lot of expediate exploratory data analysis first yeah academic books The academic books I’ve read say that that’s one of the biggest risks of everything, but the practical books say, let’s do some EDA first Yeah, so that the truth is kind of somewhere in between and I generally I generally try to do machine learning driven EDA and that’s what we’re going to learn today Okay so the only thing I do care about though is What’s the purpose of the project and for kaggle projects the purpose is very easy We can just look and find out there’s always an evaluation section How is it evaluated and this is evaluated on root mean squared log error So this means they’re going to look at the difference between the log of our prediction of price and the log of the actual price And then they’re going to square it and add them up Okay, so because they’re going to be focusing on the difference of the logs that means that we should focus on the logs as well And this is pretty common like for a price Generally you care. Not so much about did I miss by ten dollars But did I miss by ten percent right so if it was a million dollar thing and you’re a hundred thousand dollars off Or if you’re it’s a ten thousand dollar thing, and you’re a thousand dollars off often

We would consider those equivalent scale issues and so for this option problem the organizers are telling us they care about ratios More than differences, and so the log is the thing we care about So the first thing I do is to take the log okay now NP is numpy I’m Assuming that you have some familiarity with numpy if you don’t we’ve got a video called deep learning workshop Which actually isn’t just for deep learning. It’s Rahal It’s basically for this as well and one of the parts there Which we’ve got a time coded link to there’s a quick introduction to numpy But basically numpy lets us treat arrays matrices vectors high dimensional chances as if they’re Python variables and we can do stuff like log to them and it’ll apply it to everything Numpy and pandas work together very nicely so in this case df_raw.saleprice is pulling a column out of a panda’s data frame which gives us a Pandas series Right shows us the sale prices and their indexes right? and a series can be passed to a numpy Function okay, which is pretty handy, and so you can see here. This is how I can replace a column with a new column pretty easy So okay now that we’ve replaced its sale price with its log. We can go ahead and try to create a random forest What’s a random forest? we’ll find out in detail but in brief a random forest is a kind of universal machine learning technique It’s a way of predicting something that can be of any kind it could be a category Like is it a dog, or a cat, or it could be a continuous variable like price It can predict it with columns of pretty much any kind pixel data zip codes revenues whatever In general it doesn’t overfit it can and we’ll learn to check whether it is, but it doesn’t generally overfit too badly And it’s very very easy to make to stop it from overfitting You don’t need and we’ll talk more about this You don’t need a separate validation set in general it can tell you how well it generalizes even if you only have one data set It has few if any statistical assumptions. It doesn’t assume that your data is normally distributed It doesn’t assume that the relationships are linear. It doesn’t assume that you’ve just specified the interactions It requires Very few pieces of feature engineering for many different types of situation you don’t have to take the log of the data You don’t have to multiply interactions together so in other words It’s a great place to start right if your first random forest does very little useful, then that’s a sign that There might be problems with your data like it’s designed to work pretty much first off Can you please throw it out towards this gentleman? Thank you? What about curse of dimensionality of random forests? Yeah great question, so there’s this concept of curse of dimensionality. In fact there’s two concepts I’ll touch upon curse of dimensionality and the no free lunch theorem these are two concepts you often hear a lot about They’re both largely meaningless and basically stupid And yet I would say maybe the majority of people in the field Not only don’t know that but think the opposite so it’s well worth explaining The curse of dimensionality is this idea that the more columns you have It basically creates a space that’s more and more empty, and this is kind of fascinating mathematical ideia, which is the more dimensions you have the more all of the points sit on the edge of that space alright, so if you’ve just got a single Dimension where things are like random then they’re spread out all over right where else if it’s a square Then the probability that they’re in the middle means that they’ve kind of been on the edge of either dimension So it’s a little bit less likely that they’re not on the edge edge dimension. You add it becomes more addictive Ly less likely that the point isn’t on the edge of at least one dimension Right and so basically in high dimensions everything sits on the edge And what that means in theory is that the distance between points is much less meaningful, and so if we assume that Somehow that matters that it would suggest that when you’ve got lots of columns and you Just use them without being very careful to remove the ones you don’t care about that somehow things won’t work that Turns out just not to be the case It’s not the case for a number of reasons

One is that the points still do have different distances away from each other just because they’re on the edge they still do vary and Far where they are from each other and so this point is More similar at this point that it is to that point so even things will learn about K nearest neighbors Actually work really well really really well in high dimensions Despite what the theoreticians claimed and what really happened here was that in the 90s Theory totally took over Machine learning and so particularly there was this concept of these things called support vector machines that were theoretically very well justified Extremely easy to analyze mathematically and you could like kind of prove things about them And we kind of lost a decade of real practical development in my opinion and all these theories became very popular like the curse of dimensionality nowadays and a lot of theoreticians hate this and the World of machine learning has become very empirical which is like which techniques actually work And it turns out that in practice building models on lots and lots of columns works really really well So yeah the other thing to quickly mention is the no free lunch theorem There’s a mathematical theorem by that name that you will often hear about their claims that There is no type of model that works well for any kind of data set Which is true and is obviously true if you think about it in the mathematical sense? Any random data set By definition is random right so there isn’t going to be some way of looking at every possible random data set that’s in some way More useful than any other approach. In the real world we look at data. Which is not random Mathematically, we would say it sits on some lower dimensional manifold. It was created by some kind of Caused all structure. There are some relationships in there So the truth is that we’re not using random datasets and so the truth is in the real world there are actually techniques that work much better than other techniques for nearly all of the datasets you look at And nowadays there are empirical researchers who spend a lot of time studying this which is which techniques work a lot of the time and Ensembles of decision trees of which random forest is one Is perhaps the technique which most often comes at the top and that is despite the fact that until the library that we’re showing you today fastAI came along there wasn’t really any standard way to pre-process them properly and to properly set their parameters So I think it’s even more strong than that So yeah, I think this is where the difference between theory and practice is huge So when I try to create a ra… So random forest regressor what is that? Random forest regressor, okay? It’s part of something called SK learn SK learn is scikit-learn It is by far the most popular and important package for machine learning in python. It does nearly everything It’s not the best at nearly everything But it’s perfectly good at Nearly everything so like You might find in the next part of this course with your net you’re going to look at a different kind of decision tree ensemble Called gradient boosting trees Where actually there’s something called XG boost which is better than gradient boosting trees in scikit learn But it’s pretty good at everything so I’m really going to focus on sci-kit learn Random forests you can do two kinds of things with a random forest if I hit tab I haven’t imported it so let’s go back to where we import So you can hit tab in Jupiter notebook to get tab-completion for anything, that’s in your environment you’ll see that there’s also a random forest classifier, so In general there’s an important distinction between things which can predict continuous variables that’s called regression and therefore a method for doing that would be a regressor and things that predict categorical variables And that is called classification and the things that do that are called classifiers So in our case we’re trying to predict a continuous variable price so therefore we are doing regression and therefore we need a regressor A lot of people incorrectly use the word regression to refer to linear regression Now it is just not at all true or appropriate regression means a machine learning model That’s trying to predict some kind of continuous outcome. It has a continuous dependent variable So pretty much everything in Scikit-learn has the same form you first of all create an instance of an object for the machine learning model you want

you then call fit passing in the independent variables the things you’re gonna use to predict and the dependent variable the thing that you want to predict so in our case the dependent variable is Is the data frames sale price column and So we the thing we want to use to predict is everything except that in pandas the drop method returns a new data frame with a list of columns removed Right well a list of rows or columns removed so access equals 1 means removed columns So this here is the data frame containing everything except for sale price Okay So if you want to remove some columns you just pass a list of strings with the column names? Let’s find out so to find out I could hit shift tab and That will bring up the you know a quick inspection of the parameters in this case it doesn’t quite tell me what I want so if I hit shift tab twice It gives me a bit more information Yes, and that tells me. It’s a single label or list like list like means like anything you can index in Python There’s lots of things by the way if I hit three times It will give me a whole little window at the bottom. Okay, so that was shift tab Another way of doing that of course which we learned would be question mark question mark DF bra Drop Okay Sorry question mark question mark would be the source code for it for a single question mark Is the documentation So I think that trick of like tab complete shift-tab parameters Question mark and double question mark for the docs and the source code like if you know nothing else about using Python libraries Know that because now you know how to find out everything else Okay So we try to run it, and it doesn’t work okay, so why didn’t it work so anytime you get a stack trace like this so an error the trick is to go to the bottom because the bottom tells you what went wrong a Buffer it tells you all of the functions the court other function could cause other functions to get there Could not convert string to float Conventional so there was a column name Sorry a there was a value rather inside my data set conventional the word conventional And it didn’t know how to create a model using that string Now that’s true. We have to pass numbers to most machine learning Models and certainly to random forests, so step one is to convert everything into numbers So our data set contains both continuous variables so numbers where the meaning is numeric like price and it contains categorical variables which could either be numbers where the meaning is not continuous like zip code, or it could be a String like large small and medium, it’s a categorical and continuous variables We want to basically Get to a point where we have a data set where we can use all of these variables so they have to all be Numeric and they have to be usable in some way, so one issue is that we’ve got something called sale date Which you might remember right at the top We told it that that’s a date so it’s been passed as a date, and so you can see here. It’s Data type DType very important thing. Data type is datetime 64-bit So that’s not a number Right, and this is actually where we need to do our first piece of feature engineering right? Inside a date there’s a lot of interesting stuff All right, so since you’ve got the catch box. Can you tell me? What are some of the interesting bits of information inside a date? What we can see like a time series That’s true, I hadn’t expressed very well. What are some columns that we could pull out of this? Yeah, month The date as in like it come… at least give me a number. Yeah, month, Quarter, pass it to your right and get some more. Behind you. Just pass it to your right you go You got some more columns for us The day of month yeah keep going to the right, day of week, yeah Week of year yeah, okay, I’ll give you a few more like that you might want to think about would be like

Is it a holiday? Is it a weekend? was it raining that day? was there a sports event that day? Like it depends a bit on what you’re doing right so like if you’re predicting soda sales in soma, you would probably want to know was there a San Francisco Giants ball game on that day? right so like what’s in a Date is one of the most important pieces of feature engineering you can do and no machine learning Algorithm can tell you whether the Giants were playing that day and that it was important, right? so this is where you need to do feature engineering so I do as much things as many things automatically as I can for you right so here I’ve got something called add date part What is that? It’s something inside fast AI dot structured okay, and What is it well let’s read the source code? Here it is so you’ll find most of my functions are Less than half a page of code alright, so here is something. It’s going to so rather than often rather than having Docs I’m going to try to add Doc’s over time, but that is their design you can understand them I reading the code, so we’re passing in a data frame and the name of some field, okay, which in this case was sale date and so in This case we can’t go D. F Dot field name because that would actually find a field called field name it literally so DF square bracket field name is how we grab a column where that column name is stored in this variable okay? So we’ve now got the field itself the series yeah And so what we’re going to do is we’re going to go through all of these different strings right and this is a piece of Python, which actually looks inside an object and finds a Attribute with that name, so this is going to go through and you can again you can google for Python get attribute It’s a cool little advanced technique, but this is going to go through it’s going to find for this field It’s going to find its Year attribute Now planter’s has got this interesting idea, which is if I actually look inside Let’s go field equals. This is the kind of experiment. I want you to do right play around say all date Okay, so I’ve now got that in a field object and so I can go field Right and I can go field dot tab Okay, and let’s see is year in there. Oh It’s not okay Why not well that’s because year is only going to apply to pandas series that Date time objects so what pandas does is it lets out different methods? Inside attributes that are specific to what they are so date/time objects will have a DT Attribute defined and at that is where you’ll find all the date/time specific stuff So what I went through was I went through all of these and picked out all of the ones that could ever be Interesting for having any reason right and this is like the opposite of the curse of dimensionality It’s like if there is any column or any variant of that column That could be ever be interesting at all add that to your data set and every variation of it you can think of There’s no harm in adding more columns Nearly all the time right so in this case We’re going to go ahead and add all of these different attributes and so for every one I’m going to create a new field That’s going to be called The name of your field with the word date removed sort of a sale and then the name of the attribute So we’re going to get a sale year sale months so a week say all day Etc etc okay, and then at the very end I’m going to remove The original field right because remember we can’t use Say all date directly because it’s not a number So you’re saying this only worked because it was a date type, did you make it a date? or was it already saved as one in the original? Yeah, it’s already a date type and the reason it was a date type Is because when we imported it We said parse dates equals and told pandas it’s a date type so as long as it looks date-ish And we tell it to parse it as a date, it will turn it into a date type Is there a way to do that so we just look through all the columns and say, like, if it looks like a date, make it a date? What would happen to each one? I think there might be but for some reason it wasn’t ideal like maybe it took lots of time Or it didn’t always work or for some reason I had to list it here I would suggest checking out the docs for pandas.read_csv and Maybe on the forum you can tell us what you find because I can’t remember offhand

So how about the time zone, how can we get the time zone? Let’s do that one on the same forum thread that savanah creates because I think it’s a reasonably advanced question, but generally speaking the Time zone in a properly formatted date will be included in the string And it should format it it should pull it out correctly and turn it into a universal time zone So generally speaking it should handle it for you So I noticed you for indexing a column to shrink when we use the Is there any consideration The square brackets one is safer Particularly if you’re assigning to a column if it didn’t already exist You need to use the square brackets format, otherwise you’ll get weird errors So the square brackets format is safer the dot version saves me like a couple of keystrokes So I probably use it more than I should in this particular case Because I wanted to grab something that was had field name was had something inside It wasn’t the name itself. I have to use square brackets So square brackets is going to be your your safe bet if in doubt So after I run that You’ll notice that df_raw.columns gives me a list of all of the columns Just as strings and at the end there they all are right, so it’s removed sale date, and it’s added all those So that’s not quite enough The other problem is that we’ve got a whole bunch of strings in there right so You can just think that they’re doing to pass a bet So Is like low high medium, thank you So pandas actually has a concept with a category data type But by default it doesn’t turn anything into a category for you, so I’ve created something called Train cats Which Creates categorical variables for everything that’s the string Okay, and so what that’s going to do is behind the scenes. It’s going to create a column That’s actually a number right as an integer and it’s going to store a mapping from the integers to the streets, okay? The reason it’s trained cats as it uses for the training set more advanced usage Is that when we get to looking at the test and validation sets this is really important idea? In fact Terrence came to me the other day, and he said my models not working Why not and he figured it out for himself it turned out the reason why was because the mappings? He was using from string to number in the training set were different to the mappings He was using from string to number in the test set so therefore in the training set High might have been three But in the trait test set it might have been two so the two were totally Different and so the model was basically non predictive. Okay, so I have another function Called apply categories Where you can pass in your existing training set and it will use the same Mappings to let you all make sure your test set of validation set uses the same mappings, okay, so when I go trained cats It’s actually not going to make the data frame look different at all, but behind the scenes it’s going to turn them all into numbers When we finish at 12 11:50 Let’s see how we go I’ll try to finish on time, so you’ll see now remember. I mentioned there was this dot DT Attribute that gives you access to everything assuming. It’s a date time about the date time There’s a dot cat attribute that gives you access to things assuming something’s a category all right And so usageband was a string and so now that I’ve run train cats. It’s turned it into a category so I can go to your or usage banned Cat right, and there’s a whole bunch of other things. We’ve got there, okay So one of the things we’ve got there is dot categories and you can see here is the list Now one of the things you might notice It’s that this list is in a bit of a weird order high low medium. The truth is it doesn’t matter too much But what’s going to happen when we use the random forest is it’s actually good that this is going to be 0 this is going

To be 1 this is gonna be true, and we’re going to be creating decision trees And so we’re going to have a decision tree that can split things at a single point so it either be high versus low and medium or Medium versus high and low that would be kind of weird right it actually turns out not to work too badly But it’ll work a little bit better if you have these in sensible orders Okay, so if you want to reorder a category then you can just go cat.set_categories and pass in The order you want until it is ordered and almost every pandas method has an in-place Parameter which rather than returning a new data frame is going to change that data frame Okay, so I’m not going to do that like I didn’t check that carefully for categories It should be ordered, but this seems like a pretty obvious one You reiterate that issue I don’t understand what the chart so um the usage banned column It’s actually going to be This is actually what I random forest is gonna see these numbers one zero two one okay? And they map to the position in this array and as we’re going to learn shortly a random forest consists of a bunch of trees It’s going to make a single split, and the single split is going to be either Greater than or less than 1 or greater than a less than two right so we could split it into high Versus low and medium, which that semantically makes sense it’s like is it big or we could split it into Medium versus high and low it doesn’t make much sense Right so in practice the decision tree could then make a second split to say like Medium versus high and low and then within the high and low into high and low But by putting it in a sensible order if it wants to spit out low it can do it in One decision rather than two and we’ll be learning more about this shortly It’s it honestly it’s not a big deal, but I just wanted to mention. It’s there and It’s also good to know that people when they talk about like different types of categorical variable Specifically you need to know there’s a kind of categorical variable called Ordinal and an ordinal categorical variable is one that has some kind of order like high medium and low Okay, and random forests are terribly sensitive for that fact But it’s worth knowing t’s there and trying it out Still ordering wouldn’t help our for maximum that That’s what I’m saying it. Helps a little bit right it means you can get there with one decision rather than two I noticed there is a negative one in that list of categories is that like an NA yeah exactly so for free we get a negative one which refers to missing And what are the things we’re going to do is we’re going to actually add one can somebody pass the vector Paul is we’re going to add one to our codes maybe in two guys Let people know it’s coming Yeah, so let people know we’re going to add one to all of our codes to make missing a zero later on Yeah, we’re going to get to that yeah Yeah, so get dummies Which we’ll get to in a moment is going to create three separate columns ones and zeros for high once There’s a million ones in series for low where else this one creates a single column with an integer zero one or two We’re going to get to that one shortly, yep, did you have a question to Paul or just pointing out okay? Okay, so at this point as long as we always make sure we use the thing with the numbers in We’re basically done all of our streams have been turned into numbers Dates been turned into a bunch of numeric columns and everything else is already a number, okay The only other main thing we have to do is notice that we have lots of missing values so here is df_raw.isnull that’s going to return true or false depending on whether something is empty sum is going to add up how many empty for each series And then I’m going to sort them and divide by the size of the data set so here we have some things which have like quite high percentages of NaNs so missing values we call them in display all What I call it, maybe I didn’t run it There we go, okay, so

We’re going to get to that in a moment, but I will point something out which is reading the CSV Talk a minute or so the processing took another ten seconds or so From time to time when I’ve done a little bit of work. I don’t want to wait for again I will tend to save where I’m at so here I’m going to save it and I got to save it in a format called feather format This is very very new all right But what this is going to do is it’s going to save it to disk in exactly the same basic format But it’s actually in RAM this is by far the Fastest way to save something in the fastest way to read it back right so most of the folks you deal with Unless they’re on the cutting edge won’t be familiar with this format So this would be something you can teach them about it’s becoming the standard right. It’s actually Becoming something that’s going to be used not just in pandas, but in Java In spark in lots of like things for like communicating across computers because it’s incredibly fast And it’s actually co-designed by the guy that made Panthers by where’s McKinney So we can just go df_raw.to_feather and pass in some Name, I tend to have a folder called temp for all of my like as I’m going along stuff And so when you go OS.makedirs as you can pass in any path path here you like It won’t complain if it’s already there Exists okay equals true. If there are some sub directories. It’ll create them for you, so this is a super handy little function Okay, so It’s not installed So because I’m using crastle for the first time it’s complaining about that so if you get a message that something’s not installed If you’re using anaconda you can conda install crastle actually doesn’t use anaconda. It uses pip And so we wait for that to go along okay, and so now if I run it And so sometimes You may find you actually have to Restart Jupyter, so I won’t do that now. It’s really out of time so if you restart Jupyter You’ll be able to keep moving a lot so from now on You don’t have to rerun all the stuff they love you could just say pd.read_feather and we’ve got our data frame back So the last step we’re going to do is to actually replace the strings with their numeric codes and we’re going to pull out the dependent variable sale price into a separate variable and We’re going to also handle missing continuous values, and so how are we going to do that? So you’ll see here. We’ve got a function called proc DF. What is that proc DF? So it’s inside fastAI.structured again And here it is So quite a lot of the functions have a few additional parameters that you can provide and we’ll talk about them later But basically we’re providing the data frame to process and the name of the dependent variable that the the Y field name, okay And so what it’s going to do is it’s going to make a copy of the data frame It’s going to grab the Y value it’s going to drop the dependent variable from the original and then it’s going to Fix missing so how do we fix missing? So what we do to fix missing is pretty simple If it’s numeric Then we fix it by basically saying Let’s first of all check that it does have some missing Right so if it does have some missing values so in other words. The is not some is nonzero then We’re going to create a new column called with name as the original plus underscore na And it’s going to be a bullion column with a 1 Anytime that was missing and a 0 anytime it wasn’t We’re going to talk about this again next week but this is you know give you the quick version having done that where they’re going to replace the NA s the missing with the median, okay? So anywhere that used to be missing will be replaced with the median Or add a new column to tell us which ones were missing we only do that for numeric We don’t need it for categories because pandas had is handles categorical variables automatically by setting them to minus one so What we’re going to do Is if it’s not numeric and

It’s a categorical type we’ll talk about the maximum number of categories later, but lets us units is always true So if it’s not a numeric type We’re going to replace the column with its codes the integers okay plus one right so the by default Pandas uses minus one for missing so now zero will be missing and one two three four will be all the other categories So we’re going to talk about dummies later on in the course, but basically Optionally you can say that if you already know about dummy values there are columns with a small number of Possible values you can put in two dummies instead you can America lysing them, but we’re not going to do that for now okay? So for now all we’re doing is we’re using the categorical codes plus one replacing missing values with the median adding an additional column telling us which ones were replaced and removing the dependent variable So, that’s what Proc DF. Does runs very quickly, okay, so you’ll see now Sale price is no longer here. Okay. We’ve now got a whole new color a whole new variable called Y the contain sale press You’ll see we put a couple of extra blah underscore na s at the end, okay, and if I look at that Everything is a number Okay These boolean z’ are treated as numbers. They’re just considered contributed a zero or one that is displayed as false and true They can see here is at the end of a month is at the start of a month is at the end of a quarter It’s kind of funny right because we’ve got things like a model ID Which presumably is something like that could be a serial number? It could be like the model identifier That’s created by the factory or something. We’ve got like a data source ID like some of these are numbers But they’re not continuous it Turns out actually random forests work fine with those We’ll talk about why and how and a lot about that in detail but for now all you need to know is no problem Okay, so as long as this is all numbers, which it now is we can now go ahead and create a random forest so m dot random forest regressor random forests are Trivially paralyse abour so what that means is that they if you’ve got more than one CPU which everybody will basically on their Computers at home, and if you’ve got a t2 dot medium or bigger at AWS You’ve got multiple CPUs trivially paralyse Abul means that it will split up the data across your different CPUs and basically linearly scale right so the more CPUs you have Pretty much it will divide the time it takes by that number not exactly But roughly so n jobs equals minus one tells the random forest regressor to create a separate job It’s a separate process basically for each CPU you have so that’s pretty much what you want all the time Fit the model using this new data frame we created using that Y value We pulled out and then get the score ok the score is going to be the r-squared. We’ll define that next week Hopefully some of you already know about the r-squared one is very good zero is very bad So as you can see we’ve mmediately got a very high score okay, so That looks great But what we’ll talk about next week a lot more is that it’s not quite great Because maybe we had data that had points that look like this and we fitted a line that looks like this When actually we want to want normal it looks like that ok, the only way to know Whether we’ve actually done a good job is by having some other data set that we didn’t use to train the model now We’re going to learn about some ways with random fire We can kind of get away without even having that other data set but for now What we’re going to do is we’re going to split Into twelve thousand rows, which we’re going to put in a separate data. Set called the validation set Versus the training sets going to take contain everything else right and our data set is Going to be sorted by date And so that means that the most recent twelve thousand rows are going to be our validation set again We’ll talk more about this next week. It’s a really important idea but for now we can just recognize that if we do that and Run it. I’ve created a little thing called print score, and it’s going to print out the root mean square error between the predictions and actuals for the training set for the validation set that r-squared for the training set and the validation set and You’ll see that actually the r-squared for the training was 0.98 but for the validation was 0.89 Okay, then the RMS see and remember. This is on the logs was point. Oh nine for the training set 0.25 for the validation set now if you actually go to cattle and go to the leaderboard

Okay, let’s do it right now He’s got private and public I click on public leaderboard and We can go down and find out. Where is point two five so there are four hundred seventy-five teams And generally speaking if you’re in the top half of a capital competition you’re doing pretty well So a point two-five here, we are point two five. What was it exactly point two five by two 507 Yeah about a hundred and tenth so we’re about in the top 25% so so the idea Like this is pretty cool right with with like with no thinking at all using the defaults of everything We’re in the top 25% of a caracal competition so like random forests are insanely powerful, and this totally standardized process is insanely good for like any datasets so We’re gonna wrap up. Well what I’m going to ask you to do For Tuesday, it’s like take as many Tackle competitions as you can whether they be running now or old ones or datasets that you’re interested in for your hobbies will work and and please try it right try this process and If it doesn’t work. You know tell us on the forum Here’s the data So I’m using here’s where I got it from his like the stack trace of where I got an error or here’s like you know if you use my Print score function or something like it like you know show us what the training versus tests It looks like we’ll try and figure it out right, but what I’m hoping We’ll find is that all of you will be pleasantly surprised that with with the you know Hour or two with information you got today You can already get better models than most of the very serious Practicing data scientists that competing table competitions, okay great good luck, and I’ll see you on the forums. Oh one more thing Friday The other class said a lot of them had class during my office hours So if I made them one till three instead of two two or four on Fridays is that okay? Seminar oh Okay, I have to find a whole another time all right I will talk to somebody who actually knows what they’re doing unlike me about finding other cells Absolutely