Welcome everybody, you’re watching mr fugu data science. We’re doing some time series plotting today. I had a viewer reach out to me, wanting a little bit of help. As always feel free to get a hold of me, here on instagram and twitter. If you think this is useful content feel free to help support the channel with this provided here. As always thanks to the recent subscribers and viewers i greatly appreciate it. Let’s get into this here’s the data that we’re going to be using today. It’s a very famous data set for retail online retail sales from a website from 2010 to 2011. I left in the code the original data set because i have cleaned these data as you can see here all the code will be in the link in description below for my Github. What we’re going to focus on today is parsing the invoice dates, taking some of the country information, as well as using the unit price and a quantity. And creating a new column called sales totals. When we parse these invoice dates, we will also be parsing this date time format which is stored as a string initially. We need to convert that to utilize this for our time series. I had to do a lot of conversion and manipulation with this data set. I highly suggest that you take this code and review what i did so you can get a lot of practice. Nevertheless, when you’re dealing with pandas. You can zoom in just a little bit. You can do the two date time formatting as i have highlighted here for the invoice, state column. Just scroll over here and understand what i’m doing. i’m taking that column and i’m using the infer format equals true. Then i have to create two new columns called date and time where i further subset. And i did d t dot strf time, and then the formatting that i wanted for each of those. But you’re going to notice something, so let’s scroll down Here’s our new columns but, the data types are now different than what the data type is when you have this invoice date. We’re going to further take these dates and pull up the month name, the day of the week as well. And then later we will just take the hours from there. I subset these data and day of the week the month and then i created that new column with our sales totals, the quantity and the unit prices when you’re calling the day of the week. You use this percent capital A. When you’re taking the month name, it’s going to be percent capital B with this with this right here If you notice, we now have our three additional columns that we just created right here. Which is going to allow us to start doing our first plots we’re going to look at doing some simple plots First with two bar plots. We can get an idea of how to handle these data in order to move further, you should check if there’s any nan or none values for your date time. For instance, I then checked for null values in my column. So i have zero null values which is great. But, there’s some more hiccups we’re going to stumble on to. So i subset these data and i want to show you something distinctive. Notice, that right here we’re dealing with sales volume and what i did was, i took the specific two columns the month and the sales total. And from there i subset and only did the month and i did dot value counts. Now, this is taking basically a volume count. I can’t use this, if i was looking at total sales i would have to do a group by sum. We’ll see that in a second This is allowing us to seal it by month. We had this many transactions. Whereas, when we do this one we’re actually finding the amount of sales by month. I did a group by month, i summed it all together. You could do it by the mean. Just depends on what you’re actually doing. And then i had to unstack because, what happens is let me print this off for you so you know what i’m doing there. If you see this, this is a nested table. Now you see that this is above, this month. When you call it, you need to actually call this, which is the outer portion of it. That’s a distinction, and if you just wrap this in a list it gives you just the actual values of what you’re looking for Now, we do our actual first plotting if you didn’t have the imports from above, you would need this where you’re just calling your matplotlib as our starter piece. i’m adjusting the figure size so it’s quite big. But, i’m going to make it smaller for the next plots. Because, we won’t be able to see them fully. Then i change the tick size for each axis and i throw in the order of the labels. I want on the x axis in the particular order for the month. Then we use this plot bar to make our bar plot where i’m calling in my labels as my x, my y values that we took from above. And then i’m aligning everything. And if you had multiple plots, then you could do the alpha so you have some kind of transparency so you can look between it. If you had like overlaid plots and then this is going to change the orientation of the x-axis ticks, all right That’s what’s going on here, and then here’s our label. If you’re wondering what this is?
It’s just using the actual symbol for the British pound, and then i increase the plot size text We scroll down we could zoom out just a little bit, and this is going to be sales by month in British pounds. You can see October, November right here are the best sales, and then for some reason March there’s a great big spike. Now these data are interesting because, it’s from December 2010 until December 2011 and what i decided to do is look at day of the week for purchases. But, there’s a problem. There was never in the original data Saturday. For some reason these data no one sold any items on this weird online website, and i think it was the United Kingdom. Because, that’s where most of the purchases came from. For some reason they didn’t operate or take orders on Saturday. I don’t know why? So you have to take this into account when you’re doing plotting Sometimes your range of values don’t work out, and you have to find a way to finagle that so in order for me to make this cohesive. I did the Monday through Friday and i artificially made a new data frame that i’m going to create with the sales total of zero, and the new weekday as Saturday. Then i have to subset the original data that i’m taking from my data frame, and append a new row of Saturday which will go to the end of the data frame. And then what i’m doing here with ignore index is changing the index value for the data frame. So it all starts from zero again. Instead of the original values and then we’re going to do the same thing that we did above. We’re going to take our data frame. We’re going to group by the day of the week sum everything together and just take our total values for our sales for each day of the week. And you’ll say well the magnitude of these numbers looks very different. Yes they do okay. But, if you look at these, if you sum all of these values together and check to see if it’s equal for all the other days you only be off by a few pennies or pence. Nevertheless, fixed everything, got it to zero. Now we do our next plot this one’s pretty much similar to the last one We make a subplot here, we keep our giant plot, we change our x and y tick sizes. But then we have this function right here. This function is taking care of the y-axis for the labels, and you’re gonna see what’s going on in a second. But what’s happening is it’s taking care of the magnitude of the numbers. If it’s a thousand, then it’s gonna get a k, if it’s a million it gets an m, if it’s less than a thousand it just is its own number, etc. And then i’m taking care of the amount of significant figures, and only keeping one from there you’re using this right here. This is called in from matplotlib this function formatter and it’s taking in this function right here this function and it’s setting it to the y-axis that we’re calling. We could change the labels, then we’re doing the same thing from above. We’re calling our x and our y so our label names for x, and then whatever we’re subsetting for our y aligning the plot and then rotating our x ticks once again and then doing our plot title, and our y label and then plot does show so we’ll scroll down and you notice look at the y axis right here. You have two hundred and fifty thousand, one million, two million, zero dollars and it matches up for this Saturday. Now let’s look at something. When you plot this originally the zero position would be over here Because it wasn’t aligned properly and you scroll right back up here. and we’ll notice you see Saturday in this position. This is how it prints out. You’ll put your labels and it’ll say Monday through Friday in the correct order. But your plotting numbers here are in the incorrect order That’s why i did this line again here to subset everything in the correct order. When i plot it, it lines up correctly. That’s why now this lines up correctly. Now we’re good to move on to our time series stuff. I decided for the time series we’re going to investigate the top five countries The first part we’re going to neglect the United Kingdom because, it dwarfs our data and we’ll look at that in the second example later because the scaling will be off. So what do we do here? We took our data frame here for retail sales and we’re only extracting the country and the sales totals. Then i did a group by statement here so i can take it the country and sum all of the values for sales and then i print it off one through six. Which would be one, two, three, four and five instead of zero which would have been the uk which was the highest selling and this is what we’re going to use for our first plot that we’re going to use for mapping here’s the uk. You see it’s three times as big as the other ones, pretty much that’s why it’s kind of an issue. That’s why we had to do a different subsetting for the sales. So what i did was i called those top countries that i wanted to take and do the dot index on the sales total so i can get the actual name of the countries. That’s what this idx stands for from there i’m going to chuck it into this and only subset the columns that relate to idx which is the names of these countries here
Then to verify it you can run the same thing and then just do a dot country dot unique to make sure that this prints out correctly. if you’re in doubt so to self check help from there let’s investigate this a little bit and take on the columns that we have of interest here but let’s figure out a range of dates that we want to look into and if we notice the top it was October, November, December where the top selling month. Let’s investigate those first, with these top five countries. To do that we use this right here. You call in your data frame, the specific column, and then a date range. And then i subset with whatever columns i wanted to do Use this dot idx which is just getting the label names and subset the countries by these dates You could probably do this in a different steps. But, this is how i chose to do it Then what i decided to do is take our date ranges and convert it into our date formatting I iterated through the names for each one of these countries and then what i did here was i basically appended a list for comparisons. This right here, is saying create a list that’s the same length as the set of values which is these dates subtracted by the original dates for our data frame. And what you’re getting from this set difference here is all the dates that were not included in this range. Because what i want here is the range of every day in October. But i know from these data and investigating for each one of these countries. The sales did not include all 31 days. To make up for the difference in this, i did the set difference then i multiplied our iterator here. For the name of the country by the length of this. Because, we’re creating a new data frame. The data frame will have the same length as whatever empty values we have for this. Then we create a list a new list which is our dates that were not included for our country. And then a length of zeros to correspond to the amount of sales that were held within each one of these dates that was not included in the original data frame. So i create this new data frame and i have to use the explode from pandas to flatten out the data frame. Because if we do this real quick, you’ll see what’s going on. You’ll see here each one of these is the specific length low we need but, we need to flatten it out to create more rows. So we could append it to our old data frame and that’s what’s going on here, okay. You’ll see this oops, there now you see this this is including the the dates that weren’t originally included in the data frame and we’re going to append them with the country name and zero sales. We can use this, for our plotting then we need to iterate through each one of these countries and there and store their sales by date So i’m taking in the the data frame and i’m sorting by the country first. Then by the date, then i’m taking the unique names for each of the countries and iterating through it. Then i’m looping through basically the data frame for each one of these countries and i’m taking the dates, i’m summing together all of their sales data and we can see that right here. This is the sales data for each one of the countries that we have, and then that’s the list of lists And then what i’m doing from here. We’ll zoom in just a little bit for the plotting. You’re going to do a list comprehension and basically take your data frame by the date column and take all the unique dates that’s going to be your x value. Then your y is going to be the store list that i just printed off those right there. Then same thing as before we get our plot size and you’re doing an enumerate, so you’re looping through this you’re taking your x. You’re taking your y but, now you have to make your labels coincide. So you’re getting your country names which are unique so you have basically a set or a list of unique country names and you make sure that it works with your iterator that you have from here. That’s how you get it to match up for your legend then you do your same label information we got a legend dictate where you want it to be then you do your x ticks, rotate it, create its size and you’re doing it by date. And then i did a grid portion so you can see what this looks like. We’ll zoom out just some so we can see this. Because, it’s kind of big but, this is your October 2011 sales by country, per day in great British pounds. Right here are your countries we did this by the date of the month so we have all the months you could do this by hours by whatever you want and we’re doing this comparison. This is what it looks like for all five of those countries, you can see there is definitely a trend of what’s going on with sales. How about plotting a single day based on hourly stuff, okay. Fine this one involves quite a bit of manipulations because, now we have to get the hour
separated from the date time formatting. We’re going to use our same date range that we did before. But instead i’m going to use the column of United Kingdom since this had a lot of sales But i decided within that date range i’m picking one particular day. Why did i choose this day? Because when you look up here from the 5th, it looks like there’s a lot of traffic on the fifth for these countries. I could have did somewhere in this range, or this range. But i chose to do it around the fifth. So we sort those values by the invoice state this one and i did that because you have a time stamp in here which you can also sort by which is what we want after you do your sorting to get this in order by your timestamp information You want to pluck out the hours which is done here and highlight it. But there’s another problem You don’t have 24 hours represented in these data. So now we have to manipulate and do the same finessing to create more rows and find the missing hours and append those in at the end of your data frame. Here’s our hours here that we created now let’s do the finessing there’s a lot going on with this one i’m doing my date formatting so i could get all 24 hours. But to get the formatting it’s a 0 8 instead of an 8. So if i’m less than 10 i add that 0 into it and append it otherwise, i chuck in the values so you get 0 1, 0 2, 0 3 and then at 10. It’s just 10, 11, 12, etc . Then we’ve got to do our set difference here and this set difference is our new list of hours that we’re not included in the original data frame. Then you chuck that in and create a dictionary with your values as a list and then we have to create the sales totals. There were zero sales in that time so you need to make the length of that new list of values the same as this so your data frame will match up later then concatenate everything together so you have all of your columns from there we do our plotting same as what we’ve been doing before. Your x is going to be here which is our unique hours. Then our y is going to be this right here so it looks crazy But it’s the actual sum of values for each hour of your sales and then don’t forget your x ticks which is your hour names and it’s just the same as we’ve been doing before. This is the uk sales for one particular day by hours and this is what it looks like and you see these litter at zero Those are all the hours that we’re not including the original dictionary. Now let’s expand this and look at three consecutive days by hour for the uk data. This is quite involved, very similar to what we did above for this previous plot. You subset the country that you want you subset the date range that you want. Then you need to pluck out the hours that you want within your data frame You need to sort by the values for your invoice date that’s what this highlighted portion is here from there we take all of the unique dates there’s only going to be three days October 4th, 5th and 6th. You iterate through those days and you append a list of the unique values for those names. You take those hours and you need to do a assorted list of your set difference which is here. The set difference is going to give you all of the hours that were not involved in the original data set. You take that as a list of values here. For your new hours you will then have to expand this as a data frame because all of your rows will be a list. So you use the explode you make sure to reset your index and then now you’re creating a sales total that’s the same length as this that you’re just creating here. So your data frames can be concatenated and joined together with the same number of rows from there we have to go through all of the days, iterate through that and then we’re taking all of the hours. Taking the length of that set difference we did before and multiplying it times all our unique names to get a list a nested list of each date corresponding to the length of each of the three nested list of hourly values. And then you just do the same thing of creating your data frames and mashing everything together. Let’s see what this looks like right here for this ds because, that’s the most confusing you see what’s going on it’s creating a list of the same length of all of these you see there’s 40 entries from there. We will take our hours and our dates. Iterate through our unique dates which is three Go inside of it and group by our hours and our sales. We can get a list, it’s just the first entry in our list there’s three entries and this is all of the sales for 24 hours from October 4th 2011. The second entry would be October 5th, the third entry October 6th. We do what we’ve done
before a list comprehension for our unique hour names to be our x value. We call in what i just showed you for that above which is our y then we got to do our iteration here. Take our x y and our labels to our unique date names and make sure that you use this here. So you get the correct indexing positions and checking what we’ve done for the last few right here. And we’ll scroll out and this is going to be October 4th through 6th 2011 uk sales by hour. And you’ll see here’s our time series information plotted out fine you can manipulate this anyway. But, i made sure to laboriously go through this so you can understand. As always thank you for watching please like, share and subscribe. And if you subscribe turn on that notification bell. Feel free to help support this channel and buy me a coffee. There’s my handle right here mr fugu data science. I would appreciate it i’ll see you in the next video. Bye