Adventures with a Wine data set – week 13 – How did we get here? – Twitch stream

– Hello everybody, happy Monday

We are back for the wine adventures A quick apology for those of you who are expecting this week in neo4j’s article review, I just realized I broke something on the meet up page So I’ll get that sorted and fixed, but we’ll be back next week Anyway, welcome, welcome, welcome So, what are we going to be doing this week? So I’m putting a request, so just as a quick reminder to everybody, if there’s something you’d like me to investigate whilst we go through this, this wine journey that we’re all going on together and it can be anything, then let me know I’m collecting a list of these and then we can explore these as we go through them So I know there’s been a request around GraphQL I think GraphQL is super cool So that’s definitely gonna happen soon I’m just going to think of an example of where we can use it I’m quite keen myself So I come from a .NET background I haven’t coded for a little while, but you know that, originally that’s what the space I was working in, so I’m quite keen to see can we put something together in the .NET space, maybe something around .NET Core, see what that looks like And there’s just so many things, so many things, NLP, there’s an opportunity to have a play with that A bunch of stat stuff we can do, a bunch of graph data, science stuff we can do, loads of stuff So if there’s something in particular that you want us to have a look at one week or a series of weeks, ’cause some of these might be quite meaty projects, let me know and then we’ll do that So I thought what we’d do this week is let us have a look at where we’ve got to So I did put an ask if anybody fancies any particular topics, so it hasn’t So I thought, well, you know what, we’ve been doing this for a number of weeks So let’s revisit what we’ve done so far and what’s worked well, what’s not worked so well, what have we learned? And yes, let’s do that So maybe it’s shorter session today so that’s what we’re gonna do And absolutely as we’re going through this stuff, if you go, “Hang on a minute, I want to ask a particular question”, drop it in the chat, let’s have a look, what’s up, let’s have a bit of fun with this as well So I know we may actually have a question that somebody asked a couple of weeks ago, so maybe we’ll have a go at that as well So let’s get cracking So what I’m gonna do is let’s start off with having a look at the data model And then we’re going to remind ourselves about the journey, what happened, how we, you know, where we got to, how far we’ve got and so forth So let’s do a bit of a reminder on that side So first of all, let’s have a look at the database schema So this is where we’re up to so far Oh, where was I recently? ‘Cause this is week 13, obviously unlucky for some I’m not like that, but I’ve just remembered, I was somewhere recently with my husband and we were somewhere and he’d spotted there wasn’t a number 13 and I’m just, for the life of me, I can’t remember where we were, we were on holiday Very recently we were in Lyme Regis, which is very cool So for those of you, if you like your fossils, go to Lyme Regis, it’s on the Jurassic Coast and there are Ammonites everywhere, super cool place But anyway, I digress I think it may have been the prey Ah, it was the beach huts So, yeah, so you had this thing in the UK where a lot of the seaside towns that we have, had these things called beach huts So do have a look on your favorite search engine, but they’re quite small, they’re roughly about one and a half meters by about two meters So they’re super, super small and they’re really popular I mean, you can’t get these things for love nor money I mean, they’re sort of typically made out of wood and they cost like easily over 30,000 pounds if you want to buy one, but they’re like sort of right on the beach and they’re really good for if you have your chair and maybe have a few bits and pieces like your favorite books and maybe, you know, some things to make your drinks with and people sit in them by the beach, but yes We we’re looking at the numbers and they skipped the number 13, sorry, I went off on a tangent there But anyway, this is week 13 So let’s have a look at where we’ve got to so far So here is the data we’ve got loaded into the graph database and let’s talk a little bit about what’s going on So, nice bit of revision as it were to see where we are, where we’ve got to and things we’ve got to do and there is a lot of stuff to do, but it’s all good It’s all an exciting adventure

So let’s just put some nice colors in some of these nodes that we’ve not yet colored up So province points let’s give the points some color Let’s give taster, see if we go into taste Let’s expand these out It’s just the taster node that hasn’t got color yet So I know some of these are repeating, but needs must Okay, right, so let’s, oh and price, sorry price That’s the exciting bit, isn’t it? How much we’re actually paying for this beverage So let’s have a look at what’s going on here So we’ve got wine So talking a little bit about modeling and typically your data model is going to be determined by what kinds of questions we’re looking to answer And whilst we didn’t have concrete questions at the start of this journey, it was clear very quickly that wine was going to be the center of our universe So what do I mean by center of the universe? What you will spot when you’re putting together a data model, based on the question you’re looking to ask or the subjects of importance of your data, it’s in this example as well, you’ll spot that that tends to gravitate into the middle of your data model and you’ll have lots of relationships coming off of that node So that’s what I mean by center of the universe So here, unsurprisingly, we’ve got a data set of wine Wine is the center of our universe here And we got a number of things coming off, I mean we got quite array fight data models So what do I mean by that? I mean, we have a number of things coming off of wine and they’ve become notes in their own right, so we don’t have a list of properties So you see here wine and in wine, we’ve got a property called title So title is the name of the wine So for example, and this is where I’m going to fall apart of coming up with a sensible name, but let’s say there’s something Chateau 2017 Bordeaux style red, for example, so that’s the title And maybe we could have had some other properties there, we could have had the year in this, you know, so you’ve got a year node Maybe I could have included description there We may have had the price And for those of you who’ve been following the series, you will have remembered, or you will recall and we’ll have a look in the in the previous postings that we had price as a property on wine And last week, the week we actually brought in the price data and started doing something with it, we thought it probably made sense for it to be a node in its own right, so we’ve pulled that property out and we’ve turned it into a node And a really, really common question that gets asked about our modeling for the graph and how do you decide that something’s going to be a node? How do you decide something’s going to be a relationship? How do you decide something’s going to be a label, a property and so forth? And look, this is going to be driven behind what kinds of questions you’re looking to ask and also sort of a level of pragmatism around how are you going to use your data And the beautiful thing about when we work with graphs is that it’s not a problem to refactor it, so we can make changes, it’s not an issue And because our schema, it’s scheme run right, so when we’re loading data in that’s going to dictate as far as the database is concerned, what the data model looks like, so that’s not a problem So we’ll talk a little bit about decisions we’ve made So we’ll go through each of the weeks we’ve done so far and we can talk about some of the decisions that were made So you’ve got wine, we’ve got the year of the wine So this was when we did some processing So we’ll talk about that in a bit but wine group we’ll touch on that as well So wine group and year are basically from the wine title So when we have this data set, so we have something like 120,000 wines, and unfortunately out of the many, many fields that we had in our data, the one we did not have was the year of the wine So the year of the wine was a part of the wine title So we did a piece of work where we extracted out the year for that wine and then this concept of wine group and wine group is basically the wine title without the year So for example, if we had Portuguese red blend 2017, Portuguese red blend 2018, Portuguese red blend 2019, wine group would be Portuguese reds And then we’ve got those three years So 2017, 2018, 2019 as a separate things And that’s really useful because now we can ask some interesting questions around wine ratings to year, for example, or blends to year So maybe we can get some interesting information out of that Wine also has a description So this is the thing that talks about, you know, the bouquet of the wine, what does it smell like?

What does it taste like? Where did it come from? That kind of thing And what we discovered as well as part of our journey is in there, we can do a bunch of things So if we, and this is still work in progress, for those of you who’ve been watching, with this description, we can tokenize that And if we can get rid of all the verbs and things like that, then we can, we’re left with a bunch of things such as the aroma smells so cherry, I think we had chive was one that I keep, no clove, not chives, wine smelling like chives is a bit scary, but you know, cherry, cloves berries, apple and so forth But we also had the actual grape varieties And for those of you who recall, because we did have a column there, which was variety, which tells us the grape variety of the wine but the third and fourth most popular wine blend was, sorry, wine variety was red blend And it didn’t tell us more in the red blend, it could consist of many things So we did a piece where we use the variety to generate some reference data to then dig out the grape varieties and description to then tell us what the red blends were Price we did last week, so that’s where we took that So we’ll talk a little bit more about that And then we had the taster So the person who was tasting the wine and providing a review and the number of points that they gave it. And again, we’ll talk a little bit about the modeling decisions we made there to summarize that And then we’ve got designation, which we discovered didn’t tell us anything particularly useful, so we just ignored that And the thing that we still haven’t really done a huge amount of work behind, is a wine coming from a winery, winery being part of a province and a province coming from a country And the significance behind this is, you remember we had more than one province and we had another set of things as well to try and figure out how we fit them in, so we haven’t done that piece of work yet So that’s where we are at the moment So let’s revisit what’s happened so far And again, if you’ve got any questions, drop them in the chat, please do So one moment, please, whilst I just quickly check something So what we have got, so if you remember the first week of this adventure, we explored the data set And I think some of you who’ve been watching from the start or watched some of the catch up videos, I was super excited about this dataset because we had a lot of really interesting information here So we had, for example, we had the country, description, designation, so this was just a bit of the title, so that’s why we ruled out the points, the price and we have some really expensive wines in our selection of data here Province, so where did it come from? And we have a number of regions, which we haven’t yet figured out what we’re going to do with those We’ve got the taster name, the Twitter handle Oh, is it too stalky, I was wondering whether we can do something with that on Twitter and having have a look at how many cross followers I’ve got, but that might be a bit too stalky, so maybe we won’t do that at variety, winery and so forth So this data, and I remember talking about why I was so excited about this data and the kind of cool things we could do with this And we had an initial stab at, well we’ve kind of got some rough ideas, the kinds of questions we’re going to ask of our data So we thought, well, let’s have a go So this was our initial data model And some of the sharp-eyed amongst you may have spotted the data model that I just showed you now we’ve gotten in the database and what we started off with are quite different So what have we got here? So originally we thought, well, here we’ve got wine And it probably makes sense to have the description on the wine and the price of the wine, so here we go So we’ve got our property graph model and we’ve got this thing and we did a couple of things So we said, well, variety probably makes sense to have it as a separate node And the rationale behind that is if we want to find all of the wines that have got the same variety, then we just go via, we pivot via that variety node and we can pull all the wines that’ve got the same variety Same idea with the winery It probably made sense to plant the winery node because we can find out all, you know, which wineries have got the most number of wines, which are popular and that kind of thing And then we had, so I’m going to stick with the easy ones for now The other easy one we had was at the taster, so the person who tasted the wine and gave a des- So there, the taster who’s given a description for the wine and they also gave us a score

And we thought well, probably makes sense to have taster as a separate node because the taster was likely to taste more than one wine And we thought it probably made sense to put the points and, you know, the review points that the taster gave for the wine as a property on that relationship And then we had this really big messy thing, and we still haven’t resolved it And maybe next week or the week after, we’ll have to think about what we’re going to do with this big lump of question mark, which is what is the relationship for wine between region, between province, between country? And the interesting here is, because is region part of province, is province part of region? Not sure yet, we’re not sure how those things connect together, so we’re gonna have to do a bit of digging about the data to try and figure that out So that was our original point. And then what did we do? So we started off little by little So the first thing we did was we imported the winery Oh yeah, so we didn’t even put the wine in the first, in week two, so here we just imported the winery, the province and the country And this allowed us to ask questions such as, you know, which provinces had the most wineries, which country had the most wineries, which countries had the most featured provinces, that kind of thing So we’re able to do that And we talked about dealing with null values because with our data set we have some empty columns and that’s fine, so we still need to do something because otherwise the database will come back to us when we’re loading the data and go, you’ve provided a null value, I need something because nulls as a concept don’t exist within neo4j, you don’t need to capture a null If you don’t have a value for something, you don’t need to put little blank space in it, you can if you want to, but you don’t need to So we talked about dealing with the nulls and we asked some questions So then we moved into week two and we added some more data So you originally, we started off with winery, province and country and the next stage what we’ve done here is we have loaded in our wine So the actual wine with the title and description, the designation, so we talked about designation which was turned out well, particularly useful information we discovered that later on, the variety So what grape did that wine consist of and the taster? And one thing to mention here is whilst we do mention the points from taster, we didn’t import it just yet because we weren’t ready to answer any questions So this is what we did And we put that data in And again, we’re dealing with nulls where we need to deal with nulls You’ll notice we split out our load CSV queries that’s to make it efficient and we avoid eager, avoid eager queries, eager, eager, eager queries So we’ve got that going on and so we do all that good stuff And then we can start to ask interesting questions So we did the prolific taster So, and then the one thing I completely forgot for us to do, to have a look at the count, and I’d obviously forgotten about this A couple of weeks ago we did have Roger who tasted something like 24,000 wines But obviously, I don’t know what happened here Obviously forgotten about Roger when we did this query, but here we go So we were looking at the most prolific wine taster And what we did as well is we had a look at the varieties that contained red and the reason why we did this was when we did a little quick look at the query, we discovered that we had a lot of red blends We had red blend, Portuguese red blend, Bordeaux style red blend So it was just to give us a hint of how many of those reds were coming up And then we came to week three, where I think it dawned on us that we didn’t have that year data And why do we care about the year data? And the reason why we care is you can have a wine from exactly the same winery with exactly the same variety of grape and so forth but the tastes can differ over different years and there’s going to be a variety of factors behind that You’ll have things like variation in the weather is going to impact it Maybe something happened and the grapes had to be picked sooner or later, so maybe something happened around that, you have certain situations where certain vineyards what they will do is if they don’t produce enough grapes on their vineyard, then they’ll bring the grapes in from another country or different region, that’s going to impact the taste There’s gonna, you’re gonna have lots of different things and maybe it will be interesting to see how those variations happen with the wine Equally, maybe we can do interesting things where if we do know something about the weather, then we’re going to know that actually we want to know about wines from a certain year from a certain region because we know they’re all gonna have a similar taste or we hope there is So there’s a lot of value about why we want to pull the year out And then we had this, so we had another opportunity

where we think about revising the data model So, whereas previously we were just bringing data written, what we have here is trying to make a decision about what do we want to do out with yet? So what does this look like? And as you can see here we’re thinking about, well, do we have wine that’s in a wine group and then have the year come off that? Or do we want, so this is something that happened We looked at this, so this is what we were originally thinking, so remember if we had a wine with the title of Portuguese red, Portuguese table red 2015, as an example, then the wine group would be Portuguese red table wine and our year would be 2015 So I think originally we were thinking about doing something like this, and I believe actually, no we don’t do this And I’m just going to quickly recall what was the rationale behind splitting it Oh yes, so the reason why we didn’t go down this route of having wine grouped in year, is just from a query perspective So if you think about us comparing wines from the same year, so obviously we’re going to go to the same year node and the reason why we have the year as a node rather than as a property on the wine, is if you think about it, we’re gonna have a limited range of year, so we did some data cleaning where we’re, you’ll see future on, where we were trying to figure out the you know, the number of years that we’re working with, and we had a range of around, I think, 60 different years Now bearing in mind we have 120,000 wines Do we really want to look up 120 wine nodes to find out which have matching years? Or does it make more sense to find one of the 50 year nodes that represent the year for the wine and then connect that to wine? And that makes more sense because if you think about it, now we’re only having to filter 50 year nodes and then traverse a relationship to the relevant wines rather than having to search through, you know, scan through the property of 120,000 wines So this is the kinds of things you think about, as you have better understanding of what questions you’re looking to ask of your data and how you further iterate them So that’s why we’ve pulled out the year And that made sense And another reason, and this could change, this could change, there’s you know, there are, you know, there will be pros and cons for a different approach, but the reason I decided to go down this route, we’ve got the wine group, so that will be the Portuguese table red and then the year 2015, separately rather than as an extension of it is that we cut down the number of traversals if you were trying to compare two wines based on year So we don’t have to go through wine group to figure out the year And that’s just cutting down the complexity of our query, because I think at the time I couldn’t think of a good reason why we would want to go via the wine group and then stop back again So that’s the rationale behind that, but maybe it’s the wrong rationale, we can update the model later, but that’s not a problem So that’s what we, that’s where we got to there So that was the piece of work where we’re figuring out what’s a sensible base model Then we had a fun series of, I think, two to three weeks of actually trying to extract the year out of the title And I appreciate there’s lots of different tools that we can use to do this, but I think the fun, the fun is to practice our cipher skills and using APOC to help us do that within the database, I thought let’s go that route So I’m not gonna go into huge detail about, you know, the you know, the different trials and tribulations that we had, but the interesting thing to bear in mind here is we used APOC to help us do this So we used regular expressions to help with this And just as an example as you can see here, here’s the example of here’s the wine title and then you can see that the group here test and this is us trying to create our wine groups You remember we had that node of wine group, that’s what’s going on here And we had lots of fun with this because what we discovered was some wines had two years in them and the year could be like a branding So like, I don’t know, classic, you know, I’m going to come up with a silly example but let’s say Portuguese table red 2015, Ested 1909 or something like that So we could end up with two years in there So that made it quite interesting because we now are going well, which year do we, which one’s the right year? Now common sense would dictate it’s probably not going to be the 1909, it’s not going to be the year

of our wine because we’ve got one that’s 2015 and that’s the sensible wine The kicker here was there were some wines that had 2020 in them, which absolutely, we can have wine from 2020, but this data set was from about two, three years ago So again, that’s another branding year that’s been used So we had a bit of fun around this So here we looked at the principles that we could use to pull that information out, so this was painful where we loaded the data, thought brilliant, we’ve got a way of doing this And then we had to delete everything the following weekend and do it again, but that’s cool So we did this, we had a look, well, talk through this So we had some challenges that we discovered So here we go This is what we talked about, so we had branding years So we had to think about that Some wines, yes, so we talked about some wines So some wines just had the branding year, so like port, port had a branding year in there And again, if it’s a, you know, a fancy port then absolutely you’re probably gonna have a year for that If it’s something that you’ve paid about 15, 20 bucks for, for bottle, it’s going to be a branding year, more likely So we had something like that So we had some wines that had two years, the actual year of the wine and the branding year Some wines have no year, so many of the champagnes that were reviewed didn’t have a year So we don’t always have a year for all the wines So what we did, was we did some interrogation of the data and we first, when we tried to figure out was, what was the sensible range of years for the wine And we did this by looking at the price of the wine as well as the year So for example, if a wine has got, you know, the year is 1975, for example, but it costs $15 a bottle, that’s probably a branding year, that’s not a sensible year because I would expect, you think about the cost behind putting down wine, so you know, when you have to sort of put it in specially controlled conditions and manage the temperature and the humidity and all of that stuff to stop the wine from going, you know, sort of drying out or ruining the cork and that kind of thing That’s not a cheap process So if we see a wine, that’s got a year of 1975 and it costs $2,000 then I can well believe that that’s probably a year So this is where we’re using business rules So we’ve made an assumption about some of the business rules here, so we made an assumption about any years So any year has to be a 1970 to 2017 So anything that didn’t fall within that range, we just assumed it was branding year and we sort of got rid of that Any wines with just the branding year or no year, we just typed that as no year So that’s a common approach we’ve used for everything So no country, no year, no taster, et cetera And what else did we say? Oh yes, and if there was two years in the bottle and both is fitted within our range, then we went for the first year that appeared because what we observed was typically the wine, the first year tended to be the year of the wine and then the second year was the branding year So here are all our business rules So we did this and I think we only had, this kind of works for most of our data So we finally got it done We were there, week, what was this, week five? That was an exciting week Well, it was for me, because we finally got that stuff sorted and what we did, so what we did for week five was we applied all of our knowledge, we applied our business rules and we were able to create the wine group So that’s the title without the year And we were able to create the, and we were able to create the actual year itself and to be able to do something with that So that’s super exciting So that was a great year And then we were able to ask some interesting questions So we were able to ask what was the most popular year? So we’ve got an example here where I’m gonna pick the awesome chart we’ve got here So we saw things like what was the most popular wine we had in our dataset So bearing in mind that I was up to 2017, so like 2012 had the most number of wines that we had there, which kind of makes sense ‘Cause if you have a lot of reds and things, it tends to, you tend to have that lag for about four or five years, you know, for your most commonly drank wines So that was pretty cool So we did that and now we can look through all of these, these year questions, so that was a great week And oh yes, we also discovered, fortunately, our wine set did have some duplication in there So I did a bit of Excel magic as in, I just removed duplicates, a bit of secret sourcing and I removed the duplicates and then we just reloaded the data But again, that’s pretty quick So we got rid of those duplicates and that was sorted

And we’re also very good, so I remember we had the wine group, so the Portuguese table reds and then the year 2015 So one of the questions we wanted to know was which wine group had the most number of years So I don’t know if this is the, it looks like the de duplicated data, wonderful So as an example here or let’s see, can I see zoom this out, make it, no, it’s cut this- I apologize that I’ve chopped off a bit on the side But you can see, for example, you can see here, this was the most popular wine group So this Sebestiani Cherryblock Cabernet Sauvignon from Sonoma Valley, and you can see all of the years there And that’s really cool So actually, I’m gonna take a note of that Let’s go find out what the ratings were So now we’ve got the year data, we’ve got the group data and we’ve got rating data and we’ve got price data So actually a really fun one So this is really good looking through this, a fun question I want to ask is for example, this Sebastiani wine, did the price change over the time? And maybe it does, because we don’t know when these reviews were collected but also what would the rating say, did the rating stay predominantly same? So a fun bit thing is we can do like the max min rating, the average and what was you know, what was the standard deviation from that? So that should be a bit of fun Oh yeah, so now we’ve got that and that’s a fun question I’m gonna ask So we did that and it was great and brilliant So we saw as well, we could do from the wineries as well where we wanted to find out for each winery And again, this was ordered by, oh, it keeps truncating all these images, I do apologize Maybe go back to the non truncated one, but you can see here for example, where we’re looking at the winery and seeing how many wines per year they were producing So we can ask a lot of interesting questions now with that year data And oh, week six week, six was a fun week They’re all fun weeks What we did in week six So we first off wanted to find out what was designation all about and we had a look at designation and then we realized it was part of the title So, wasn’t particularly interesting So we thought, well, let’s go and have a look at variety and variety was the grape data And I thought, well, Shiraz, Shiraz is a variety and something that we discovered when we looked at the query to pull back all of the grapes that, all of the wines that had Shiraz in there is that oh, there is a lot of Shiraz in there And we’ve got a lot of wine blends mixed in there So we thought well, what probably makes sense is we want to find some kind of way of de duplicating our data, so what we did there is we used the approach which I mentioned in my BBC Good Food blog posts, and a really good way, certainly is to start to pretend to try and remove duplicate entities in your data, especially when you’ve got something like this is first of all, to tokenize your data So this is where let’s take something like the, take something like the Cabernet Sauvignon, Merlot, Shiraz as an example there So we’ve got three wine varieties in there We’ve got Cabernet Sauvignon, we’ve got Merlot, we’ve got Shiraz, and it could easily happen that we have a Shiraz, Merlot, Cabernet Sauvignon or a Merlot, Shiraz, Cabernet Sauvignon The ordering can change and you know, more often than not that’s based on what’s the biggest proportion of the grape down to the lowest proportion of the grape So these things can move around but we might not necessarily care about the proportion, what we do care about is finding all the wines that say have all three of these grapes in there So we’ve got some tricks in our, you know, tricks we can pull out the hat to tackle this And the first one thing we can do, is we can tokenize these words So for it here, we do a split on the hyphen So this will give us Cabernet Sauvignon as a node, we’ll have a Merlot as a node and we’ll have a Shiraz as a node And what we can do with this is we say that all three of these come off this particular variety name, variety node that has Cabernet Sauvignon, Merlot, Shiraz And then we do the same approach in all of the varieties and anything when we tokenize, that has Shiraz in it, we’re gonna create a relationship to that variety, you know, the new node, variety in that node that we’ve created that has Shiraz and we keep doing that Now immediately we’ve fixed the ordering problem by tokenizing it, so straight away anything that has any variety of Cabernet Sauvignon, Merlot, Shiraz in whatever order we can immediately go, all of those are the same for the purposes

of what has the same grapes So we can immediately do that And we can use this approach as well for things like phonex and fuzzy matching and all of this good stuff to try and reduce those down And that’s what we discussed in this post about cleaning that up And we talk about using things like a Leverstein similarity to figure out words of similar, and that helps us deal with plurals and typos and that kind of thing So you’ve got a lot of interesting things around there And that effectively left us to refactor the graph yet again and so what we’ve got now is we had our variety, so that was what we originally had So that was our Cabernet Sauvignon, Merlot, Shiraz And we now have variety name and variety name is you know, the tokenized, cleaned up, grape variety which allows us now to very easily find out, you know, things like which wines have all got the same grape variety or which wines contain any grape variety ‘Cause maybe we discover we really like any wine that has Shiraz in it and we like the blends, we liked the pure you know, pure Shiraz that allows us to do that And this is really powerful because this means we can do all sorts of interesting recommendation questions based on grape variety So that was an exciting week and it got more exciting because something we discovered, so we did that in week six That was the tokenizing And then we thought well, okay, let’s do the same thing with description So description was, let’s bring up the new model for that I’m gonna close some of these, so we don’t have a crowded tab Oh, don’t really need to be there And we had the same thing with descriptions Now remember, description was the thing that tells you information about the wine So the wine has aromas of cherries, that the wine has this variety of grapes, the wine tastes like this, it has hints of something else, it was found in this region There’s a lot of interesting information So this is the information that the taster has provided about the wine And the interesting thing there is we can’t do much with the description as is, it’s a block of text It’s a block of, you know, it’s unstructured data to some extent, so we can’t do much with it as is, but there’s a lot of fun things we can do if we tokenize that text So again, similar process, we split by spaces and there’s lots of other interesting irregularities we have to figure out to split So it’s still kind of working process, but we getting there, work in progress But what we can do is if we split it out and we can clean that text up, and this is where, when we do this and I really want to do this, where we start thinking about using maybe some NLP to help us out with this, is if we take that text and we strip out the things that we’re not interested in So we’re not really interested in any verbs yet They’re probably going to be super important in the future So, you know, we want to have a look at how something smells versus something tastes, but for now if we can strip out all of the stop words, so stop words are things like and, the, a, at, this, et cetera, you get rid of all of those If we can get rid of all the verbs and then all we should be left with should be either descriptive words about aroma, smell or appearance or the grape variety, because something we spotted and this is why week eight was so exciting was we had grape variety and lot of grape variety here for some wines were big red blend All right, red blend of what, you know? There could be many things in the red blend But when we went off and had a look at the description for the wine that said red blend, in the description it would go this Bordeaux style red blend contains Cabernet Sauvignon and Shiraz and you know, a few other grape varieties in there And you’d be like, well, hang on a minute You’ve mentioned that, you know, it’s mentioned in the description but we don’t have it in the variety And the really cool thing because we’ve tokenized and you know, dealt with the stuff in variety name, what we can now do is cross-map, treat variety name as a set of reference data So reference data is a finite list of terms that don’t tend to change very often So I think like country names could be that or of salutations like Mr, Ms, Dr. Lord, so forth So it’s a fairly static set of data, doesn’t change very often And the nice thing is we’ve got 120,000 wines in this data set and whilst a lot of them have things like red blend, white blend, there’s also a lot of them which is just the grape on its own So when we do that cleanup we now have a set of reference data, which we can now map against description,

and we can now figure it out what the, figure out what the variety is for red blend And that’s something that we’ve been able to do that by being able to clean up and use the description And again, same idea of description word We clean that up and we get things like cherry, apple, cinnamon, clove, so forth, all of that And what we can do now is we can recommend wines based on these attributes So if you really like fruity wines, you like a wine with lots of berries, well we can do that We can find them, find that and then suggest some wines But even more powerfully, let’s say you say, “Well, here’s a bunch of wines that I quite like.” So we’re going to data, let’s assume you’ve tried these wines in your data set You can now have a look at those wines and go, “Well, I like this wine, this wine and this wine.” And now we have a bunch of things to figure out, why do you like those wines? Is it because of the variety? So we can go up to the variety node and figure out are there variety types that you like that are in common and we can start going, “Well, you probably want to try these wines.” But maybe it’s more subtle Maybe there’s something in the tasting profile So maybe you like the things that have hints of berries, maybe it was cinnamon, maybe you never knew it was the combination of cinnamon and vanilla that you really enjoyed And now we can spot that, so we can start to dig under the surface and figure out why do you like specific wines? So that was great fun, that was a super discovery I actually did a talk on nodes about this as well I was so excited by this I might turn the lights on in a bit, it’s getting a bit dark, bear with me one second So that was really cool And so now we’re building in more and more information to be able to understand you know, why do we like it? Why not and how we can recommend stuff, so that’s what happened that week So that was a fun week And yeah, I did a nodes talk on this on finding the wine blend gaps So that was week eight and that was super fun Oh, week seven even, so we did more stuff within week eight So there is little issue with the description and this is something I’m gonna take offline and figure out, so I’m still gonna deal with this is it is quite dirty the data in there So we have lots of different splitters So we have space different, two different types of spaces I didn’t realize you can have two different types of spaces, short hyphen, long hyphen, ampersands There’s a lot of things going on in there So I need to research and figure out a good way of being able to clean that up whilst preserving the nature of the text So that’s something that’s in progress So bear with me, I’m gonna do that And definitely like I said, we want to use fancy tools to process that text as well I’m really keen to look at lots of stuff that my colleague Mark Needham is putting together on NLP stuff So that was descriptions and oh, look at this There we go, look, we’ve got an example of here We’ve got the variety of which is red blends And if we look in the description, it tells us, look, a Sangiovese and Cabernet Sauvignon So we used the, and I’ve got a more detailed example when I do the nodes talk of where we can then pull out Sangiovese and Cabernet Sauvignon to show as well So that was super cool, that was great fun At week nine, what did we do? Oh, some description words, we should skip over that Yeah, this was disappointing Right, so the fun stuff The last two weeks prior to this week that we were looking at was adding the rating, the points that the tasters imported to wine and then asking some questions And I think the price, so let’s have a quick look at what we did So you will be shocked and horrified to hear that we changed the data model yet again, it was worth it though And so what we did here and let’s talk a little bit about what we did So originally we were thinking about doing you know, taster, then we have rates, wine and then wine And then we had to, you know, we thought about this We thought, well, hang on a minute, we’ve got 120,000 wines So that means you’ve got 120,000 relationships between taster and a wine and 120,000 properties on that relationship telling you the points that that taster gave that wine And that’s problematic because if we want to do anything interesting around understanding you know, which wines have got the same rating or what’s the rating profile, what blends get assessed ratings and so forth, that means we have to go off and query, so much like what we had previously, that means we have to go off and filter through 120,000 relationships to check that And then what we did was, we thought well actually, what is our rating range? So the maximum points you can award the wine is a hundred, the minimum points you can award the wine is zero

And we thought, well, actually, maybe we should find out, oh yes, this was what we found about Roger Roger, 23,000 wines tasted, oh my goodness But anyway, so we had a look and we’ve really got 19 different named reviewers or 20, including no taster And we had a look actually, and it wasn’t even nought or zero, the lowest wine scored 80 points, the highest wine scored a hundred So effectively, we are looking at 20 times, sorry, 21 times 20, so we are looking at, and this is where my maths fails me miserably, 400 and, is that the right way around? Yeah, 420 nodes That’s the total number of point nodes we need when we map the reviewer, the wine they had and what score that wine got And that’s significantly better to just look up 420 nodes than 120,000 relationship properties, that’s significantly faster So what we did was we refactored the graph again And we’ve, what we’ve done is we’ve pulled out that, remember we had the relationship between wine to taster, so we’ve split it out a little bit And what we’ve said is that a taster that gave a certain point to, a taster of gave points to a wine and a wine got those points And then what that means is we’re now collating, so we’ll have, you know, so two different viewers could have given the one 80 points, but that’s okay We have two different points nodes connected to those tasters, but they come to the wine So that means we can very easily find that relationships between wines, the points they scored and for example, the grape variety So we included that in, and then what we did, we had a little bit of fun with this I think some of the questions we asked was, you know, if we were to find out all the grapes that, all the wines that Roger was reviewing, because he’s reviewed a lot, and we had to look at which wines were well rated So this is where we looked at wines that had a score of 90 and above and to have a look at what the variety was So you can see here Pinot Noir, this image is clipping So Pinot Noir scored 6,000, you know, over 6,388 or 6,388 Pinot Noirs got a rating of 90 and above So that’s, you know, a good bit of fun there And then what we did the following week was we added the price And again, we did the similar approach because again, we’ve got 120,000 bottles of wine That’s 120,000 different prices And again, we did that query and we had a look at how many different, how many different prices did we have and we’ve only got 390 different prices, which is quite amazing because the range we had anything from around, I think $5 was the cheapest wine and $3,700 was the most expensive But there was only 390 different prices So again, 320 nodes is better than trying to look up 120,000 relationship properties or sorry, 120,000 properties on a wine node So we did that same approach again So we pulled out, price from wine, we did that And we had a look at, you know, so we had a look at which wine, there were some expensive wines We first had looked at the expensive wines At least all of the wines that were over a thousand dollars a bottle and what scores they received and the really amusing thing here was you can spend a thousand bucks on a bottle of wine, it doesn’t mean it’s going to score in points So interesting, I mean, I don’t think I’d ever spend that amount of money on wine, but there you go Oh, sorry, it was $3,300, it’s $400 cheaper than I thought it was And then we can do fun things, but we’ll okay So let’s find out the wines that score 97 and above points and how much they cost And it turns out, you know, for $35 you can acquire a bottle of wine, for $97 And I believe the cheapest bottle of wine that scored a hundred points was $80, a special occasion, why not? And then we did things around, now we’ve got this information in then we had a look at the tasters and what was the cheapest bottle of wine that they tasted, what was the most expensive? And just looking the average around that score So you can see that the average is pretty much sat around between 25 and $40 So you get a favor around where I was going with that And then we just had to look as well Last thing we’re going to leave on, on this recap was having a look at which country had the highest scoring, highest average between the low score and high school, which was England,

you know, apparently we do good wine We do good white sparkling wine, I have to say So there you go So that was a very quick recap, I thought, you know, maybe it’s good to do this quick check in of what we’re doing So I’m going to leave it there So I’ll give it a couple of moments If any of you have got any questions, you’ve seen the recap and if you’ve got some questions that you’d like me to ask of this wine data, ask me Otherwise, I’m going to come up with some questions because I’ve had this nice refresh of what we’ve done And I’ve got a couple of questions I mentioned a couple as we were going along So I will do those questions and we can look at them next week And again, if you’ve got something in particular you want to have a look at next week, let me know Otherwise I’ve got some thoughts about what we can look at So with that, I hope you have a fantastic week and I will see you next week Take care Bye