All right, welcome everyone to the first lecture of Cs two, four, six I am yukari, I work with you here I’m going to be your core instructor throughout the quarter If we were wondering if I was a bath on Piazza, just replying to her answer as I exist as a person so you are going to be seeing me around until the end of the quarter, and before we started this lecture just let me give you two important announcements Her as one is right after the lecture You just have ten minutes to run two gates, Beichuan, we’re going to have a refresher on linear algebra is going to be extremely useful for your homework and the final exam so recommended to be there We already put the handouts on the website, and we’re also going to be putting the video line by tonight or tomorrow So don’t miss their professional in your algebra Second thing next week is going to be quite intense for your guys, so we have a bunch of their lines We’re going to be having a more zero or more Kwan A scene on Grace Court Many of you have sent me to the radio moral zero, so good job Make sure that you’re working fast on American because his way longer than a more zeros you might have realized we’re also going to be having two creases that lines on gradient So we decided to postpone the first one costs Some of you just realize that you have to work on quizzes or so you are still deciding If they won a dropout or not, so we made sure that they would be the first one would be postponed one week you can use as a reminder you can use the late period for the homework You cannot use late periods for the quizzes, so by Thursday at midnight they have to be submitted Ok, all right, so let’s start to this lecture topic is going to be the fury or locality sensitive machines On Tuesday, Euro gave you an overview of our message, and today we’re going to try to understand exactly why he works in the way we want, and also we’re going to try to generalize it to different similarity functions Ok, so let’s first of all, let me give a recap so we’re sure everyone is on the same page on a message and then we can build on top of those foundations So what is the task? What do we care about? Basically, when, whenever we were given a large collection of documents, I said when the order of is in the millions are in the billions and we want to find similar ones, we want to find near duplicates This is an extremely important task if there are search engines The winery index again again pages are similar It’s an interesting task if you are running something like I see a cybercrime showing you want to match the fingerprints, so that’s another very common scenario where you are going to do well, a stage or anything very busy, one to find singularity in very large datasets Why is this a challenging problem? Because if you wanted to pair wise comparisons, you just have too many pairs These are quadratic problem, and here in Italy, whenever you want to make something scale, we hate when the complexities quadratic We wanted to be linear or even better sublinear at times So basically what we do with our messages we bring a quadratic problem into linear problem I was okay, alright So what’s the solution that we are going to apply? What we want to do is we want to ask these documents so that the similar documents and ducked into the same bucket and Ashley is a very fast operation We can do this on each single document, so it’s a linear one, and then once the action has been performed, why want to find is we want to just work on some of the candidate pairs that we were able to identify and we’re gonna perform the expensive similarity comparisons only on those pairs OK That’s how we bring down the computational cost of running similarity on such a large dataset How do we do this? So it’s a straightforward pipeline The first step is, we do this shingling representation of our documents, and I’m going to give you a while I reminder of how it works But basically, the output of the first step of our pipeline is a set of strings of landscapes that are appearing recurrently in the document Once we have this representation, we want to make it even more compact, so we are going to be performing me, gnashing there and the output of manipulating you’re just a set of signature scanning from the document, and this signature, in the case of rushing or just short integer vectors and the important property of mean rushing is that the signature that is how putting are maintaining our reflecting the similarity between the documents So we have a more compact representation, which still retains the similarity property among the signature, and the last step is we are going to be performing at the stage So I’m going to be reminding you today about roars and bands and so forth, and the goal here is that we want to generate only few candidate pears, And these are the kind of repairs where we are going to perform the similarity check OK, so let’s go one step at a time and have a refresher on each element of this pipeline so first one shingles So we said that a case shingles, sometimes also called kilogram,
is a sequence of cake tokens that appear in the document I recall the question there we go lots on Tuesday What is the granularity of K of what this integral right here, these tokens, they could be characters, they could be words they could be sentences It really depends on What is your task if are performing? Spell checking than you want to find common K shingles In terms of characters, if you are just trying to find similarity between documents than usually your tokens are gonna be warns, OK, So the granularity depends on your domain application Now let’s let’s have an example, so we said, OK, you’ve got to do so we’re going to have to shingles This is our document and the output set of two shingles that we obtain is just a BBC and CNN OK, so we’re going to do for this thing to only keep the shingles that are unique, and this is the way in which we obtain our representation Basically, it’s a set of our own values hobbyists case shingles Now you will realise that hope to hear these parts of our pipeline is the main dependent because now we are working with documents From now on, whenever we were going to fly down here doesn’t depend on what is your input data OK, because at this point that we reach a representation that is in form of sets OK, so now, whenever we can turn into a set that can be fed into our similarity measure and one very national similarity measure that we have been using throughout the past lecture is the jaguar similarity very common, used in a lot of context we’re gonna give a reminder So basically, what it does is we have our two documents see one in situ, and what we do is first of all, we take the intersection of the common shingles and then we divided by the Union of all the shingles OK and this intuitively tells you how many shingles two documents having calm OK? This is the similarity measure that we’re going to be using The Dzhokhar similarity Once we have these a representation in place than will move to the next step, which is performing the main hashing and what is the main property that we care about in me, GNASHING, so we want to go from these very large sets of shingle soil You can imagine that in a very large corpora with millions of documents, we will be funding hundreds of thousands of the shingles, depending on the size that we sat and the growth of the token So basically The beehive of these metrics is going to be a very, very big, so we want to convert it into a more compact representation, still preserving the similarity OK, so basically, what we want to do is that the probability that two documents are two sets are going to be having the same hash is equal to their similarity So how do we do this? We take our input matrix columns are the documents roles are going to be our shingles and then we perform our permutations This was the trick that we are applying Tuesday So let me give a reminder, Let’s say ah, why this value here is four, so we’re working with document tree Then we’re working with a psycho Nash, the yellow one What we do is we keep the first instance of a one appearing in our collection of shingles to see wealth number, one which I hear it’s zero than which had a number to zero again, which had a number three yet another zero Then we go to four and we sign a one, and we put here the index of our permutation OK, that’s how we perform this conversion So we are going from an from a wider and more expensive representation into more compact won The beauty or the magic of this tree is the following Is that the similarity of the columns and remember that each column here is representing a document is proportional to the similarity that you’re going to be standing on our input matrix? Okay, so we can go to read just as an example to check if it’s actually working So let’s say we want to compare document One in document too has receded on that anybody is in common I invite you to check the same on the bigger metrics, but say, you want to check one and three So we’re going to be having these two types of values in common, the bottom two in common, so our signature similarity is going to be two thirds While if you go and check on the column one in three, you will find that three fourth of the values are in common Now you might say This is an approximation is not as close as we will wish is absolutely true, and the tweak that we’re going to be seeing today is that we can make the similarities we can make those two values getting closer to each other each other arbitrarily and a tweak to make that happen is to have more hash functions, as it is that today I’m going to teach you how to choose exactly those functions, how to choose the size of the bands, the number of rows and so forth But? As pretty much anything that’s in computer science, what we are going to be doing is a tradeoff The Moorish functions we introduced in the more computations we have to perform
OK, so there is a tradeoff between how much work we are going to be doing and how well these two are mapping to each other OK, so that’s first right After that, I want you to keep in mind for today And last step out, the recoveries were moving to the locality sensitive bashing so we have our singers representational Now we asked the columns of our metric, Sam, and what we’re going to do is that we want to put similar columns into the same buckets and once you have a bucket So keep in mind that this step is basically linear because we are computing a saddled ashes per document We placed them in different markets, and then we are going to be performing the similarity check Call it within a bucket OK, That’s where we go from quadratic tool to a linear complexity So as we were saying last time, rather than just doing the comparison between two documents so two columns, we can subdivide our space into bans OK, so this is our common representation that we’ve seen before and we consider a candidate pair if is matching on two bands only OK, so let’s have an example for me or, for instance, say we have documented seven, I think and document for you see they are rushing into the same bucket, so these two guys are going to be candidates for our similarity check Ah, is who takes six and seven rushing to different buckets, so we are never going to be performing a comparison among them all right? So, ideally, what we would have to happen is the following is what we see here on the X axis we have the similarity, the extra similarity between the two documents and on the y axis we will have the probability of sharing a least one bucket, which means they are gonna be at least matching on one band right So why do we want this beer or say we want all the documents, all the states that have more than zero point six similarity We want to be able to retrieve all of them and be part of the same buckets so that we were going to be comparing to each other But anything as it is below the treasure zero point six We don’t want it to be part of the markets where we were going to be performing the similarity comparisons OK So the whole goal of today is how to turn this plot from a line into a nice curve that is as close as possible to our traditional line All right, and I’m going to be explaining the tradeoffs of what happens when we introduce false positives and false negatives I’m going to be repeating this a couple of times so let’s start from now What do you think is more worrisome for this kind of task? Having false positives or false negatives and also do is not strictly correct answer so if anyone wants to give it a shot, I’m happy to talk about this for a minute Anyone wants to try Yeah, why human intuition why? I think it’s called? Exactly so, whenever we introduce false positives in the second step, we’re actually computing the similarities You will be able to discard them eventually because they just come at the price of performing more competition all right while if it’s true, if we are false negatives in our pipeline, what actually happens is that we are never going to consider them as candidates So this is the data that we are going to be missing And so my goal for the lecture today is basically to give you one of those eureka moments where you know that it was worth together, master or underground, in computer science and a rather than going to a stack overflow and copy pasting codes I want to give a reason why is worth to study so much math for so many years, because then you will be mastering this, this tool, this technique, and we are going to be able to understand why you’re choosing certain parameters OK All right, so let’s go one step at a time and reach that And the other goal for today is to generalize mean gnashing and make sure that Alice age is not going to be used only for documents where we can perform these shingles extraction But it can also be applied to basically by data points, all right and is extremely important because, as you know, data comes in many different forms and shapes or rhyme, especially, you have seen all the craze that we have today, with deep learning, the fact that we can do great with images and multimedia data and so forth son data cannot always be represented The shingles and this technique is an equally powerful also on different type of data so we’re going to try to generalize it in that way All right So what do we need to do to make that happen? Is we have to design a locality of sensitive as function that is going to be working for a specific distance metric right? So this has magic that we’re missing last Tuesday and up to now is our jacquard singularity,
which we’re going to see later, how it can be converted into a distance But we want something that can be applied to different distance magics OK, we are going to be seeing euclidean, kazan’s similarity and so forth Then everything else stays the same We a pair or sing as her has short, introducing issues that are reflected in the point similarity And then we are going to be applying our elder sage technique, doing some of the magic’s with a band and role cells we’re going to be seeing later, and eventually at the end, we began our set of candidate pairs, where we’re going to be performing the similarity check all right OK, any questions up to now, or shall we dig into the technical part? All right, as it is So as I I spoiled my lecture at the beginning, so I told you that most of the magic that we’re going to be seeing today lies in this s curve OK, that’s where the magic happens And why is this the case? so say, we selected only one single ash function Okay, that is able to retain the similarity between the documents What we will be seeing is exactly that a straight line, as in this plot So again we have X axis The similarity to the two sets Y axis the probability that our list sharing one bucket So what happens here is if you have only one ash that is able to represent the similarity between these two documents is going to be exactly proportional, with actual similarity between them OK, so doesn’t really make us any service, because we are not going to be able to filter out the bad candidate pairs we just want to try So the trick is where we want to introduce more ashes, and we want to end up having a shape of this form So whenever our similarities below the treasure, We don’t want any chance of those candidates appearing in our in our buckets to the protein, the similarities above the treasure that we want the probability to be as close as possible to one So ultimately, we want to turn that line into step function And everything that we are gonna be seeing today is an expectation like many other algorithms to make your pipeline scale with large data you can really expect to work with billions of elements and making sure that you’re going to be doing a deterministic, perfect job were on the percent of the time Okay, So what we want to do today is to reach the status where we can do very well one to nine per cent of the times and that’s really great result question forecasters said the Y axis for that is technically sharing one bucket, so this one is just a bit in that way Yeah, yeah, so it’s a toy example is just to show you the two extremes that is useless This one will never happen We are going to try to approximate this step function as much as we can Thanks for a passing grade Okay, so how do we shape in the way we like our s core, So we have two tunable two parameters we have the number of bands in which we’re going to subdivide our input matrix, and then we have the number of roles which are gonna be composing Each single bank kept so These are the two parameters that we play with So let’s start with having our similarity between our two sets equal to us, and what is the probability of at least one band is equal, so we’re going to be basically getting eventually I’m going to be able to give you a formula on how to compute exactly in the probability, so we pick some band and then the probability that the elements in a single row of these two columns are equal is equal to s OK now if you want, that is that all the rows in a band or equal we need to take has to the power of our cat, so we have our rows so that’s what we get with the probability and if you only started the probability that some role in a band is not equal, That we’re going to be taking the inverse of that so one minus X to the power of art, the bear with me who are going to another couple of steps And then we get into the fun of formula Now the probability that old bands are not equal, it’s going to be a falling We are going to be getting this to the power of big, so the number of bands because we want all of them not to be equal, and ultimately, what we care about is the following results, so we want the probability of at least one band is equal OK, remember, this was the condition that told us these two pairs are a good candidate to be checked Okay, only when they match a list on one band So does the probability that we just found which is going to be one minus this times this step before the cool thing is that this is the probability that she wants it to the candidate pear and we can simply plot it so that’s what we’re gonna be doing in their necks, Chinese lives or so you can plug this curve and then find the sweet spot that exactly matches your use case
OK So here is one example first say that we want to pick our R and B to create the best possible escort for our scenario We want to offer the budget of computing fifty hash functions All right, so this is one possible configuration We can have our equal to five so five rows per band and ten bands in total, but as you can imagine, there are way more are combinations that you can find here So this is the s curve video obtained as an output again, X axis, Oxford Similarity Wax is probably the thing that are sharing a least one bucket and therefore we’re going to be computing the similarity between them and our threshold in this case is zero point six What do you think about this headscarf? Do we like it? Is there something that we dislike or we could do better? Exactly so, this is the air that we don’t like right, OK, so how could we still use this s curve in other contexts, where it will still make sense? Right, what can we do here Exactly so? we can take the treasure from zero point six Let’s say we move it to zero point that we reach this intersection, which is gone and probably nine nine, nine, nine, nine, nine, nine, nine, nine, nine, nine percent And this s curve is gonna be great when you want to find all the candidates that similarity of zero point eight or above all, right so that’s why I was saying to you before our and be our tunable You change them in such a way that once you know what is the treasure that you care about, you shape the s curve to make it sure that he will not return to many false positives But you’re still OK with the false negatives Sorry, the other way around right Yes, I make them okay Here I just showed you the answer that we were talking about before So with a treasure of zero point eight, these are scoreboard in a sudden becomes a very good feet So how can we shape our cordula’s? Let’s fix one parameter change the other, and we’re going to see how the s curve is modified Thanks to that OK, so first of all, we fix our threshold, so we can take our zero point six from before and then What we want to ask him until you make happen is that we choose RMB in such a way that the candidate pair as a step right around us Okay So yes, Carbon in a southern as these hawkish RT artistic growth, exactly on the fragile So let’s talk First of all, the top left axes are exactly the same as before acts similarity why the probability of being a candidate pair, so we fix the number of bands equals to one and then we were in the number of rows Okay, so we go from Ah, we go from the straight line with one role to this curve down here When we have ten rows Can you give me the intuition why this is happening? Why do you think that keeping? Why do you think that considering more roles at the same time makes the probability a core of change in that way? But is the intuition behind it? Exactly exactly so, basically, what we are doing is we are considering more rows at the same time, and we’re gonna be funding if their ashes are matching or not If you have only one row at a time, the probability is going to be higher, but as soon as you consider more rows together than when a sudden our condition becomes stricter OK, so that’s why we would see here that you will start to return potential candidates Hollywood have like zero point eight similarity The prominent with a curve that is stretch in this way is that is unfortunately not going to be returning a lot of interesting candidates OK, they are just about the intersection line, so is not enough basically to tool only hard Let’s do another example Here we are fixing are equal to one, and then we’re we’re playing with a number of bands So once again, one one who got a straight line, the more bands we introduce, the more we stretch it on the top left here, we’re doing exactly the opposite of what we did with the roles, The more bands of introduce like the more we subdivide our metrics into bans, the higher is the probability that at least one of them will be matching OK, so one band is he goes to one It means we are we are We are comparing the two columns for each single value The more bands we have the higher is the likelihood that at least one of them will be matching Okay, So we have these two very nice construct, contrasting tradeoffs, and when we start to tune both of them, then we reach this nicely as shaped curves
OK, and as you can see here in these two plots in the riots were fixed in the number of rows is either equal to five or ten, and then we are changing the number of bands and, as you can see, for instance, if cargoes from five to ten, everything is going to be shifted on the right, because the more rows you get, the stricter look on the condition becomes so you will be returning Candidates wholly of the similarity is higher question Oh, yes, yes, so the question is, once we have our own treasure, can we reverse engineer are and be in such a way that C matches, he gives us the best possible escort The answer is yes, I’m not going to be spending too much time today telling you exactly lethal, how to compute it But it cinders lie, that I’m going to be giving you all the intuitions all How that can happen are good libraries for our message Do that for you Likely so? OK, now before I jump into the theory of LS Age, any questions The question is, how do I decide how many hash functions? I want to use right? Oh, yeah, that’s quite so busy you Your question is what if I choose a certain amount of hash functions and all you know I suddenly realized there are not enough to have the performance of the care about, and I will have to recall impute to give another full paths on the data That’s the reason why we’re doing this lecture today, because basically you can tune or you can find out what are the bounds of an R and D, and those two values give you the number of hash functions that you are going to need C you can approximate what you need before you have to give a pass all their ways to make that happen are also whenever we have a very large dataset, you work with samples That’s a very good recipe in January life, you know, if you know that the data distribution is not too skewed, it takes a small sample you work with, that makes some assumption, and then when you give the full path, you’re going to find out If our assumption holds or not That’s a great question Okay, now let’s jump to the two year of our lesage really works So again we wanted we use it We use a message up to now to find similar documents, or in our case we are working with shingles or similar sets Can we done use our lesage for order? This has measures all right, so you’re clear on distances, cruzan distances and so forth So let’s try to generalize what we learn quick refresher what is a decent measure, so it’s a function that given a pair of points X and Y, you will return real numbers with the following properties So the distance the distance is always greater than or equal to zero It’s equal to zero if Annaly Here X and Y are the same element it’s symmetric So the distance between X and Y is the same as why an axe and the triangle inequality, poverty holes So given any other point with other than X and Y, then you know that the distance between X and Y is always less radical than the distance between X and Y and The Z point the turf point, plus the distance between Z and Y point OK So there is a refresher from the past, And how do we turn, for instance, our Dzhokhar similarity that we use for the earth, for documents into a distance metric, as as simple as doing one minus the Joker, Sinjar, as you remember, the jacket, singularity was between zero and one went to Documents are exactly the same even turns one in our case Instead, we want two documents to be the same when the distance is equal to zero, so we just take one minus hijacker Similarity odor distances they were going to see today, for example, are the cosine This means for vectors, where we are analyzing the angle between those two vectors or another, a distance that is are daring to defray one is the nuclear one so usually recall the as to alarm and what we do here is just we take the square root of the sum of the squares of the differences between X and Y, and this is the most common notion of distance is used in many different contexts, also, images and a P and so forth Another interesting one is the L, one arm, which is the sum of the absolute value of the differences in each dimension Sometimes called the Manhattan distance because it’s basically the decency will have to travel in our city that looks like a grid, so one block at a time and you’re going to be making the sum of all the steps you do along the coordinates This works well in the? U S very bad in Europe, as you can imagine, because our cities are not That’s quick Question Ah,
ah, yes, we are gonna be seeing how yeah, he’s going to be a relatively different hash function from whatever we have seen up to now is based on projections But I’ll give you the intuition why than can workers in all dimensions? Great Okay, so Another key concept for today that I really want to be happy in your mind is what are the families of hash functions So let’s say we have our main rushing signatures We we have, uh, minami functions for each possible permutations of the roles as you can remember like this, where all this was the space that will be possible ash functions we could be using for me Lashing OK, so basically as a reminder of the hash function is just any function that tells us whether if two elements are equal or not so what is done, a family or lash functions is any sort of any set of hash functions from which we can pick one at random efficiently, and here are the key word is efficiently Okay, it’s relatively easy to code, something that will give you proper family of as functions, but if it’s extremely expensive to compute that than you are kind of defining the purpose of doing, let’s say it’s OK So we want to be able to create those families very efficiently and so we saw the example for the main ash is all the possible permutations of roles They are very easy to generate We’re going to be seeing similar tricks for the other as functions that we’re going to be seeing today Now let’s define local into sensitive family OK, so suppose we had our space of points as where we define our distant measure Our distance measure, so feminine bash functions is defined in the following way is said to be the one day to prp too sensitive if for any accident while by using us so for every candidate pair that we work on, we’re going to be having the following a part of this So if the distance between X and Y is less than or equal to the one than the probability over all, the possible ashes in our family is going to be at least be one and conversely, distance between the two points larger than detail than the probability is going to be a most people So here are basically defining boundaries right? We’re saying if our two elements are close to each other within a certain treasure than we want the probability of the ashes to be equal to B above a certain probability, because these are the candidates that we care about their close to each other, we won’t ever be want to be as big as possible All right Conversely, If X and Y are far from each other than we want, our probability p two to become as small as possible right and what we’re going to be seeing today is going from the standard means cash that we saw last Tuesday that as not very exciting values for P one and P two, we want to stretch them in such a way that everyone is going to become almost close to one and Peter He’s gonna go straight down to zero OK And that’s why this is the most important week for realization because it really makes the difference on how much computation we have to perform Henry is a visual clue what was going on? First of all, first thing that I want you to pay attention to This is a distance right, so we move from similarity to distance Keep always in mind that when you have juggled, for example, similarity is nothing but one minus the distance OK, and that’s the reason why our s curve now here is flipped Right now Don’t get, I don’t get confused by this Okay, We just mirror the curve and what we want to happen is the following Whenever there is a small distance, we’re going to be having a high probability that it washes are gonna be the same When the distance is I once again We want the probability of hashing to the same value to be very low How do you think we want to squeeze this plot, like we have this for moving dash lines? What would you do with the one in the to, and what would you do with P one and P two? What should happen to them Once you give it a shot? So the peace should be very easy Right, we said, won’t be want us to become as big as possible and be to us to become as well as possible So we want we want to go on the top be due to stretch to the bottom now What do we want between the one and the two? Do we want these to dash lines to get close to each other to get further from each other? Closer, alright, perfect! so why, why do we want to get them closer? Ideally our treasures we lie between the yuan and it all right So we want to make sure that this the shape will look in such a way that everything that as a distance less than the one we have a high probability Harrington,
they will have a distance larger than the two will have a very low probability and that’s exactly what we’re going to be saying okay, So these are distant treasures here, and this is how basically we approximate the beer row as curve are going to be playing with the parameters so that the one in the two are squeezed He wanted People gets four far from each other All right, hold here Great, OK, so let’s apply now this lesson to minish so once again, as space of all the possible sets, these are Jack our distance we really use It is a family menashe functions for all the possible permutations of rows Then Frank Asche function H, belonging to the vocabulary is family We want this property to hold so the probability that they too will be ashes to the same value is going to be one minus the distance we just put We just went from similarity to distance here we haven’t done anything different from what we have seen on Tuesday Ok, perfect So what is our claim? Well, Well, what do we want to? What do we want to analyze from the behaviour of mere lashing? Is that this is a one third tutor tutor one turn sensitive family since the sensitive family for us, our space of options and for the jackal, distance d OK, so this is the behavioral meaning rushing without playing a role with bands and rose And that’s why I was saying before these values are not exciting Once again, it’s way better than doing quadratic comparison OK, Because here what this is this is telling us is the following you are gonna be if the distance is one third, which means if the similarity is two thirds, because we have to take one minus one term than the probability that the candidate will be returned is two thirds So not nothing better than doing a standard as function, so we want to do better than this How do we make it happen? OK And you go faster, otherwise we are out of time So can we then reproduce? Whereas curve that we saw before for any locality of sensitive hashing family OK, so first week that we’re going to apply is how our bans technique OK that we’ve seen also, alas, as Tuesday And we’re going to be having two different constructions, as we said before, one or both of them are tradeoffs, for as one is the end construction Where is we’re going to be working with a number of rows in a band? The second one is going to be called the or construction, which is tuning how many bands we are going to be extracting from four from our input matrix All right, OK, so how do we amplify our hash functions? How do we amplify our s curve? We can play in the in the following way OK, so first of all, we start from our large family or hash function H and then we want We want to get a subset of those with which we call each prime, and here we’re talking about the end constructions OK, So we are trying to reproduce the the effect of the band size, so the more rose we have in a band, the stricter this condition becomes Okay, So essentially, what we want is that all the ash functions between Element X and Y will be matching for all the elements in that ban OK for all the roles they are contained in the inn that ban now in a sudden what we do is we go from this function, sensitive just to be one in Peter to something that is going to be senses beyond two one, to the power of R and people to the power of R OK, so we’re making those probabilities smaller, right And? To make sure that this happens, you have to make assumptions that our hash functions are independent OK, we’re going to be seeing in a while why this is important So Do you think that both these changes are beneficial or one of them is beneficial, and the other one is detrimental So I want to give it a shot Together we make that probably small and that’s the probabilities of their similar arrangements website and then you’re missing, but it’s good for you On the other hand, exactly what you write is exhausting, so we lower the probability for the large distance Remember D two is when two elements are far apart So we were lowering that probability We are doing something good, while at the same time we’re lowering the probability for the small distances, and this is bad right So that’s why none of the two constructions taken in isolation is useful We are going to be having to play with both of them At the same time Now, before you ask me a question of why do we make this assumption that the hash functions are independent?
Let’s jump to this slide You can read more with more attention on your own later But here’s the gist of the idea, especially for me, Nash There will always be some permutations that are very much correlated to each other OK, but we were not talking only about the worst case here, like in general among wholly possible members of our hash functions The average case is not going to be the worst case, so the two hash functions will be pretty much independent So this is another reason why we’re introducing a bill of error into our research and that’s why I told you before this is a problem that we’re studying expectations We don’t expect it to have one hundred percent perfect performance Okay, but this is just to reassure you that the assumption were made for the independence makes everything more tractable today to talk about it, and he doesn’t introduce a lot of additional lives All right so this is for the and construction we jump to the poor construction very similar The only thing that is changing here is that we want our hash functions to be equal for at least one of the elements OK, so for one of the bands, so here, what we’re saying is we take one is as functional per band say our bees equals to tan, for instance, only one of them has to match if they match then we return those two elements as a candidate So once again we took our original function and then we modified it In this way, distance are retained the same, but then we are moving to P one and P two, expressed as one minus One must be one to the power of B, and it follows the same for people You shouldn’t come as a surprise now that we are doing something good and we’re doing something bad who once again, okay So we’re raising the probability for the small distances here seem to the Y because we’re doing one minus a quantity that is getting smaller, and at the same time we were raising the probability for the large distances, which is bad Okay, so once again That’s why we need the two constructions happening at the same time Otherwise we all optimize one aspect of the problem OK Sorry is the intuition the end are making the probity shrink, but which, when we choose our correctly, we can make the lower probability approach zero, while the higher does not OK, so this is the first property that we care about The probability of having candidates among elements is a barrel of similarity Values should be good, so good as close as possible to zero, and that’s what we tool by making are larger and at the same time, as we said before, the your construction stretches these curves on the top left so we are making higher the probability that Whenever the similarity dykstra similarity value is higher than we return it as a candidate, with probability very close to one OK, So we went from the intuition now to the mathematical instrument to make this happen in our s curve expression Any questions look here Cool, great, OK, so so we combine We can combine the two constructions together Ah, I’m going to be going very quickly to to the next slides the day that I want to give you is that they can be composed in many different ways you don’t have to apply Strict refers to the end and then Dior You can apply them as a chain you can do, and or and or the outcome that you care about At the end of the day are tiguan and people OK So let’s say, for instance, here are our composing construction Let me just go to an example with the curves So here we chose four bands and four rows per bank Okay, so you remember, this was the formula that we were using before we’d be an hour on top So what happens, for instance, when we have point who has a similarity? OK, so the chance that a candidate we will be returned if the similarity is equals to point to is this very small probability There Conversely, with point, it is gonna be appointed sound probability of returning that has a candied, okay So we went from a family of this with these properties, so not as powerful as we hope to something way stronger, especially for the candidates that have very low similarity values Ok, so we’re starting to stretch this s curve as we need, and as you can imagine, you can make those number bigger, and this shape is gonna Keep on changing So now we’re going to be addressed your question How do we choose hiring me in a way that does exactly what we care about? OK, so we want to pick hiring me to get our desire performance
So say with Hicks once again fifty hash functions so we have R equals to five, five rows per ban and then ten bands in total What we, what we can accept I’ll discover once again is our green area here because it represents the false positives As we said, we want to make it as small as possible, but we can still afford having it because we can compute the actual similarity in the sequence in the next step and then discard the bad candidates What we tend to worry about his studies is blue here a year, because it contains are false negatives that will never become candidates OK All right now, here we start to see some curves that gave in a totally different way, even if we have fifty ashes in total So the computational cost for running on the stage with fifty ashes is going to be exactly the same But just changing these two parameters The behavior is drastically changing All right So, for example, if you take the blue curve, which is Article Five B equals Tan is it’s probably a sub Barker for a similar to zero point six because it will be having the blue hair there that we don’t like and something like the red curve with Tang and five is going to be given to strict OK because it’s going to be returning with probability equals to one only similarity values that are very high OK, So by changing those parameters and not too much, you can change the behavior of the system dramatically OK Okay, so we’ll go quickly through this business here What I want to show you is that you can also do the or construction before and then the end, and you get different numbers You just have to plug those two parameters in our formula, plot the curve and see how it behaves OK and you can change them As I said before, so now I just give you a real world example with a number that is not shocking, so let’s say you’re using a total of two hundred and fifty six ashes in total, so you do are for and construction of four or an end on a lower Once again you get this number in total, and this is the barrier, all our family, hashing timing a candies predict, will be a remains that pretty much from it It goes to one for anything that is above zero point of similarity, and this number is extremely small, frightened as below zero to OK, so you can see that the performance explodes very quickly as soon as you are adding a few more hashish OK, You don’t need to compute a million ashes per datapoint because you will be defined the old point of doing alization medical care So it’s yes, it’s still very cheap but it gets extremely more powerful by doing that Okay OK At the time Right, he’s again check later so let’s go just to the summary so we do with the key points that we saw before So you pick any two distances between our our two points, and then you start with a standard, Alice family that is added as the following sensitivity, and then you apply different constructions to amplify the behavior of this family OK and you want to make P one almost close to one and Peter almost close to zero OK All right Questions before we jump to other dcs metrics It seemed a lot of class file serving ones You mean something like this, Because here we are always plotting curves that are not very extreme You see, we have like wherefore ashes for thee We have this a foreign construction for construction, so you still see this kind of behavior As soon as you plug in one order of magnitude more ashes than these curves, you will be almost looking like a hockey stick OK, this was just for the sake of having values in this column In these roses are different Otherwise we’ll be seeing everything equal here and I was thinking on the top tax rate question Let’s go quickly back to take a look at that So in an ant, construction is is the following You want to make sure that you are considering a band Remember what we were saying before, So we have all the rows in a band and you want all of them to be asking exactly to the same value So this is the young construction it means If you want, you’re bound to be formed by ten rows, then you’re going to be needing ten different ashes If each one of them is matching, then you would say this band is matching OK And that’s why it amplifies the probabilities in that way
Conversely, do your construction is when you want, You have to be bands in which subdivide your metrics and then you will be applying an ash for each band and you are happy if just one of them is matching Okay, so we are just putting in numbers They turn it off That we talked about at the beginning right makes it more clear, no great Anything else, or we jump to the else Yeah, you can do bands within dance Yeah, let’s does thee This construction here okay, so aplenty for foreign construction, follow body for four and horror construction you, you can do pretty much, any kind of composition and they will change the numbers they will turn the curve in different ways So as I was saying at the beginning, Depending on your context, let’s say you are trying to do matching of fingerprints than it will be extremely bad for you if you don’t consider certain candidates but if you’re using this as a basis for a recommended systems, for instance to recommend shoes and clothes, if you’re missing ten good candidates, nothing happens OK, so that’s how given your requirements you play with those numbers Okay, Thanks very much Yeah Exactly because I’m still working on these examples by letting you think about your car similarity does is never going to be seeing from the following his lights, works with a different is a distance metric, and then we’re just gonna be applying that in a straightforward manner Thanks for your clarification Okay, oh, question Ask them since matched the number of roses Does the number of such functions matches the number of rows? No, he matches the number of rows in the inner band so you are gonna be having as functions for how many rows you consider in a band, and then we’re going to be having ash functions for our many bands you have in your metrics But there was if you have a varnish function Peru in your dataset, you would be working with a billion ashes right, and that’s not something that doesn’t work It doesn’t scale Make sense At times, the number of other functions Yes isn’t as the road, the reverence in which that’s something that’s a space of possible as functions that you can have is like with what we saw with mean asking, you can get all the possible permutations of the rose The number is extremely big, but you don’t need to use all the hash functions in your space OK, so that’s why we went from The space of possible has functions to the family that we care about, which has the properties that we like for PR and people We’re going to be seeing exactly the same in the following examples I’m gonna be giving you up It’s basically a base, all the possible as functions that you can generate a few remains of linear algebra, so you have this base You can generate plenty of them, but we are gonna be selecting only the subset that we care about makes it clear no great Anything else Okay, let’s jump to the two, the last parts So now we are going to be seeing how we flagging difference this as matrix into a stage So this should look familiar by now we’ve seen it probably ten times, so we are feeling now points in general data points in our as functions we want to generate shortly into their signature Now we learn how to design our They wanted it to be one people sensitive final hash functions But what we want to do here is we need to come up with a way to generate this family from other particulars distance metric Up to now we have seen it with a joker, similarity and the Jackal distance Now we are going to see it with different examples Everything else that we learned up to now is going to be applied exactly the same way, so that’s another beautiful thing about the message at once you perform this conversion and you obtain the signatures than whatever is the input data? The rest is going to be working exactly in the same way All right, so in the lingo machine learning, you can think about this as some sort of future engineering, and this is the model where we are touring the other parameters Okay, I guess kind of you are taking two to nine to the fore in this period, so I tried to make your life easier
OK, now We have seen how this works with a mean ash, so we have our bans technique, and we go from these representation with the implant metrics to something more compact The first example that we’re going to be seeing led around it has to do with Cassandra stance and, for instance, is started using the gnashing We’re going to be using the random airplanes, and this is the kind of representation we get Everything else is going to be working exactly in the same way OK, so let’s jump to the cuisine distance Why do we care about it? Why why is it important and it’s using a lot of different contexts It has been used a lot in information retrieval when you want to retrieve documents that are similar to each other, for example, so you do you have a pictorial representation of a document and you want to measure the angle between those two documents The smaller is the angle, the higher the likelihood that these documents are similar and what we want to do without a sage is once again returning only few candidates where the document back to this should be done checked OK, we can express the equal sign similarity you would find it in papers and textbook in a lot of different ways What we care about is that created as a range between zero and pi, and you divide our hunger thither by a pi so that we normalize our distance in the range between zero to one and as before, you can obtain the consent similarity just by doing one minus the distance, so the two things are connected in thee In that way, You will also find it define at times in the following way, with the dot product between the vector, A and B divided by the product of the norms, everything can be normalized to the other representation easily so I’m giving you basically this advice in case you see it implemented in libraries or you find it in papers The two expressions are pretty much the same What we care about, though, is the falling of capsize We were saying before two vectors that are very similar to each other are gone out of very small angle So the concern goes to one vector is our orthogonal Therefore clearly not similar to each other are gonna have a cosine equals to zero and interesting enough if they have hop was its course the cassano we’ll go to minus one, and that’s also very interesting in case you want to fine in case, for instance, you want to do going off serendipity in Europe from their system, such as trying to recommend items that are as different as possible to each other So this is why it’s also interesting to have this measure that tells you if two documents are at this, similar as it gets OK, so now that we have this distance define, how do we create our ash as family? We have this thing, we call random, I replace similar to me and I think so A lot of ideas will be in common, but we need to apply a few tweaks here and there to make it work OK, so we start from our standard family, which is the one you need to design or distances, and is our probability so one minus the distance once again normalized by my pipe And this is just a reminder of what this means We have seen it twenty minutes ago, so it’s not worth going to see it again So this is our definition of the random rubber plants ash family How does this work? How are we going to use it and why does he generate a slightly different signature company compared to mean ashes? So we do the following for each vector B we determine our Nash function H We just as two buckets So the return values are going to be either plus one or minus one And let’s see what this very simple hash function does so it takes the dot product and it returns, plus one if the dot product is equal to or larger than zero or minus one if the dot product is less than zero, what are we doing here? exactly so? What is this telling us? Anyone wants to try if you are fresh in your algebra in your brain Parts of the upper flame, the two vectors are lying Right OK, so if they’re lying in the same parts of the upper plane, we’re gonna be getting a plus one If it does not the case, we are going to be getting a minus one OK, so it’s very easy ash functions and how do we build a family? Basically, each possible vector that we can have in our vector space is a potential good candidate for an ash function, so we just take an arbitrary vector, and then we use it to do the dot product between the vector of our element And we use our claim is that the probability that the ash between the two elements is going to be the same should be equal to one minus the distance between those two points Okay, so one minus the angle between our two points Okay, So the closer is the angle The higher is the probability that our hash function will return a plus one
OK, now let’s try to make this initiative We are going to see it visually on those lives who was going to be okay These are our two data points action why is an angle, theater between them and we select our first vector? B? We don’t care that much about the vector We rather care about the upper plane, that is this normal orthogonal to our vector B, exactly for what we were saying before What we care about in our as functions is to understand on which side our points are lying fits on the right side or on the left side of the hyper plane And so, and is important in the fact that the upper plain is outside the angle, Not that the vector is inside OK, if the vector is inside, we don’t care because it’s not going to change the way the output of the hash function But if the upper plane cuts between the two points than the ash function, the ashes to the wax will be different from the ash output Hawaii OK because they are lying in different size of the upper plane so intuitively What are we trying to build here? We subdivide the space with different different airplanes and this set of hash functions We’ll just define an upper polygon in our space so we cut this space when multiple planes, and then with all the outputs of the different as functions, we’re gonna be understanding if the point lies within or outside of the upper plane Okay, so as it is that? And let’s try to see it here now Why? the the output probability is proportional to what I was showing to you before So again, let’s go step by step We have I just lost my Potter Sorry, so I’m gonna be waving We have our fine acts up there, why here then we have two hyper planes, one defined by the red vector So it’s the red dashed line, another one defined by the blue vector, which is the blue dashed line So what happens here? X and Y are gonna be hashing the same within the blew up a plane We’re gonna have a different hash for the function defined on the red upper plate Okay, because one factories on one side and the other one is on the other side So Y is the probability proportional to the angle The smaller is this angle The less is the likelihood that there would be another plane cutting between those two vectors OK, that’s the only intuition that you care about The closer they are to each other, the fewer hours play hyper Planes will be cutting between those two vectors Therefore, most of the hash functions will be returning a plus one Conversely, if this is your your initial angle, then there is a high likelihood that it will be an upper plane cutting between them Therefore, a lot of values will allow the rash functions will return minus one OK, that’s the only intuition about these random outer planes Technique is working Everything clear questions Great question, so you ask, are we going to sample our vectors that are defying the ash functions in a uniform way or not, and I can also add on top of that another question that could come up to your mind by the way, how expensive it is to generate random factors of different norm and with real values It turns out that we don’t need to do that OK, so the only thing that that we have to do is not an angle I’ll go back to this line, is we just wanna? Generate vectors of that component is equal to plus minus plus one or minus one so they’re all going to be in the unit sphere, and in this way we also make sure that all the hash functions we generate are not skewed, because we can just take Random Bayless plasma plus one and minus one, and then we can have a uniform distribution of our vectors that are creating the ashes when it OK So that’s why it’s cheap because we don’t have to generate random real values, and also that’s why we can make it uniform just by promoting the plus minus ones On the components of the vectors, OK, great Dimensional data, and, most specifically this customized wanted to start again Yes, correct the question of grade the question was If you’re if you’re datas and dimension doesn’t mean that you cannot have more than two to the power of van hash functions The answer is yes, and the answer answer is it’s way more than enough right because you usually we
We care about our message when our data is as a very high dimensionality Otherwise this technique is overkill, which, and two to the power of one hundred is a real astrophysical number So you can generate more than enough hash functions In that case, okay, right thanks for your clarification Okay, let’s go quickly through the air through the example Now we recovered our message Now, let’s try to get the intuition of how this will be working with the euclidean distance Now we have another family of hash functions that we are gonna be using, and those are all based on projections Okay, so what we do is our hash functions are just corresponding two lines we take lines in our euclidean space OK, and then we partition our line into buckets of size K We take uniform buckets, we partitioned is lying and what we do is we ask each point to the bucket containing its projection onto the line are going to be seeing a visual examples have become super easy well after that, but it’s all based on this continent Jared Lines very cheap operation to do in a euclidean space, subdivided in buckets Take the projection whenever you are a member of a certain bucket Does the output har ash functions occur and the key intuition here is that you’re nearby Points are always close to each other right, And so there is a high chance that they are gonna be belonging to the same bucket Conversely, distant points will not belong to the same bucket, so the ash output will be different Now let’s see the visual explanation, so we start from the lucky case OK, so we have one is functioning, we only have our life Some devaluing buckets than pauses are close, hush in the same buckets of our green and blue are gonna be machine together and read his in another bucket perfect scenario It’s not always the case So now we see the two unlucky cases The first one is when we are not lucky with a composition, so with our many buckets we create on our line So sometimes we could just be a lucky and deed The boundary is between the two markets we lie within two points therefore they in their split a simple workaround for this is simply to have more buckets But then it becomes more expensive So it’s the same tradeoff that we’ve seen before We’ve been asking, for example, and another A lucky case is when we are unlucky with the projection so here we’re always talking about two dimensions I think I’m answering your question now Can we generalize this to multiple dimensions of the idea is clearly yes here you, you can now see why, because you can project for nine dimensions easily, but the reason why you need multiple as functions is because if you are in a direction, that is a lucky like in this one, you might have hold The projections have ended up in the same bucket OK, even if you did a projection in the orthogonal direction, they could be belonging to all different buckets K So that’s why we need more than one line clearly to make it work well Great Okay, so we have our animation That’s That’s how you can work in case we have multiple directions, so you see, in some of these lines we are getting lucky and the projections are good in some others are bad, so similar to what we’ve seen before with a cousin Similarity is some of the ashes are not going to be ideal, but the majority of them will be putting their points to gather in case their clothes and we get the output that we want One interesting thing to see here is if the distance between the two points is weighs a lot smaller than a than the chance that the points are going to be in the same market is at least one minus this divided by a OK so closer they are to each other, the more chances you are to end up exactly in the same bucket Conversely, we have another case where, if the distance is studies larger than I than the point, the angle between the two, the two points, basically the production line and the two points should be close to ninety degrees OK, so in this case, When the angle is not close to ninety degrees, even the projections will end up in different buckets Now you can imagine those two points having almost an angle of ninety degrees compared to the production line, and that’s the unlucky case, when there will be projected exactly in the same bucket But once again, the three curious stuff Matthew Brush functions is not going to be working In this case, you will be working with another one and we can easily done make them members or different buckets Perfect now, here, once again we are going to be able to play our tweaks with bands and with the rose So the the standards family that we obtained from this is the first instance is eight divided by two hot way,
and these are the two probabilities that we were seeing before playing We’re using our end and whore constructions than we can amplify the behavioral as curve as we have seen before OK, so basically as a summary, and then I’m going to be taking some questions so that we can clarify everything What we’ve seen today We have data that comes in very different shapes could be tax could be vector, a schoolboy images, whatever you want, as long as you can represent it in a way that allows us to build our family of hash functions with the properties that we care about than this can become the input of our allies age pipeline so we produce our signatures Then we go through the early stage pipeline, obtain our candidates and we can play with the behavior once again with our Andorra constructions If it makes it easier for you to remember, just remember with our band and roads construction because asking me that it, those are the two parameters that we are modifying and we’ve seen that this can work It will mean ash, or you can work with a random number of planes or just with the last hash set of ashes that we’ve seen What’s in the euclidian space OK Westerns The earliest said hash functions correspond to the lines of the Quaker Hal It’s how they would classify as functions, correspond to Rosen did uncle respond their apply too, to the roles, we said, we have hash functions for the rows in a band and then we have aas functions for each single band The rows of the ruins of the design role, so it really depends on the representation that we have of the data in the main ash is going to be the output of the mean ashes From the promotions from the cosine similarity we have seen this sketch representation with a plus and minus ones OK, so this result is a signature A sketch where each component of our thinking she was going to be a plus or minus one in D In Euclidean version, we have seen that it’s based on the distance, So regardless of what is your output at the end of the day, you’re going to be having this a same shape of an input matrix and there you can subdivided by bands you can decide how many rows you have in a band Apply your rash galeras curve and the magic happens Thanks OK, let’s go to the recaps light so that we are in time and then I’m happy to get any final questions So the two important points that I want you to take out from this room today are the following First, one is we talk about the property where if the probability that two sets she wants you to our ash in the same value, then this is proportional to the similarity among these two documents OK, so we can build hash functions that have that kind of property OK, and this is the essential core or the lesage Without that we can’t really do anything at all Okay, if we were unable to build those as functions that everything as we talk about would be impossible and the additional trick that makes a lesser age from a very elegant solution to wander the scalable, an extremely powerful with large datasets is the fact that we can play with our bans technique, so we can decide the size of the bands and the number of bands with which we subdivide lower input matrix And once we applied this, we can make our message behave in drastically different way, Depending on our domain, OK, so I wrap up the lecture here I’m happy to take questions Qualifications Chat This is Tom You’re not necessarily using a rush function After you’ve got the candidates you want to check their similarity it could be, It could be anything knowing me You could work with softer plagiarism, and then you’re going to be having functions that tell you If they have similar constructs are similar blocks of code, you can have images, and then you’re gonna be analyzing objects in the images, so it really depends on the on the use case So the beautiful thing of our research is that first part you go from your input into something that is a set of signatures you get out of there He had the members of your buckets, and then you can refer back to your data Guess it’s a transformation, like when you do a linear algebra, when we go to a different domain, we do the dimensionality reduction that domain And then poof, we go back to our real data Make sense question how are we going to partition the lines, or indeed in the euclidian space example?
OK, yeah, great question thanks, this is a problem in case you have outliers for instance, like if you know that you have points far away from the others, then you will have to stretch your lines and your buckets are going to become wider just because of those who outliers Elysees, like plenty of other techniques work with dimensionality reduction, they do require some preprocessing on the data First of all, so if you really were outliers there this will underperform if you remove them You kind of define our nearer and d dimensional space where the points are lying within and then the lines are going to just be going from one extreme to the other Thanks for the great question Times Red Cross worker Yeah, I repeat the question for the sake of the old people in the classroom I’ll do we overcome the curse of dimensionality without lesage Ah, the answer is a lesage doesn’t help you to address that you need to figure out your own way to get rid of it so, and basically what it’s telling you is that if you’re using croissants similarity and you are aware that your date is highly dimensional and you’re going to be returning a lot of sketches with my as once so one day and the doctors are taller than you know that either kazan’s similarity is the wrong similarity measure for your use case, or you want to process your data in a way that removes the curse Great question thanks Anything else I think we are running out of time Thanks, We’ll see you next week Thank you