RailsConf 2016 – Multi-table Full Text Search with Postgres By Caleb Thompson

(upbeat, exciting music) – Today we’re gonna talk about a somewhat non-standard approach to search in your applications Often, search is done in some sort of external service Sorry Better? Better? I’ll just stand like this for half an hour Starting over, because I had that extra time, we’re gonna talk about a somewhat non-standard approach for search in your applications Often, search is gonna be done with some sort of an external service Rather than doing that, we’re gonna use the tool that’s already responsible for storing and finding data Welcome to mult-table full text search in Postgres I am Caleb Thompson, we are not doing a Wednesday hug You can find me at these places on the internet, feel free to tweet to me during the talk I love that shit But if you’re not tweeting, please do close your laptops There’s going to be a lot of code here, and it’s going to be hard to get anything out of this talk if you’re not looking Now that I’ve told you that you should close your laptops, I don’t like to jump into credentials at the beginning of a talk, you’re already here, and that’s sort of what the point is We had the abstract, we had the bio So you’ll hear a little bit more about me at the end, but I do have one claim to fame, that that’s that my birthday is in the SimpleDelegator dox, so you know I’m an expert I’m going to talk to you about a real-life feature We’re gonna iterate, explore other options, and optimize, just like we would when we’re developing a feature in our applications We’re gonna talk about full text search What it is, how it can help us Hopefully, you could’ve guessed that from the title We’re gonna talk about views, and no, not the HTML views, we’re gonna talk about database views Naturally, we want to explore the performance implications of whatever we’re doing So, we’re gonna look at some of the performance implications, and how we can mitigate them And we’re gonna talk about materialized views, of course, as one way to do that We’ll look at a couple of gems that can helps us out while we’re doing this fun stuff, and, of course, we’re gonna look at all of the other options, or some of the other options for what we’re doing Alright, let’s look at sort of the classic example Let’s search for articles in our applicaiton The simplest thing we could possibly do that works is to search for substrings So, here we’ve got articles where the body includes, how’s that looking? Terrible? Can somebody hit the lights over there? Just keep going, it’s fine How’s that, better? That looks better, I can see it We’re gonna look at articles where the body includes some substring, fasten that query You’ve probably seen this in your applications This works It works if your users know exactly what they’re searching for, so if they want to find an article with, say, the word book in it, and they know that it’s going to be lowercase, in the middle of a sentence somewhere, then they can search like this, and you can pass that search right in Like I said, that’s doing exact substrings That’s not the most useful thing So, a tiny little step forward that we could take is to do the case insensitive like, with ILIKE I don’t like this But it is slightly better Alright, well, let’s leave that It sort of does what we need for now We don’t need to know where in the sentence the word that you’re searching for is, that’s cool Okay, well, naturally, features expand We need to search based on the titles of our articles We probably should’ve seen that coming when we were searching We can handle this, too We’ll just sort of extend what we had already done We’re gonna pass in that query two times instead of once, and we’ll still do that case insensitive like, and the percent signs in here are so that it’ll show up anywhere in the query Anywhere in a word So you could have a sub, any sort of substring in the text So now we wanna search by the author’s name This is getting a little more complicated, but, you know, again, probably something we should’ve seen coming

We’ll go ahead and join onto our user model I apologize, we’re gonna use the user and author’s tables interchangeably in this talk Users table This is basically that same query, but we’re doing, we’re letting Rails, arell, handle the join, and then we’re doing, you know, pulling out the user’s name and the same two fields that we were already pulling out That query, query, query is starting to stutter, which is something that we don’t really want in our code So, one way that we could refactor something like this is to do it in a query object And this is less performant, but arguably easier to understand when you’re looking at it But when it comes down to it, we still got really poor results We’re only searching for these case insensitive substrings, and, you know, that’s not great What if the word’s gonna be plural, and we have a singular query or what if we’re searching for singular things Google knows how to do these things, and that’s what our users are going to expect from us when we’re buIlding this search feature Enter full text search Full text search allows us to query for rows based on natural language searching Hey, Caleb, what is natural language searching? I’m glad you asked Natural language searching allows us to remove stop works from the query So these are words that appear in all sorts of sentences, they have not a lot of semantic meaning to us, and we want to not really include them in our search results We don’t want every search that includes the word and, or every article that returns the word and, to show up in our search results Again, just like with the like versus ILIKE, we’re gonna eliminate casing Fairly straightforward We want synonyms to show up, so if our user has sort of a concept in mind that they’re searching for but doesn’t remember exactly what it is, then, you know, both of these should return the same results And we’re going to incorporate stemming, is another feature of natural language searching, which means that related words, like try, tyring, and tries, these are all different versions of the same root word, so we record them under the same concept and when we’re doing our searching under the hood, we’re actually searching for that root word instead of the specific words that were passed in So here’s an example of making that same query We’re gonna sort of zoom around in this code a little bit and we’ll highlight the more important pieces So here we’re looking at the text that we wanna join We’re saying title, can catenate that an empty space, and then the body We’ll call that the text, just because we need to give it a name for Postgres to be happy, and so that’s what, the two pipe operators are the catenation And we’re also going to pull in the author’s name as the text And, naturally, we want the ID when we’re pulling out at the article, and the article’s ID when we’re pulling out of authors We want unique results, because we don’t want the same article to show up a bunch of times in our query, if it shows up multiple times in the body, or if it shows up both in the authors table and in the title We’re in the body Alright, so that’s a lot of SQL Where do we put all of this? We can throw that back into our query object that we looked at last time we had some code Basically, just some inline SQL Pass that through an execute or where Same thing with the scope Throw it in there, just pass in that query so it is interpolated, but, to be honest, SQL doesn’t belong in .RV files, that we’ve got an extension for that And so Postgres actually has our answer in the form of views View is a partial query stored in the database that can be selected from, and it returns a set of columns that can be searched on later on The nature of views is that, because they’re basically just a query, they can have multiple source tables So right now we’ve got the user/authors table and we’ve got the articles table, so this view will allow us to sort of abstract that away and just say this is the text that we care about And then we can perform a where Later on, we can do whatever we need to do to complete that query so that it’s meaningful to our users when they’re actually performing a search later on down the road So, if we were going to build, here’s just sort of an example view We’ve got this create view syntax Just give it a name, just like you would a table We can select distinct user ID So right now, obviously we’re pulling users

with recent activity So, we’re gonna look at a couple of different tables We want to see all of the information about the user, and also the last time they were active So, we only want one instance of each user, like I said, we want all of those rows from users table, and we’re going to create this concept of a user’s last active time by pulling in the activities created at column And we’re just going to limit that to activity, or users who were active in the last seven days So, when we’re looking through this view, it looks pretty similar to what searching through a table would look like You select whatever, select everything from users with recent activity, where, order, whatever you need to do And, in fact, it’s so similar that ActiveRecord can use a view as a backend for a model So what we can do is create a fairly vanilla model, looks very familiar, and we can interact with that as if it were a table in our database So, we’ve got this users with recent activities model As you can see, it’s an ActiveRecord subclass We’re gonna give it a table name, just because our naming of that table didn’t match what Rails would’ve expected when given the class name, and we’re gonna tell it that it’s read only And this isn’t strictly speaking true, but it’s easiest to just assume that a view is going to be read only If you need it not to be read only, then there are some special rules for that That’s an exercise for the audience But, what this says, is it tells Rails that nothing can be deleted, and nothing can be written into this table So, it’s read only, you can only query against it Will this work will full text search? Luckily, yes We’re going to talk now about our first gem, Textacular Textacular was originally written by Aaron Patterson, and lives here in GitHub Textacular takes care of the full text search portions of the queries, and it assumes that you wanna search over every text field, because it’s called full text search, so it’s full text full text search, I guess, on a record, on the table And it gives you some variant search options like basic_search and fuzzy search and advanced search For our purposes, all we really care about is this basic_search And that’s gonna be what the most generally useful when you’re building sort of a single field input that your users use to get results back So, that looks like this If you’re searching for a game, any sort of game that included Sonic, Sonic the Hedgehog, or Super Sonic, whatever, I don’t know And you can get a little more complicated, even with the basic search, and say the title needs to include Mario, and the system needs to include Nintendo, but I want any Mario title on any Nintendo platform So this is sort of the next simplest useful thing that you can do with Textaculars Mix them Alright, so let’s go back and take a look at that search that we wrote This is that same sequel from before To get out articles based on either the article’s name, body, or the author’s name So our search result is really simple on the Rails side We’re going to create this three line class We’re gonna include Textacular, and we’re gonna tell it that it belongs to an article, because we named that field article ID And when we wanna actually use it, if we wanna, say, find an article written by Sandi, or that mentions Sandi, then we just do this basic search for Sandi, and map that onto the articles If you wanted to get a little bit crazy, you could do, you could include a numerable into your record A numerable is a super important and very useful feature of Ruby, built into the standard library, and if you don’t know about it, feel free to come up and ask me afterwards But, basically, it’s gonna give you all of those cool each and map and everything else So, you’d be able to use this class with search result dot new, the query, and then dot each So, basically, you can treat it as if it were any sort of other collection, array-like collection So, creating this view, I’m sure I’ve convinced you now that views are great to use and you want to use them Want to know how to use them? So, creating that is fairly straightforward You’ve got ActiveRecord base connection and execute So this is in a migration that you could actually just shorten this to execute, and we’re gonna use that create view SQL that we just had on the screen And then to drop that, we just say, drop view, and then the name of the view How resistant to change is this? Alright, well, let’s find out Let’s look at some feature creep, as we always have in our features Project manager comes back and says, articles whose comments match the query

should also show up in the results If somebody has mentioned Sandi in a comment about an article, we want that to show up in, that article to show up in our search results So, to recap, we’re now searching on an article’s title and body, an author’s name, and a comments body And any article that any of these things are related to should show up in our search results So, the updated SQL looks like this The updated query And the new part is this new union, with a fairly straightforward select and join where we’re pulling in the body and the article ID from a comment So, let’s take a look at that update view migration What we can do is throw the new SQL into that uprecord, just like we had done before And, unfortunately, what we need to do for the down migration is put the old text of the view into that down migration so that when we’re rolling back, Rails knows what state to put the database into That’s sort of a pain, but we can handle that That’s not too bad And, unfortunately, we can’t always update You actually can’t remove any columns from the view, which we happen to be doing when we’re rolling back this migration, because we no longer have Oh, no, I’m sorry, no we don’t But if you had added a new column into your view so that you have more information, say, that active at column from before If you say, we don’t really need that active at column anymore, let’s remove that from the user recent activity, then you couldn’t just do that with an update So, what you have to do is just, first, drop that view, and then create the view again And again, we need to dump that whole SQL into the migration That sucks You also can’t dump a view into db/schema.rb, and so the solution to that is to tell Rails that the database format is gonna be structure, and then you’re gonna dump in the structure.sql It’s gonna dump out an actual, Postgres SQL version of your entire database Sorry Luckily, we got our second gem, Scenic, which adds a few methods to migrations, and allows views to be dumped into db/scheme.rv, which is what you expect, and just generally makes views a little bit easier to work with Of course, I had a little hand in this I am one of two maintainers, the other one is in this room, of this gem So, creating this Scenic migration is pretty straightforward The read me goes over it, but you’re gonna write into a SQL file basically just the query portion of the view So you don’t need to worry about the create view syntax, the dropview syntax, it’ll handle that for you And because you’re writing it into, you actually write it into a .SQL file, and so you’re getting whatever sort of editor support your, you know, Vim or TeamUp or Sublime or whatever people are using these days, I don’t You get whatever benefits that gives you So mine gives me indentation and some nice syntax highlighting in SQL And then for the syntax in the migration, you actually just have this create view, which is reversible, just like create table, and you can go back to using that change method And then if you need to change the view, you can actually just do this update view You tell it what version numbers, and it knows, based on a naming convention, how to find the new and old versions of the SQL, of the SQL for the view Even that’s a little bit tough to remember, so we did create some generators We actually have a monogenerator, that gives you that read me, or read only, and infers the name of the model based on the name of the view, so that that naming will match up, so you end up with a three line method Or a one line method, and, like, a total of five lines in your file And then when you’re writing into that SQL file, it looks just like this, so the first verison of the searches, we just write in This is that same SQL from before Pretty straightforward, and that lives in a SQL file, so that it’s a lot easier to read and look at when you’re in your editor We also have a view generator for when you need to update that view, so you can just do, rather than Scenic model, you can do Scenic view I don’t need notes And that will give you the next version of,

the next numeric version of the view And it dumps in the old version of the SQL, which then you can update with whatever you needed to add Now, there are definitely some performance implications with this approach As I mentioned, this query is pretty slow It has to search across three different tables and a couple of columns to get us all of the results that we need So, what it comes down to is actually, in order of magnitude, slower to get these results And, unfortunately, views can’t be indexed Luckily, underlying tables can So, the recommendation here is add indices There are several types of index The one you’re most familiar with is btree Btree is great for exact matches on a column, so either text or even, like, the substring matches are okay with a btree You might get an index hit with that, and definitely for primary keys where you’re just looking up an ID or a UUID Those are great For full text search, the ones that we’re interested in are GIN and GIST GIN stands for Generalized Inverted Index, and GIST stands for Generalized Inverted Search Tree There’s some information you’ll never use again GIN lookups are generally about three times faster than GIST, they also take about three times longer to build GIN indexes are moderately slower to update than GIST indexes, and they’re two-to-three times larger on disk What does that mean? Who cares? This is what it means You should use GIN if you don’t have hundreds of thousands of rows in your database You’re not concerned about longer writes blocking the db, you’re adding the index late in the game, and you don’t care about disk space Perhaps because it’s 2016 And you want very fast lookups, so, we’re optimizing for read-heavy If we were building a log manager or something like that, then we would want to optimize for write heavy, and maybe GIN isn’t the right solution So, on the other hand, you should use GIST if you have very large tables, millions, billions of records There’s an order of magnitude in there if you’re between those two, it’s up to you to figure out I work at a consultancy, you can pay me You should use it if you have performance concerns right now, and when I say that, I mean that you currently have performance issues, not that you are concerned that in the future, you will have performance concerns You should use it if, for some reason, disk space is important, it’s 1994 And you should use it if your table is very write-heavy Like I said, log aggravators are a great example of this So, adding those indexes as GIN is pretty straightforward These are the four fields that we’ve been using, and you just say using GIN, Rails knows how to handle that Materialized views are another way that we can improve this performance Materialized views are a tool to pre-populate the results of the views So, it’s gonna run that entire search query that we had, and it’s going to store all of those results into a temporary table So we still pay, say, the 400 milliseconds whenever we’re creating that table, but then we can query against the temporary table, which already has the results in it, and that’s much, much faster So we query against that result set, rather than performing the full query, and it’s another order of magnitude, faster even than the ILIKE was This is without the indexes And the downside of a materialized view is that it’s not always updating, because it is storing into that temporary table You have to tell it when you want to pay that 400 milliseconds to get your update happening, or however long your query takes And you can do that as often as you like, you can do that on a rewrite, with a trigger in SQL, or with an after commit hook, looks like this, or you can do it, maybe, on a timer If your search results don’t always need to be up to date, you could have, you know, the heropi schedule, or do it every 10 minutes, or hour, or day, or whatever So you can do that either with a Postgres trigger, exercise to the reader, or with Rails after commit hook, that looks like this Alright, well, what about some pre-built solutions? There’s a lot of options out there, and I did say we would look at them We’ve got elasticsearch with either tire, or elasticsearch-rails with Chewy, it’s including who knows how much into your models I know, it’s a lot You can use Solr via Sunspot, but holy shit Sphinx with ThinkingSphinx actually does use

a separate file, but still, like, I don’t know what any of this means Why do I have to figure this out? I already know things, I know SQL Alright, so what these services are great at is vacidid search If your search doesn’t look like a single box, it looks like Amazon’s sidebar, then full text search is gonna be a little bit more difficult to work with, I’ll admit Or Postgres’ full text search These other tools do full text search for you as well All of these things have to run on your development machine, they have to run on your production machine, which means they have to be running They’re slowing down on your machine, you have to deal with all of these dependencies, you also have to deal with them every time you’re doing an update to your system If you ever change that version, then you need to make sure that development and produciton are all on the same version, if you’re ever gonna go back, you need to make sure that’s handled Big pain in the ass They all need to be faked in tests, because you don’t want to be actually using these things live in tests In fact, I had a couple of coworkers who were doing, were using Solr, I believe, and a great feature of Solr is that it synchronizes its index across the same submap And because they were both sitting on the work, on their work computers and had the same IP addresses, external IP addresses, their test indexes were being synced between each other, and that was a lot of fun for a week All of these have a lot of cruft in the models I said some, but it’s all And the removing of data concern from your database Database, yeah They all have this arcane syntax, and ultimately, they make me make this face So, by combining materialized view, full text search, and Rails magic, we have a pretty cool search feature that doesn’t require any new dependencies, and it makes me smile Thank you (applause) (upbeat, happy music)