Big Data Pipelines and Use Cases at StumbleUpon – SF Data Mining Meetup Talk

well upon death what are the problem that we are trying to solve I don’t know how many of you are familiar with stumbleupon door I’ve used stumbleupon but stumbleupon is a quite old startup in the sense that was founded 11 years ago for solving a problem then no one was solving and this problem was serendipity search after 11 years no one is still solving the problem of serendipity search not big players are doing not Google not being not yahoo why because it’s an extremely difficult problem what serendipity pro search is about is basically the user come to stumbleupon without a clear information need it’s not that you come and you say I want to find the address of trulia in San Francisco you and working for many years Yahoo it is clear in the most of the case what the user want and this clear how to set up relevance measurements and is clear to say okay the user was successful the user was not successful in full field is information need with 13 deputies searched DS does not happen the user arrived and want to find the information but in a fashion that is more similar to be entertained so we still need to present the user with information that is relevant to the user but we don’t have any clue about what is really relevant to the user we have broad interest categories but within an interest you can really bro such number of different facet a different aspect the other big problem of discovery of trendy BTW search is that we present one result at the time so all is good the user likes o is not good there is no other you no other way to do in search the effect is not so direct the user has at least 10 result you can choose from and so even if the result is not the first result but is the third result it still can be happy it still can find what he was looking for trendy be searched never repeats you never see well there are bugs but in theory you never see the same page twice and so this means that we needed to carefully select pages and do with the dub detection in order not to show you page that are very very similar and if you think about news in which usually you have until that is repeated from many different website this is extremely different difficult the other important part of serendipity searches that needs to adapt if you come back a result the page I need to be sure that next time that you search in that particular interest I’m not showing you similar results that you can consider not relevant your experience search this doesn’t up and the proof of these I mean we know we have the static results search and in the most of the case people is using search I think that was around ten fifteen percent of the queries are used as a bookmarking tool so they don’t don’t save the address of truly anywhere they just search again and this is kind of boo marking and obviously discovery needs serendipity search in this case needs to be personal needs to be very tailored to you so we needed to provide different result if you are young if you are old if you are female male and some result that target for people basically located in San

Francisco maybe are not so good for people that is located on the east coast and so far so as I told you I don’t think that currently there is any big player let’s speak about big player google but you can correct me or you being yahoo that i have tried successfully to solve trendy pity search problem i’ll stumble upon works there you come on the side you register weed email and password with your facebook account second important step you need to select interest and it is like down teeth step because we have 523 interest that’s it you start stumbling one at a time just you hit the ball continues and what we hope we hope that to basically take a big portion of the time that users spent on internet for i would say not work-related time and is very difficult when stumbleupon started facebook twitter pinterest ready worth guy now we are getting with the time that you spend looking at the picture your friends or tweeting about your personal life or commenting about news work in another similar network so it became difficult and difficult every day more and more but still we have user that are with stumbleupon since day one they keep stumbling they are loyal they come almost every day and they keep enjoying the the product how we do what we do well this is a really broad overview but we inject content we have two way to do it one is by user discovery if you have a favorite page you can submit it or crawling subset of trusted source obviously this change over time that we have a number of filters some you you know the goal is to select good content other are just business rules or you know try to understand if we have other content and so there is a second phase it is what we call sampling and the sampling phase just try to solve the same problem that all the recommendation system have and is the cold start problem well I do not have any information about a page and they need to understand how good is it I basically we keeping it in this limbo for a minimum number of stumble or a fixed period of time and we basically monitor out this page behave if page pass the threshold of minimum quality it becomes one of the page of our index and we start recommending according to our recommendation methodologies so obviously at the center of everything is the user and the user tell us explicitly what you like so the set of is interest and sometimes we see that yet kind of favorite kind of news or trending content versus I don’t know evergreen content for example you can also choose a particular you know keyword and this becomes kind of channels we actually have channel but we are going to discontinue them but we can see that there are some source that they use some channel that the user prefer

but also we suggest results on the base of what friends do we have a similar mechanism of following like twitter you can follow a be followed by friends and obviously suggesting what friends like is one of the most common to duke thing to do and at the end we add we have also other implicit signal and that we leverage and the tower like-minded user so user that have similar interest profile so the same selection of interest or similar age similar gender and also expert we have a methodology algorithm if you want that allowed to select the top user within each interest and so we know that this user usually submit discover content that is I quality and will leverage user for suggestion purpose what is the business model obviously sponsored stumbles why this model is successful and it’s a quite successful model I have to say I think that so the the business model is called native advertising what is native advertising and why it’s different from other form of advertising all the other form of advertising if you think about a TV or display basically interrupt the user experience what I mean with this well that you’re watching a movie and you are stopped to see a particular ads this is breaking your user experience of watching a movie display advertising is the same you are reading news maybe politics and you have a display of the car doesn’t fit with what you are reading in most of the cases and so you have to you need to switch from what you’re doing to watch and click on the adds this does not usually happen as a usually obviously on native advertised so the way is done is another should be just another stumble so in the ideal case user should not realize that is not part of the user experience for the only difference that we have that sponsored mark on top of the page and run advertis ads that are particular that and in such a measure that user like them Tom Bob’s they share ads and for the advertiser did is particularly successful because in the moment that user likes or share this kind of piece of content this becomes organic what I mean with this is that this is injected in our index and is served as the rest of organic content so user advertiser you can actually reach much number of user for which they do not pay obviously is not always easy to give user experience to give the user experience that is organic especially because we have two different type of campaign one is managed and for the managed campaign we can and one is unmanaged so user can actually submit whatever they want but we reserve the right to show this content or not if they so also also add is in the limbo for a while if we start collecting a lot of negative feedback we can actually decide to stop showing that because what weaker the most is still the user experience as you can see here for example the way is done they target segment demographic segment in the most of the case this is what show it to me because I am a woman in a particular

segment in San Francisco probably but as you can see what we appear here is an article mistake that you do when you take care of your skin so seems more health that not selling you proud but of course they want to send me some products as I told you this is an old startup and it’s a start in the sense of 85 employees but we have a user base of 35 million registered user a hundred thousand advertiser and a size of the index that is a hundred million page and as a matter of fact tombola pony is the fourth social media traffic generator we actually generate more traffic than not link it in the not google plus total war stumble are on mobile platform and so this is another challenge and again number of employees a t5 if we consider the the revenue per employee is I girly not in Google and it’s something to be proud of what is the data science role in this obviously we talked about 85 total employees into office New York San Francisco engineering is in total I think 30 30 35 and data science is I also say me now it’s a three person plus a analytic engineer that is in charge of the data pipeline three person that needs to take care to support data-driven bad word innovation initiative or exactly for all istanbul product and when i say some product i mean male mobile site personalization and romania and we actually are able to do it with i would say good result so I’m going to present now a number of project we have done I’m not entering in the technical details but if you have questions of course you can ask I thought that for a torque at 7pm you really don’t want to to be bother but you can contact me and so the first to use case are an example of how interest can be exploited and why they are so important and a third example is some optimization that we have done for the revenue team and I will briefly speak about what else we are trying to do a summer / so as I told you interests are very important important for example for an infographic point of view we can actually see and understand how users are using our product just look at what they rate correct so we see for example that teenager prefer natural relate topic animals pet exotic animal but surprisingly did for example rates top rates also interest like mathematic or writings so really teenagers likes mathematics I don’t think so i think that they are using especially when you have big number right they are using the site to bookmark resources probably for homework and so this is kind of different that saying they’re using stumbleupon just for be entertained winterton is still the most important aspect of stumbleupon but looking at interest we understand differences if we look at women for example young women for sure day rates I don’t know parenting babies kids more than men and it is not surprising but still they also for example rate computer programming related stuff and I don’t think that this is again for

entertainment they probably took marking some resources for that they find useful for work so the problem of gender prediction reading some you know literature recently I found this recent paper 2013 it actually analyze how they the way people disclosure personal information on the web is changing in since 2005 and actually they notice that there is a shift people is more concerned now the not in 2005 about disclosure personal information this actually is from facebook they monitor I’ll change the quantity of information that people is given about I school uptown birthday interest so far so on and so the blue is less dense the red is a lot of more information and this is the heat map how it goes in the last few years another paper that i read recently was 2010 actually give some numbers they say the thirty percent of user do not disclose the rage on facebook twenty percent their gender and the ninety-five percent political affiliation some upon is not different from the rest of the world but our number are a little bit less we have that ten percent of stumbleupon user do not declare their gender at the registration time and age is mandatory you need to be 13 I think year old to create an account of stumbleupon because there is an adult content and so we but zero-point-three percent of our user as more than a hundred year and 0.4 percent is in the 800 interval we are still speaking about hundred thousand of user what you can obviously in fear is that you should not trust age in general so I don’t know why should trust a gender either but let’s say that we trusted aide agenda obviously this is a problem right we do not have so many ingredient for a recommendation we have demographic plus interest if we cannot trust demographic we are left to us with interest it’s not enough and also because targeting advertiser basically come to stumbleupon and say I want to target women that are in the segment 3446 and at live in san francisc if we don’t know the gender and we denote the age it’s kind of difficult to do the appropriate behavioral targeting targeting for ads so we need to understand who are these ten percent of the user to increase retention to increase revenue and surprisingly and I found a lot of literature related work even reason to related work that trying to solve the problem of gender prediction there are some method that just use link based information for example in facebook and then for gender just from the social network there are some work that right to infer gender looking at x 12 feature so if you write it to it or if you write a comment for a movie review for a movie I I can do some processing on the language that you are using and I can’t tell if you are man or if you are a woman a recent paper is from guys from Technicolor represent is the user gender based on ratings so this was our axis 2012 and they actually say just observing the

ratings on movies that is extremely sparse we know that it is very sparse we can say if you are a woman or a man with a good result 86 is on flixster eighty-four percent using logistic regression and 86 using as we have pretty good result is that okay that’s right we have interest our mother is extremely simple what we did we just look at the basket of interest one if you user select the interests the otherwise straight random forest we physically train this model every night on 0.1 percent of our users and initially with test and fe be connected the user for which we did not have an information LC is 88 way I garden what is not reported in literature for the same problem just using interest I’ll see when we speak about machine learning the data mining let’s go back to the company and let’s try to explain to the exact why this model is good and his mother is good because of this if we can actually target ten percent of the user that we cannot target before the estimation of increasing in revenue is ten percent just with the simple recommendation classification not so interests are really powerful we use for recommendation purpose we use for understanding user so we need to give the possibility to the user tools really select the best interest for them right and it is not always easy so I’m not saying anything new when I state that content that works on the web usually does not work on a mobile device and you said there are pages that are highly optimized for mobile for the mobile experience what about interested selection on a mobile we have 523 interest so suppose that you are creating an account on mobile and user do they actually create account using apps try to browse 523 interest with your mobile obviously big bias we have a lot of interest with a but artists usually they interest that better characterize that particular user so we said okay we saw that interests are powerful so given the interest we can understand user what if given the user we try to present them with to interest for that but what we know at and the moment that the user is creating account we know the gender hopefully if you said to us the age the device they mail and at set and what we want to take what is the metric that I want to try to autumn eyes at this phase of recommendation I want to optimize what we call activation activation is what tell us if for the user that created an account is becoming a real user user that will come back in seven day in 14 days in 30 days of a result possibly loyal user we have good model to predict if if user is activated there is a paper presented at dub dub dub 2013 if you are interested in see our activation matrix and so we basically can say given a user and given this information we can create a data set in which say this user is going to be activated is not big problem we don’t save interest we don’t have in our data the interest that user basically saved a registration we see we

have in our tough at all IDI interest at the current moment so and obviously as part of the recommendation we present the user with you like mathematic do you want to browse also physic so we really don’t know what are they interested the registration moment but we can consider the interest in the very first session for the user and try to use that to understand which are the interest that better behave on mobile we sample the first session over a period of three months so for session for each new user and we try to see if we can apply any classification framework if we can try any model we tried a lot of you know out-of-the-box methodology none of them work coin is the best so what we did was something a little bit different we have the problem as an association troll problem I think most of you are familiar is quite use so we said okay let’s see let’s track all the rules that say interest humor activated equal one interest kale steer it activated equals zero and so far so we fix the minimum support to fifteen percent the minimum confidence to seventy percent and the minimum left one point one percent in this way we always pruned all the rules with the same left side not indicative right and as result of this first to two step we were left with 80 interest that are actually fifteen percent of the total interest that we have so one natural question will be but maybe it is because user actually select only these 80 interest on mobile so we did a comparison and we actually see that we the cumulative distribution of interest istanbul on mobile and on web toolbar the website number website is mostly the same so there is no such bias that only 80 interests are actually used on mobile we did a second step and said okay at this point we will use the demographic information and we just ranked days interest on the base of popularity so Tom Bob’s thumb down and so the obviously the results are interesting we as such a standard that we are going to Abbott as this we could not because our mobile system was not prepared to dynamically select interest so the interface was static so we could not change the order but we added to wait for front-end to do the necessary modification but so this is a comparison on the segment on two different device one is android tablet the other is ipad for men and woman under 24 year old can you tell who is a man which column is a man on android i could without looking without knowing and this is because i really think that there are some you know think that we say about the diversity or user android vs a user of nye but it actually are true as a matter of fact for sure you can say were the women here the two column that are the women and the two column that our men and it’s so evident these are less than 40 24 year old food cooking food cooking no way that college male student is going to browse cooking receive and they

are drawing and drawing only women will then Droid guy here dan droid man come on hacking it is then droid sorry for it it is actually interesting what is the thing that I noticed here I told you that in the top rating interest on the web i found that general teenager also rates mathematics also rate a physic or something that the writing none of these interest is here because obviously you are not going to do your own work on a platform like this so if you are so if you are a user that as is signing with a mobile device your purpose is mostly entertained so this is totally different experience is a totally different way of using stumbleupon the not on mobile similarly this is the top comparison for 46 / 46 year old now food cooking is over so everyone eats after 40 60 then they want well but you will see that women for example there is held alt is something that everything care there is no more music anymore that was instead music walls in the others wide but mins our men are worried about health where is the help instead women are worried and want to browse alternative health they really don’t have any health problem they just want to be alternative and iphone ipad actually they want to travel androidguys seems that they don’t they don’t want so there are different that actually the device itself android vs ipod put in evidence and when you think about are not so hard right so it’s so this is the last thing that I want to talk kind in semi details and after I want I have a just an overview of other problem we are trying to solve this is very fast so as I told you we have two different way to do advertising two different type of campaigns one is the managed and the other is a day and managed static price we don’t have bid we don’t have any dynamic pricing is it works so well that we don’t need any form of it optimization so far so but we have a payment threshold so especially for they manage campaign if the user does not stay at least a four second we are not going to get any money and anything for seems like kind of opportunity I mean if we have campaign that we in any case we get money and campaign for which we don’t get money if the user doesn’t stay for at least four second maybe we should try to optimize and what we did was very simple statistical analysis so what we did was we simply compute the probability the tendency of the user to skip what we notice is that there are user that naturally skip they actually especially on mobile they skip skip that read just the title and this is the way they actually use not even if it’s good or bad they actually use the device like this when we want to send advertise to that oh if we want to send advertise to them the one that does not have a treasured for payment and the other thing that we did we filter the expected amount of money to be made by serving a particular ads with a particular treasure to the pair user device then we rank these these basically is happen every night and we

have this optimal selection of ads to user on the base of their tendency to skip as I told you obviously this model works when we have a lot of that especially for example in months like December which we have the peak of advertiser and all of them want to serve a lot of ads in which the skipper eight improved by seven percent and this means in in the top week of December and net gain of seven thousand dollar per day justice understanding the tendency of the user to skip this is the last I was a last that I want to talk we are doing other cool stuff ah we’re doing a lot for being a tree I think that we are really avid trouble that is amazing and they have to say I have I work I yahoo for five years and the constant problem that that was at scientist that there was not data scientist at that moment we were applied scientist some point I became data scientist but the big problem was they I would say the limitation we never were were so fast as engineering engineering was ready to deploy next model and we still needed time this is the first time that I actually do not have this problem our true put is either than not engineering so we actually the ability to say I deploying this I’m not going to give you any support you have two other two model other two algorithm to deploy what we’re doing and some could work the first work we are actually trying to use dual time to predict ratings we’re doing something very similar to this paper with some 2014 so we actually still are segmenting our items from the base of interest and user behavior and we are basically using a probabilistic model to understand our toes distributed with respect to rate current by the base line with the random forest as an equivalency of seventy five point seven but it is our baseline this is a true an important project is going to help a lot with a cold start problem so as I told you we have a limbo state which page are for 24 hour or minimum number of stumble but if we don’t reach enough rating in that limbo period that the pages discard it but what if since we didn’t know anything about the pay the page we’ve just presented to the wrong user so the idea is that we do not have time to collect enough evidence even if the page maybe is good this will help alot and also to recommend so again recommendation this is the recommendation engine I like I don’t like but also all the recommendation engine have the exactly same problem ratings are sparse a model that actually exploit dual time is not a sports model anymore is the model for which you have one vote for each item your index the other project that we are doing is in collaboration with Professor Saran and professor dongle from the university of santa clara and this is a we want to predict the user and this is a web app web product user get bored and they live so we needed to counteract so predicting user chart is the the most natural thing to do we are using this model from this paper counting your customer one by one and this model is basically is the consumer behavior model and it’s a base on hypothesis of APIs on purchases and exponential lifetime we have a basil pretty strong we are adding coverage correct means demographic or other information about the user the idea is

that we will have kind of rate for each user that is telling us this user going to stay lawyer this user is going to leave soon and obviously this will be used for notification purpose so if I have to select a number of users that I want to send an email saying it look we have them stream league good content for you and also we start we haven’t done and i would say till this moment no work on our social network so we have not exploit leaders we are not to suggest friends so we are starting will work on us on this and this is not all but as i discovered huge part what data science means of tamil upon and thank you you