So, I read this article on LinkedIn today that offered a pretty good perspective on the technical workings of a search engine. In an effort to digest what was a very substantial post, I want to try and put it in my own words here. As the author notes, it takes an engineer with a fairly solid resume to understand the finest details of creating and maintaining an accurate search engine. However, the basic schematic of what happens after you enter your query and hit “enter” is not only fairly straightforward, but is something which is useful to understand.
First, you type in a query and hit enter, or click on the search button. Based off of your input, the search engine then creates something called an inverted index.
When a search engine “crawls” your website, what it is doing is adding your webpage text into a database, which is then used in this inverted index, such that it is found whenever someone queries any word on that webpage, with some exceptions. “Stop words,” so-called because they do not inform the search engine as to the intent of the query, are dropped before this data is collected. Stop words include words such as “the.” So, when you are typing in a query, “the best place to eat sushi” will probably be truncated to “best place eat sushi” before the inverted index is compiled, thus making the two searches yield similar results. Why? Because “the” and “to” don’t provide additional information to the search engine regarding what it is the searcher is looking for. The compilation of the inverted index is often referred to as the “retrieval” phase of a search. Since the number of matches for any set of queries quickly becomes unmanageable across a data set as large as Google’s, scoring and ranking becomes the primary means that differentiate major search engines.
If you’ve followed any search engine optimization blog for any length of time, you have probably seen some discussion of search engine scoring methods. With the retrieval step, the search engine does not have anything that would be that useful just yet, as, in this example, it has a list of every webpage that mentions any combination of “best,” “place,” “eat,” and “sushi.” Thus, you can be fairly certain that the list includes all restaurants, regardless of whether or not they serve sushi, because almost all restaurant home pages are going to have some variant of “eat” in their copy, because that’s what you do at a restaurant. Thus, scoring becomes necessary, and a variety of signals are used to attempt to sort through the inverse index in search of those pages which are closest to what it is you are looking for. How each search engine does this is proprietary, but involves a blend of global signals (such as a website’s popularity), query signals (how well your query matches a website), and personalized signals (such as the searcher’s location).
Once scoring is complete, the search engine simply lists out the index to the searcher, sorted from best score to worst score.