#searchengines

waynerad@diasp.org

Yandex, the Russian search engine, had a source code and data leak. I don't know what else is in the source code, but most of the news about it has to do with the search rankings, which is of interest to "search engine optimization" (SEO) people, who rarely get a glimpse inside any search engine. People are always guessing how search engines work, but it's rare to get those guesses confirmed or falsified, or to discover the search engine is doing something different from what they're guessing. For those of you who might be unfamiliar with "search engine optimization", what it basically means is making search results worse for all the rest of us by trying to 'game' search engines and get 'junk' to show up at the top of the rankings -- whatever junk the SEO person is trying to sell. But never mind that... let's look at the factors Yandex uses to rank search results.

Factors include: PageRank (just like Google), pessimization ("Our interpretation is that when a website is penalized (pessimized), its PageRank is reduced to zero"), clicks & click-through-rate (CTR), overall site performance and reliability, URL construction (URL contains words from the query, URL contains city or country of the user, URL doesn't have too many numbers or trailing slashes, URL has a short Levenshtein distance to the query --the Levenshtein distance is the minimum number of single-character edits required to change one word into another), whether the page has a single product or multiple products (such as a product category page) (there is an algorithm called DSSM that determines if a webpage has one product or multiple products listed on it), host quality (if the page is on a site with lots of low quality pages, it's quality score gets lowered), how long the page has existed, whether it's linked to from TikTok, what its Yandex.Metrica impact ranking is, age of links, relevancy of text in titles, the Inverse Document Frequency of the search term in paragraphs, the BM25 algorithm ranking, and the time of day and day of week the user did the search. There are a few additional factors for medical, financial, and legal topics. Wikipedia and Vkontakte have additional rules that apply only to them. If you're not familiar with Vkontakte, it's a Russian social network, similar to Facebook.

Yandex.Metrika is an analytics service analogous to Google Analytics. It uses a variety of statistics about user visits to a web page (number of visits, average time spent, what type of audience the page draws, etc) to form an impact score.

I noticed the BM25 algorithm because I was just reading about an idea for combining semantic search (AI-powered search) (also called vector search) with traditional keyword searching and it talked about the BM25 algorithm (link below). The BM25 algorithm (BM stands for "best matching") was the algorithm of the traditional search engine, while the semantic/vector search used cosine similarity, and the challenge was to combine them into a single ranking. The BM25 algorithm incorporates the Inverse Document Frequency (IDF) algorithm also mention here. The Inverse Document Frequency algorithm not only quantifies how frequently the search terms occur in a given document, it also factors in how much those terms don't occur in other documents.

Yandex data leak: Initial findings & SEO learnings (the 1,922)

#solidstatelife #searchengines #cybersecurity