#searchengines

waynerad@diasp.org

Exa Websets purports to turn the whole internet into a searchable database.

"All AI startups building new LLMs chips that are post series A."

"All PhDs who have worked on developer products and graduated from a top university and have a blog."

"Obviously traditional search tools can't do these things. You don't even think to ask them that because they weren't built to be a database."

"So how do we do it? Well, we built the first web-scale embeddings-based search engine. Essentially, we trained an AI system to organize the whole web by meaning."

They claim "Exa's system knows when to use more compute to agentically research and verify each result. That means Exa Websets might take a long time to complete."

But it's not available now. You can join the waitlist. If this works as advertised, it'll be amazing.

Introducing Websets: A breakthrough toward perfect web search

#solidstatelife #ai #genai #embedding #searchengines

psych@diasp.org

Scary timing...
Speaking only hours ago of this, & the legacy/bastardization of "search engines" (esp. Google) And poof! This appeared by magic.

Google Search: Ten alternative search engines to Google’s offering

FWIW, in revisiting my turn-of-the-century search tools directory, I found many long gone, but I re-tried Bing, with surprisingly good results. (Haven't tried it since checking it out at launch time and then returning to Netscape/Firefox and then-not-evil #Google )

#GoogleISevil #SearchTools #SearchEngines ++ My museum/archives of the heyday of search engines (IMM) ; http://www.fenichel.com/search

joseph_teller@diaspora.glasswings.com

Google Has Been Lying About Their Search Results

Recent google API documentation leak that revealed Google has been lying about how they rank search metrics. The leak proves that Google does sandbox new sites, they do track clicks and time spent on sites, and they still use site authority to rank websites. And they Do Use Chrome and Chrome based Browsers to Feed them information to affect results.

About Google Search

#SearchEngines #ChromeBrowsers #Google #GoogleSearch #API #SearchMetrics

waynerad@diasp.org

Yandex, the Russian search engine, had a source code and data leak. I don't know what else is in the source code, but most of the news about it has to do with the search rankings, which is of interest to "search engine optimization" (SEO) people, who rarely get a glimpse inside any search engine. People are always guessing how search engines work, but it's rare to get those guesses confirmed or falsified, or to discover the search engine is doing something different from what they're guessing. For those of you who might be unfamiliar with "search engine optimization", what it basically means is making search results worse for all the rest of us by trying to 'game' search engines and get 'junk' to show up at the top of the rankings -- whatever junk the SEO person is trying to sell. But never mind that... let's look at the factors Yandex uses to rank search results.

Factors include: PageRank (just like Google), pessimization ("Our interpretation is that when a website is penalized (pessimized), its PageRank is reduced to zero"), clicks & click-through-rate (CTR), overall site performance and reliability, URL construction (URL contains words from the query, URL contains city or country of the user, URL doesn't have too many numbers or trailing slashes, URL has a short Levenshtein distance to the query --the Levenshtein distance is the minimum number of single-character edits required to change one word into another), whether the page has a single product or multiple products (such as a product category page) (there is an algorithm called DSSM that determines if a webpage has one product or multiple products listed on it), host quality (if the page is on a site with lots of low quality pages, it's quality score gets lowered), how long the page has existed, whether it's linked to from TikTok, what its Yandex.Metrica impact ranking is, age of links, relevancy of text in titles, the Inverse Document Frequency of the search term in paragraphs, the BM25 algorithm ranking, and the time of day and day of week the user did the search. There are a few additional factors for medical, financial, and legal topics. Wikipedia and Vkontakte have additional rules that apply only to them. If you're not familiar with Vkontakte, it's a Russian social network, similar to Facebook.

Yandex.Metrika is an analytics service analogous to Google Analytics. It uses a variety of statistics about user visits to a web page (number of visits, average time spent, what type of audience the page draws, etc) to form an impact score.

I noticed the BM25 algorithm because I was just reading about an idea for combining semantic search (AI-powered search) (also called vector search) with traditional keyword searching and it talked about the BM25 algorithm (link below). The BM25 algorithm (BM stands for "best matching") was the algorithm of the traditional search engine, while the semantic/vector search used cosine similarity, and the challenge was to combine them into a single ranking. The BM25 algorithm incorporates the Inverse Document Frequency (IDF) algorithm also mention here. The Inverse Document Frequency algorithm not only quantifies how frequently the search terms occur in a given document, it also factors in how much those terms don't occur in other documents.

Yandex data leak: Initial findings & SEO learnings (the 1,922)

#solidstatelife #searchengines #cybersecurity