Democratization of search

Eric Walker, June 13, 2020

A week or so ago I added the twenty-thousandth link to Digraph since November 2018, when the project had reached the point where it was a useful tool. Let’s take a moment to look at some of the influences that have gone into the effort along with the speculative possibility of the democratization of search that has guided its development.

In 1994, Yahoo launched a website that provided a directory of links to interesting sites on the World Wide Web (what we now think of as the internet), which was relatively new at the time. Around the time the Yahoo directory became available, I bought a book that contained listings of links for different web sites together with short descriptions of each site. Today, with search engines and news aggregators and the billions of links out there, such a book would not be very useful. At the time, the Yahoo directory was more convenient than the book I had bought because you didn’t need to type in the long URLs.

On Yahoo’s landing page, you saw the top of the hierarchy of categories, with entries like “Arts & Humanities,” “Business & Economy,” “Computer & Internet,” “Education,” and so on. If you clicked on “Arts & Humanities,” you were taken to a page that had subtopics like “Artists,” “By Region” and “Performing Arts.” A second section of the page had related categories like “Art History” and “Education.” A third section had a long list of links for the current category.

Yahoo's landing page

What would happen if you were interested in art education? Yahoo gave you at least two ways to get there. You could get to it by way of Arts → Education, and if you started with Education instead, you could get to it by Education → By Subject → Art@. The “@” seems to have been a hint that the category was really under a different parent category, and this was a shortcut to the category that was added out of convenience.

Yahoo, in providing a directory of links, stumbled upon an interesting categorization problem. Their directory had thousands of links, and a simple hierarchy of categories would be confusing for users. If there was only one way to find arts education, visitors would sometimes be lucky and find the one available path to get to it, while other people would try a different path and possibly give up before arriving at the category. Yahoo handled this challenge of finding a specific subcategory by providing multiple ways to get to it. In the language of software development, their categories formed a directed acyclic graph rather than a strictly hierarchical tree.

In the late 1990s, Google launched its search engine, which used arrays of computers to index large numbers of web pages. Once the pages had been indexed, they were easily found using simple search terms from a form on Google’s page. Yahoo’s directory, which was maintained by hand by employees at Yahoo, was not as useful and faded into obscurity.

In history, people have seemed to gravitate towards simple ways of organizing things. Aristotle, with his classification, and Linnaeas, with his taxonomy, used the simple approach of a strict hierarchy. A species of plant was in one genus and not any others, and that genus was in one family and not any others, and so on, all the way up to the top of the category scheme. In the case of Linnaeas, the intent was practical. He organized his classification around things that were easy to identify, such as the number of stamens a flower has, which could be used to narrow down the possibilities and eventually get to the entry for the species. Nonetheless the Linnaean classification was rigid, in the sense that a species has one and only one place within the taxonomy.

When we’re talking about thousands or tens of thousands of distinct things to be organized, the Aristotelian and Linnaean approach probably works well enough. When we’re talking about hundreds of thousands, or millions, or billions, of distinct things (e.g., web pages, or stars, or proteins of scientific interest), the limitations of the simple approach are insurmountable. A more nuanced approach like Google’s or Yahoo’s becomes necessary. Google did not attempt to provide a system of categories that the user could use to navigate down to all of those links. In order to find a page, the user had to already know enough about the content of the page for it to appear towards the top of search results. This was fine in many situations. In most cases, you started with a phrase of interest instead of a specific page in mind, and hoped that you got the phrasing right.

With news aggregators such as Reddit and Hacker News and social media sites, search no longer has the central role in discovering new web pages that it played a decade ago, although it still comes in handy at times.

Wikipedia has followed a similar approach to the old Yahoo directory in the categories that are included at the bottom of encyclopedia pages. Unlike the Yahoo directory, Wikipedia is crowd-sourced. This has allowed the site to accumulate millions of pages on different subjects together with a large enough collection of categories to classify all of those pages. Wikipedia’s categories follow a similar structure to the categories of the Yahoo directory and are not strictly hierarchical. In a way, Wikipedia’s scheme of categories is groundbreaking. Its authors are not merely classifying every animal or plant or mineral ever found; they’re classifying every idea and concept that people use. I gather that that is a harder challenge, given the scope and generality of some of the ideas and concepts involved; where to place them is not always obvious.

When reading an article in Wikipedia, one naturally comes across new topics, an experience that gives it much of its value. Google’s use of machine learning to index documents probably makes the assembling of an organized, human-friendly hierarchical set of topics a difficult challenge.

With the large number of categories available in Wikipedia, something interesting becomes possible. Theoretically, it is now within the realm of possibility to search within categories and within intersections of categories. You could wire up Wikipedia to allow for a search for all articles that are within a “Covid-19” category and a category of “Proteins.” This would return a list of articles that fall into both of these categories, or, transitively, within subcategories of these categories. What we’re doing here is similar to a Google search, but instead of looking for pages that contain certain phrases and variations of those phrases, we’re looking for pages that mention specialized names and concepts, and subtopics of those names and concepts. Our search has become a concept search. It would turn up not only web pages that mention the words “Covid-19” and “Proteins,” but also ones that are tagged with ACE2, Recombinant Spike Subunit 1 (S1), Recombinant 2019-nCoV NSP7 and others. There is nothing to prevent us from adding topics that are even more specific if any of the existing topics become too unwieldy.

Now let’s extend things further and take Wikipedia’s categories, which categorize millions of articles, and expand upon them and overlay them on top of Google’s search corpus, which cover billions of web pages. The result would be a search engine that would allow us to zoom in and out of concepts as much as needed in order to explore and research a topic of interest or an interesting possibility or a new connection. To make this kind of search possible, the application of the topics to large numbers of web pages would need to be precise. Wikipedia has shown that crowd sourcing can accomplish the scale and precision needed. As a result of building out this growing and evolving corpus of links and topics, we would have a search engine that would be free of the exigencies of corporate strategy and profit making, forces that inevitably influence Google’s search results. Our crowd-sourced search engine would be a work in progress and a community effort rather than a finished product handed to us out of corporate largess. This is the democratization of search.

How realistic is such a possibility in the short or medium term?

This question has two parts. The first part is a question about whether such an effort is fundamentally feasible. I think it is. In the past 18 months, I’ve added 20,000 links to Digraph, which comes out to about 11,000 links per year. Nine million users working at this rate could index 100 billion documents per year at a cursory level. By comparison, Wikipedia has 39 million users, 142,000 of which are active, and Github has 40 million users. I doubt that indexing anything close to 100 billion documents would be needed to make the service useful to people. In reality, there are many web pages that are not worth indexing. I’m going to speculate that if there were 142,000 people actively indexing web pages in the course of their everyday reading, the result would be a sufficient level of activity to make the effort competitive with existing search engines, which return as as many advertisements disguised as web pages as they do useful search results.

The second part to this question is how likely it is, realistically speaking, for a crowd-sourced search engine to come about. That is a harder question to answer over the longer term. The answer would depend upon circumstances and trends that are difficult to predict. In the short term, I would imagine that something like this would be unlikely to happen anytime soon.