Next steps

Eric Walker, May 29, 2022

It’s been a while since the last update, and now is a good time to gather some of my thoughts about some of the changes to Digraph that would be nice to have. Let’s take a look at some possible next steps, which may or may not happen in the near future:

Digraph is an app for keeping track of links and topics. No topic is too specific, and any link that I spend more than a few moments reviewing is something I want to be able to capture for later retrieval. Topics are often situated in time, especially ones for ongoing developments. A new infectious disease is mentioned here and there and then quickly turns into a pandemic and becomes an all-consuming thing. After a few years it starts to recede from attention. The invasion of Ukraine by Russian forces is first something that is hinted at and foreshadowed by the amassing of troops at the border, and then one day it happens. Topics are split out into new ones as the links accumulate, and they are renamed and given synonyms. Numerous smaller events come and go. Even relatively settled topics, such as the planet Pluto, can undergo significant shifts in our understanding and categorization.

This dimension of time is inseparable from the activity of organizing links into topics. But at the moment, updates to the graph of topics and links in Digraph permanently change what is seen and captured, and there is no way of going back to an earlier point in time to see what things were like. We lose important information in the process about the historical development. Some of these changes can be tracked in an event log of some kind. But it would be nice to be able to go back to an earlier point in time and see what the whole graph looked like at that time.

Coming up with a representation for data that allows one to travel backwards in time like this is a hard problem and requires things like persistent data structures. Despite the difficulties, it is a problem worth solving for a graph of topics and links that will be around for and updated over many years. Solving this problem using a database like Postgres is probably doable. But since 2005, the Git open source project has been available and can do what we need. So a next step will be to transfer most of the data to one or more Git repositories and then begin tracking these changes there. This will allow one to ask, what did this topic look like two years ago? Many modifications to the way that data are represented will be needed to make all of this happen.

Most of the data will be stored in Git, but some will still be stored in Postgres. Things like user permissions should not change when you are looking at an older snapshot in time, for example, so this is one of the things that will continue to be tracked in Postgres.

Saving data in Git will provide the basis for two other features I have wanted for a while. First, it will allow a clean separation of the data into different repositories. Imagine a number of repos with different owners that exist for different purposes. There’s a repo called “Wiki” that holds the largest set of data and that can be updated by anyone with an account. And then there are other repos whose visibility is scoped to an organization or an individual. These repos would be used to track topics and links that are of interest only to that individual or organization, or that represent sensitive information that the owner might not want to share more broadly. These repos would be analogous to private repos in GitHub. Unlike repos in GitHub, however, the contents of these repos would be combined into a single view. A user could tag a link in a private repo with a topic from the Wiki repo rather than making a copy of the whole topic, and the link would be shown under the topic but be visible only to that person. Different departments within an organization could similarly keep track of topics and links in their own repos without needing to store the data in the Wiki repo, while still making use of the topics in the Wiki repo that are already there and relevant. The system will integrate and overlay the data from different repos to which one has access and present everything in one place. Even further out, you can imagine federating these repos so that they reside in different systems. For now, though, let’s just try to things working with separate Git repos. A challenge here will be figuring out how to support time travel across all of them.

Another thing that saving data in Git will allow will be cloning a repo. You will be able to clone the Wiki repo as a starting point and make significant changes to it in your own copy. And normal Git cloning will automatically give people a way to save a backup of a repo with the full history.

Since Digraph was first deployed as an application sometime in 2018, I have been the only real user of it. But I want to get it to a place where it is clear how it might be used by a large number of people, if only as a demonstration of and a proof of concept for a set of ideas I’ve had for years. The system should be able to store millions of links or more — at least as many links as there are interesting web sites and documents on the internet. But there are many links that are not interesting, because they are automatically generated for manipulating search engines or are produced by content farms for other reasons. A lot of links are interesting only to a narrow audience or in the context of a specific project. There should be a place for these links and topics in the system, but let’s not show them under a commonly viewed topic unless the user takes additional steps to remove the filters hiding them.

Which links and subtopics should be shown by default and which should be hidden behind filters? This is something that would ideally be sorted out by subject matter experts, or, at least, by contributors who have built up a reputation for quality contributions and know a topic pretty well. So a reputation system and a way for specific topics to be managed by specific users in a decentralized way would be nice. Just because a user has knowledge of one topic area does not mean that he or she will know much about other unrelated topics. So in a system that is meant to be used by numerous contributors, let’s handle who manages a topic as specifically as possible, instead of, say, allowing admins to do this and allowing other roles to do that. What we’re talking about should be separate from an access control system with roles. Also, we’ll need levels of visibility that a user can turn on and off in order to show or filter out links and topics, depending on how many links and subtopics he or she wants to see and has the patience to sort through. A few of those levels of visibility will be used all over the place, and others might only be needed for a project and be of interest only to a handful of users. And let’s not assume that because a specific link should be shown by default under one topic that it should also appear unfiltered under every other topic that it is tagged with. Whether a link is shown to everyone is something that should be handled on a case-by-case basis, depending on the link and its relevance to a topic.

While we’re making such wide-ranging changes to the project, let’s re-implement the project in Rust, which is a programming language that we can be confident will have the performance we need, and one that I look forward to learning more about. Since I’m the only developer contributing to the project at this time, such a big change should not inconvenience anyone.