aqua overview (2023 edition)

aqua is the stupid goddess in charge of managing my photo library. She’s a bit of an idiot, but she gets the job done:

stupid goddess

In general aqua is a codename for a suite of projects that aim to organize my files more efficiently. The unifying concept behind all the software is that: storing files (leaves) hierarchically sucks because in general there is rarely a single “path” to retrieving that knowledge.

I have recently been reorganizing my media, after moving it to a pair of shiny new 20TB drives, and felt compelled to write about this particular passion project of mine. Additionally I also got an iPad this holiday season, and have been learning to use it’s Photo gallery. Eager to find a way to backup and organize those photos: I took to writing some code to process their photo database.

So below is a basic description of aqua, to hopefully provide some context for interested readers. I hope to develop it this year, and will endeavor to write more about it as I do so.

Database model

The basic model of the program is actually remarkably simple:

entries: a record of content-addressable files (SHA256) & system metadata.
tags: tags consist of three key fields: id, schema, and name.
entries_tags: an M:N relation of entries<->tags forms the basis for our database; all query capabilities start here.

Why content addressable?

I find myself downloading the same file repeatedly over the years. One of the downsides of embracing memetics / having a digital brain is that pernicious ideas have a certain “orbital” property to them.

Content addressable data stores solve two key problems:

Q: Where does the data belong? A: In the corresponding hash bucket.
Q: What do you do with duplicates? A: Discard them.(*)

*Unless you need to collect & merge any interesting new metadata into the database.

One thing that content-addressable storage does not solve is weeding out practical duplicates—e.g., visually identical images at differing compression levels, images that are clearly crops of one another, etc. This seems to be best addressed by “AI” programs, however I’m in general not a fan of handing off my multimedia library to corporations so they can run their algorithms on my data.

There are two privacy respecting options I’m aware of:

DupeGuru seems to have this capability and is open source
Apple claims to do image analysis on-device, but their software (Photos.app) is not open source, so I’m unsure how this claim could ever be verified satisfactorily.

Since I do use an iPad to consume/store media, and would eventually like to ingest from it into aqua, leveraging its analysis results from Apple’s database would be an interesting option to explore.

However more broadly applying DupeGuru to the aqua content store seems to be the more practical & controllable option. I am no expert in “AI” but it seems to me that this use-case will be what drives me to finally learn / embrace the technology.

Why tags?

My problem with hierarchical filesystems is thus: having multiple paths to data demands things like symlinks or hardlinks from a filesystem. Using these is like walking into a minefield of platform compatibility hazards. Just to name a few:

On Windows you require elevated privileges to create them (SeCreateSymbolicLinkPrivilege)
Creating them on mobile devices from “trusted” code is virtually impossible.
Creating them on either Linux or Windows requires them to be on the same device. This sounds reasonable at first glance, but prevents some use-cases I have such as: (a) storing the database & content store on different ZFS datasets, (b) accessing the content store remotely but querying & linking it to a local filesystem.

I have dabbled with using links for prototype aqua frontends, but due to the issues above the experience definitely feels like links are a “second class citizen” compared to a real file.

The problem with hierarchies is primarily that they enforce one person’s “lens” on the data which, especially in team environemnts, conveniently ignores the fact that organization is ultimately a subjective endeavor.

Even if you use links to escape this imposition: you still have the cognitive overhead of dealing with multiple “copies” of the data. This may confuse space usage reporting tools, it makes it difficult to truly delete a file because you have to know all the myriad paths to it, and you now have multiple “sources of truth” as each path is individually encoding data, and finding related paths is a non-trivial operation.

Queries

Consider the following reasons why might you want to watch a particular movie:

You know exactly which film (“title”) you want to watch.
You want to watch a film by your favorite producer or director.
You want to watch a film with a particular actor.
You want to watch any film in a particular genre.
You just finished a serial or film and want to see related entries in the series.

Typically to solve this problem you need to first scan your library, match it to some metadata provider, and then browse the results in a frontend. Oftentimes these frontends are proprietary (think Spotify or Netflix), but even opensource ones rely heavily on metadata provider services. (This creates a potential privacy hazard.)

You can apply this same logic to almost any type of file you want to store. For example consider these queries for finding your typical “back-office” document:

Find all “Quotes” / “Invoices” relating to a given project
Find the above, but for a customer.
Find the above but limit it to a specific year.
Find the presentation you gave to customer at point in time, even if you forgot the title.

Modern filesystems can help with some of this if you are judicious about naming conventions, file extensions, etc. However consider dates: they often get bungled when moving data between devices. (Which, in the cloud-connected era, seems to happen far more frequently.) Query 4 could easily be messed up by any number of reasons: (a) someone inadvertendly modified the file, (b) you restored the file from a backup and your OS bumped the dates.

The current modus operandi seems to be: throw files in an inadequate structure, and then let some background process waste energy & CPU cycles trying to guess how you might query it. However since computing is moving towards low power mobile devices, and indexing post-facto is a relatively expensive operation, we (consumers) have been sold on the idea of trusting all our data to a Cloud Vendor who will index/analyze/provide searachble results for us.

The author whole-heartedly rejects the idea that an entire fucking datacenter is necessary just to have functional search over their data.

Solution

Enter the humble tag. Tags have the following properties:

Any file can have any number of tags.
A tag belongs to exactly one schema. (Though tag names only need be unique within a given schema.)
aqua programs may write relevant tags in the sys:* schema.
aqua programs may expose metadata to the end-user as synthetic sys:* schema tags that do not necessarily exist on disk.

The following properties will likely be added in the future:

Tags can have relations to one another. (For e.g., the tag comic-book-movie “implies” movie, or actress “is-a-synonym” for the tag actor.)
Schemas can be queried with tree operators instead of set operators if they parse to valid paths.

Schemas exist to help the user organize their tags. They can be used to describe what a tag is, for instance: project, author, or title. They can also be used as a path, which identifies where a tag belongs in some particular taxonomy. (Hierarchies are not inherently bad, after all, they’re just ill-suited for organization of leaf data.)

A fully-formed tag is formatted as follows: <namespace?>:<schema? | path?>:<tag>. As such the fragment separator (:) and path separator (/) characters are reserved by the grammar and must be escaped to be used as part of a schema or tag. Note that the namespace is a convention (a colon-prefix on a schema) and not currently a separate field in the database.

Tags can be combined using a fairly simple set-based grammar defining just four operations:

Intersection (+)
Subtraction (-)
Union (*)
Grouping (( ... ))

Some examples of how one might use this to build ad-hoc queries:

Find a facial expression shot of an anime character: (meme:reaction + genre:anime) + character:megumin
Find pictures without a character: (media:image + genre:anime) + (series:suzumiya haruhi - character:asakura ryouko)
Find a movie with an actor media:movie + actor:Ewan McGregor
Find a presentation for a project: sys:type:pptx + project:aqua

Since tags are stored in an actual RDBMS: indices are automatically maintained for all these tags. There is no scanning process that must be done to rebuild indices, find missing entries, etc. It’s all done at insertion time. Also because SQL naturally works in terms of sets: query planners are very good at optimizing these types of queries.

The grammar will eventually be extended to work well with types other than strings, the two types I am most interested in adding support for are:

Integers (page numbers, revisions, etc.)
Dates (release date, modified date, etc.)

Integers would support your typical comparison operations, and dates would additionally be able support comparison to an interval or duration—e.g., find files modified x days ago, or find films released between 1980 and 1989.

Vision

There are two main things that need to be addressed with aqua:

Integration - ideally I want to be able to use aqua within the framework of existing operating systems. (e.g: saving a document and having it automatically ingested w/ reasonable default tags; finding an image and uploading it to SNS, etc.)
Explorability - the end goal of carefully categorizing your media is, of course, to be able to find & enjoy it later. Tool(s) should enable the user to quickly build the queries described above, iterate & refine them, and use the results in a meaningful way.

There are plenty of programs out there that aim to address parts of the media organization problem, but I find they usually share some or all of these shortcomings:

They treat the filesystem as the content store, and do not embrace content addressable storage.
They tend to be focused on a specific type of media, e.g: images. aqua aims to eventually replace the hierarchical filesystem.
They usually have poor to no integration with the existing operating system shell.
The programs are often overly complex (e.g., Hydrus does media tagging, but has a complicated schema, sync protocol, and community/network aspect built in. Many photo library programs are designed for professional photographers and organize by project or shoot, can ingest raws, etc.)
Often even if these apps have a decent cross-platform desktop story: their mobile story is non-existent or an afterthought. Currently the plan is to take advantage of the data already produced by native galleries on Apple/Google devices and use it as part of an ingest program tailored to these programs. (Possibly with bidirectional support.)

Project outline & plan

Currently the code lives in two repos:

aqua: an old web based interface
aqua2: a reboot of the above libraries with new binaries, minus the web frontend. Updated for the modern Rust ecosystem. (NOTE: this will probably get merged together soon, I just got lazy and didn’t want to figure out how to get a Rust web framework from 2015 to compile in 2022.)
aqua-gui: a prototype C#/WPF desktop client, not currently published.

⚠️ DISCLAIMER: THIS IS ALPHA SOFTWARE WHICH MOVES & DELETES FILES. ⚠️

Use this software at your own risk. Use on a library which is not backed-up is strongly discouraged. The software is not designed for end-users; many tunables are missing or not exposed to clients. Use without knowledge of Rust, SQL, and a basic understanding of the model described above is not advised at this time.

The following command line utilities exist:

aqua-drop / aqua-watch: a program designed to offer a “drop box” where files can be saved & ingested into the database.
aqua-query / aqua-link: a program which exposes a set of tags as links into the content store. This is useful for browsing a tag set with existing gallery software.
aqua-gui: a gallery program designed to aid in browsing & tagging media. (Primarily photos & videos.)
aqua-web: a web gallery designed for browsing media. If you’re a degenerate weeabo, like me, it was supposed to be a booru-clone.

The following utilties are planned:

aqua-ios: ingest a Photos.app database, possibly with integration to libimobiledevice to automate reading the database and image files.
aqua-explore: a Windows daemon which adds context menus, as well as a “spotlight-esque” pop-over interface to explorer. Used to search, tag, and manipulate files in a document library that is automagically linked to a content store.
aqua-shell: a REPL-style command line interface that lets you iteratively build queries and expose the results on a POSIX filesystem.
aqua-z???: possibly use ZFS internals directly as the content store, and provide interfaces on top of it; basically replace ZPL with an object layer that is queried by tags. Could be interesting in a potential “client-server” model.
aqua-9???: same thing as above, but instead of ZFS build it on 9P for lulz.

Database model#

Why content addressable?#

Why tags?#

Queries#

Solution#

Vision#

Project outline & plan#