So we tend to have multiple copies of files, because it needed less time to buy a new hard drive, than to go over files than to find all the very similar files, and choose the latest version of a file.

Moreover, we avoided deleting similar files in different folders, because these directory
provided contexts to understanding what these files did in different situations.

We forgot where these redundancies were, and now, every time we make backups of disks, we just copy over the whole stuff, creating exponential growth of space requirements.

[Solution]

If a file was under several different hierarchies, add it to a git repository as the same file. Arrange it by times, and commit.

For each commit, make the comment be the location where file was found.

Commit by modification times.

Then, create a browser of files defined this way, that can browse multiple virtual hierarchy as defined by labels.

Moreover, based on file location statistics, automatically suggest the best location for each file to be.

"Tired of young people ? Can't seem to make them see your point of view ? Introducing the La Beling Gitifier <shows device resembling a Buck Rogers ray gun>. . . One quick shot to the sternum and they'll be ranting right along with you . . ."

[+] I had an idea that accomplishes some of the
goals on a more limited scale but would be easier
to implement <link>. It appears that the Gitifier
would need to be incorporated into the file
system and completely change how it worked, but
would accomplish everything I wanted with my
idea and much much more.

While were at it, lets let GIT keep a history of
each file. Obviously that would fill up the hard
drive too fast if all history was stored, but history
could be pruned as needed to make space, leaving
more recent changes tracked in case the user
needs to go back to an old version of any file.

I had a dream yesterday that I was on a train going
over an aqueduct with Linus Torvalds and some
friends of his, and I persuaded them to explain Git to
me via the conduit of explaining to me how a two-
stroke engine works.

Doesn't git (and VCSes in general) only work on files that are
text-based? I thought that was one of the reasons why many
file formats (MS Office, CadSoft Eagle, ) have moved from
being binary to being XML-based recently. But many formats
are still binary (images, videos, ), so I don't see how this
would work as well for those files.

Whilst not understanding the idea, I would like to
comment on it herewith.

To the extent that I _do_ understand, GIT is some
system for managing files to avoid redundancy,
yes? OK, so how about a simpler option.

Have an application that runs in the background.
Whenever it finds two identical files on the disk,
and provided neither copy is being edited at that
moment, it simply deletes one copy and replaces it
with an alias. (Do aliases exist on non-Mac
systems? I presume so.)

The alias will sit there in the directory where the
file originally was, and will therefore be fully
findable and will retain its context. Problem
solved, no?

An extension of this system could save yet more
space, by replacing the repetitive parts of large
files with aliases, and then reinstating them on the
fly. Maybe.

I have some experience that seems to contraindicate the use of
aliases or other OSes' equivalents. I have a collection of reference
documents (scientific papers, datasheets, etc.) in my Google Drive. I
had them organized into topic folders, but recently I found that the
folders I had weren't optimal, so I decided to rearrange the files. At
the same time, I decided to consolidate all of the files into one folder
and put only aliases to them in the topic folders, to be able to put a
file in multiple categories without duplication. This worked great until
I tried to access it from my Windows computer or the Google Drive
web interface, which don't understand Mac aliases. I could use
Windows shortcuts (equivalent to aliases) but Mac OS X doesn't
understand those. I could use symlinks, but I found some reason that I
don't remember for those to not work either. I also thought of using
hardlinks, but I realized that Google Drive would see those as separate
files, resulting in duplication in the cloud and then probably in the
local folders when it sunc again.

So. I'm currently thinking I have to build some kind of document
management system to keep track of my reference documents (and it
has to be able to sync between my computers and ideally also be
accessible from mobile and web). I would like to be able to tag
documents with multiple tags each, rather than having each one in
just one folder (which is why I started the alias thing in the first
place). I considered Evernote, which would work perfectly for that,
but I use Evernote Basic (the free version), and the 60 MB/month
upload cap is an order of magnitude too small.

Git isn't so much a method of avoiding redundancy, it's
more a temporal directory - if you've a Mac, then it's like
a focused time-machine that you can share with other
people - if you've used "track-changes" in windows, then
it's like that too, only not shit.

The way it works is that it maintains a set of keys based
on the path+filename of a part of your directory tree, and
for each key, a hash of the file-content. If the hash
changes, or a new filename appears between one sync
and the next, Git knows to update the centralised store
with a copy of the contents of that file. What's nice is
that it keeps
a copy of the original, and shows you the precise
differences between one version and the next.

If anything, you get added redundancy, as you're able to
step back through time, to see the entire history of your
project folder, with annotations, revert to pre-mullered
versions of your code, and importantly, share a single
repository with lots of people.

The idea is to leverage some of the features of Git in
order to save space - based on the idea that in a given
store, there is lots of duplication.

You could do this by generating a list of filenames as the
keys in a dictionary, the values being lists of paths in
which those files are found. The first disconnect between
this and Git is that (I think) Git uses the filename (or at
least path+filename as the unique identifier of a file - but
here, that assumption no longer applies - a file called
accounts.csv in a folder called zentom_personal_account
is best kept separate from another file called
accounts.csv in a folder called
orphans_do_not_embezzle.

You could take a measure of the content and do a
comparison such that if two files are 99% the same, and
are found in different folders, then they can be stored
using the same root, and perhaps an accessory 'delta' file
- or you could just say, if they're different, then store
them differently. But that means then that there's no
definite temporal connection you can make on filename
only.

Alternately, you could completely turn git on its head,
and build a 'tig' system that analyses the content of files,
and stores their contents as a graph of referenced nodes
that are formed from stubs of repeated content.

If I've got lots of files containing repeated boilerplate
code, the system might extract that boilerplating and
save it physically as a single unit, then any files
referencing that unit would be able to do so via a
reference. Essentially, you're re-coding the storage to
optimise for performance, based on the assumption that
there's lots of repeated content replicated across your
filesystem. That works for portions of files, but it also
works for exact copies of files as well, so if a single
picture appears in lots of different folders, it only needs
to actually be stored once, and referenced in each other
case.

Updating files could result in a temporary new copy of
that file, but later it could be decomposed into its
component parts and referenced into the filesystem.

It's interesting though in that some theories, our brains
are supposed to work on this principle, with short-term
memory being like a working, high-definition, high-
density, dereferenced data object, which, after it's
finished being used, eventually gets referenced against a
longer-term memory set in which much of the content has
already been encoded - i.e. after some point between
childhood and growing up, our brains stop learning
anything new, and instead move more to a filing/curation
type role.