A lurking global problem

Some years ago a friend of mine, a positive and rather carefree person, woke up
one morning and called me for help: she was crying in despair- fore she had lost her
smartphone.
It was too late and only at that moment did she realize how dreadful the
consequences were: her bank details, her contacts, her work, her government IDs,
her signature, all of it was in there and accessible to the lucky one who found her
smartphone the night before.

In hindsight, it may seem obvious and it sounds legitimate to ask: “Why put
yourself in such a fragile situation in the first place? You should have taken
some precautions, right?”

It turns out that it’s not so obvious for most of us. In fact, sit back for a minute
and ask yourself:

What if I lost my computer (or smartphone, or whichever device with personal
data on it) just now? What would be the consequences?

It’s not evident what those consequences are until they’ve been brought
to our attention. So here are a few possible things you could lose:

Your credentials (login to websites for instance).

Your money (bank account credentials, cryptocurrency wallet)

Your personal data, such as pictures and videos.

Your work data.

Your contacts.

Your conversations (emails, etc.).

And probably much more.

A loss can be classified into the following categories:

Destruction: data is gone. Question: Would you be able to precisely know
which files were lost?

Theft: someone else has your data, which means they probably have some of your
credentials, private pictures, maybe money, etc.

Unknown: you don’t know what happened to your device. As suggested above, it could have been
stolen or destroyed. But not knowing may leave you deeply uneasy about
the situation. You should always assume the worst: theft.

Eventually my friend found her phone hidden underneath her bed. A happy end, that at
least served as a good wake up call… or did it? I’m not quite sure she spent
time working on some precautions later. But who can blame her? Unless you
are a techie, it is overwhelming to envision how to even get started
with those precautions.

The effort might not be worth it, so should we care at all, or simply accept the
state of things as they are?

“Not gonna happen to me!”

We would naturally think so. It’s a common psychological fallacy and we should not fool
ourselves, no one is immune to theft or accidental damage.

In fact, hard drives are among the most failure-prone pieces of hardware. Some
day you’ll start your computer and the hard drive will be gone. Shit happens.

If you’ve got data stored in only one place, it means there is a single point
of failure: that’s all you need to face a certain doom.

“I’m safe, my stuff is in the cloud. Or am I?”

Not so fast: what about your credentials for that “cloud?”
If you device gets stolen, how confident are you that the thief won’t have
access to your data? Even after a password change?

Is that cloud trustworthy? Who owns it? Is it in the owner’s interest
to protect your privacy?
Would you store embarrassing pictures there? Passwords? Work data?

Are you fully confident about what you’ve put in there? What if you’ve leaked a
sensitive piece of data there by mistake? Can you take it back or will it be
persisted forever on the cloud’s servers?

In this day and age of privacy protection, cloud storage requires extra
cautiousness, and at the bare minimum you should know what you are doing and
understand the full extent of the technical implications.

“My weekly backup is enough, right?”

If you use your computer for work, which seems to be increasingly the case in
our society, it’s probably not alright.

Think about it this way: how would you feel about a week of work going to waste?
Are you ready to go through it all over again? As much as you might love your work,
this is at the very least unproductive, if not outright demotivating.
Should your work be something on the creative side (music, writing or maybe
programming?), can you be certain you would produce the same result the second
time? Will it be better or worse?

Even if you don’t work on a computer, the amount of data accumulated over a week
can be significant enough that it would cost you a lot to lose it.

Conclusion: daily backups are better.

Aftermath: The system and data recovery

When this happens, you are faced with the inevitable rehabilitation part.
It’s hard to give up on computers (or smartphones) these days, even if we are
fed up with them. We have to get back on track, and this might be an exhausting process.

Have you ever:

Spent a week re-installing and re-configuring your computer? (Or had someone
do it for you?)

Spent the same amount of time getting as much of your old data back as
possible?

Lived on with the discomforting itch that you forgot what you had lost?

Even if you are not the geek kind and happily use the defaults you are provided
with, the system will inevitably be shaped to your liking over time. You
probably have some favourites, bookmarks, and credentials saved somewhere.

“Can we even do anything about it? Yes we can!”

At this point it must be apparent to most of us that those issues are enough a
concern that we can’t just sweep them under the rug.

In this article I am going to address several possible solutions:

Backup user settings: this makes it trivial to synchronize the exact
user profile to multiple machines. In other words, this allows you to log in
onto a new machine and replicate your exact working environment in a click.

Backup data offline and online. The pros, the cons, and most importantly, the
costs.

Unfortunately, this article will mostly be an outline more than a detailed
walk-through since the process is extremely dependent on your operating system
(your choices are quite limited on Windows, for instance). More importantly, it
will be an attempt at increasing awareness about data protection, privacy and
user-centric control.

For these to work, there are some essential requirements:

Friction-less: if the process is cumbersome and lengthy, let’s face it, we
will procrastinate. Even a simple copy-paste to an external hard drive
becomes tiring in the long run and we will eventually postpone it for days and
weeks.

Fast and low on resources: the process must be fast enough and light on disk
usage so that it can be run at least once a day. As we saw previously, even
weekly backups could be insufficient.

Automatable: it should be possible to have it run automatically every day.

Public: anything that can be found “out there,” on the market. Typically
music, movies, programs, etc.
As a special case, much of your user settings can be safely marked as
“public.”

Private: your vacation pictures, your work, your credentials, etc.

The distinction matters because we are going to store some stuff online, in
which case it must be very clear: your private data must be encrypted so that
only you can access it with your private key.

Before getting started, let’s make sure we are on the same page when it comes to
basic digital security and privacy requirements.

Computer Security 101

Encryption

Regarding private data, there is something very important to
understand: if it’s stored behind a password on the cloud, it
does not mean it’s safe. It might be safe from external
attackers, but the people running the cloud service have full,
unrestricted access to it.

The only sane way to store data onto an untrusted third party is
to never let your data leave your machine unencrypted. Don’t
let anyone encrypt data for you if you don’t want them to have
full access to your data: you must do it yourself.

Understand that “encrypting your own data” is not an involved
process: user-friendly programs will happily do it for you.

Mobile devices

By their very nature, it’s easy to lose them or to steal
them. When this happens, the thief could full access
to your critical data (saved password, contacts, bank
details, etc.). The PIN or the login password won’t protect
you much if they can plug the hard drive onto some other
computer. This is why storage on mobile devices should
always be fully encrypted: without the passphrase, the thief
won’t be able to see anything but binary garbage on the device.

Software

None of the above matters if you cannot trust the underlying system (the
programs and the operating system). It’s crucial that those are
transparent and open enough for you to trust them. Which means
that they must be free software, open source, reproducible. Guix
is a good example of such a system.

Password management

This animated overview by the EFF should give you a
good feeling of how safe a password manager is and why you need it. As a bonus,
it makes your life easier: it lifts the burden of having to remember
legions of passwords.

Offline backups

Now that we’ve got a good understanding of the security requirements, let’s get
down the actual issue of safeguarding our data. The most obvious and
straighforward approach is to buy multiple hard drives to duplicate the data.

It also happens to be a rather cheap approach. Renting storage online is
usually more expensive per GB.

Never stick to a single hard drive as it would weaken your setup to a single
point of failure. Hard drive failures occur quite often, so you are better off
always acquiring hard drives in pairs (at least).

Mirroring

The most obvious way to backup your data is to copy it from one drive to the
next.

In practice, this is not ideal:

It’s too manual. It should be done automatically.

It can be a slow process. If it’s too slow, we won’t do it this often. If
backups are too spread out between each other, we increase the chances of a
breakdown happening days (or weeks) after the last backup, thus losing much more
data than tolerable.

The answer to this is mirroring (like RAID1): whenever a file is copied on drive
A, the computer automatically copies it on drive B.

There is a pitfall however: if a file is removed from A, it’s also removed from
B. Removing a file by accident would make it effectively unrecoverable despite
the backup, which kills the purpose of the whole thing.

Enter snapshots.

Snapshots

Snapshots are like a save point of the drive at some point in time.
A killer feature of snapshots is that they can be mounted like a drive and you
can browse them just like any other folders.
This effectively allows you to use (and compare!) multiple versions of your data
at the same time. It’s obviously possible to revert back to any snapshot, and
even branch off from them should you decide to work with multiple histories of
your data.

Snapshots are smart enough that they won’t duplicate data, so they are very efficient
both to create (done in a matter of seconds) and to store (they require some
tiny percent of your overall data usage).

Say you store some movies from dear Georges Melliès in kick-ass quality:

A Trip to the Moon: 200 GB.

The Impossible Voyage: 300 GB.

You take a snapshot named “dawn”. Total disk usage would be around 500 GB,
snapshot included.

Now remove “A Trip to the Moon” and snapshot to “noon”. Total disk usage would
still be around 500 GB because the “dawn” snapshot is still holding the movie.

Let’s add a new movie:

Plan 9 from Outer Space: 700 GB.

We take a new snapshot named “dusk”, and now disk usage is around 1200 GB, all 3
snapshots included.

Two movies are “visible”: The Impossible Voyage and Plan 9 from Outer Space.
But we can still go back in time and play A Trip to the Moon.

Last, we delete “dawn”: A Trip to the Moon is no longer referenced, so it is
effectively removed from the hard drive and some space is freed: we are now
using about 1000 GB.

Snapshots make it much safer to use mirroring: should you accidentally delete a
file, it can be restored from a snapshot present on both drives.

Snapshots are available only to some file systems (which is determined when you
format the hard drive). As of January 2019, good solutions include ZFS and
Btrfs. From there, use dedicated tools (such as the btrfs command line tool)
to create snapshots.

Note that hardware-based mirroring like RAID1 is not really necessary with ZFS
or Btrfs, both of which support mirroring themselves.

As of January 2019, those file systems still tend to be used only marginally. I
think it’s a pity considering what a game changer they provide: by safekeeping
the integrity of users data, computers suddenly become much friendlier machines!

Summary

So here we go, an ideal starting point for offline backups: 2 hard drives both
formatted using a file system with snapshot support and set up for mirroring.
Once formatted, the snapshots can be programmed to be automatically run daily
for instance. Then there is nothing left to do on the user end:

It’s all automatic.

It’s fast.

It’s space efficient.

It preserves the complete history of the data: It safeguards against
accidental deletion for instance.

Online backups

Now what if your hard drives all burn down at the same time?
One way to cope with this is to have another computer in a remote location,
but that might not be doable for all of us, as it’s more costly.

Another, also costly solution is online storage. The great selling point of
many “cloud” solutions is that they protect you from real-life damage. That is,
assuming the cloud providers have several data-centers and they don’t all burn
down.

But you should not exclusively rely on a cloud service either. That would also
get you back to a single point of failure. What if the company shuts down?
What if they make a mistake and erase your data? What if you lose your
credentials? What if…?

Remote storage is nonetheless a great solution for extra security beyond your
local storage.

Remember however that you should not send anything unencrypted to the remote
server if it belongs to an untrusted third-party.

There are a couple of approaches here:

Synchronize your ZFS encrypted snapshots. (I’ve never done it myself, I just
assume this would work in this scenario.) As of January 2019, Btrfs does not
support snapshot encryption, so it’s a big no-no for remote
synchronization. (Let me know if I’m wrong about this.)

A dedicated backup manager (as of January 2019, BorgBackup is one of the prime
tools in the field). They work independently of your file system
capabilities. It allows you to store encrypted backups remotely. Since it
supports data deduplication (much like snapshots), backups scale well and
won’t occupy much more than the sum of the different bits of data found
across all backups.

File listings

At this point we’ve covered the question of safeguarding our data. Some
legitimate concerns may have arisen:

Offline and online backups are great, but admittedly come at a price. What if
we cannot afford it? Or only partially, not for all the data?

In the long run, snapshots may eat up too much space if they keep track of
data that was deleted a long time ago. So it’s common to delete the older
snapshots over time to regain some disk space, but then we lose part of the
history.

What about devices without storage space, like mobile devices? What about
laptops without external hard drives?

In all those circumstances, it might not be possible to always keep track of all our
data.
But there is still some information we can preserve for very cheap: the file
listings.

A file listing is a simple text file of all the files found on the drive, one
full path per line.

While file listings don’t get us data back, they at least provide us with what
data we have. This can be very valuable.

Think about it: when you accidentally lose data (e.g. you lose your computer),
can you remember what you lost? Some of it, certainly, but what about the rest?
Our memory isn’t that great, and it could very well be that we are not able to
recall some important data either (as paradoxical as it may sound).

File listings occupy rarely more than a few megabytes and they are
fast to generate.

File listings can then be kept under version control, for instance under some
private repository of yours (a possibly remote storage space), preferably
encrypted. This way you’ll not only keep the list of files but also the
history of the all the files you had at every point in time.

None of this should be done manually and just like snapshots, we are better off
if they are run automatically, e.g. once a day.

Reproducible user profile

Data is not everything and backing up your user settings like regular data is
not the smartest thing to do. Let’s get down to it without further ado.

Versioning your user settings

User settings are everything about your environment:

Favourite programs.

All the configurations of those programs.

Keyboard shortcuts.

File shortcuts.

Accessibility configurations.

etc.

Why not back them up like regular data, one may ask? For a fundamental
difference: the user settings are much more akin to a computer program that
glues together all your other programs. It’s not static data and thus it
benefits greatly from being transparent and reproducible.

It might not be obvious, but for a better part of it, those settings are far
from being confidential and it’s often fine, even commended, to share them
publicly, like any free software.

There is a long standing tradition among hackers to share their user profile
configuration, often nicknamed dotfiles. Those are often stored under version
controlled repositories such as Git. You’ll find mine here :)

Depending on your involvement with computers, your user settings might be more
or less extensive. But even with simpler settings, it is often useful to keep
track of them under version control. Version control offers the following
perks:

Decentralized backups: it’s on all your devices, plus on all the servers where
you’ve synchronized them.

You have full control over what’s in it, what is not, what changes and the
history of changes since the beginning of time.

Version control checks the data integrity at all times, it gives you a
full guarantee over what you are getting. Thus it’s fully reproducible.

Private settings and data

Some of your settings might be private. In general, it’s mostly about our
personal activity on a computer, for instance:

Bookmarks, favourites.

Bucket lists, “Sticky notes.”

All sorts of notes.

Newsfeeds.

Some preferences of your web browser.

Contacts, address book.

Paperwork.

This data can be kept under version control as well, but remember Computer Security 101: encrypt the repository if it’s synchronized with an untrusted
third-party server.

It’s also possible to only encrypt the sensitive files in a repository.
For instance, you can encrypt files with GnuPG and store the resulting .gpg in
a Git repository.
To display the history of the encrypted file and the differences between two
versions, add the following to a .gitattributes file in the Git repository:

*.gpg diff=gpg

User profile initialization

Your user profile is not just about configuration files and data. There might
be some tasks you’d like to run to initialize your environment back to the
desired state.

Needs vary and it’s hard to fit everyone’s shoes at the same time, so over time
I wrote a script (i.e. a small, quickly-patched-together program) that would fit
all my personal requirements:

Install the list of my programs.

Retrieve my private data.

Retrieve my password manager database.

Retrieve and install my user settings (the “dotfiles”).

Retrieve my emails.

And some other nits…

The result is the following: after a fresh installation, or the first time I log
in on a new machine, I run the script, wait a few seconds (or minutes, depending
on the Internet connection) and there it is: my exact user environment as I left
it last time I synchronized my user profile.

User profile synchronization

The user profile must also be synchronized. While the “dotfiles”
synchronization is done with the version control system, there is more:

Again, your mileage may vary. So I wrote another script to do all the above for
me. In particular, it reports all version control repositories that are not
synchronized, so that I remember to finish and synchronize my pending work on
all projects before going to bed. This can be done automatically if need be.

A synchronization takes no more than a couple of seconds to run and can easily
be done every day, even automatically.

At this point, if I lose my computer, I’ll be able to restore an environment
matching the last synchronization that won’t be older than a day. In my case,
the process boils down to:

I originally wrote those scripts a long time ago with different target systems
in mind (FreeBSD and Arch Linux among others). Some requirements were:

Interpretable: I must be able to hack them in case I need to adapt something
to the system.

Retrievable over the network and verifiable. So I would host them under
version control.

Portable: they must run everywhere with no dependencies.

Idempotent: Running it multiple times should produce the same result.

Lazy: Only perform a task if necessary. A second run should terminate in seconds.

For portability’s sake, I started off writing a POSIX shell script, since it
seems to be the only language that can be understood on almost all systems.

In hindsight, this proved to be a debatable choice as the script grew and more
complex features were added. POSIX shell is a very poor and limited
language to program with.

Today, I mostly use Guix, so portability is less of a concern.
Even then, it’s not far fetched to ask for a tiny requirement: a widely
available interpreter. Then the installation process would only ask for one
more step: the installation of the interpreter.

I could have sticked to a much more powerful programming language. Even then,
portability would not be such an issue: Guile Scheme, for instance, is a
nonrestrictive requirement as it’s rather light and widely available. Finally
it’s about time we broke with the tradition that the only portable scripting
language should be one of the worst. We need to move on and use better
programming languages globally.

The data frenzy: a social drift of the new millennial?

This was a long article. At this point your might wonder: “Why should we care so
much about our data anyways? Aren’t we getting too attached to technology?”

It’s a vast topic and there is probably too much to talk about to fit in this one
article.
So I’ll keep it to just a few points for now:

The blame does not have to be put on our attachment to data, but rather on the
setup and the infrastructure. Data attachment and data loss crises
essentially occur because currently user data is under the spotlight while
typical computer setups are extremely fragile. The social and psychological
question of data-attachment would mostly be moot if the technology of backups
and users’ control over their data was appropriate to its level of importance.

This article is not about the effort every user should make, it’s how
vendors should set up their products so that everything is ready for
backup-and-control out of the box.

We don’t have to be attached to our data. Having the possibility to
control it and to rely on it is a different thing. I believe we should all
have the right both to ignore our data or to depend on it. It should be our
own decision, hence the importance of user-centric control.

Data is increasingly reflecting power. When external entities own our data
(even part of it), such as corporates with poor incentives to stand for us, we
are threatening our democratic rights on the political level, and our
individuality on the social level. If we are the only and full proprietors of
our own data, we remain in control to stand strong as first class citizens and
individuals. Should we be data-craving techies or not, society
is making a choice here, and we need to enforce our rights as its members,
lest we lost our place in society.

On a more abstract level, data can be seen as a form of human consciousness
expansion. Our brain is limited and can store only so much information.

There was a time were mankind was little aware of notions such as freedom and
choice. The philosophy of individualism is a rather recent evolution.
User data could be just another form of human evolution, that of memory
expansion. It might be hard to foresee the benefits at this early stage, but so
was it certainly when the Enlightenment philosophers were thinking ideas of
individualism. Time will tell, I suppose.

An interesting experiment is that of the Facebook data, for those who’ve tried
the social network for a couple of months or years. Facebook allows its users
to download an archive of a collection of data that Facebook has gathered about
them since you created an account. (Note that it’s most certain that lots of
data is missing from that archive, and Facebook knows way more.) Going through
the archive for 5 minutes will give you a look back it your own self from months
and years ago, to a level of detail you would not be capable of digging out
yourself with your memory alone. Yes, to some extent, social networks like
Facebook might know more about yourself than yourself.

The Internet-connected society is growing to become an entity that knows more
about human beings than themselves. If we as individuals don’t want to be
overwhelmed and overtaken in this play of power, we might need to extend our
capabilities and what defines us to something that can safeguard our
power against this societal paradigm shift.

Future work

I believe the setup I’ve presented in this article provides some definite
benefits, and yet there is much left to improve. In particular when it comes
to universal accessibility.

Now let’s dream on a little bit and munch over some crazy ideas.

(Don’t hesitate to let me know if this is nowhere close to feasible, or, on the
contrary, if it’s already done or close to being achievable.)

First of all, it’s quite clear today that many people don’t like to have to
bother with data storage. The “cloud” is such an attractive concept, it would
be really nice if we could use it without its privacy-infringing pitfalls.

So if we really want to go in that direction, they are a few requirements:

It should all be encrypted. So regular users must properly learn about
authentication systems and understand what it means to keep a secret key
secret, for real.

It should be distributed, which means there would be no single point of
failure, nowhere in the world. User data should not be censored or blocked or
removed without the user consent.

Free and huge (unlimited?) storage space. Paying for data storage poses a
threat to social equality, as richer people would have the possibility to
store more data and thus have a more extensive “memory,” if not individuality.

In their talk, the IPFS team shows how disk space over cost ratio has increased
more rapidly than Internet bandwidth over cost. This could be interpreted in
a sense that if we all
shared our storage space in a storage pool distributed over the Internet, we
could simulate a seemingly infinite storage space available for everyone to use
(with smart space optimization like data deduplication and compression).

IPFS is a prime implementation of this, but some pieces of the puzzle are still
missing. For one, the incentive for every user to put their storage space to
availability for everyone to use. Should we work out such a system, we would
basically re-create Silicon Valley’s Pied Piper where our data is everywhere and
nowhere at the same time, and there would be no more need for a “Download”
button!