From be9a84d7042541363a5c1b57c5f18b23ca02cde8 Mon Sep 17 00:00:00 2001
From: Torsten Grote <t@grobox.de>
Date: Thu, 4 Mar 2021 16:04:15 -0300
Subject: [PATCH] Add storage design document

---
 storage/doc/design.md | 402 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 402 insertions(+)
 create mode 100644 storage/doc/design.md

diff --git a/storage/doc/design.md b/storage/doc/design.md
new file mode 100644
index 00000000..59e149c9
--- /dev/null
+++ b/storage/doc/design.md
@@ -0,0 +1,402 @@
+# Overview
+
+This is a design document for Seedvault Storage backup.
+It is heavily inspired by borgbackup, but simplified and adapted to the Android context.
+
+The aim is to efficiently backup media files from Android's `MediaStore`
+and other files from external storage.
+Apps and their data are explicitly out of scope
+as this is handled already by Seedvault via the Android backup system.
+Techniques introduced here might be applied to app backups in the future.
+
+## Terminology
+
+A **backup snapshot** (or short backup) represents a collection of files at one point in time.
+Making a backup creates such a snapshot and writes it to **backup storage**
+which is an abstract location to save files to (e.g. flash drive or cloud storage).
+Technically the backup snapshot is a file containing metadata
+about the backup such as the included files.
+A **backup run** is the process of backing up files i.e. making a backup.
+
+Large files are split into **chunks** (smaller pieces) by a **chunker**.
+Small files are combined to **zip chunks**.
+
+File information is cached locally in the **files cache** to speed up operations.
+There is also the **chunks cache** to cache information about available chunks.
+
+# Operations
+
+## Making a backup
+
+A backup run is ideally (automatically) triggered when
+
+* the device is charging and connected to an un-metered network in case network storage is used
+* a storage medium is plugged in and the user confirmed the run in case removable storage is used
+
+Files to be backed up are listed based on the user's preference
+using Android's `MediaProvider` and `ExternalStorageProvider`.
+Tests on real world devices have shown ~200ms list times for `MediaProvider`
+and `~10sec` for *all* of `ExternalStorageProvider`
+(which is unlikely to happen, because the entire storage volume cannot be selected on Android 11).
+
+All files will be processed with every backup run.
+If a file is found in the cache, it is checked
+if its content-modification-indicating attributes have not been modified
+and all its chunks are still present in the backup storage.
+We might be able to speed up the latter check by initially retrieving a list of all chunks.
+
+For present unchanged files, an entry will be added to the backup snapshot
+and the TTL in the files cache updated.
+If a file is not found in cache an entry will be added for it.
+New and modified files will be put through a chunker
+which splits up larger files into smaller chunks.
+Initially, the chunker might just return a single chunk,
+the file itself, to simplify the operation.
+
+A chunk is hashed (with a key / MACed),
+then (compressed and) encrypted (with authentication) and written to backup storage,
+if it is not already present.
+New chunks get added to the chunks cache.
+Only after the backup has completed and the backup snapshot was written,
+the reference counters of the included chunks will be incremented.
+
+When all chunks of a file have either been written or were present already,
+the file metadata is added to the backup snapshot with its list of chunk IDs and other metadata.
+
+When all files have been processed, the backup snapshot is finalized
+and written (encrypted) to storage.
+
+If the backup fails, a new run is attempted at the next opportunity creating a new backup snapshot.
+Chunks uploaded during the failed run should still be available in backup storage
+and the cache with reference count `0` providing an auto-resume.
+
+After a successful backup, chunks that still have reference count `0`
+can be deleted from storage and cache without risking to delete chunks that will be needed later.
+
+## Removing old backups
+
+Ideally, the user can decide how many backups should be kept based on available storage capacity.
+These could be a number in the yearly/monthly/weekly/daily categories.
+However, initially, we might simply auto-prune backups older than a month,
+if there have been at least 3 backups within that month (or some similar scheme).
+
+After doing a successful backup run, is a good time to prune old backups.
+To determine which backups to delete, the backup snapshots need to be downloaded and inspected.
+Maybe their file name can be the `timeStart` timestamp to help with that task.
+If a backup is selected for deletion, the reference counter of all included chunks is decremented.
+The backup snapshot file and chunks with reference count of `0` are deleted from storage.
+
+## Restoring from backup
+
+When the user wishes to restore a backup, they select the backup snapshot that should be used.
+The selection can be done based on time and name.
+We go through the list of files in the snapshot,
+download, authenticate, decrypt (and decompress) each chunk of the file
+and re-assemble the file this way.
+Once we have the original chunk,
+we might want to re-calculate the chunk ID to check if it is as expected
+to prevent an attacker from swapping chunks.
+This could also be achieved by including the chunk ID
+in the associated data of the authenticated encryption (AEAD).
+The re-assembled file will be placed into the same directory under the same name
+with its attributes (e.g. lastModified) restored as much as possible on Android.
+
+Restoring to storage that is already in use is not supported.
+However, if a file already exists with the that name and path,
+we check if the file is identical to the one we want to restore
+(by relying on file metadata or re-computing chunk IDs)
+and move to the next if it is indeed identical.
+If it is not identical, we could rely on Android's Storage Access Framework
+to automatically give it a `(1)` suffix when writing it to disk.
+Normally, restores are expected to happen to a clean file system anyway.
+
+However, if a restore fails, the above behavior should give us a seamless auto-resume experience.
+The user can re-try the restore and it will quickly skip already restored files
+and continue to download the ones that are still missing.
+
+After all files have been written to a directory,
+we might want to attempt to restore its metadata (and flags?) as well.
+
+
+# Cryptography
+
+The goal here is to be as simple as possible while still being secure
+meaning that we want to primarily conceal the content of the backed up files.
+Certain trade-offs have to be made though,
+so that for now we do not attempt to hide file sizes.
+E.g. an attacker with access to the backup storage might be able to infer
+that the Snowden files are part of our backup.
+We do however encrypt file names and paths.
+
+## Master Key
+
+Seedvault already uses [BIP39](https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki)
+to give users a mnemonic recovery code and for generating deterministic keys.
+The derived key has 512 bits
+and Seedvault uses the first 256 bits as an AES key to encrypt app data (out of scope here).
+This key's usage is limited by Android for encryption and decryption.
+Therefore, the second 256 bits will be imported into Android's keystore for use with HMAC-SHA256,
+so that this key can act as a master key we can deterministically derive additional keys from
+by using HKDF ([RFC5869](https://tools.ietf.org/html/rfc5869)).
+These second 256 bits must not be used for any other purpose in the future.
+We use them for a master key to avoid users having to handle another secret.
+
+For deriving keys, we are only using the HKDF's second 'expand' step,
+because the Android Keystore does not give us access
+to the key's byte representation (required for first 'extract' step) after importing it.
+This should be fine as the input key material is already a cryptographically strong key
+(see section 3.3 of RFC 5869 above).
+
+## Choice of primitives
+
+AES-GCM and SHA256 have been chosen,
+because [both are hardware accelerated](https://en.wikichip.org/wiki/arm/armv8#ARMv8_Extensions_and_Processor_Features)
+on 64-bit ARMv8 CPUs that are used in modern phones.
+Our own tests against Java implementations of Blake2s, Blake3 and ChaCha20-Poly1305
+have confirmed that these indeed offer worse performance by a few factors.
+C implementations via JNI have not been evaluated though
+due to difficulties of building those as part of AOSP.
+
+## Chunk ID calculation
+
+We use a keyed hash instead of a normal hash for calculating the chunk ID
+to not leak the file content via the public hash.
+Using HMAC-SHA256 directly with the master key in Android's key store
+resulted in terrible throughput of around 4 MB/sec.
+Java implementations of Blake2s and Blake3 performed better,
+but by far the best performance gave HMAC-SHA256
+with a key we can hold the byte representation for in memory.
+
+Therefore, we suggest to derive a dedicated key for chunk ID calculation from the master key
+and keep it in memory for as long as we need it.
+If an attacker is able to read our memory,
+they have access to the entire device anyway
+and there's no point anymore in protecting content indicators such as chunk hashes.
+
+To derive the chunk ID calculation key, we use HKDF's expand step
+with the UTF-8 byte representation of "Chunk ID calculation" as info input.
+
+## Stream Encryption
+
+When a stream is written to backup storage,
+it starts with a header consisting of a single byte indicating the backup format version
+followed by the encrypted payload.
+
+Each chunk and backup snapshot written to backup storage will be encrypted with a fresh key
+to prevent issues with nonce/IV re-use of a single key.
+Similar to the chunk ID calculation key above, we derive a stream key from the master key
+by using HKDF's expand step with the UTF-8 byte representation of "stream key" as info input.
+This stream key is then used to derive a new key for each stream.
+
+Instead of encrypting, authenticating and segmenting a cleartext stream ourselves,
+we have chosen to employ the [tink library](https://github.com/google/tink) for that task.
+Since it does not allow us to work with imported or derived keys,
+we are only using its [AesGcmHkdfStreaming](https://google.github.io/tink/javadoc/tink-android/1.5.0/index.html?com/google/crypto/tink/subtle/AesGcmHkdfStreaming.html)
+to delegate encryption and decryption of byte streams.
+This follows the OAE2 definition as proposed in the paper
+"Online Authenticated-Encryption and its Nonce-Reuse Misuse-Resistance"
+([PDF](https://eprint.iacr.org/2015/189.pdf)).
+
+It adds its own 40 byte header consisting of header length (1 byte), salt and nonce prefix.
+Then it adds one or more segments, each up to 1 MB in size.
+All segments are encrypted with a fresh key that is derived by using HKDF
+on our stream key with another internal random salt (32 bytes) and associated data as info
+([documentation](https://github.com/google/tink/blob/master/docs/WIRE-FORMAT.md#streaming-encryption)).
+
+When writing files/chunks to backup storage,
+the authenticated associated data (AAD) will contain the backup version as the first byte
+(to prevent downgrade attacks)
+followed by a second type byte depending on the type of file written:
+
+* chunks: `0x00` as type byte and then the chunk ID
+* backup snapshots: `0x01` as type byte and then the backup snapshot timestamp
+
+The chunk ID and the backup snapshot timestamp get added
+to prevent an attacker from renaming and swapping files/chunks.
+
+# Data structures
+
+## Local caches
+
+### Files cache
+
+This cache is needed to quickly look up if a file has changed and if we have all of its chunks.
+It will probably be implemented as a sqlite-based Room database
+which has shown promising performance in early tests.
+
+Contents:
+
+* URI (stripped by scheme and authority?) (`String` with index for fast lookups)
+* file size (`Long`)
+* last modified in milliseconds (`Long`)
+* generation modified (MediaStore only) (`Long`)
+* list of chunk IDs representing the file's contents
+* zip index in case this file is inside a single zip chunk (`Integer`)
+* last seen in milliseconds (`Long`)
+
+If the file's size, last modified timestamp (and generation) is still the same,
+it is considered to not have changed.
+In that case, we check that all file content chunks are (still) present in storage.
+
+If the file has not changed and all chunks are present,
+the file is not read/chunked/hashed again.
+Only file metadata is added to the backup snapshot.
+
+As the cache grows over time, we need a way to evict files eventually.
+This could happen by checking a last seen timestamp or by using a TTL counter
+or maybe even a boolean flag that gets checked after a successful run over all files.
+A flag might not be ideal if the user adds/removes folder as backup targets.
+Current preference is for using a last seen timestamp.
+
+The files cache is local only and will not be included in the backup.
+After restoring from backup the cache needs to get repopulated on the next backup run
+after re-generating the chunks cache.
+The URIs of the restored files will most likely differ from the backed up ones.
+When the `MediaStore` version changes,
+the chunk IDs of all files will need to get recalculated as well.
+
+### Chunks cache
+
+This is used to determine whether we already have a chunk,
+to count references to it and also for statistics.
+
+It could be implemented as a table in the same database as the files cache.
+
+* chunk ID (hex representation of the chunk's MAC)
+* reference count
+* size
+
+If the reference count of a chunk reaches `0`,
+we can delete it from storage (after a successful backup)
+as it isn't used by a backup anymore.
+
+References are only stored in this local chunks cache.
+If the cache is lost (or not available after restoring),
+it can be repopulated by inspecting all backup snapshots
+and setting the reference count to the number of backup snapshots a chunk is referenced from.
+
+When making a backup run and hitting the files cache,
+we need to check that all chunks are still available on storage.
+
+## Remote Files
+
+All types of files written to backup storage have the following format:
+
+    ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+    ┃         ┃ tink payload (with 40 bytes header)                          ┃
+    ┃ version ┃ ┏━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ ┃
+    ┃  byte   ┃ ┃ header length ┃ salt ┃ nonce prefix ┃ encrypted segments ┃ ┃
+    ┃         ┃ ┗━━━━━━━━━━━━━━━┻━━━━━━┻━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━┛ ┃
+    ┗━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
+
+### Backup Snapshot
+
+The backup snapshot contains metadata about a single backup
+and is written to the storage after a successful backup run.
+
+* version - the backup version
+* name - a name of the backup
+* media files - a list of all `MediaStore` files in this backup
+  * media type (enum: images, video, audio or downloads)
+  * name (string)
+  * relative path (string)
+  * last modified timestamp (long)
+  * owner package name (string)
+  * is favorite (boolean)
+  * file size (long)
+  * storage volume (string)
+  * ordered list of chunk IDs (to re-assemble the file)
+  * zip index (int)
+* document files - a list of all document files from external storage in this backup
+  * name (string)
+  * relative path (string)
+  * last modified timestamp (long)
+  * file size (long)
+  * storage volume (string)
+  * ordered list of chunk IDs (to re-assemble the file)
+  * zip index (int)
+* total size - sum of the size of all files, for stats
+* timeStart - when the backup run was started
+* timeEnd - when the backup run was finished
+
+All backup snapshots are stored in the root folder.
+The filename is the timeStart timestamp.
+
+### Chunks
+
+The encrypted payload of chunks is just the chunk data itself.
+All chunks are stored in one of 256 sub-folders
+representing the first byte of the chunk ID encoded as a hex string.
+The file name is the chunk ID encoded as a (lower-case) hex string.
+This is similar to how git stores its repository objects
+and to avoid having to store all chunks in a single directory.
+
+### Zip chunks
+
+Transferring many very small files causes a substantial overhead
+when transferring them to the storage medium.
+It would be nice to avoid that.
+Michael Rogers proposed the following idea to address this.
+
+A chunk can either be part of a large file, all of a medium-sized file,
+or a (deterministic) zip containing multiple small files.
+When creating a backup, we sort the files in the small category by last modification
+and pack as many files into each chunk as we can.
+Each small file will be stored in the zip chunk under some artificial name
+that is unique within the scope of the zip chunk like a counter.
+The path to unique name mapping will be stored in the backup snapshot (zip index).
+If a small file is inside a zip chunk,
+that chunk ID will be listed as the only chunk of the file in the backup snapshot
+and likewise for any other files inside that chunk.
+
+When creating the next backup, if none of the small files have changed,
+we just increase the ref count on the existing chunk.
+If some of them have changed, they will be added to a new zip chunk
+together with other new/changed small files.
+
+When fetching a chunk for restore, we know in advance whether it is a zip chunk,
+because the file we need it for contains the zip index,
+so we will not confuse it with a medium-sized zip file.
+Then we unzip the zip chunk and extract the file by its zip index (name).
+
+# Out-of-Scope
+
+The following features would be nice to have,
+but are considered out-of-scope of the current design for time and budget reasons.
+
+* compression (we initially assume that most files are already sufficiently compressed)
+* packing several smaller files into larger combined chunks to improve transfer efficiency
+* using a rolling hash to produce chunks in order to increase likelihood of obtaining same chunks
+  even if file contents change slightly or shift
+* external secret-less corruption checks that would use checksums over encrypted data
+* supporting different backup clients backing up to the same storage
+* concealing file sizes (though zip chunks helps a bit here)
+
+# Known issues
+
+## Changes to files can not be detected reliably
+
+Changes can be detected using file size and lastModified timestamps.
+These have only a precision of seconds,
+so we can't detect a changes happening within a second of a first change.
+Also other apps can reset the lastModified timestamp
+preventing us from registering a change if the file size doesn't change.
+On Android 11, media files have a generation counter that gets incremented when files changes
+to help with this issue.
+However, files on external storage still don't have anything similar
+and usually also don't trigger `ContentObserver` notifications.
+
+# Acknowledgements
+
+The following individuals have reviewed this document and provided helpful feedback.
+
+* Demi M. Obenour
+* Chirayu Desai
+* Kevin Niehage
+* Michael Rogers
+* Thomas Waldmann
+* Tom Hacohen
+
+As they have reviewed different parts and different versions at different times,
+this acknowledgement should not be mistaken for their endorsement of the current design
+or the final implementation.