# Archiving (not backup)



## rfdesigner (Jun 6, 2017)

Hi

So I have too much data, and I've just started playing with video.. so I need to archive the old stuff to release space.

I could work out what I need to do from scratch.. but I thought I'd ask what others do?

I've already cleaned out my "lesser" stuff and I've googled archiving.. and it's a whole new set of problems.

A while back someone here referred to backups as 3-2-1 (3 coppies, 2 types, one off site) which is how I now do things.

I wondered if anyone had a similarly simple and effective approach to archives.

Thanks in advance.


----------



## neuroanatomist (Jun 6, 2017)

I don't currently archive, but I'm ready. For RAW images and home movies, I maintain _five_ copies (laptop + 4). I have a pair of HDDs at home and another pair at work. One of each pair is a clone of my laptop SSD, the other gets just the RAWs and movies. The clones are maintained with SuperDuper, the RAWs/movies I just copy manually. 

My RAW and video libraries are organized by year. Once my SSD gets to the point where I need to free up space, I will delete the oldest 1-2 years of RAWs/movies from my laptop. At the next clone update, those files will be deleted from the backups as well, but they'll remain as archive copies on two external drives in separate locations. 

I came close to needing to archive a couple of years ago, but I swapped my laptop's 500 GB HDD for a 1 TB SSD and bought some time (and a big performance boost – my laptop is a late 2011 17" MacBook Pro, and it still runs fine).


----------



## mnclayshooter (Jun 6, 2017)

In terms of what we do in our corporate office - terminology is the key in this discussion:

Backups are just that - disaster recovery - incremental - cover you for some set period(s) and then are expunged to make room for new backups. These are the immediate recovery of accidentally deleted files, lightning strike, HD crash etc. Risk is low to losing a backup, but we keep multiple incremental backups, just in case. 

Archives are files that are old and are kept for "archival" purposes (imagine the audacity of us to call them archives - ha!). We maintain a main archive that is accessible if need be (think Read-only RAID configured NAS that is kept online - except super-sized for corporate use) - it's slower than the main network storage, but it is rarely accessed, so that's ok.... it is also backed up via offline copy (dupicate copies of array disks) and for good measure, it's also archived and stored in an off-site facility.. We don't typically mix media too much, however some of the archive is old enough that it is on tapes and optical disks. Modernization of them hasn't been a big priority. 

For home use, you could pretty easily (and cheaply) keep a RAID 1 NAS for your archive. The Drobo or something like it (WD, SIIG, etc NAS) can store a nearly obscene amount of data in redundant form. If you're concerned, you can get a second one and have them back each other up. That should free up quite a bit of space on your working drive(s). You could, in theory, then also power down the NAS(es) once you're satisfied that you've made sufficient redundant archives and only power it on when you need to access it... that will greatly reduce potential for spindle/HD mechanical failure. Take the drives out, and put one in a safe deposit box at the bank or your office etc.. and keep the other at home for use when needed. Set a calendar appt to make new updates annually, quarterly, monthly etc as needed for new archived stuff. 

Just my two cents.


----------



## Mt Spokane Photography (Jun 7, 2017)

There really is no reliable archiving system for digital data. Certainly, a hard drive or SSD is not going to be reliable to retrieve files 50 or 100 years from now. M Disks have had some adherents who say they are good for 1000 years, but they are extremely expensive. 4 TB of "M" disks could cost around $480.00!


I certainly would not keep archived files on a operating NAS, a major electrical event like a lightening strike nearby could blow all the hard drives. Keep two copies on different types of media stored in different places, and don't store them in a hot garage or rental storage, and transfer them to new media every few years religiously.

In many ways, having high quality prints made to store is not a bad idea, they can last well over 100 years.

This article is one of the better ones I've read. http://www.pcworld.com/article/2984597/storage/hard-core-data-preservation-the-best-media-and-methods-for-archiving-your-data.html

So far, I have images stored on multiple hard drives, prints, and a limited number of optical disks. I'm not very prepared!


----------



## sanj (Jun 7, 2017)

What is "2 types"


----------



## Orangutan (Jun 7, 2017)

As with backups, the important first step is to recognize the specific purpose of your archive scheme. With backups, the breakdown is usually something like this:

Risk 1:Corruption or accidental deletion of a few files
Remedy 1:frequent backups to an easily accessible device, with multiple versions going back in time to avoid locking-in an error. Examples include Time Machine on Mac or Shadow Copy on Windows

Risk 2: Corruption, accidental deletion or theft of an entire disk
Remedy 2: Regular full backups to off-site location to guard against theft, fire, etc; this would include backups of your Remedy 1 backup.


Archiving assumes you've deleted the files from your working storage, so you still need Remedy 2 for your archival media. In addition, though, you also need to guard against long-term corruption of your storage medium. Magnetic and optical media suffer, to various degrees, from what is playfully called "bit rot." I.e., the media degrades, destroying data. So that brings you two new risks:

Risk 3: Corruption of individual bits of a medium; i.e., files may become corrupt in a way that's not obvious until you try to use it.
Remedy 3: Use a storage medium that incorporates validation (CRC, etc) to detect degradation, and store multiple copies in at least two different locations. Simply having multiple copies doesn't necessarily help a lot because, without validation, it's a manual process to decide which of your copies is valid.

Risk 4: Obsolescence of the media
Remedy 4: Periodically upgrade your archival storage to new media.

There are lots of archive options out there, just make sure the one you choose addresses all your needs.


----------



## LDS (Jun 7, 2017)

rfdesigner said:


> I wondered if anyone had a similarly simple and effective approach to archives.



If you can use a "cloud" service, give a look to Amazon AWS Glacier. Remember it's optimized for archiving, so you need to know how retrieval works, and how much it costs:

https://aws.amazon.com/glacier/faqs/#dataretrievals


----------



## mnclayshooter (Jun 7, 2017)

In reality - at roughly $0.03/GB for the now average 4TB drive, there's really not any better bang for the buck in storage. How big is your archive? 10000 RAWs and 20000 jpgs or something to that effect? That's barely 500GB. You can buy a 500GB drive for $40 every 6 months and back up your entire archive over and over and over again. You could simply start a shoebox full of annual archives and have quite literally tons (weight) of redundancy. 

For a home/hobby/casual user: The beauty of using a RAID device - be it a HD dock, hot-swappable drive bays, or a NAS with RAID is that the device (and the hardware associated) is doing the dirty work of checksums to verify that the data is complete and that bit rot or other mechanical issues with the drive itself are not occuring, if you've bought a decent system. I'd be shocked, to be honest, if the system being used by the big-players in cloud storage aren't using pretty simple raid systems and are simply hiring trained monkeys to watch the status of the drives day in and day out and replacing them as needed. 

In the course of 30+ years of using computers, I've never had an issue where I was confined to one single storage media type at a single point in time, nor is the pace of advancement fast enough that I can't build an archive on one type of media and transition in its entirety it to new types as they become available. Spinning HD's are so common that there will assuredly be compatibility overlap with any new future media type for quite a long time, you shouldn't be getting trapped by using them. Something more proprietary (zip disk anyone?) maybe not so lucky. 

Probably the more important aspect of archiving is the ability to use the files you've saved. Jpg, Tiff, even DNG/RAW are so ubiquitous that I can't imagine that they won't have readers/editors in the future. Simply, too many people have them and they are defacto standards. Something more proprietary like a .LR file (lightroom) maybe not as much universality, and there's higher potential for a proprietary format to become abandoned. 

My advice: stick to commonly available devices - if there's only one brand and one model, stay away from it, replace them often (don't wait for them to die), use commonly available format for your files, keep regular tabs on and consistent updates for the archives you maintain... keep incremental backups for the working files you have and you likely won't ever experience any problems and probably won't be out a lot of money or heartache in the end.


----------



## LDS (Jun 7, 2017)

mnclayshooter said:


> For a home/hobby/casual user: The beauty of using a RAID device - be it a HD dock, hot-swappable drive bays, or a NAS with RAID is that the device (and the hardware associated) is doing the dirty work of checksums to verify that the data is
> complete and that bit rot or other mechanical issues with the drive itself are not occuring,



Be careful with hardware RAID. The disks may not be readable by a different controller. The RAID controller is also a point of failure. Software "raid" is usually better, because as long as you use the same OS, it will recognize its format.

RAID may not protect you from "bit rot" (see, for example https://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/). RAID is for availability, not long time protection of the data.

Usually archiving is made creating full separate copies of the data (stored also separately), not just one copy and some information to rebuild a missing part. The separate copies may have additional information to check for corruption and fix errors (this can be done by software, no need to be directly in hardware).

What is important witch archival media is minimizing wear (you don't reuse them over and over), and ensure correct storage to minimize environment induced errors.



mnclayshooter said:


> I'd be shocked, to be honest, if the system being used by the big-players in cloud storage aren't using pretty simple raid systems and are simply hiring trained monkeys to watch the status of the drives day in and day out and replacing them as needed.



Large storage systems are a quite complex stuff today, and they employ several technologies to obtain enough redundancy, and replicate across different sites to ensure resiliency. The actual storage disks may be quite simple (although they use thousands and thousands of disks and other media, some laws requirement may dictate WORM media), but they have several layers of software to manage the data written and read. They may use RAID-like techniques, but they have to cope with very large arrays, which also increase the probability of disks failures and bad writes.

Here, for example, Facebook "cold storage" architecture:

https://code.facebook.com/posts/1433093613662262/-under-the-hood-facebook-s-cold-storage-system-/?_fb_noscript=1


----------



## hne (Jun 7, 2017)

mnclayshooter said:


> In reality - at roughly $0.03/GB for the now average 4TB drive, there's really not any better bang for the buck in storage. How big is your archive? 10000 RAWs and 20000 jpgs or something to that effect? That's barely 500GB. You can buy a 500GB drive for $40 every 6 months and back up your entire archive over and over and over again. You could simply start a shoebox full of annual archives and have quite literally tons (weight) of redundancy.
> 
> For a home/hobby/casual user: The beauty of using a RAID device - be it a HD dock, hot-swappable drive bays, or a NAS with RAID is that the device (and the hardware associated) is doing the dirty work of checksums to verify that the data is complete and that bit rot or other mechanical issues with the drive itself are not occuring, if you've bought a decent system. I'd be shocked, to be honest, if the system being used by the big-players in cloud storage aren't using pretty simple raid systems and are simply hiring trained monkeys to watch the status of the drives day in and day out and replacing them as needed.
> 
> ...



I've got a bit of work damage to my brain, having worked with the design, development and delivery of media archive systems for the past decade. You kind of stepped on my pet peeve. You've got a few things wrong and quite a few things delightfully right.

Correct observations:

your average home user could easily keep shoe boxes full of hard drives for a relatively modest cost.
You are very likely to always be able to keep reading your last-gen storage to copy the contents over to current-generation storage technology. At least as long as you're quick enough realising you might need to switch storage medium (not hold on to them zip disks for a decade before trying to do the switch). This is built into most commercial big-ticket archive systems, like tape robot controlling solutions
File formats are often possible to read for decades. As long as you store a lossless copy of everything important and make sure you have at least one machine with software installed/installable to read it. This can fail if you auto-upgrade cloud subscription software like Lightroom and they silently drop support for your old cameras' raw formats. Keep an eye open!

Incorrect parts:

You might be shocked, but the storage used by the big cloud players (or even most inhouse storage solutions above 100TB at the worlds larger media houses) are NOT "pretty simple raid systems", see this for a few examples of the level of engineering that goes into cloud operations these days: https://www.google.com/about/datacenters/inside/data-security/index.html then look at EMC Isilon,
IBM GPFS, Harmonic MediaGrid, Quantum StoreNext et al.
Raid is not verifying that data is correct or complete. The best you get is RAID6 where dual parity helps you determine which of three copies (actually orthogonal error correction codes) of data differs from the other two. But people don't run RAID6 (especially not in home scenarios) because it is slow (double the checksumming) and expensive (one extra drive bay with one extra drive for no storage increase)
Building your routines around trying to switch media before it dies is a recipe for disaster. Better know that drives fail and build your routines around that: When a drive fails, what would I do to recover and how long do I have until I start losing data?

I've got 42146 .CR2 images, all really good ones developed to taste into high-quality JPEG files (for long-term compatibility), and 1346 .MOV files in my archive, for a total size of 1.3TB. I've got last months images on a laptop, one full copy on a desktop machine with a checksumming filesystem (ZFS) and one copy on a NAS with another checksumming filesystem (BTRFS). The NAS has a year worth of snapshots such that I can recover a lost/overwritten file if needed. Additionally, I keep the most important parts as a manual selection on a loose drive at the office.

Failure scenarios in my setup:

I remove a whole directory of my best images. I've got a copy on my desktop. Was it on the desktop? I've got a full copy on the NAS. Was it on the NAS too? I've got snapshots I can go back to. If I didn't find out within a year that I removed the files, they were probably not that great to begin with
A disk in desktop machine dies. I make sure the nightly sync to the NAS gets done, then shut down the desktop to avoid having more disks die while I replace the disk (would be a pain to sync back 1.3TB onto a clean filesystem from the NAS, which can have a drive fail without losing data)
A disk in the desktop machine starts corrupting data. My monthly scheduled data scrub detects that the block-level checksum is incorrect, overwrites it with a copy from a different disk where the checksum matches and increases the checksum error rate counter. I get a mail that there was an error and can decide if I want to replace the disk or wait a bit more
A disk controller or device driver in the desktop starts corrupting data. I'll similarly get email that checksum error counter has increased but this time across multiple drives. I still have a copy on the NAS (and the sync will fail on files where there is a block that has no copy which passes checksum verification) so I can continue using the desktop until I have figured out which component needs replacement.
A disk in NAS dies. I get a mail. I replace it before both my desktop machine has a chance to have two drives fail AND the NAS have another failed drive.
A disk drive in the NAS starts silently corrupting data. My monthly scheduled data scrub detects that the block-level checksum is incorrect and sends me an email. I can rewrite the file or change the disk immediately.
The NAS has a controller or device driver issue that corrupts data. Data on disk will still be clean (unless of course writes are misdirected) so I can shut it down and replace the NAS just moving the disks. I've still got N+1 redundancy on the desktop.
Fire! Or similar catastrophic event rendering all electronics in my home non-working. I've got a copy of the best media off-site.

I've got some plans to replace the manual selection of best images for off-site storage with a simple rotating scheme of drives holding the full archive instead. As you said, hard drives are relatively cheap.[/list]


----------



## Zeidora (Jun 8, 2017)

NAS vs. HD. Unless you need the network aspect or want to use it as a web file server, I would just use straight HDs. The NAS communication protocols are rather slow, and one more point where thing can go wrong. Used to have NAS, now on RAID10 HDs. Even cross platform protocols are now much easier even at home. My Mac for imaging and PC for microscope control are connected via home 1GB Ethernet. And, no, I don't have a degree in computer engineering.

RAID ≠ RAID. There are quite a few flavors of RAID. If you talk parity bits etc. then most likely it refers to RAID5 (maybe 6). The little bit of added storage space is not worth the hassle/risk given the relatively low costs of HDs. I would only use RAID1 (or RAID10 for arrays with more than 2 disks).

Geographic separation is key. 

Cloud storage. Personally, I don't trust them, particularly if it is "free".

The biggest problem is human error, and remembering to transfer files. Tried some automated backup systems (Retrospect), but never was happy with them.

My system is: MacPro SSD for OS and apps only, 2x3TB RAID1 drive for day to day, attached 4x4TB RAID10 for local storage, usually switched off. Another 4x4TB RAID10 100 miles apart for alternate storage. 2x1TB RAID1 portable drive to move data between locations, and as travel storage from camera. Have around 80K tifs and CR2s, plus three big book projects. Not even close to filling it up. All HDs are from LaCie, never let me down thus far. Had problems with BuffaloTech NAS drives, and G-Force HD.


----------



## mnclayshooter (Jun 8, 2017)

hne said:


> I've got a bit of work damage to my brain, having worked with the design, development and delivery of media archive systems for the past decade. You kind of stepped on my pet peeve. You've got a few things wrong and quite a few things delightfully right.



Sorry, didn't intend further the brain damage. Merely trying to point out that unless you're Reuters, SI or Nat Geo, your home photography business or hobby simply doesn't have the archive size or capacity that would require anything more than some spinning HD drives (widely available) and some backup/archive software and some low-cost hardware to check up on it once in a while. Certainly not $1000's to buy specialty backup hardware - likely proprietary in nature. 

My system includes my old Ubuntu box with swappable drive bays that I essentially use 2 and 4TB drives as mega-floppies for making 1:1 archive backups. While I've lost HD's to head crashes, arm motor freezes, card failures (which I've had some luck in actually replacing and getting access to the data again by buying a matched drive off of ebay), bad/loose cables etc, I've never lost a file I regret losing in 30+ years, simply because of the redundancy of my backup/archive plan for my documents/photos/music etc. It started with floppies back in the day, and now is using HD's and is advantaged by some better hardware (and OS) capabilities. 

Having multiple swap bays, I can generate two identical copies - one for the safe deposit box and one for the house - for annual archives. For my working files, I have a small 2TB WD NAS that runs RAID 1... the disks are completely readable by my windows system when I pop them out of the enclosure and use them. Works great to create a working backup copy on the fly for working files before I transition them to an annual (or sometime semi-annual archive). Having peace of mind that at I at least have a good working copy on the NAS if a drive goes bad (and the NAS emails me), I can power it off and get it replaced quickly. 

If I'm being 100% honest, I was trying to avoid delving into discussing Ubuntu but OpenZFS does a great job for me. Best part is, the OS and the disk management are FREE, provided you want to learn a little about installing/using Ubuntu/Linux. Mainly because I'm not able to provide support if someone doesn't "get it right" and I'm also not a Linux expert by any means. Just good enough to be dangerous.


----------

