selfhosting — Robert's Tech Lab

Data Storage at home – Safely storing your first bytes

August 19, 2024

#homelab #selfhosting #unraid #storagespaces

I debated on what I wanted my first main topic to be about today, and after some internal debating I ended up deciding on starting like I do with most of my projects, from the bottom up. So today I'm going to talk about how I manage my (apparently) large amounts of data.

Storing data has always been the root cause of all of my selfhosting and homelabbing. Starting back in the days where an mp3 would take an hour to download, I quickly learned that I did not want to download things multiple times, if you can even find it a second time. So, I learned to download frequently and often.

As we started using data more, we had piles of new things cropping up. Photos were going digital, even some short video files started to appear. I remember being very annoyed that we ran out of disk space, and were forced to choose which photos to keep and which ones to delete, and so I started looking into my first few ways to store larger amounts of data.

Today I'm well over 100TB of data at home, mostly my own personal media, personal projects, backups of backups, and I proudly host cloud drives for my family. Getting here though was a long process.

Getting started

As I said in the first post, I started with a rock solid Pentium 3 with 256MB of RAM way back in the day. At that time sharing data between computers was as simple for me as setting up a share on Windows and sharing files stored on the main drive of the “server”. I ran Windows Server 2000, and ran shares for the entire house from that PC. (I actually remember installing games to the shares, and running installed games like Age of Empires from other PCs via Windows Shares).

Unfortunately data scales, and so the more I stored the larger drives I needed. A massive 120 GB drive becomes a 320, and then 500. Each time I would carefully install the new drive by hooking up the giant ribbon cable, copying everything through Windows Explorer to the new drive, and praying that everything would complete well.

Eventually I outgrew what a single drive could handle, and I got my first external drive, an IOGear 1TB “Drive”, which was 2 500GB drives in a RAID 0 configuration. I'll explain RAID more below, but if you aren't aware this is essentially 2 drives working together to appear as if it's a single 1TB drive. At the time I wasn't aware what RAID was, or that I was actually using it, or how risky it was at all (especially since it was in an external drive that I took with me everywhere) – but I would learn over time.

This system of a server computer hosting a few drives worked well for years. Into college I continued to grow my data to the point that I finally bought a 3TB internal Seagate drive, which I was amazed at how much storage it could fit. Unfortunately, this is when I learned my first lesson on the importance of data redundancy, as I woke up one morning and heard the stomach churning noise: click click click whirrrr – click click click whirrrr.

That was the end of all of that data, about 7 years of memories and content built up over years gone, overnight. I was devestated, and the thought of doing professional recovery for well over two thousand dollars made a college student living on ramen queasy. So, I had to start all over.

Trial and Error, “pro-sumer” level storage.

I went through a few different solutions for a few years before I solidified my decision on what I would use. I'll go over a few of them now, and why I eventually ended up choosing Unraid. Foreshadowing...

Hardware RAIDs

The first option was probably the easiest solution to start up with. I needed space, and I needed ways to store my data in a way where I could minimize losing data if a disk failed. This would mean combining multiple drives to act as one large drive, which is called an array.

I mentioned RAID 0 above. RAID is a hardware level option for combining drives in different configurations so that the above operating system only sees the drives that it sets up. If you want to combine 3 drives into one super mega drive, that's RAID. You usually configure it in your motherboard's BIOS, and then maybe install a driver in Windows to show the new drive. To Windows, it looks like any other drive.

An example of a RAID configuration An example configuration of a RAID configuration in BIOS

RAID has a few configurations, I'll talk about the core ones, but there's a complete list on the wikipedia page.

RAID 0 – All data is striped. This means that you are maximizing your data. For each N drives, your data is split into N chunks and distributed across all of those drives. Like slicing a loaf of bread, the first slice goes to drive 1, the second slice to drive 2, and on and on until you're out of drives and it starts over at 1.
- 0 maximizes storage, but your tolerance for failure is extremely high. If any drive fails, you lose the entire array. There is no redundancy, it is simply lost.
- 0, however, is great for speed. Since you are pulling from N number of drives, you have (slowest drive speed x N) the speed, with the only maximum speed being the speed of what your RAID controller (motherboard) can handle. RAID 0 is a popular choice in gaming and professional server applications. (It's actually what I still use in my main Gaming PC, I have 5 2TB SSDs in a RAID 0, and games are fast)
RAID 1 – All data is mirrored. Storage size is not the priority with RAID 1, but rather redundancy. In our bread example, for each slice of bread it sees, it instead clones the slice of bread, and puts a clone of the slice on each drive. So for N drives there are N number of loaves of bread now. (Okay the analogy is falling apart, I guess they have Star Trek replicator technology)
- 1 has maximum redundancy, if a drive fails you can simply replace the failed drive, and the failed data will be copied to the new drive, restoring the full array.
- 1 has no speed implications, as it is still limited by an individual drive's speed
- 1 is the simplest approach to having redundancy, you carry a complete mirror of the drive twice, which means that for building out your system, you must double all costs to carry the second copy.
- We are still limited by the size of a single drive, so no additional storage space is gained.
RAID 0+1 – Here we're starting to get a bit more clever, but not completely. Data is both striped and mirrored. This is where you may need more space of a RAID 0, but you want the redundancy of a RAID 1. Data is first sliced across the multiple drives, and then cloned into the mirror. You get the benefits of 0, with the additional storage and speed, the safety of 1 because you have the entire mirror, but unfortunately you also get the downsides of both as well.
- 0+1 gives us the speed boost of 0, and the mirroring of 1 together
- 0+1 mitigates some of the failures of 0, where if one drive fails you can recover it. However, if one drive fails you are officially down to one drive, and your entire array depends on the one single drive to read everything from, probably it's most intense use to date, without failing, to replace the failed drive. If that drive does fail, the entire array is lost.
RAID 5 – Finally we arrive at something that might work for a real world use case. 5 introduces the concept of striping with parity. Parity is going to be a big word as we continue. The concept is that if you take a RAID 0 array, for each bit on the drive you can do a mathematical equation on it, and then at the end, record the result of that equation. If one drive fails, you simply reverse that equation to find out what the value of the failed drive's bit was and store it on the new drive. I'll explain this a bit more, but essentially your number of drives needed for a 1-drive failure scenario is no longer 2N (where N is the number of drives in your array) for 0+1, but now it's N+1. For N number of drives, with parity you only need one additional drive. (The exact implementation is a bit different in that it slivers the parity across multiple drives, but for now this explanation will work) So to break it down:
- 5 gives us RAID-0 read speeds, with varying write speeds (due to the calculation of the parity).
- 5 gives us full array redundancy, with only needing 1 extra drive
- 5 any one drive can fail, and the entire array can be rebuilt from the parity – however...
- If any one drive fails, the entire array must go through a rebuild cycle to regenerate the failed drive, to bring the system back up to parity.

Again I'll dive into how that parity is calculated below, but that's the gist of it.

Okay, that was a lot, thanks for learning (a few of the standard) RAID types!

Hardware RAIDs are common, and I used one as my primary storage for a while, however they have a couple major flaws.

RAIDs are hardware, meaning that they work using your motherboard's RAID controller, or some other controller that you may install. This means that if that controller fails, then you are at the mercy of another controller working in the same way, or finding another identical controller. Portability to a new computer is near non-existent because of this.

RAIDs require that all of the drives be the exact same model. Not just the same size, the same model. Remember the hardware is in control of slicing that data up, and that means that the interfaces and ways it stores that data must be exactly the same. This is a pretty severe limitation of RAID, that you pretty much must know exactly how you want to build your array before running it.

But what if they don't make that drive anymore? What if they changed the drive and didn't tell anyone? What if you simply want to add more storage to your array? Well, then it's time to take it to the next level. To software RAIDs.

Windows Storage Spaces

Storage Spaces was my first foray into the world of Software RAIDs. Software RAIDs are similar to Hardware RAIDs in that they still combine disks, usually with the same basic algorithms, but since they're software based they have additional flexibility, mostly that being in software, you don't need to use the same model or even same capacity of drive. You can add a 4TB drive to an array of 3TB drives and it will work fine. This was a huge determination for me, because I wanted my array to grow with me.

Windows Storage Spaces is a built-in Microsoft approach at handling multiple disks and spreading data across them. I first heard about it through a friend at work, who recommended Storage Spaces. I decided to try it out, just for fun, and created a Virtual Machine with Windows Server 2012 on it. I attached 4 virtual “disks” to the virtual machine, so I could play with the array. The drives weren't big, only 100MB each, but I was able to create a simple array through Window's dialogs. The size was fair, it wasn't the full 400MB, but it was clearly keeping a parity copy, so it was about 320MB of total space.

I copied some data into the newly formed Storage Spaces drive, and then I proceeded to mess around. I shut down the VM, detached a drive, and watched what would happen. Storage Spaces saw the failed drive, and offered to remove it, or I could even still start it in a “degraded” state. I detached another drive and the array went offline. I could hot swap drives, I yanked drives while the VM's “power” was still on, everything was stable. I added drives of different sizes. When I was happy with my testing, I installed Server 2012 on a spare computer with a bunch of SATA ports, set up my storage spaces, and started the copy.

Screenshot of the GUI configuring Storage Spaces Configuration Storage Spaces, from the blog I'm pretty sure I read way back in the day to get started

Which took forever. I assumed it was just that my network was slow or something but I was getting maybe 100Kbps. I learned the biggest downfall of the software raid, that it's software. Being in software means that parity calculation must be also done in software, there is no specialized hardware calculating the parity for you. Read speeds were fine off the drives, but writing to the drives became an arduous task. I stuck with Storage Spaces for a while, but it was clear that as long as I used Storage Spaces, I would just have to deal with the trade off of flexibility for write speeds.

Until...

Unraid

I started hearing about Unraid on Reddit's /r/DataHoarder for a while. DataHoarder I realized was a totally real term that did apply to me, the need to retain and collect all data while the Delete key on the keyboard grows dusty and sad. Essentially we're squirrels, but for data.

Unraid is an entire Operating System, meaning that it won't be something you can enable on your existing computer, you will need a separate computer from your primary to run it.

It's primary feature is of course, the data array. Unraid offers a software raid that's similar to Storage Spaces in that it can use arbitrarily sized disks to create an array of any disks you have lying around. It creates a “storage pool” using these disks, mimicking a RAID 0 environment, but with an important caveat. It doesn't stripe the data. Instead of striping your data like slicing a loaf of bread, it will choose one of your disks to put the entire “loaf” (or file) on. So when you look at an individual disk, you will see your whole file sitting there. The pool/array part is that your directories (folders for you windows folks) will be split, so one file may be on one disk, but another file will be on a completely separate disk. The operating system then uses some proprietary magic to create “shares”, that combine all of these individual disks to then look like one large cohesive drive.

Image of unraid's main screen, showing multiple drives

So that's storage, what about parity? Storage Spaces reserves a small portion of each drive that's attached to store parity bits from all other drives. Well, let's talk about Parity.

Unraid's Parity system

Parity for unraid is the same basic function, but instead of reserving space on each drive, in Unraid you specify a separate drive, similar to RAID 5 mentioned above, to be your parity drive. The caveat is that Parity must be the largest drive in the array, or equal to the largest drive in the array. This makes sense why in a second.

To calculate parity, on each write Unraid does basically the same thing as RAID5 and Storage Spaces, in that it it runs a calculation on the entire array to determine what the parity bit should be. If you have 4 drives plus 1 parity then, it would mean that it sums up the first 4 bits, and calculates what the parity bit would be.

For you nerds following, this calculation is an XOR between each drive. For everyone else, think of it like adding up the 0's and 1's, and then deciding if it's even or odd. An even would be a zero, an odd would be a one. So if one drive failed, all you would need to do is add up those numbers again, and using the parity you could tell what the missing value would be. If you had a missing drive and the values were even, but the parity said it was an odd number, then you know that missing bit should be a one. Some examples:

Drive 1	Drive 2	Drive 3	Drive 4	Parity
0	0	0	0	0
1	0	0	0	1
1	1	0	0	0
1	1	1	1	0

So when writing a file it will check the other bits at the same location on each drive, and then update the parity drive with the new value. This is why the parity drive must be the largest drive, it must have the capacity to store the entire array's parity bits. For the largest actual storage drive in the pool, there must be a corresponding parity bit to match up in case that largest drive fails.

Double Parity

Unraid allows you to set up no parity (please don't do this), single parity like above, or double parity. This is actually supported across the other systems as well, RAID 6 is the same as RAID 5 but it has the extra parity drive. Storage Spaces also supports the double parity.

Why do double parity? Well, let's think about how parity works. If a drive fails, that means your array has no redundancy. At that moment, you have suffered the maximum amount of drive failures that it can handle. At the same time, you need to replace the failed disk, and at that time the rebuild process will start. The array rebuild consists of – Reading every bit in sequential order, start to finish, on every drive – Calculating the what bit should be on the new drive – Writing the new bit to that new drive

During a parity rebuild this operation will run all of your disks at 100%, for likely many hours (maybe even days, now that we're running at 20+TB drives). Your machine will have 100% activity while also producing the most heat it ever will. In essence, these are prime conditions for another drive to fail. If you have another that is teetering on the edge of failure, this would be the time for it to fail – and remember it's at this time you are at your most vulnerable, you have no extra redundancy.

This is why I recommend just biting the bullet and putting the extra drive into your cart. Yes, it's more money, but it'll save the extra stress and anxiety if the worst should happen.

Cache Drive

Unraid supports using a Cache Drive along with the main array. This is extraordinarily useful because of the limitations of both write speeds of spinning hard disk drives, but also the limitations of calculating parity. (Remember that calculating parity means spinning up all of your drives at once, and then running those calculations, then writing to the drive the file is stored on along with updating the parity). Using a cache drive like an SSD or NVME drive means that you can write that new file at blazing fast speeds, and unraid will save it to your full data array later. The tradeoff is that for this short time your file is unprotected by the array, but it allows you to move onto other things.

The process of moving files to the array is (clever enough), called the Mover in unraid. For my own, I schedule the mover at about midnight every night. Unraid takes any files on the cache drive and saves them to the array. In the case that my cache drive dies, I've lost maximum one day's files.

Full Homelab Suite

Unraid is much more than just a data storage system. It's a full operating system, with full virtualization built in. Because of this in Unraid you can run virtual machines and run docker images right from Unraid itself. This can be an amazing way to get started with homelabbing, by running your first applications right on unraid. There's no need to run separate servers or anything with unraid. You of course can, but it's extremely easy to get your first applications running with only Unraid. Need Windows? Spin up a Windows VM. Want to run Plex? Use the Plex docker image and directly hook in your media.

Comparisons

My ultimate decision to go for Unraid was a personal one, there are many other storage solutions out there that I could have gone with, more than I can document here, but for me ultimately this is how I viewed it.

Pros – The array can be expanded, and I can use newer drives to allow the system to grow with me. Where I started with 3TB drives, I just installed my first 22TB drive into the array, all without needing to do a massive copy of all of my data to a new array. – Write speeds are better than storage spaces because of how parity is calculated. – If you add a cache drive, write speeds are much faster than storage spaces. – Unraid is software, so you aren't dependent on a controller failing, like with a standard RAID. If your motherboard dies, you plug the drives into a new computer, and start Unraid. The configuration itself is stored on the array. – Unraid has an amazing feature suite of additional things you can do with it. It deserves it's own post, but if you're getting started you can: – Run VMs – Run docker images – Download a shocking amount of plugins

Cons – Not a true RAID array, so no performance gains – By keeping files separate rather than sliced, you are limited to the speed of your drive – A full OS, so this will truly only work as network attached storage, not on a primary PC – Proprietary. While relatively inexpensive (currently $119 for a forever license), it is not open source.
– Needs to run regular jobs to maintain parity – Parity check (on your schedule preference, I run monthly) – Mover (and while mover hasn't run, your data is at risk)

Summing up

Overall, I chose Unraid for it's extreme flexibility, even with it's slight performance hits. For me, this is my primary network storage, where I store large files that may only be accessed once in a while. If I'm storing things there, I can live with a write operation taking a bit longer than lightening speed. With the cache drive on top I can get the maximum my network allows, and then set it and forget it, with the mover picking up my writes again later.

There are many different options out there. One I looked into I didn't have the space to write about here was TrueNAS, I know a lot of people like that. I've toyed with ZFS a bit too on proxmox, ultimately it's going to be a use case. There are dozens of comparisons online if you're curious between options, but I hope you see the benefits of Unraid, and why I personally chose it.

It is a great system to start with if your curious about starting homelabbing/self hosting, as it does provide everything out of the box.

This ended up being much longer than I anticipated, but I wanted to give a full idea of how I manage storage. If you read this far, thank you! This was a lot to write so we'll see going forward how my other posts turn out. See you next time!