The GetDPI Photography Forum

Great to see you here. Join our insightful photographic forum today and start tapping into a huge wealth of photographic knowledge. Completing our simple registration process will allow you to gain access to exclusive content, add your own topics and posts, share your work and connect with other members through your own private inbox! And don’t forget to say hi!

OpenSolaris As A Storage Server

M

meilicke

Guest
For nearly a year I have been trying to come up with a sane storage solution that is quite, reliable, expandable, reasonably high performing, and inexpensive. My photo library is not terribly large (~130GB), but since I acquired an HD video camera, my storage needs were rapidly getting out of control. One hour of video consumes about 80GB of working space, and about 9GB of space to archive.

After far too many hours surfing the Internet and trying various solutions, I settled on building an OpenSolaris computer, accessed from my Mac via iSCSI. The reason for OpenSolaris vs. Linux, Windows, etc., is the filesystem, Zetta File System (ZFS). Simply put, it is the best file system I have ever used. More on that later.

To be clear, building your own storage box is not like a drobo or similar device. There is nothing plug and play about this. But it is fun, and you can get the system exactly the way you want.

The box I built has two mirrored boot drives, and four data drives in a RAIDz, the ZFS equivalent of a RAID-5 (but better). This gives me a little over 2 TB of space, plus about 700 GB from the boot mirror. The whole thing was a bit over $1000.

My iMac connects to the box over a single 1GBit connection, through an inexpensive Linksys switch. I think I paid $40 for an 8 port version.

The data on the server is exposed via iSCSI to my mac, so that the storage appears like any other disk. This allows me to backup the iSCSI data via time machine. It is just as easy to export the data via Windows file sharing or NFS.

Informal speed tests showed I was getting about 220MB/s on the RAIDz. My technique was simple and error prone - I copied a 3GB movie from one file system to another, all within the same 4 disk RAIDz set. Considering the file was being read and then written with the same disks, that is pretty good performance.

Testing between the iMac and the storage box was a little more disappointing. I used iozone, an excellent I/O tool, and got between 40-50MB/s. Considering the speed of the network and my equipment, I expected around 80-90MB/s (125MB/s is the theoretical max for a single gigabit connection). The same test to another iMac showed speeds around 60MB/s (the limit of the second iMac hard drive), so something is not configured correctly on the storage server. The switch may also play a part. I have read of people with similar hardware to mine getting in the 100MB/s region with a single network card, and 190MB/s with two cards, so it is certainly possible to go faster than what I am getting. My guess is that somewhere my configuration is not correct.

But the real glory of the system is the ZFS. Wikipedia does a much better job explaining ZFS than I could do, but here are the bullets.

* Lightweight file systems
Think of a ZFS as a glorified directory. It takes a single command to create one, they can be nested, and each one can have individual policies such as compression, sharing, snapshots, etc. Nested ZFSs share the same storage (thin provisioning), can have quotas, pre-allocation, etc.

* Snapshots
Example - before doing a major overhaul on your photo library, take a snapshot. If things just don't work out, revert to the snapshot, and you are back to where you started. Heck, take 2 or 10 or more during the process to protect you at various stages. Writable snapshots (clones) can be created as well.

* Data Integrity
Assuming you are using some sort of redundant RAID (RAIDz, mirror), every read/write is validated to ensure the data has not silently corrupted. Traditional RAID-5 and mirrors do not do this. Also, disks have an error rate of 1 in 10^14 bits or so, and a TB is 10^13 bits. It is statistically probable that you will have data corruption with a largish traditional RAID-5 system during the rebuild of a failed drive. ZFS is able to silently recovery from such errors. :thumbs::thumbs::thumbs:

* Data portability
A ZFS can be securely replicated to another system. Read-only ZFS support is also included in Mac OS X, so if your Open Solaris box blows up :cry:, you can mount the disks on your Mac, in USB cases if necessary, and read the data! :clap:

* Easy to use
Now, this is relative. Compared to Linux or Windows, it really is easy to create and manipulate ZFS. Compared to plug-and-play drobo type devices, it is an incredible pain in the neck. But where is the fun with drobo? :thumbup:

* Flexibility
Oh boy, where to start. The combination of file systems as easy as directories, really fast disk I/O, high data integrity, snapshots, multiple access methods, etc. etc. etc. makes for a really flexible system. Not to mention everything else you can do with OpenSolaris - virtualization, iTunes servers, desktop applications, etc.

I have been running this server for about eight months, and have had two failures. One was my fault, the other was a crash (OK, I power cycled the system in a fit of impatience :mad:). The first time, mucked up the OS boot mirror, which left my system un-bootable. I booted from the install CD, was able to re-mirror my disks with one line at the command prompt, and 10 minutes later rebooted with a healthy system :). In the other case, I could not get the system to boot, and had to reinstall the OS from scratch. The reinstall took about 20 minutes, and all of my data on the RAIDz was fine. In fact, due to the way the file system works, this was the first computer system of hundreds that I have managed in the past where I felt really confident that the data would be OK after an OS crash.

Sorry this is so long! I am happy to clarify anything or provide more detail if you wish. But honestly, wikipedia does a great job explaining the benefits of ZFS.

-Scott
 
M

meilicke

Guest
Two more thoughts, hopefully a bit less geeky than the first. :)

The flexibility of abstracting your disks is really attractive. Recently I wanted to rearrange my time machine disks, but to do that I had to reformat them. This would have left me without a backup. So I carved out 500GB on a different set of disks than my main storage, exported via iSCSI, connected to my mac, and did a time machine backup. I was then able to reconfigure my external TM disks, all the while being protected from a primary disk failure.

The other thought is that OpenFiler is a good alternative to OpenSolaris, and easier to setup and manage than OpenSolaris. It has all of the storage appliance goodness - snapshots, RAID, iSCSI, etc. You can download a vmware appliance for testing, or an ISO image to install on real hardware. All of the management is done via a straight forward web front end. No command line.

-Scott
 
The box I built has two mirrored boot drives, and four data drives in a RAIDz, the ZFS equivalent of a RAID-5 (but better). This gives me a little over 2 TB of space, plus about 700 GB from the boot mirror.

My iMac connects to the box over a single 1GBit connection, through an inexpensive Linksys switch. I think I paid $40 for an 8 port version.

Informal speed tests showed I was getting about 220MB/s on the RAIDz. My technique was simple and error prone - I copied a 3GB movie from one file system to another, all within the same 4 disk RAIDz set. Considering the file was being read and then written with the same disks, that is pretty good performance.

Testing between the iMac and the storage box was a little more disappointing. I used iozone, an excellent I/O tool, and got between 40-50MB/s. Considering the speed of the network and my equipment, I expected around 80-90MB/s (125MB/s is the theoretical max for a single gigabit connection). The same test to another iMac showed speeds around 60MB/s (the limit of the second iMac hard drive), so something is not configured correctly on the storage server. The switch may also play a part. I have read of people with similar hardware to mine getting in the 100MB/s region with a single network card, and 190MB/s with two cards, so it is certainly possible to go faster than what I am getting. My guess is that somewhere my configuration is not correct.

-Scott
It's been a while since I played sysadmin, and nothing comes to mind from the top of my head, but I'll let it fester.

In the meantime I thought I'd be pedantic like a sysadmin should ;)

You don't say what type of drives you're using; if they're a SCSI variety they talk to each other directly (not through the CPU) so transfer speeds between file systems on the same disk set should be fast.

125MB/s *is* 1 Gigabit/s. Add overheads for TCP headers and the like, and you won't get near 125MB/s of data - your expectations are more realistic.

190MB/s would need two end-to-end connections for the same reason.

The switch shouldn't be a problem if the connection is full duplex; it should auto-sense but sometimes they need forcing.

I remember reading somewhere that at the time gigabit networks were being started, the Windows kernel wouldn't handle more than about 350Mb/s! I know you're not using Windows, but it's funny anyway...

I'll sleep on your problem, and by tomorrow I'll have forgotten about it :eek:
 
M

meilicke

Guest
Thanks for your feedback.

You don't say what type of drives you're using; if they're a SCSI variety they talk to each other directly (not through the CPU) so transfer speeds between file systems on the same disk set should be fast.
I am intrigued. If this is software RAID, don't transfers need to through the CPU? :confused:

125MB/s *is* 1 Gigabit/s. Add overheads for TCP headers and the like, and you won't get near 125MB/s of data - your expectations are more realistic.
90MB/s or so would be delightful. :thumbs:

The switch shouldn't be a problem if the connection is full duplex; it should auto-sense but sometimes they need forcing.
This is where I need to put my energy, especially on the opensolaris side. I think the switch is OK (for now). I know my mac is full duplex 1000. Initially I had a bad cable which reset me down to 100MB. I have found that at least half of the time networking problems are physical (cables, connectors, etc.).
 
I am intrigued. If this is software RAID, don't transfers need to through the CPU? :confused:
I hadn't thought about it much until you said. :shocked:

I'll probably have to read up on it when I get chance, however I would guess like this:

The OS understands the file system (and software RAID is just a file system, albeit a complicated one), and tells the SCSI driver to move nnn bytes of data from address aaa on HD xx to address bbb on HD yy. The driver, being hardware aware, will tell the devices what data to move where, and let them get on with it until they report they've finished. If the data is fragmented, the OS will have to intervene and get the drives to talk together in chunks rather than consecutively; that of course would require small amounts of processing time.

Regarding your problem, one thing that may be worth looking into is your TCP settings (although I have to admit I'd be getting the manual out as far as OpenSolaris is concerned), e.g. packet sizes. Reducing overheads by making your packets larger (as the files are 60MB(?) in size) may have some effect, although if there's only two machines on your network, contention shouldn't be an issue. Be aware, though, that if the mac is your internet machine, and you only have the one NIC, altering packet sizes will also affect your browsing speeds.

There are bound to be tests that will let you see where the bottleneck is, but again I'll need to dig into a memory that isn't what it was...:)
 
I hadn't thought about it much until you said. :shocked:

I'll probably have to read up on it when I get chance
I found this article, although it's not new...

"SCSI still has one advantage over IDE of any flavor: it doesn't bog down your CPU to do its job. IDE, in fact, uses nine times the resources SCSI does for the same operation. With Pentium-III and faster processors, this overhead is usually inconsequential. At one time, the argument against IDE was that any average horse could have beaten Secretariat--if Secretariat had to carry 9 jockeys and the other horse had to carry only one. This metaphor illustrated that no matter how fast an IDE drive became, its resource requirements would bog down the whole system. But, this metaphor is no longer true when it comes to CPUs because a modern CPU in an IDE system isn't carrying nine extra jockeys. It is carrying nine extra ounces on a jockey--something that may not make a noticeable difference at all."

So maybe (with even faster chips these days) it doesn't make as much difference as I thought.
 
M

meilicke

Guest
I remember those days well. My first PC had scsi for just that reason.

ZFS actually check-sums each and every byte in and out of the file system (which is one reason I like it so much). I don't have a link at the moment, but I have read that the overhead is not much. Things have come a long way.
 

etrigan63

Active member
Perhaps you should contribute your findings to the OpenFiler project. If they add ZFS support then they would have a truly earthshaking solution!
 

etrigan63

Active member
One other question: Can you post the parts list of the box you built? I cobbled a parts list but the total came out to about $2500. Granted, I chose a rackmounted chassis with 8 hotswap bays for SAS/SATA and included a RAID-5 controller with 256MB of cache.
 

cjlacz

Member
I've been busy lately so haven't been reading the forums. It enjoyed your writeup of our ZFS experience. Unfortunately I've had to put my plans to build on hold for a bit and I think I'm going to wait for WD's 2TB drive to drop in price a little.

You can actually download and install full ZFS support on your mac if you want.

I like ZFS because it's has finally delivered on RAID promise. CHEAP reliable storage using normal disks. RAID hardware is just too expensive. The thing I maybe dislike most about ZFS is that it's not possible to add disks to an existing array, nor is it possible to go from raidz to raidz2. For companies that isn't a problem. As a home user it would of been nice to expand storage incrementally, but ZFS allows you to save enough money you can just buy all the disks at once.
 

cjlacz

Member
One other question: Can you post the parts list of the box you built? I cobbled a parts list but the total came out to about $2500. Granted, I chose a rackmounted chassis with 8 hotswap bays for SAS/SATA and included a RAID-5 controller with 256MB of cache.
If you are buiding a ZFS box the RAID-5 controller is a waste. ZFS doesn't use it and that's part of the appeal. Look at a card like this:

http://www.newegg.com/Product/Product.aspx?Item=N82E16815121009

This card works well with solaris and 8 SATA ports. The problem is that it's PCI-X, but I don't think there is a cheap PCI-E card with 8 SATA ports that works with solaris yet. You can use it in any PCI slot. That should save you a chunk of money. Cards with a Marvell controller are compatible. You could also get a motherboard with 6 SATA ports and a small card for the remaining drives.

ZFS is rather in intelligent is how it caches data and uses the memory on your machine. If you want more cache, just add more memory.
 

etrigan63

Active member
Thanks for the tip Charles. I blogged about a project based on this and found some interesting bits of kit to support it. All I would need is a 4-port SATA 1x PCIe card that is compatible as I would be using the mobo SATA to handle the boot drive (a 32GB SSD).

The chassis I found can handle 4 SAS/SATA drives plus a boot drive and is mini-ITX form-factor. The mobo comes with support for Core2Quad CPUs.

Where is the ZFS for Mac project located?
 
M

meilicke

Guest
Carlos, sorry about not seeing your earlier post (Feb 11th). I read your blog post, and it sounds like you are on your way. I love that case. Good idea about using flash for the boot disk.

You may still consider mirroring your boot disk. I messed mine up once, and was happy to have the mirror, even though a reinstall is pretty easy, especially since ZFS keeps config info mostly on the disks being managed.

My parts list (from Frys):
Antec Sonata III 500 - very quite, about $90
ASUS P5E-VM HDMI, about $150
6 Seagate 7200.11 750GB disks, about $75 ea (great sale that day)
Supermicro CSE-M35T 5 bay hot-swap SATA enclosure, $110
Salman ZM-F2GL 92mm Case Fan (to make the 5 bay disk cage quieter), $10
2GB RAM, $??
Dual core 2.2GHz proc, $??
Intel 1GB NIC, $40
IDE DVD

The ASUS MB that I bought was a mistake. While it has 6 USB ports, the NIC is not recognized by OpenSolaris, so I had to buy a NIC. Live and learn.

With the five disk cage, I think I could fit 11 disks total, but I'm not sure the power supply would deal with that. The case does not have room for the disk cage and the DVD drive, so when I need the drive, I end up hanging it out of the side of the case. It's a pain in the neck, and I would do things differently, maybe just a simple USB external case. The Salman fan is great, very quite.

Charles, thanks for the info. You actually can expand an existing zfs pool, but you need add disks in terms of stripes. So if you have a four disk raidz pool, you could add another four disk raidz pool and strip between the two. That will expand the entire pool online. It would be nice to do this one disk at a time.

The Mac ZFS stuff, at least about six months ago, seemed fragile in that people reported panics if you unmounted the volumes incorrectly. Carlos, you can download read/write support from apple. Read support is already in the OS. Snowleopard is supposed to offer read/write zfs out of the box.

-Scott
 

cjlacz

Member
Carlos,

Here is a link to the OS X port of ZFS. http://zfs.macosforge.org/trac/wiki I'd probably still recommend sticking with Solaris right now for a ZFS server. If something happened to it, it's good to know you do have a way to mount them in OS X. You can see from their issues list they are still working on many features. http://zfs.macosforge.org/trac/wiki/issues I haven't heard about any major stability problems, but I haven't looked into it closely.

Have you thought about how you are going to configure the drives? I like raidz for the capacity, but with only 4 disks mirroring two stripes would provide great redundancy and higher performance. I do like the small case you found. Four disks is a little limiting for what I want to do, but the small size is great for my place.

It depends what you'll be using the server for, but your boot drive is probably going to be seeing very little use after booting. I hear about people using just a normal CF card to boot. Cache is all handled in memory. Have you read anything were using a fast SSD would significantly improve performance? I'd love to be proved wrong.

Scott - Yes, I know about adding a another raidz then combining them in the pool. Setting up another raidz means you have another disk for redundancy and if you use hot spares you'd need two rather then one. It's a good solution for businesses which is where they are aiming their market, not so good for home users. It's still the best option around though. That said, the developers did post an algorithm for doing disk expansions of raidz and it would allow you to convert a raidz to raidz2. It's just low on Sun's priority list. Hopefully another developer will start to work on it.

Snow Leopard will only have ZFS support in OS X server it looks like. We'll probably have to wait a while longer to get it in the normal release. HFS+ was the same way.

I'm glad you like the Antec case. I was looking at the P182 which should give me access to 11 disks + boot drive and still have space for an optical drive. They seem to make some of the quietest cases around.
 
Top