I'm in the process of implementing a backup service with these major requirements:
and the additional wishlist:
Of course the Perl motto "there is more than one way to do it" is valid for the major goals.
E.g. The external part could be done via:
and the time machine could be done via
My current idea is to use:
So I'm in a hunt for the:
Resources:
Requirements
All the protocols listed below should be interchangeable. I might do some benchmarks at a later stage.
SCSI over internet
Network Block Device
Exporting a device via NBD is a matter of:
root@server:/# apt-get install nbd-server
root@server:/# cat /etc/nbd-server/config
[generic]
[export0]
exportname = /dev/mapper/vg0-nbd6.0
port = 99
root@server:/# /etc/init.d/nbd-server restart
And importing it on a client is:
root@client:/# apt-get install nbd-client
root@client:/# grep -v '^#' /etc/nbd-client
AUTO_GEN="n"
KILLALL="true"
NBD_DEVICE[0]=/dev/nbd0
NBD_TYPE[0]=r
NBD_HOST[0]=SERVER-HOSTNAME
NBD_PORT[0]=99
root@client:/# /etc/init.d/nbd-client restart
You might want to check the manual pages in the respective packages for more configuration options and tweaks. E.g. the nbd-client init scripts has the feature to auto mount file systems.
By default, nbd-client creates a blockdevice with a block size of 1024 bytes:
# On the client
blockdev --getbsz /dev/nbd0
1024
for ((i=0; i<10; i++)); do dd if=/dev/nbd0 of=/dev/null bs=1M count=1000 iflag=direct 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 12.8387 s, 81.7 MB/s
1048576000 bytes (1.0 GB) copied, 14.1621 s, 74.0 MB/s
1048576000 bytes (1.0 GB) copied, 14.1721 s, 74.0 MB/s
1048576000 bytes (1.0 GB) copied, 15.6536 s, 67.0 MB/s
1048576000 bytes (1.0 GB) copied, 15.1352 s, 69.3 MB/s
1048576000 bytes (1.0 GB) copied, 15.5831 s, 67.3 MB/s
1048576000 bytes (1.0 GB) copied, 14.3358 s, 73.1 MB/s
1048576000 bytes (1.0 GB) copied, 15.256 s, 68.7 MB/s
1048576000 bytes (1.0 GB) copied, 13.9433 s, 75.2 MB/s
1048576000 bytes (1.0 GB) copied, 13.0245 s, 80.5 MB/s
# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 365.70 32194.80 380.00 321948 3800
sdb 316.20 31760.40 319.20 317604 3192
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 361.80 39333.20 281.20 393332 2812
sdb 323.20 39295.20 260.80 392952 2608
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 325.20 35762.80 238.40 357628 2384
sdb 274.90 35794.40 201.20 357944 2012
To summarize we have a performance of about 70-80 MB/s and the server is reading about 100KB in each request. The results are pretty much the same with 2048 bytes blocksize.
4k block size drops the transfer rate to 55 MB/s and keeps the 100 KB per IO op rate.
Lets remove the "direct" flag from dd:
# On the client
blockdev --getbsz /dev/nbd0
1024
for ((i=0; i<10; i++)); do dd if=/dev/nbd0 of=/dev/null bs=1M count=1000 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 14.5043 s, 72.3 MB/s
1048576000 bytes (1.0 GB) copied, 18.6863 s, 56.1 MB/s
1048576000 bytes (1.0 GB) copied, 15.6981 s, 66.8 MB/s
1048576000 bytes (1.0 GB) copied, 15.8664 s, 66.1 MB/s
1048576000 bytes (1.0 GB) copied, 16.7602 s, 62.6 MB/s
1048576000 bytes (1.0 GB) copied, 18.382 s, 57.0 MB/s
1048576000 bytes (1.0 GB) copied, 17.1475 s, 61.2 MB/s
1048576000 bytes (1.0 GB) copied, 15.3853 s, 68.2 MB/s
1048576000 bytes (1.0 GB) copied, 19.3907 s, 54.1 MB/s
1048576000 bytes (1.0 GB) copied, 21.7969 s, 48.1 MB/s
# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 312.60 30968.40 173.60 309684 1736
sdb 284.80 30978.00 172.00 309780 1720
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 330.40 32506.40 166.00 325064 1660
sdb 280.60 32517.20 152.00 325172 1520
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 224.40 33598.80 51.60 335988 516
sdb 208.20 33604.40 60.80 336044 608
So this time we have around 60 MB/s with 100 KB per IO operation ratio (Note that the server is not totally idle and this is not the only disk activity it sees). With a block size of 2048 bytes this tests shows decreased speed of about 50 MB/s and the number of IO ops per second doubles. 4k block size gives us an average of 60 MB/s with 50 kb per IO op.
Lets do some write tests:
# On the client
blockdev --getbsz /dev/nbd0
1024
for ((i=0; i<10; i++)); do dd if=/dev/zero of=/dev/nbd0 bs=1M count=1000 oflag=direct 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 10.1818 s, 103 MB/s
1048576000 bytes (1.0 GB) copied, 9.89168 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 9.73052 s, 108 MB/s
1048576000 bytes (1.0 GB) copied, 9.89912 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 9.91606 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 10.0242 s, 105 MB/s
1048576000 bytes (1.0 GB) copied, 9.95247 s, 105 MB/s
1048576000 bytes (1.0 GB) copied, 9.92473 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 10.0946 s, 104 MB/s
1048576000 bytes (1.0 GB) copied, 10.1183 s, 104 MB/s
# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 137.80 7.20 51806.80 72 518068
sdb 144.20 1.20 51798.00 12 517980
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 125.70 16.00 52375.20 160 523752
sdb 132.20 4.80 52362.80 48 523628
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 133.20 4.80 52117.60 48 521176
sdb 130.40 5.20 52265.20 52 522652
Write speed is 105 MB/s with about 500 KB per IO operation.
With block size of 2k and 4k the results of this tests stay the same.
And lets remove the "direct" flag while writing:
# On the client
blockdev --getbsz /dev/nbd0
1024
for ((i=0; i<10; i++)); do dd if=/dev/zero of=/dev/nbd0 bs=1M count=1000 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 9.34019 s, 112 MB/s
1048576000 bytes (1.0 GB) copied, 15.3738 s, 68.2 MB/s
1048576000 bytes (1.0 GB) copied, 15.6453 s, 67.0 MB/s
1048576000 bytes (1.0 GB) copied, 20.3934 s, 51.4 MB/s
1048576000 bytes (1.0 GB) copied, 20.1742 s, 52.0 MB/s
1048576000 bytes (1.0 GB) copied, 19.0891 s, 54.9 MB/s
1048576000 bytes (1.0 GB) copied, 20.4181 s, 51.4 MB/s
1048576000 bytes (1.0 GB) copied, 16.8115 s, 62.4 MB/s
1048576000 bytes (1.0 GB) copied, 18.3555 s, 57.1 MB/s
1048576000 bytes (1.0 GB) copied, 20.0491 s, 52.3 MB/s
# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 242.30 667.60 28498.00 6676 284980
sdb 261.80 768.00 26874.40 7680 268744
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 236.70 639.60 28760.00 6396 287600
sdb 247.80 653.20 29739.20 6532 297392
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 257.60 760.00 20544.40 7600 205444
sdb 155.30 356.00 21658.40 3560 216584
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 325.80 1026.40 28021.20 10264 280212
sdb 136.60 238.80 26988.80 2388 269888
We see decreased write speed - around 50-60 MB/s and once again about 100 KB per IO operation.
The results are pretty much the same with block size of 2048 bytes.
Increasing the block size to 4k though rises the transfer speed to about 100 MB/s and give a nice 500 KB per IO request.
Next: Summarize the above results in a nice table and test with real files and filesystem
Block Size | Sequential read | Sequential Read + idirect| Sequential write | Sequential write + idirect |
1k | 60 MB / 100 KB | 75 MB / 100 KB | 55 MB / 100 KB | 105 MB / 500 KB |
2k | 50 MB / 50 KB | 75 MB / 100 KB | 55 MB / 100 KB | 105 MB / 500 KB |
4k | 50 MB / 50 KB | 55 MB / 100 KB | 100 MB / 500 KB | 105 MB / 500 KB |
Securing who can access the device is a different story though. The server implementation does not support any authentication. Well it does support IP based ACLs but that is nothing since in most configurations IP addresses could be easily spoofed. I don't see much point in putting such ACL in the server, as it could be easily and more reliably be implemented in the firewall.
So if you want/need security with NBD you should:
nbd-server in Debian testing (as of 100110) does not support the SDP (Socket Direct Protocol) so TCP/IP is used for the tests. SDP is claimed to offer a better performance.
I've read somewhere that NBD is not particularly good in case of connection problems.
DST stands for Distributed STorage
Resources:
Merged in (recent) 2.6.30 kernel. Update: Unfortunately it was removed as of the 2.6.33 kernel.
As far as I can see from various resources it is implemented as alternative of NBD and iSCSI.
Its author ( Evgeniy Polyakov ) looks like a good hacker and when a good hacker feels that he has to come with a new implementation there must be something wrong with the old one.
Performance tests done by the DST author show that aoe performs better though, so aoe is probably the first thing that I will try.
DST looks like the second option I will try as I also plan to implement similar backup solution in a distributed environment over insecure channels.
Notes:
ATA over Ethernet
Resources:
Notes:
AoE works in layer 2 (Data Link - Ethernet) directly, bypassing the processing overhead of upper layers (IP, TCP/UDP).
This is a candidate for a performance boost but it also has some drawbacks.
E.g. it could not be easily passed trough routers. Even if Ethernet in IP tunneling is used a TCP fragmentation will likely occur which will probably slow things down. Looks suitable for usage within the data center where performance is needed and the client and the server will either be directly connected or will be interconnected via a good switch supporting jumbo frames.
The AoE protocol is insecure by design and it is stateless.
So if we want security we should use some additional measures.
Security of the storage
To guarantee the security of the storage we could think of some sort of isolation of the path.
Several options come to my mind:
With the first one, of course, being the most secure ( switches could also be penetrated ) .
The MAC filtering could be easily misused. If you do the filtering only on the server, then any other host within the network could be reconfigured to become a client.
The path isolation will guarantee that a breach in another host in the same LAN segment will not compromise the storage.
Security of the data
The data security is another topic. Although a man in the middle attack does not look too probable within the data center you might prefer to be paranoiac ( or you might simply have a different setup requiring it ). For this case you could always add additional layer of encryption on the client for the cost of more CPU cycles and probably slightly increased latency.
One additional aspect bugged me.
How about if a user account on the client host gets compromised ? Could it be used to run a AoE client in userspace to gain access to the data?
Thankfully no. The access to the server is done via raw sockets and a dedicated ethertype. The creation of the RAW sockets under Linux requires the CAP_NET_RAW privilege which is usually granted only to root.
Both machines are Dell PowerEdge R200:
1U
1 Intel Xeon CPU X3320 @ 2.50GHz with 4 cores
4 GB of memory.
Debian GNU/Linux testing/Squeeze
2.6.30-2-686-bigmem kernel package
2 x Broadcom NetXtreme BCM5721 ( 1Gbit, No jumbo frame support )
2 HDDs each of them being:
Model Family: Seagate Barracuda ES.2
Device Model: ST3750330NS
Firmware Version: SN05
User Capacity: 750,156,374,016 bytes
The servers are connected via a dedicated wire.
The network interfaces are at:
root@client:/# ethtool eth1
Settings for eth1:
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
Auto-negotiation: on
Link detected: yes
Both systems were not completely stale during the tests.
Here goes the block device exportation:
root@server:/# lvcreate --verbose --size 500G --name nbd6.0 VGNAME /dev/md8 /dev/md9
root@server:/# vblade 6 0 eth1 /dev/VGNAME/nbd6.0 2>&1
md8 is soft raid0 (stripping) over 2x150 GB partitions at the end of the HDDs. So is md9. Two physical HDDs are used in total. The soft raid is added for performance. The partitioning is done for easier relocation of parts of the space.
The partitions being at the end of the drive gives roughly 1.5x to 2x performance penalty for sequential operations. This is due to the circular design of the Winchester hard drives. Inner tracks have smaller radius and thus length, so outer tracks offer higher number of storage points and are divided in more sectors. So for each revolution higher number of sectors are read from the outer tracks.
The performance I was able to get from this raid on the server looks like:
root@server:/# hdparm -tT /dev/VGNAME/nbd6.0
/dev/VGNAME/nbd6.0:
Timing cached reads: 4146 MB in 2.00 seconds = 2073.21 MB/sec
Timing buffered disk reads: 408 MB in 3.00 seconds = 135.88 MB/sec
Here goes the setup on the client side:
root@client:/# cat /etc/default/aoetools
INTERFACES="eth1"
LVMGROUPS=""
AOEMOUNTS=""
root@client:/# /etc/init.d/aoetools restart
Starting AoE devices discovery and mounting AoE filesystems: Nothing to mount.
At this point /dev/etherd was populated and it was time for some tests.
root@client:/# hdparm -tT /dev/etherd/e6.0
/dev/etherd/e6.0:
Timing cached reads: 3620 MB in 2.00 seconds = 1810.16 MB/sec
Timing buffered disk reads: 324 MB in 3.01 seconds = 107.63 MB/sec
So .. WOW !
I was not expecting such performance. My hopes were around 50MB max. At this point I was wondering if the bottleneck was not on the server side since in several of my hdparm invocations on the server showed a performance just around 80MB(probably of times of some server load).
So let's create an in-memory ( and sparse ) file and export it:
root@server:/# dd if=/dev/zero of=6.1 bs=1M count=1 seek=3071
root@server:/# vblade 6 1 eth1 /dev/shm/6.1
The /dev/etherd/e6.1 device was created on the client automagically.
Lets' do the tests once again:
root@client:/# hdparm -tT /dev/etherd/e6.1
/dev/etherd/e6.1:
Timing cached reads: 4006 MB in 2.00 seconds = 2003.68 MB/sec
Timing buffered disk reads: 336 MB in 3.00 seconds = 111.85 MB/sec
Not too much difference so I guess I was lucky and hit the top at my first try.
Lets also try a sequential write test:
root@client:/# dd if=/dev/zero of=/dev/etherd/e6.1 bs=1M count=1024 conv=sync,fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.0311 s, 107 MB/s
At the time of the tests the maximum network utilization reported by nload on the client was around 890 (outgoing) and 950 MBits (incoming). On the server it was 950 outgoing and 1330 (???) Mbits incoming .
/proc/net/dev on both the server and the client showed no errors or packet drops prior or after the tests.
I'm pleased to say that I'm astonished by the performance results from the isolated tests. A read/write speed of around 110-115 MB/s is more than enough for me when the theoretical maximum is around 125MB (before the exclusion Ethernet frame overhead). The CPU utilization of the vblade server process was around 50% of 1 core which is 1/8 of the available CPU resources. This also sound pretty good to me. I did not bother measuring the CPU utilization on the client as it happens inside the kernel ( with no dedicated thread to follow ). The tests were performed multiple times in order the results to be verified.
Unfortunately, I've started observing decreased write performance with AoE during real world tests. At first I've blamed NILFS, but when I did the tests with EXT4 the problem appeared again. So I've first tested the network throughput, which proved to be fine, and then did write tests ( dd if=/dev/zero of=/dev/etherd/e6.0 ) tests with the AoE device again. This time I have observed peaks and falls on the traffic graphs, with the bandwidth utilization from 10 to 900 Mbits. Sometimes it started fast, other times it ended fast, but the sustained rate was about 100 - 120 Mbits. I have tried various block sizes and tunning some kernel parameters with no real improvement. Searching the net showed that others also had write performance issues with AoE. This nice document - http://www.massey.ac.nz/~chmessom/APAC2007.pdf, shows that the most likely cause is the lack of Jumbo frames support of the network interfaces that I use. On the other side it also shows that others (e.g. iSCSI) could perform a lot better in a 1500 bytes MTU. So I wonder if the problem is in AoE protocol or in the software implementation. I could not easily switch Jumbo frames on, and there are not multiple AoE client implementations. I guess it is time to test ggaoed.
Fiber Channel over Ethernet
Resources:
root@client:/# mkfs -v -t nilfs2 -L nbd6.0 /dev/etherd/e6.0
FS creation took about 16 minutes for a 500 GB file system (with the above setup) and actually created an ext2 file system !!! So let's try again:
root@client:/# time mkfs.nilfs2 -L nbd6.0 /dev/etherd/e6.0
mkfs.nilfs2 ver 2.0
Start writing file system initial data to the device
Blocksize:4096 Device:/dev/etherd/e6.0 Device Size:536870912000
File system initialization succeeded !!
real 0m0.122s
user 0m0.000s
sys 0m0.008s
Well, quite better - just about (16 * 60) / 0.122 = 7869 times faster.
root@client:/# mount -t nilfs2 /dev/etherd/e6.0 /mnt/protected/nbd6.0
mount.nilfs2: WARNING! - The NILFS on-disk format may change at any time.
mount.nilfs2: WARNING! - Do not place critical data on a NILFS filesystem.
root@client:/# df | grep etherd
/dev/etherd/e6.0 500G 16M 475G 1% /mnt/protected/nbd6.0
Two things to notice here. First there is no initial file system overhead of several gigs as with ext2/3 and second the missing 25 gigs are for the 5% reserved space ( see mkfs.nilfs2 ) .
On the bad side. I've tried to fill the file system with data. After the first 70-80 gigs I have noticed the things were pretty slow (network interface utilization of about 50 Mbits) and decided to do FS benchmarks. The throughoutput I was able to achieve was from 5-10 MB/s for sequential writes. Pretty disappointing. I've also tried to tune /etc/nilfs_cleanerd.conf by increasing the cleaning_interval from 5 seconds to half an hour and the nsegments_per_clean from 2 to 800. Unfortunately it did not produce any measurable speedup.
I've also observed a network utilization of about 30 Mbits in each direction while the FS was stale. Unmounting it stopped the traffic. Remounting it made it show again. So I decided that the cleaner process is doing it business after my "unconsidered" over increase of the parameters. Sadly the traffic was there several hours later.
Additionally the number of the checkpoint was increasing without any file system activity (versus the statement in the docs).
I don't need the auto checkpoint feature at all but the docs did not show me a way to disable it. Doing manual "mkcp -s" and "rmcp" later will do the job for my needs. I guess this also obsoletes the cleanerd for my use case.
Anyway. I will try to contact the NILFS maintainers and the community to see if anyone has a cure.
I could also implement a different solution, e.g. using LVM over the AoE device and using LVM snapshotting feature, but I would really like to give NILFS the chance it deserves.