Дойде време да ъпгрейдна от Debian Buster към BullsEye .
Новият стабилен Дебиан излезе това лято. Нямах особен зор да ъпгрейдвам, но ми беше в TODO списъка.
Първо ъпгрейднах лаптопа, където нямаше проблеми, но там и няма кой знае какво конфигурирано, понеже го ползвам рядко.
Десктопа ми беше сложен микс от:
На него се ползват 3 работни места с три видеокарти, всяко с по няколко монитора, собствени клавиатури и мишки, саундкарти, принтери и др.
Има активни Docker и LXC контейнери. Отделно има VirtualBox, systemd-nspawn, SnapD .. През годините се тестват разни мрежови конфигурации за каквото ли не. От bridges, firewalls, proxies, load balancers, уеб и файлови сървъри, файлови системи ..
Миксираната система се поддържаше от сложен микс apt_preferences
, holds
, sources.list
.
Бих казал, че ъпгрейда мина много плавно, като нещата от които очаквах проблем( binary драйверите на NVidia, root on ZFS, KDE, multi seat) минаха гладко.
Единственият не очевиден проблем, който възникна (на този етап поне) беше с akonadiserver
.
Крашваше дори и след като му затърках напълно базата и конфигурацията.
Проблемът се оказа, че са сложили apparmor
policy, което предполагаше файловете
му да са на същият дял като HOME директорията, а аз съм конфигурирал отделен
потребителски VOLATILE дял за кешове и подобни, да не ми пълнят ZFS снапшотите.
Та въпросното policy не му позволяваше достъп до файловете, които имаше нужда да ползва и крашваше при всеки опит за старт.
Та бързият фикс беше:
aa-complain /etc/apparmor.d/usr.bin.akonadiserver
Правилното планиране направи връщането на конфигурацията лесно с един zfs rollback
към snapshot-а, който направих преди ъпгрейда.
Ъпгрейда отне няколко часа, може би 5-6, като голямата част от тях бяха затова, защото държах да следя какво се случва отблизо и гледах да не си спестявам стъпки, които да ме доведат до пъти повече изгубено време впоследствие.
Поразгледах и разчистих и разни стари пакети и конфигурации.
Add tags like these to your server:
KEXEC_KERNEL=http://mirror.scaleway.com/kernel/armv7l-mainline-lts-4.9-4.9.93-rev1/vmlinuz
KEXEC_INITRD=http://mirror.scaleway.com/initrd/uInitrd-Linux-armv7l-v3.14.6
KEXEC_APPEND=vmalloc=512M
The ScaleWay's "BareMetal" "C1" instance is a cheap EUR 3 / month cloud infrastructure instance. It has:
ScaleWay offers two lines of servers:
One important difference between the two is that:
Another important difference is that currently in ScaleWay infrastructure, contra-logically:
Thus a problem arises you need to change something.
My case was that I wanted to use ZFS and it is not included in the official Linux kernel. It is rather build as a module. On standard Debian it is done easily by installing the zfs-dkms package.
It is possible to build the module for the C1 instance kernel by preparing the build env like described here:
The problem was, that ZFS on 32bit Linux:
which is officially stated here:
I'm stil about to see the former but hit the latter quite fast, and as recommended
I had to add the vmalloc=512M
boot parameter.
Unfortunately Scaleway does not support passing parameters to their kernels.
They however support KEXEC via the KEXEC_KERNEL
and KEXEC_INITRD
params as
documented here:
and they support parameters to the KEXEC-ed kernel via the KEXEC_APPEND param.
So as I just needed to boot the same kernel and pass the parameter. So first I had to find where the current kernel and initrd are. This is done by installing "scaleway-cli":
I've just grabbed the pre-built amd64 deb packages, and then used the "scw" command to get info about the instance:
# list servers
$ scw ps
# Show instance details
$ scw inspect SERVER_ID
"bootscript": {
"bootcmdargs": "LINUX_COMMON scaleway boot=local nbd.max_part=16",
"initrd": "initrd/uInitrd-Linux-armv7l-v3.14.6",
"kernel": "kernel/armv7l-mainline-lts-4.9-4.9.93-rev1",
"dtb": "dtb/c1-armv7l-mainline-lts-4.9-4.9.93-rev1",
...
If you inspect a VM instance you will see that the kernel and initrd are referred by IP:
"bootscript": {
"bootcmdargs": "LINUX_COMMON scaleway boot=local nbd.max_part=16",
"initrd": "http://169.254.42.24/initrd/initrd-Linux-x86_64-v3.14.6.gz",
"kernel": "http://169.254.42.24/kernel/x86_64-mainline-lts-4.4-4.4.127-rev1/vmlinuz-4.4.127"
And a google search showed me that the kernel and the initrd were available at:
I've had a problem by trying to use the image referred in the params above:
# DO NOT USE THIS ONE
KEXEC_INITRD=http://mirror.scaleway.com/initrd/uInitrd-Linux-armv7l-v3.14.6
and I've wasted a couple of hours until I realized that this image was in a different format, not usable for the KEXEC_INITRD . Then I've changed it to:
KEXEC_INITRD=http://mirror.scaleway.com/initrd/initrd-Linux-armv7l-v3.14.6.gz
and this time it worked fine.
The kernel can be found via at least two different URLs:
KEXEC_KERNEL=http://mirror.scaleway.com/kernel/armv7l-mainline-lts-4.9-4.9.93-rev1/vmlinuz
http://mirror.scaleway.com/kernel/armv7l/4.9.93-mainline-rev1/vmlinuz
And after the successfull boot I've just had to add:
KEXEC_APPEND=vmalloc=512M
And my ZFS module was no longer complaining about lack of virtual memory.
Let me add a few articles that were helpful:
I've wasted about a day while investigating this stuff. If you find it helpful and you think that I might have saved you a couple of hours you can decide to send me a small donation on this PayPal e-mail: krustev-paypal@krustev.net
NOTE: Adobe Flash Player 11.2 will be the last version to target Linux as a supported platform. Adobe will continue to provide security backports to Flash Player 11.2 for Linux.
I've recently got a nice webcam - Logitech C600
The supported camera outputs are:
The best video quality is in the YUYV mode, however it is using less(or no) compression, so the high frame rates are available at 640x480@30 fps, and 800x600@25 fps.
Strangely, the webcam does some cropping when used at high video resolution & high frame rates. The controls pan/tilt are only usable in this crop mode. Skype also does a switch to one of the crop modes after e.g. 30 seconds of the call (I'm using the skype option to capture at 640x480 which it probably uses initially ).
Useful software:
GUVCview is able to show what your webcam can do. You can easily switch resolutions, frame rates, camera output format . It can record video in different formats and capture still images. All of V4L2 settings which your camera supports could be changed. By default it presents a preview screen, so you can see how the switch of settings is affecting the captured video. The actual frames per seconds are also displayed on the video preview window. You can also use it as a camera control application when the capture is done by another app (e.g. skype). Just start it like:
guvcview -o
Another very nice feature is that you can capture video with sound. You can easily choose which mic to use - the camera built in or the one sitting on your desktop.
It is a good idea to keep an eye of the processor load (and on the terminal window) while capturing. Some formats use the CPU heavily and video/audio can easily get out of sync.
MPlayer is usable for fast preview. To play video with mplayer you can just do:
mplayer tv://
or give it some more options:
mplayer -tv driver=v4l2:input=0:width=640:height=480:device=/dev/video0
It appeared hard to get mencoder to capture the video right, especially when it does frame rate switching during the capture. Mine does that when it has the option "Exposure auto priority" checked. I was not able to get mplayer play video and audio at the same time too. But may be I've not tried hard enough. VLC on the other hand can do this.
VLC needs to know the video and sound devices when you open a capture device. I've specified them as:
/dev/video0 (the webcam)
hw:1 (or hw:1,0) (this was my webcam mic)
You can list your capture devices by:
arecord -l
VLC output is a little laggish in comparison to mplayer or guvcview preview window. I was able to fix this by specifying a smaller buffer time (300ms by default), however at a later try this did not work. I've not played with VLC enough too. As you might know it is quite powerful - may be the most mature video player with a GUI available for Linux. I still use mplayer from the command line for video playing though and haven't found a reason to replace it with anything else :-)
v4l2ucp is covered by the "Image control" tab of guvcview. Luvcview looks like older version of his G brother. You can get the list of video modes your camera support by doing:
luvcview -L
Another software which I've barely tried is the popular "cheese".
The ultimate webcam software for linux is GUVCView.
Some extra commands to test sound from your webcam mic:
$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 0: Intel [HDA Intel], device 0: ALC883 Analog [ALC883 Analog]
Subdevices: 1/1
Subdevice #0: subdevice #0
card 0: Intel [HDA Intel], device 2: ALC883 Analog [ALC883 Analog]
Subdevices: 1/1
Subdevice #0: subdevice #0
card 1: U0x46d0x808 [USB Device 0x46d:0x808], device 0: USB Audio [USB Audio]
Subdevices: 1/1
Subdevice #0: subdevice #0
# No sound here
$ arecord -D hw:U0x46d0x808,0 | aplay
Recording WAVE 'stdin' : Unsigned 8 bit, Rate 8000 Hz, Mono
arecord: set_params:1065: Sample format non available
Available formats:
- S16_LE
aplay: playback:2467: read error
# This played the sound. Note that some of the times I started a command
# the sound did not show up. Next time I've tried it it did. The same was
# true for VLC sound capture tests. So I guess the device is not
# always initialized right.
$ arecord -D hw:U0x46d0x808,0 -f S16_LE | aplay
Recording WAVE 'stdin' : Signed 16 bit Little Endian, Rate 8000 Hz, Mono
Warning: rate is not accurate (requested = 8000Hz, got = 16000Hz)
please, try the plug plugin
Playing WAVE 'stdin' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
^CAborted by signal Interrupt...
Aborted by signal Interrupt...
# Specify the proper rate
$ arecord -D hw:U0x46d0x808,0 -f S16_LE -r 16 | aplay
Recording WAVE 'stdin' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
Playing WAVE 'stdin' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
^CAborted by signal Interrupt...
Aborted by signal Interrupt...
# Use mmap instead of read:
$ arecord -D hw:U0x46d0x808,0 -f S16_LE -r 16 -M | aplay
Recording WAVE 'stdin' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
Playing WAVE 'stdin' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
underrun!!! (at least -1900024418,571 ms long)
^CAborted by signal Interrupt...
Aborted by signal Interrupt...
$
Another note here is that kmix was not always showing the webcam mic. Sometimes it showed unplug events without me actually touching the camera. Thus the webcam mic became unmanageble with it. Thus alsamixer was my friend.
Links:
I think I will call 3.11 Linux for Workgroups.
... application developers are very important. They're not "real men" like kernel developers, he says, but still are "necessary" for Linux to succeed.
Linus Torvalds
http://www.linux.com/news/enterprise/biz-enterprise/485159-a-conversation-with-linus-torvalds
Честит празник на системните администратори! :-)
Поради спиране на тока нощес имах привилегията да си ъпдейтна десктопа:
root@work:/# cat /etc/issue.net
Debian GNU/Linux wheezy/sid
root@work:/# last reboot | head -3
reboot system boot 3.0.0-1-686-pae Fri Jul 29 15:11 - 20:29 (05:17)
reboot system boot 2.6.38-2-686-big Fri Jul 29 13:25 - 15:09 (01:43)
reboot system boot 2.6.38-2-686-big Thu Apr 7 20:41 - 15:09 (112+18:28)
root@work:/# uname -a
Linux work 3.0.0-1-686-pae #1 SMP Sun Jul 24 14:27:32 UTC 2011 i686 GNU/Linux
И макар и леко на патерици - Честита 20 годишнина на Линукс и честита ни 3-та версия! :-)
This night I've tried text recognition with various open source tools. The input were images packages as PDF. The text in the images was bad looking, but readable.
To summarize my experience:
None of the tools did the job even close to what I expected. Maybe it was my fault, but I could not spend a day each time I need to do a simple job which I do not do each month.
At the end I did the job by googling for "Online OCR" and using (guess what ?!) http://www.onlineocr.net/ for the first five pages. It had a limit for five pages per hour for non registered users (and 5 pages total for registered ones) so I registered and OCRed the last sixth page.
BTW, just to prove my point of not enough reading I later found this site http://www.free-ocr.com/, which also did the job and used one of the software I have tried - Tesseract.
I hit this about a week ago . First time I saw it was on my office desktop running Debian unstable. Since I was not doing too much Java on it I decided it was a problem with JConsole. I nearly lost a bet out of this:
I was pretty sure JConsole was able to attach to local processes even when they were started without any JMX options enabled. Borislav Tonchev was pretty sure it wasn't. I quickly wrote a Java class with its main method sleeping for 100 seconds and tried to attach to its process. Unfortunately I wasn't able to do so. At that point Borislav walked away with 10 bucks coming out of my pocket.
I was curious enough to check this stuff and at first appeared that Java didn't like the bsdgroups
option my ext3 /tmp file system was mounted with. Trying the same thing on my home PC, with bsdgroups disabled showed this java.net.SocketException: Network is unreachable
. At this point I was starting to loose ground. I decided to check the docs ( http://java.sun.com/javase/6/docs/technotes/guides/management/jconsole.html ) and they confirmed my point. I checked the documented behavior in a JVM running inside an Windows XP installation I have ( VirtualBox image for the corporate stuff in the office ) and Borislav unhappily brought my money back.
At this point I decided the exception under Debian was caused by a bug in JConsole - probably it was not maintained too much in recent releases as a similar tool appeared - VisualVM.
Several days after this long background, on Saturday, I've also hit the same exception on a production server running Tomcat. Pretty damn strange. I was not able to figure it out immediately. The actual problem was introduced in Debian in the beginning of December last year, with the netbase package setting:
# cat /etc/sysctl.d/bindv6only.conf
net.ipv6.bindv6only=1
This did not showed up on the server immediately, since the netbase upgrade did not apply the new setting. The exception appeared after a restart almost two months after the upgrade.
The workaround is to set the above to "0" as it was before, or to add the option -Djava.net.preferIPv4Stack=true
to each Java process you start. I prefer the former as I did not want to configure every Java program (e.g. I use azureus/vuze) manually.
More information could be found in Debian bug #560044
I was unable to access my E-banking at https://e-fibank.bg. It first happened on my Debian unstable box in the office. A few weeks later it also showed on my home PC running Debian testing.
My observations also showed that all the browsers stopped working at once. I'm using Iceweasel (Firefox) for the e-banking itself. Google chrome also showed some weird (unknown) SSL error.
This was enough for me to decide that the problem has been caused by recent package upgrade. I was pretty sure this was caused by the SSL libraries, especially with some recent Bugraq posts about SSL vulnerabilities.
So what I did was:
So the command I came up with was:
ls -rtl /var/lib/dpkg/info/*.list | \
grep 2010-01-02 | awk '{print $8}' | \
cut -d / -f 6 | \
cut -d . -f 1 | \
sort | \
egrep \
`dpkg -s google-chrome-unstable | \
grep Depends | \
tr ',' '\n' | \
grep '^ ' | \
awk '{print $1}' | \
xargs echo | tr ' ' '|'`
So this showed two things only:
libfontconfig1
libnss3-1d
I was not familiar with libnss3 but looking at its package description(SSL related) was enough for me to blame it. So I've checked the aptitude logs:
# grep libnss /var/log/aptitude
[UPGRADE] libnss3-1d 3.12.4-1 -> 3.12.5-1
and have seen which was the older version I used. Then checked /var/cache/apt/archives and it was just sitting there waiting to be restored:
dpkg -i /var/cache/apt/archives/libnss3-1d_3.12.4-1_i386.deb
Then restarted Iceweasel and voila ..
I've then also checked the Debian bug reports to see if this has already been reported or was waiting for me to do that. This bug report showed up:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=561918
Hope this saves you some time ..
I'm in the process of implementing a backup service with these major requirements:
and the additional wishlist:
Of course the Perl motto "there is more than one way to do it" is valid for the major goals.
E.g. The external part could be done via:
and the time machine could be done via
My current idea is to use:
So I'm in a hunt for the:
Resources:
Requirements
All the protocols listed below should be interchangeable. I might do some benchmarks at a later stage.
SCSI over internet
Network Block Device
Exporting a device via NBD is a matter of:
root@server:/# apt-get install nbd-server
root@server:/# cat /etc/nbd-server/config
[generic]
[export0]
exportname = /dev/mapper/vg0-nbd6.0
port = 99
root@server:/# /etc/init.d/nbd-server restart
And importing it on a client is:
root@client:/# apt-get install nbd-client
root@client:/# grep -v '^#' /etc/nbd-client
AUTO_GEN="n"
KILLALL="true"
NBD_DEVICE[0]=/dev/nbd0
NBD_TYPE[0]=r
NBD_HOST[0]=SERVER-HOSTNAME
NBD_PORT[0]=99
root@client:/# /etc/init.d/nbd-client restart
You might want to check the manual pages in the respective packages for more configuration options and tweaks. E.g. the nbd-client init scripts has the feature to auto mount file systems.
By default, nbd-client creates a blockdevice with a block size of 1024 bytes:
# On the client
blockdev --getbsz /dev/nbd0
1024
for ((i=0; i<10; i++)); do dd if=/dev/nbd0 of=/dev/null bs=1M count=1000 iflag=direct 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 12.8387 s, 81.7 MB/s
1048576000 bytes (1.0 GB) copied, 14.1621 s, 74.0 MB/s
1048576000 bytes (1.0 GB) copied, 14.1721 s, 74.0 MB/s
1048576000 bytes (1.0 GB) copied, 15.6536 s, 67.0 MB/s
1048576000 bytes (1.0 GB) copied, 15.1352 s, 69.3 MB/s
1048576000 bytes (1.0 GB) copied, 15.5831 s, 67.3 MB/s
1048576000 bytes (1.0 GB) copied, 14.3358 s, 73.1 MB/s
1048576000 bytes (1.0 GB) copied, 15.256 s, 68.7 MB/s
1048576000 bytes (1.0 GB) copied, 13.9433 s, 75.2 MB/s
1048576000 bytes (1.0 GB) copied, 13.0245 s, 80.5 MB/s
# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 365.70 32194.80 380.00 321948 3800
sdb 316.20 31760.40 319.20 317604 3192
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 361.80 39333.20 281.20 393332 2812
sdb 323.20 39295.20 260.80 392952 2608
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 325.20 35762.80 238.40 357628 2384
sdb 274.90 35794.40 201.20 357944 2012
To summarize we have a performance of about 70-80 MB/s and the server is reading about 100KB in each request. The results are pretty much the same with 2048 bytes blocksize.
4k block size drops the transfer rate to 55 MB/s and keeps the 100 KB per IO op rate.
Lets remove the "direct" flag from dd:
# On the client
blockdev --getbsz /dev/nbd0
1024
for ((i=0; i<10; i++)); do dd if=/dev/nbd0 of=/dev/null bs=1M count=1000 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 14.5043 s, 72.3 MB/s
1048576000 bytes (1.0 GB) copied, 18.6863 s, 56.1 MB/s
1048576000 bytes (1.0 GB) copied, 15.6981 s, 66.8 MB/s
1048576000 bytes (1.0 GB) copied, 15.8664 s, 66.1 MB/s
1048576000 bytes (1.0 GB) copied, 16.7602 s, 62.6 MB/s
1048576000 bytes (1.0 GB) copied, 18.382 s, 57.0 MB/s
1048576000 bytes (1.0 GB) copied, 17.1475 s, 61.2 MB/s
1048576000 bytes (1.0 GB) copied, 15.3853 s, 68.2 MB/s
1048576000 bytes (1.0 GB) copied, 19.3907 s, 54.1 MB/s
1048576000 bytes (1.0 GB) copied, 21.7969 s, 48.1 MB/s
# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 312.60 30968.40 173.60 309684 1736
sdb 284.80 30978.00 172.00 309780 1720
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 330.40 32506.40 166.00 325064 1660
sdb 280.60 32517.20 152.00 325172 1520
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 224.40 33598.80 51.60 335988 516
sdb 208.20 33604.40 60.80 336044 608
So this time we have around 60 MB/s with 100 KB per IO operation ratio (Note that the server is not totally idle and this is not the only disk activity it sees). With a block size of 2048 bytes this tests shows decreased speed of about 50 MB/s and the number of IO ops per second doubles. 4k block size gives us an average of 60 MB/s with 50 kb per IO op.
Lets do some write tests:
# On the client
blockdev --getbsz /dev/nbd0
1024
for ((i=0; i<10; i++)); do dd if=/dev/zero of=/dev/nbd0 bs=1M count=1000 oflag=direct 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 10.1818 s, 103 MB/s
1048576000 bytes (1.0 GB) copied, 9.89168 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 9.73052 s, 108 MB/s
1048576000 bytes (1.0 GB) copied, 9.89912 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 9.91606 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 10.0242 s, 105 MB/s
1048576000 bytes (1.0 GB) copied, 9.95247 s, 105 MB/s
1048576000 bytes (1.0 GB) copied, 9.92473 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 10.0946 s, 104 MB/s
1048576000 bytes (1.0 GB) copied, 10.1183 s, 104 MB/s
# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 137.80 7.20 51806.80 72 518068
sdb 144.20 1.20 51798.00 12 517980
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 125.70 16.00 52375.20 160 523752
sdb 132.20 4.80 52362.80 48 523628
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 133.20 4.80 52117.60 48 521176
sdb 130.40 5.20 52265.20 52 522652
Write speed is 105 MB/s with about 500 KB per IO operation.
With block size of 2k and 4k the results of this tests stay the same.
And lets remove the "direct" flag while writing:
# On the client
blockdev --getbsz /dev/nbd0
1024
for ((i=0; i<10; i++)); do dd if=/dev/zero of=/dev/nbd0 bs=1M count=1000 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 9.34019 s, 112 MB/s
1048576000 bytes (1.0 GB) copied, 15.3738 s, 68.2 MB/s
1048576000 bytes (1.0 GB) copied, 15.6453 s, 67.0 MB/s
1048576000 bytes (1.0 GB) copied, 20.3934 s, 51.4 MB/s
1048576000 bytes (1.0 GB) copied, 20.1742 s, 52.0 MB/s
1048576000 bytes (1.0 GB) copied, 19.0891 s, 54.9 MB/s
1048576000 bytes (1.0 GB) copied, 20.4181 s, 51.4 MB/s
1048576000 bytes (1.0 GB) copied, 16.8115 s, 62.4 MB/s
1048576000 bytes (1.0 GB) copied, 18.3555 s, 57.1 MB/s
1048576000 bytes (1.0 GB) copied, 20.0491 s, 52.3 MB/s
# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 242.30 667.60 28498.00 6676 284980
sdb 261.80 768.00 26874.40 7680 268744
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 236.70 639.60 28760.00 6396 287600
sdb 247.80 653.20 29739.20 6532 297392
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 257.60 760.00 20544.40 7600 205444
sdb 155.30 356.00 21658.40 3560 216584
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 325.80 1026.40 28021.20 10264 280212
sdb 136.60 238.80 26988.80 2388 269888
We see decreased write speed - around 50-60 MB/s and once again about 100 KB per IO operation.
The results are pretty much the same with block size of 2048 bytes.
Increasing the block size to 4k though rises the transfer speed to about 100 MB/s and give a nice 500 KB per IO request.
Next: Summarize the above results in a nice table and test with real files and filesystem
Block Size | Sequential read | Sequential Read + idirect| Sequential write | Sequential write + idirect |
1k | 60 MB / 100 KB | 75 MB / 100 KB | 55 MB / 100 KB | 105 MB / 500 KB |
2k | 50 MB / 50 KB | 75 MB / 100 KB | 55 MB / 100 KB | 105 MB / 500 KB |
4k | 50 MB / 50 KB | 55 MB / 100 KB | 100 MB / 500 KB | 105 MB / 500 KB |
Securing who can access the device is a different story though. The server implementation does not support any authentication. Well it does support IP based ACLs but that is nothing since in most configurations IP addresses could be easily spoofed. I don't see much point in putting such ACL in the server, as it could be easily and more reliably be implemented in the firewall.
So if you want/need security with NBD you should:
nbd-server in Debian testing (as of 100110) does not support the SDP (Socket Direct Protocol) so TCP/IP is used for the tests. SDP is claimed to offer a better performance.
I've read somewhere that NBD is not particularly good in case of connection problems.
DST stands for Distributed STorage
Resources:
Merged in (recent) 2.6.30 kernel. Update: Unfortunately it was removed as of the 2.6.33 kernel.
As far as I can see from various resources it is implemented as alternative of NBD and iSCSI.
Its author ( Evgeniy Polyakov ) looks like a good hacker and when a good hacker feels that he has to come with a new implementation there must be something wrong with the old one.
Performance tests done by the DST author show that aoe performs better though, so aoe is probably the first thing that I will try.
DST looks like the second option I will try as I also plan to implement similar backup solution in a distributed environment over insecure channels.
Notes:
ATA over Ethernet
Resources:
Notes:
AoE works in layer 2 (Data Link - Ethernet) directly, bypassing the processing overhead of upper layers (IP, TCP/UDP).
This is a candidate for a performance boost but it also has some drawbacks.
E.g. it could not be easily passed trough routers. Even if Ethernet in IP tunneling is used a TCP fragmentation will likely occur which will probably slow things down. Looks suitable for usage within the data center where performance is needed and the client and the server will either be directly connected or will be interconnected via a good switch supporting jumbo frames.
The AoE protocol is insecure by design and it is stateless.
So if we want security we should use some additional measures.
Security of the storage
To guarantee the security of the storage we could think of some sort of isolation of the path.
Several options come to my mind:
With the first one, of course, being the most secure ( switches could also be penetrated ) .
The MAC filtering could be easily misused. If you do the filtering only on the server, then any other host within the network could be reconfigured to become a client.
The path isolation will guarantee that a breach in another host in the same LAN segment will not compromise the storage.
Security of the data
The data security is another topic. Although a man in the middle attack does not look too probable within the data center you might prefer to be paranoiac ( or you might simply have a different setup requiring it ). For this case you could always add additional layer of encryption on the client for the cost of more CPU cycles and probably slightly increased latency.
One additional aspect bugged me.
How about if a user account on the client host gets compromised ? Could it be used to run a AoE client in userspace to gain access to the data?
Thankfully no. The access to the server is done via raw sockets and a dedicated ethertype. The creation of the RAW sockets under Linux requires the CAP_NET_RAW privilege which is usually granted only to root.
Both machines are Dell PowerEdge R200:
1U
1 Intel Xeon CPU X3320 @ 2.50GHz with 4 cores
4 GB of memory.
Debian GNU/Linux testing/Squeeze
2.6.30-2-686-bigmem kernel package
2 x Broadcom NetXtreme BCM5721 ( 1Gbit, No jumbo frame support )
2 HDDs each of them being:
Model Family: Seagate Barracuda ES.2
Device Model: ST3750330NS
Firmware Version: SN05
User Capacity: 750,156,374,016 bytes
The servers are connected via a dedicated wire.
The network interfaces are at:
root@client:/# ethtool eth1
Settings for eth1:
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
Auto-negotiation: on
Link detected: yes
Both systems were not completely stale during the tests.
Here goes the block device exportation:
root@server:/# lvcreate --verbose --size 500G --name nbd6.0 VGNAME /dev/md8 /dev/md9
root@server:/# vblade 6 0 eth1 /dev/VGNAME/nbd6.0 2>&1
md8 is soft raid0 (stripping) over 2x150 GB partitions at the end of the HDDs. So is md9. Two physical HDDs are used in total. The soft raid is added for performance. The partitioning is done for easier relocation of parts of the space.
The partitions being at the end of the drive gives roughly 1.5x to 2x performance penalty for sequential operations. This is due to the circular design of the Winchester hard drives. Inner tracks have smaller radius and thus length, so outer tracks offer higher number of storage points and are divided in more sectors. So for each revolution higher number of sectors are read from the outer tracks.
The performance I was able to get from this raid on the server looks like:
root@server:/# hdparm -tT /dev/VGNAME/nbd6.0
/dev/VGNAME/nbd6.0:
Timing cached reads: 4146 MB in 2.00 seconds = 2073.21 MB/sec
Timing buffered disk reads: 408 MB in 3.00 seconds = 135.88 MB/sec
Here goes the setup on the client side:
root@client:/# cat /etc/default/aoetools
INTERFACES="eth1"
LVMGROUPS=""
AOEMOUNTS=""
root@client:/# /etc/init.d/aoetools restart
Starting AoE devices discovery and mounting AoE filesystems: Nothing to mount.
At this point /dev/etherd was populated and it was time for some tests.
root@client:/# hdparm -tT /dev/etherd/e6.0
/dev/etherd/e6.0:
Timing cached reads: 3620 MB in 2.00 seconds = 1810.16 MB/sec
Timing buffered disk reads: 324 MB in 3.01 seconds = 107.63 MB/sec
So .. WOW !
I was not expecting such performance. My hopes were around 50MB max. At this point I was wondering if the bottleneck was not on the server side since in several of my hdparm invocations on the server showed a performance just around 80MB(probably of times of some server load).
So let's create an in-memory ( and sparse ) file and export it:
root@server:/# dd if=/dev/zero of=6.1 bs=1M count=1 seek=3071
root@server:/# vblade 6 1 eth1 /dev/shm/6.1
The /dev/etherd/e6.1 device was created on the client automagically.
Lets' do the tests once again:
root@client:/# hdparm -tT /dev/etherd/e6.1
/dev/etherd/e6.1:
Timing cached reads: 4006 MB in 2.00 seconds = 2003.68 MB/sec
Timing buffered disk reads: 336 MB in 3.00 seconds = 111.85 MB/sec
Not too much difference so I guess I was lucky and hit the top at my first try.
Lets also try a sequential write test:
root@client:/# dd if=/dev/zero of=/dev/etherd/e6.1 bs=1M count=1024 conv=sync,fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.0311 s, 107 MB/s
At the time of the tests the maximum network utilization reported by nload on the client was around 890 (outgoing) and 950 MBits (incoming). On the server it was 950 outgoing and 1330 (???) Mbits incoming .
/proc/net/dev on both the server and the client showed no errors or packet drops prior or after the tests.
I'm pleased to say that I'm astonished by the performance results from the isolated tests. A read/write speed of around 110-115 MB/s is more than enough for me when the theoretical maximum is around 125MB (before the exclusion Ethernet frame overhead). The CPU utilization of the vblade server process was around 50% of 1 core which is 1/8 of the available CPU resources. This also sound pretty good to me. I did not bother measuring the CPU utilization on the client as it happens inside the kernel ( with no dedicated thread to follow ). The tests were performed multiple times in order the results to be verified.
Unfortunately, I've started observing decreased write performance with AoE during real world tests. At first I've blamed NILFS, but when I did the tests with EXT4 the problem appeared again. So I've first tested the network throughput, which proved to be fine, and then did write tests ( dd if=/dev/zero of=/dev/etherd/e6.0 ) tests with the AoE device again. This time I have observed peaks and falls on the traffic graphs, with the bandwidth utilization from 10 to 900 Mbits. Sometimes it started fast, other times it ended fast, but the sustained rate was about 100 - 120 Mbits. I have tried various block sizes and tunning some kernel parameters with no real improvement. Searching the net showed that others also had write performance issues with AoE. This nice document - http://www.massey.ac.nz/~chmessom/APAC2007.pdf, shows that the most likely cause is the lack of Jumbo frames support of the network interfaces that I use. On the other side it also shows that others (e.g. iSCSI) could perform a lot better in a 1500 bytes MTU. So I wonder if the problem is in AoE protocol or in the software implementation. I could not easily switch Jumbo frames on, and there are not multiple AoE client implementations. I guess it is time to test ggaoed.
Fiber Channel over Ethernet
Resources:
root@client:/# mkfs -v -t nilfs2 -L nbd6.0 /dev/etherd/e6.0
FS creation took about 16 minutes for a 500 GB file system (with the above setup) and actually created an ext2 file system !!! So let's try again:
root@client:/# time mkfs.nilfs2 -L nbd6.0 /dev/etherd/e6.0
mkfs.nilfs2 ver 2.0
Start writing file system initial data to the device
Blocksize:4096 Device:/dev/etherd/e6.0 Device Size:536870912000
File system initialization succeeded !!
real 0m0.122s
user 0m0.000s
sys 0m0.008s
Well, quite better - just about (16 * 60) / 0.122 = 7869 times faster.
root@client:/# mount -t nilfs2 /dev/etherd/e6.0 /mnt/protected/nbd6.0
mount.nilfs2: WARNING! - The NILFS on-disk format may change at any time.
mount.nilfs2: WARNING! - Do not place critical data on a NILFS filesystem.
root@client:/# df | grep etherd
/dev/etherd/e6.0 500G 16M 475G 1% /mnt/protected/nbd6.0
Two things to notice here. First there is no initial file system overhead of several gigs as with ext2/3 and second the missing 25 gigs are for the 5% reserved space ( see mkfs.nilfs2 ) .
On the bad side. I've tried to fill the file system with data. After the first 70-80 gigs I have noticed the things were pretty slow (network interface utilization of about 50 Mbits) and decided to do FS benchmarks. The throughoutput I was able to achieve was from 5-10 MB/s for sequential writes. Pretty disappointing. I've also tried to tune /etc/nilfs_cleanerd.conf by increasing the cleaning_interval from 5 seconds to half an hour and the nsegments_per_clean from 2 to 800. Unfortunately it did not produce any measurable speedup.
I've also observed a network utilization of about 30 Mbits in each direction while the FS was stale. Unmounting it stopped the traffic. Remounting it made it show again. So I decided that the cleaner process is doing it business after my "unconsidered" over increase of the parameters. Sadly the traffic was there several hours later.
Additionally the number of the checkpoint was increasing without any file system activity (versus the statement in the docs).
I don't need the auto checkpoint feature at all but the docs did not show me a way to disable it. Doing manual "mkcp -s" and "rmcp" later will do the job for my needs. I guess this also obsoletes the cleanerd for my use case.
Anyway. I will try to contact the NILFS maintainers and the community to see if anyone has a cure.
I could also implement a different solution, e.g. using LVM over the AoE device and using LVM snapshotting feature, but I would really like to give NILFS the chance it deserves.