home Get a blog for free contact login

Pages tagged: Linux

Debian upgrade - Buster to BullsEye

Дойде време да ъпгрейдна от Debian Buster към BullsEye .

Новият стабилен Дебиан излезе това лято. Нямах особен зор да ъпгрейдвам, но ми беше в TODO списъка.

Първо ъпгрейднах лаптопа, където нямаше проблеми, но там и няма кой знае какво конфигурирано, понеже го ползвам рядко.

Десктопа ми беше сложен микс от:

  • Buster (основна)
  • Buster-Backports
  • BullsEye
  • Testing
  • Unstable
  • Custom repositories

На него се ползват 3 работни места с три видеокарти, всяко с по няколко монитора, собствени клавиатури и мишки, саундкарти, принтери и др.

Има активни Docker и LXC контейнери. Отделно има VirtualBox, systemd-nspawn, SnapD .. През годините се тестват разни мрежови конфигурации за каквото ли не. От bridges, firewalls, proxies, load balancers, уеб и файлови сървъри, файлови системи ..

Миксираната система се поддържаше от сложен микс apt_preferences, holds, sources.list.

Бих казал, че ъпгрейда мина много плавно, като нещата от които очаквах проблем( binary драйверите на NVidia, root on ZFS, KDE, multi seat) минаха гладко.

Единственият не очевиден проблем, който възникна (на този етап поне) беше с akonadiserver . Крашваше дори и след като му затърках напълно базата и конфигурацията. Проблемът се оказа, че са сложили apparmor policy, което предполагаше файловете му да са на същият дял като HOME директорията, а аз съм конфигурирал отделен потребителски VOLATILE дял за кешове и подобни, да не ми пълнят ZFS снапшотите.

Та въпросното policy не му позволяваше достъп до файловете, които имаше нужда да ползва и крашваше при всеки опит за старт.

Та бързият фикс беше:

aa-complain /etc/apparmor.d/usr.bin.akonadiserver

Правилното планиране направи връщането на конфигурацията лесно с един zfs rollback към snapshot-а, който направих преди ъпгрейда.

Ъпгрейда отне няколко часа, може би 5-6, като голямата част от тях бяха затова, защото държах да следя какво се случва отблизо и гледах да не си спестявам стъпки, които да ме доведат до пъти повече изгубено време впоследствие.

Поразгледах и разчистих и разни стари пакети и конфигурации.


Passing boot parameters to ScaleWay's baremetal C1 instance Linux kernel

Passing boot parameters to ScaleWay's baremetal C1 instance Linux kernel

Short story

Add tags like these to your server:

KEXEC_KERNEL=http://mirror.scaleway.com/kernel/armv7l-mainline-lts-4.9-4.9.93-rev1/vmlinuz
KEXEC_INITRD=http://mirror.scaleway.com/initrd/uInitrd-Linux-armv7l-v3.14.6
KEXEC_APPEND=vmalloc=512M

Longer story

The ScaleWay's "BareMetal" "C1" instance is a cheap EUR 3 / month cloud infrastructure instance. It has:

  • 4 32bit armv7l cores
  • 2 GB RAM
  • 50 GB network attached storage
  • 1 public IP included in the price

ScaleWay offers two lines of servers:

  • BareMetal
  • VirtualMachines (KVM based)

One important difference between the two is that:

  • A VM can only be booted with as much storage as included in its offer
  • Bare metal instances support attaching up-to 15x150 GB additional network block device drives ( charged EUR 1/month per 50 GB )

Another important difference is that currently in ScaleWay infrastructure, contra-logically:

  • Only VMs can run custom kernels
  • Bare metal servers come with e pre-build kernels and ScaleWay does not officially support changing these kernels. You can't even run the official kernel that comes with the chosen Linux distro.

Thus a problem arises you need to change something.

My case was that I wanted to use ZFS and it is not included in the official Linux kernel. It is rather build as a module. On standard Debian it is done easily by installing the zfs-dkms package.

It is possible to build the module for the C1 instance kernel by preparing the build env like described here:

The problem was, that ZFS on 32bit Linux:

  • "May encounter stability problems"
  • "May bump up against the virtual memory limit"

which is officially stated here:

I'm stil about to see the former but hit the latter quite fast, and as recommended I had to add the vmalloc=512M boot parameter.

Unfortunately Scaleway does not support passing parameters to their kernels. They however support KEXEC via the KEXEC_KERNEL and KEXEC_INITRD params as documented here:

and they support parameters to the KEXEC-ed kernel via the KEXEC_APPEND param.

So as I just needed to boot the same kernel and pass the parameter. So first I had to find where the current kernel and initrd are. This is done by installing "scaleway-cli":

I've just grabbed the pre-built amd64 deb packages, and then used the "scw" command to get info about the instance:

# list servers
$ scw ps 

# Show instance details 
$ scw inspect SERVER_ID

"bootscript": {
    "bootcmdargs": "LINUX_COMMON scaleway boot=local nbd.max_part=16",
    "initrd": "initrd/uInitrd-Linux-armv7l-v3.14.6",
    "kernel": "kernel/armv7l-mainline-lts-4.9-4.9.93-rev1",
    "dtb": "dtb/c1-armv7l-mainline-lts-4.9-4.9.93-rev1",
    ...

If you inspect a VM instance you will see that the kernel and initrd are referred by IP:

"bootscript": {
    "bootcmdargs": "LINUX_COMMON scaleway boot=local nbd.max_part=16",
    "initrd": "http://169.254.42.24/initrd/initrd-Linux-x86_64-v3.14.6.gz",
    "kernel": "http://169.254.42.24/kernel/x86_64-mainline-lts-4.4-4.4.127-rev1/vmlinuz-4.4.127"

And a google search showed me that the kernel and the initrd were available at:

I've had a problem by trying to use the image referred in the params above:

# DO NOT USE THIS ONE
KEXEC_INITRD=http://mirror.scaleway.com/initrd/uInitrd-Linux-armv7l-v3.14.6

and I've wasted a couple of hours until I realized that this image was in a different format, not usable for the KEXEC_INITRD . Then I've changed it to:

 KEXEC_INITRD=http://mirror.scaleway.com/initrd/initrd-Linux-armv7l-v3.14.6.gz

and this time it worked fine.

The kernel can be found via at least two different URLs:

KEXEC_KERNEL=http://mirror.scaleway.com/kernel/armv7l-mainline-lts-4.9-4.9.93-rev1/vmlinuz
             http://mirror.scaleway.com/kernel/armv7l/4.9.93-mainline-rev1/vmlinuz

And after the successfull boot I've just had to add:

KEXEC_APPEND=vmalloc=512M

And my ZFS module was no longer complaining about lack of virtual memory.

Let me add a few articles that were helpful:

I've wasted about a day while investigating this stuff. If you find it helpful and you think that I might have saved you a couple of hours you can decide to send me a small donation on this PayPal e-mail: krustev-paypal@krustev.net


Posted in dir: /articles/
Tags: BareMetal Debian Linux ScaleWay ZFS

Мдааа ..

NOTE: Adobe Flash Player 11.2 will be the last version to target Linux as a supported platform. Adobe will continue to provide security backports to Flash Player 11.2 for Linux.


Posted in dir: /blog/
Tags: adobe flash linux

Linux and webcams

I've recently got a nice webcam - Logitech C600

  • UVC driver
  • Wide angle (about 70 deg)
  • Good video ( capable of 30 fps at 1280x720 when using anything but YUYV camera output format )
  • Good sound

The supported camera outputs are:

  • YUYV
  • MJPEG
  • RGB3
  • BGR3
  • YU12
  • YV12

The best video quality is in the YUYV mode, however it is using less(or no) compression, so the high frame rates are available at 640x480@30 fps, and 800x600@25 fps.

Strangely, the webcam does some cropping when used at high video resolution & high frame rates. The controls pan/tilt are only usable in this crop mode. Skype also does a switch to one of the crop modes after e.g. 30 seconds of the call (I'm using the skype option to capture at 640x480 which it probably uses initially ).

Useful software:

  • guvcview
  • mplayer
  • vlc
  • v4l2ucp (video for linux 2 universal control panel)
  • luvcview

GUVCView

GUVCview is able to show what your webcam can do. You can easily switch resolutions, frame rates, camera output format . It can record video in different formats and capture still images. All of V4L2 settings which your camera supports could be changed. By default it presents a preview screen, so you can see how the switch of settings is affecting the captured video. The actual frames per seconds are also displayed on the video preview window. You can also use it as a camera control application when the capture is done by another app (e.g. skype). Just start it like:

guvcview -o

Another very nice feature is that you can capture video with sound. You can easily choose which mic to use - the camera built in or the one sitting on your desktop.

It is a good idea to keep an eye of the processor load (and on the terminal window) while capturing. Some formats use the CPU heavily and video/audio can easily get out of sync.

MPlayer

MPlayer is usable for fast preview. To play video with mplayer you can just do:

mplayer tv://

or give it some more options:

mplayer -tv driver=v4l2:input=0:width=640:height=480:device=/dev/video0

It appeared hard to get mencoder to capture the video right, especially when it does frame rate switching during the capture. Mine does that when it has the option "Exposure auto priority" checked. I was not able to get mplayer play video and audio at the same time too. But may be I've not tried hard enough. VLC on the other hand can do this.

VLC

VLC needs to know the video and sound devices when you open a capture device. I've specified them as:

/dev/video0 (the webcam)
hw:1 (or hw:1,0) (this was my webcam mic)

You can list your capture devices by:

arecord -l

VLC output is a little laggish in comparison to mplayer or guvcview preview window. I was able to fix this by specifying a smaller buffer time (300ms by default), however at a later try this did not work. I've not played with VLC enough too. As you might know it is quite powerful - may be the most mature video player with a GUI available for Linux. I still use mplayer from the command line for video playing though and haven't found a reason to replace it with anything else :-)

v4l2ucp/luvcview

v4l2ucp is covered by the "Image control" tab of guvcview. Luvcview looks like older version of his G brother. You can get the list of video modes your camera support by doing:

luvcview -L

Another software which I've barely tried is the popular "cheese".

The ultimate webcam software for linux is GUVCView.

Some extra commands to test sound from your webcam mic:

$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 0: Intel [HDA Intel], device 0: ALC883 Analog [ALC883 Analog]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: Intel [HDA Intel], device 2: ALC883 Analog [ALC883 Analog]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: U0x46d0x808 [USB Device 0x46d:0x808], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

# No sound here
$ arecord -D hw:U0x46d0x808,0 | aplay
Recording WAVE 'stdin' : Unsigned 8 bit, Rate 8000 Hz, Mono
arecord: set_params:1065: Sample format non available
Available formats:
- S16_LE
aplay: playback:2467: read error


# This played the sound. Note that some of the times I started a command
# the sound did not show up. Next time I've tried it it did. The same was
# true for VLC sound capture tests. So I guess the device is not
# always initialized right.
$ arecord -D hw:U0x46d0x808,0 -f S16_LE | aplay
Recording WAVE 'stdin' : Signed 16 bit Little Endian, Rate 8000 Hz, Mono
Warning: rate is not accurate (requested = 8000Hz, got = 16000Hz)
         please, try the plug plugin 
Playing WAVE 'stdin' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
^CAborted by signal Interrupt...
Aborted by signal Interrupt...

# Specify the proper rate
$ arecord -D hw:U0x46d0x808,0 -f S16_LE -r 16 | aplay
Recording WAVE 'stdin' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
Playing WAVE 'stdin' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
^CAborted by signal Interrupt...
Aborted by signal Interrupt...

# Use mmap instead of read:
$ arecord -D hw:U0x46d0x808,0 -f S16_LE -r 16 -M | aplay
Recording WAVE 'stdin' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
Playing WAVE 'stdin' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
underrun!!! (at least -1900024418,571 ms long)
^CAborted by signal Interrupt...
Aborted by signal Interrupt...
$

Another note here is that kmix was not always showing the webcam mic. Sometimes it showed unplug events without me actually touching the camera. Thus the webcam mic became unmanageble with it. Thus alsamixer was my friend.

Links:


Posted in dir: /articles/
Tags: guvcview Linux webcam

Linux 3.11 - Linux for workgroups

I think I will call 3.11 Linux for Workgroups.

... application developers are very important. They're not "real men" like kernel developers, he says, but still are "necessary" for Linux to succeed.

Linus Torvalds

http://www.linux.com/news/enterprise/biz-enterprise/485159-a-conversation-with-linus-torvalds


Posted in dir: /blog/
Tags: linux quotes

Sys admin day

Честит празник на системните администратори! :-)

Поради спиране на тока нощес имах привилегията да си ъпдейтна десктопа:

root@work:/# cat /etc/issue.net
Debian GNU/Linux wheezy/sid
root@work:/# last reboot | head -3
reboot   system boot  3.0.0-1-686-pae  Fri Jul 29 15:11 - 20:29  (05:17)    
reboot   system boot  2.6.38-2-686-big Fri Jul 29 13:25 - 15:09  (01:43)    
reboot   system boot  2.6.38-2-686-big Thu Apr  7 20:41 - 15:09 (112+18:28) 
root@work:/# uname -a
Linux work 3.0.0-1-686-pae #1 SMP Sun Jul 24 14:27:32 UTC 2011 i686 GNU/Linux

И макар и леко на патерици - Честита 20 годишнина на Линукс и честита ни 3-та версия! :-)


Posted in dir: /blog/
Tags: debian linux

Open source OCR sucks

This night I've tried text recognition with various open source tools. The input were images packages as PDF. The text in the images was bad looking, but readable.

To summarize my experience:

  • Too much reading
  • Too much hassle to convert between various input formats accepted by the tools
  • Totally unacceptable results
  • Even segmentation faults by some of the apps

None of the tools did the job even close to what I expected. Maybe it was my fault, but I could not spend a day each time I need to do a simple job which I do not do each month.

At the end I did the job by googling for "Online OCR" and using (guess what ?!) http://www.onlineocr.net/ for the first five pages. It had a limit for five pages per hour for non registered users (and 5 pages total for registered ones) so I registered and OCRed the last sixth page.

BTW, just to prove my point of not enough reading I later found this site http://www.free-ocr.com/, which also did the job and used one of the software I have tried - Tesseract.


Posted in dir: /blog/
Tags: linux ocr oss

Debian, Java, SocketException Network unreachable

java.net.SocketException: Network is unreachable

I hit this about a week ago . First time I saw it was on my office desktop running Debian unstable. Since I was not doing too much Java on it I decided it was a problem with JConsole. I nearly lost a bet out of this:

I was pretty sure JConsole was able to attach to local processes even when they were started without any JMX options enabled. Borislav Tonchev was pretty sure it wasn't. I quickly wrote a Java class with its main method sleeping for 100 seconds and tried to attach to its process. Unfortunately I wasn't able to do so. At that point Borislav walked away with 10 bucks coming out of my pocket.

I was curious enough to check this stuff and at first appeared that Java didn't like the bsdgroups option my ext3 /tmp file system was mounted with. Trying the same thing on my home PC, with bsdgroups disabled showed this java.net.SocketException: Network is unreachable. At this point I was starting to loose ground. I decided to check the docs ( http://java.sun.com/javase/6/docs/technotes/guides/management/jconsole.html ) and they confirmed my point. I checked the documented behavior in a JVM running inside an Windows XP installation I have ( VirtualBox image for the corporate stuff in the office ) and Borislav unhappily brought my money back.

At this point I decided the exception under Debian was caused by a bug in JConsole - probably it was not maintained too much in recent releases as a similar tool appeared - VisualVM.

Several days after this long background, on Saturday, I've also hit the same exception on a production server running Tomcat. Pretty damn strange. I was not able to figure it out immediately. The actual problem was introduced in Debian in the beginning of December last year, with the netbase package setting:

# cat /etc/sysctl.d/bindv6only.conf
net.ipv6.bindv6only=1

This did not showed up on the server immediately, since the netbase upgrade did not apply the new setting. The exception appeared after a restart almost two months after the upgrade.

The workaround is to set the above to "0" as it was before, or to add the option -Djava.net.preferIPv4Stack=true to each Java process you start. I prefer the former as I did not want to configure every Java program (e.g. I use azureus/vuze) manually.

More information could be found in Debian bug #560044


Posted in dir: /blog/
Tags: debian java linux

Problems accessing e-fibank.bg with Firefox

I was unable to access my E-banking at https://e-fibank.bg. It first happened on my Debian unstable box in the office. A few weeks later it also showed on my home PC running Debian testing.

My observations also showed that all the browsers stopped working at once. I'm using Iceweasel (Firefox) for the e-banking itself. Google chrome also showed some weird (unknown) SSL error.

This was enough for me to decide that the problem has been caused by recent package upgrade. I was pretty sure this was caused by the SSL libraries, especially with some recent Bugraq posts about SSL vulnerabilities.

So what I did was:

  • remembered when was the last time I successfully used the e-banking
  • checked which packages I upgraded recently (only one date showed after the last successful e-banking use - 2010-01-02 )
  • looked at browser dependencies ( google-chrome-unstable deps were easier to use since iceweasel is deeply integrated within Debian with many chained dependencies )
  • and put the results against each other

So the command I came up with was:

ls -rtl /var/lib/dpkg/info/*.list | \
    grep 2010-01-02 | awk '{print $8}' | \
    cut -d / -f 6 | \
    cut -d . -f 1 | \
    sort | \
    egrep \
       `dpkg -s google-chrome-unstable | \
          grep Depends | \
          tr ',' '\n' | \
          grep '^ ' | \
          awk '{print $1}' | \
          xargs echo | tr ' ' '|'`

So this showed two things only:

libfontconfig1
libnss3-1d

I was not familiar with libnss3 but looking at its package description(SSL related) was enough for me to blame it. So I've checked the aptitude logs:

# grep libnss /var/log/aptitude
[UPGRADE] libnss3-1d 3.12.4-1 -> 3.12.5-1

and have seen which was the older version I used. Then checked /var/cache/apt/archives and it was just sitting there waiting to be restored:

dpkg -i /var/cache/apt/archives/libnss3-1d_3.12.4-1_i386.deb

Then restarted Iceweasel and voila ..

I've then also checked the Debian bug reports to see if this has already been reported or was waiting for me to do that. This bug report showed up:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=561918

Hope this saves you some time ..


Backup service and software block devices over the net

Backup service

I'm in the process of implementing a backup service with these major requirements:

  • Is external and thus prone to hardware failures
  • Provides a time machine ( so older version of files could also be restored )

and the additional wishlist:

  • Backups are fast
  • The backup process is lightweight ( The servers are used in production and loaded all over the clock )
  • Service is reliable
  • Implementing it is as simple as possible
  • The interface is universal (e.g. it's better to use a filesystem than custom solution over dump/restore)

Of course the Perl motto "there is more than one way to do it" is valid for the major goals.

E.g. The external part could be done via:

  • some sort of network file system
  • synchronization via a network protocol to a file system living on external host

and the time machine could be done via

  • incremental backups (e.g. dump/restore)
  • a version control system, with Git being 1st in my list

My current idea is to use:

  • Software block device over the net ( External )
  • NILFS2 ( Time machine )

So I'm in a hunt for the:


Software block device over the net

Resources:

Requirements

  • Reliable
  • Fast
  • Simple ( avoid over-complication in implementation, configuration, features, dependencies .. )
  • Supported by Linux
    • Both server and client
    • Strongly preferred to be merged in mainline kernel
    • Strongly preferred tooling to be packaged in Debian

All the protocols listed below should be interchangeable. I might do some benchmarks at a later stage.


iSCSI

SCSI over internet


NBD

Network Block Device

Implementation

Exporting a device via NBD is a matter of:

root@server:/# apt-get install nbd-server
root@server:/# cat /etc/nbd-server/config
[generic]

[export0]
    exportname = /dev/mapper/vg0-nbd6.0
    port = 99
root@server:/# /etc/init.d/nbd-server restart

And importing it on a client is:

root@client:/# apt-get install nbd-client
root@client:/# grep -v '^#' /etc/nbd-client
AUTO_GEN="n"
KILLALL="true"
NBD_DEVICE[0]=/dev/nbd0
NBD_TYPE[0]=r
NBD_HOST[0]=SERVER-HOSTNAME
NBD_PORT[0]=99
root@client:/# /etc/init.d/nbd-client restart

You might want to check the manual pages in the respective packages for more configuration options and tweaks. E.g. the nbd-client init scripts has the feature to auto mount file systems.

Benchmarks

By default, nbd-client creates a blockdevice with a block size of 1024 bytes:

# On the client
blockdev --getbsz /dev/nbd0
1024

for ((i=0; i<10; i++)); do dd if=/dev/nbd0 of=/dev/null bs=1M count=1000 iflag=direct 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 12.8387 s, 81.7 MB/s
1048576000 bytes (1.0 GB) copied, 14.1621 s, 74.0 MB/s
1048576000 bytes (1.0 GB) copied, 14.1721 s, 74.0 MB/s
1048576000 bytes (1.0 GB) copied, 15.6536 s, 67.0 MB/s
1048576000 bytes (1.0 GB) copied, 15.1352 s, 69.3 MB/s
1048576000 bytes (1.0 GB) copied, 15.5831 s, 67.3 MB/s
1048576000 bytes (1.0 GB) copied, 14.3358 s, 73.1 MB/s
1048576000 bytes (1.0 GB) copied, 15.256 s, 68.7 MB/s
1048576000 bytes (1.0 GB) copied, 13.9433 s, 75.2 MB/s
1048576000 bytes (1.0 GB) copied, 13.0245 s, 80.5 MB/s

# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             365.70     32194.80       380.00     321948       3800
sdb             316.20     31760.40       319.20     317604       3192
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             361.80     39333.20       281.20     393332       2812
sdb             323.20     39295.20       260.80     392952       2608
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             325.20     35762.80       238.40     357628       2384
sdb             274.90     35794.40       201.20     357944       2012

To summarize we have a performance of about 70-80 MB/s and the server is reading about 100KB in each request. The results are pretty much the same with 2048 bytes blocksize.
4k block size drops the transfer rate to 55 MB/s and keeps the 100 KB per IO op rate.

Lets remove the "direct" flag from dd:

# On the client
blockdev --getbsz /dev/nbd0
1024

for ((i=0; i<10; i++)); do dd if=/dev/nbd0 of=/dev/null bs=1M count=1000 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 14.5043 s, 72.3 MB/s
1048576000 bytes (1.0 GB) copied, 18.6863 s, 56.1 MB/s
1048576000 bytes (1.0 GB) copied, 15.6981 s, 66.8 MB/s
1048576000 bytes (1.0 GB) copied, 15.8664 s, 66.1 MB/s
1048576000 bytes (1.0 GB) copied, 16.7602 s, 62.6 MB/s
1048576000 bytes (1.0 GB) copied, 18.382 s, 57.0 MB/s
1048576000 bytes (1.0 GB) copied, 17.1475 s, 61.2 MB/s
1048576000 bytes (1.0 GB) copied, 15.3853 s, 68.2 MB/s
1048576000 bytes (1.0 GB) copied, 19.3907 s, 54.1 MB/s
1048576000 bytes (1.0 GB) copied, 21.7969 s, 48.1 MB/s

# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             312.60     30968.40       173.60     309684       1736
sdb             284.80     30978.00       172.00     309780       1720
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             330.40     32506.40       166.00     325064       1660
sdb             280.60     32517.20       152.00     325172       1520
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             224.40     33598.80        51.60     335988        516
sdb             208.20     33604.40        60.80     336044        608

So this time we have around 60 MB/s with 100 KB per IO operation ratio (Note that the server is not totally idle and this is not the only disk activity it sees). With a block size of 2048 bytes this tests shows decreased speed of about 50 MB/s and the number of IO ops per second doubles. 4k block size gives us an average of 60 MB/s with 50 kb per IO op.

Lets do some write tests:

# On the client
blockdev --getbsz /dev/nbd0
1024

for ((i=0; i<10; i++)); do dd if=/dev/zero of=/dev/nbd0 bs=1M count=1000 oflag=direct 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 10.1818 s, 103 MB/s
1048576000 bytes (1.0 GB) copied, 9.89168 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 9.73052 s, 108 MB/s
1048576000 bytes (1.0 GB) copied, 9.89912 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 9.91606 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 10.0242 s, 105 MB/s
1048576000 bytes (1.0 GB) copied, 9.95247 s, 105 MB/s
1048576000 bytes (1.0 GB) copied, 9.92473 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 10.0946 s, 104 MB/s
1048576000 bytes (1.0 GB) copied, 10.1183 s, 104 MB/s

# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             137.80         7.20     51806.80         72     518068
sdb             144.20         1.20     51798.00         12     517980
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             125.70        16.00     52375.20        160     523752
sdb             132.20         4.80     52362.80         48     523628
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             133.20         4.80     52117.60         48     521176
sdb             130.40         5.20     52265.20         52     522652

Write speed is 105 MB/s with about 500 KB per IO operation.
With block size of 2k and 4k the results of this tests stay the same.

And lets remove the "direct" flag while writing:

# On the client
blockdev --getbsz /dev/nbd0
1024

for ((i=0; i<10; i++)); do dd if=/dev/zero of=/dev/nbd0 bs=1M count=1000 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 9.34019 s, 112 MB/s
1048576000 bytes (1.0 GB) copied, 15.3738 s, 68.2 MB/s
1048576000 bytes (1.0 GB) copied, 15.6453 s, 67.0 MB/s
1048576000 bytes (1.0 GB) copied, 20.3934 s, 51.4 MB/s
1048576000 bytes (1.0 GB) copied, 20.1742 s, 52.0 MB/s
1048576000 bytes (1.0 GB) copied, 19.0891 s, 54.9 MB/s
1048576000 bytes (1.0 GB) copied, 20.4181 s, 51.4 MB/s
1048576000 bytes (1.0 GB) copied, 16.8115 s, 62.4 MB/s
1048576000 bytes (1.0 GB) copied, 18.3555 s, 57.1 MB/s
1048576000 bytes (1.0 GB) copied, 20.0491 s, 52.3 MB/s

# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             242.30       667.60     28498.00       6676     284980
sdb             261.80       768.00     26874.40       7680     268744
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             236.70       639.60     28760.00       6396     287600
sdb             247.80       653.20     29739.20       6532     297392
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             257.60       760.00     20544.40       7600     205444
sdb             155.30       356.00     21658.40       3560     216584
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             325.80      1026.40     28021.20      10264     280212
sdb             136.60       238.80     26988.80       2388     269888

We see decreased write speed - around 50-60 MB/s and once again about 100 KB per IO operation.
The results are pretty much the same with block size of 2048 bytes.
Increasing the block size to 4k though rises the transfer speed to about 100 MB/s and give a nice 500 KB per IO request.

Next: Summarize the above results in a nice table and test with real files and filesystem

Block Size | Sequential read | Sequential Read + idirect| Sequential write | Sequential write + idirect |
1k         | 60 MB / 100 KB  | 75 MB / 100 KB           | 55 MB / 100 KB   | 105 MB / 500 KB            |
2k         | 50 MB / 50 KB   | 75 MB / 100 KB           | 55 MB / 100 KB   | 105 MB / 500 KB            |
4k         | 50 MB / 50 KB   | 55 MB / 100 KB           | 100 MB / 500 KB  | 105 MB / 500 KB            |

Security

Securing who can access the device is a different story though. The server implementation does not support any authentication. Well it does support IP based ACLs but that is nothing since in most configurations IP addresses could be easily spoofed. I don't see much point in putting such ACL in the server, as it could be easily and more reliably be implemented in the firewall.

So if you want/need security with NBD you should:

  • On the server: make sure you limit the access to the TCP port the server is listening on. E.g. only allow certain interface (explicitly disallowing the "lo" interface might also be a good idea) and only allow certain IP address and or MAC addresses.
  • On the network: make sure the IP and or MAC addresses that are in the server ACL could not be spoofed. E.g. provide a dedicated wire/vlan/etc and/or use managed switches to guarantee the path to the clients and servers.
  • On the network: If you intend to route NBD traffic via some public network you might want to add additional layer of encryption/authentication. IPSec or another tunneling scheme sound useful.
  • On the client: It might be a useful idea to limit the NBD traffic to a particular UID (most probably root). This is especially important if you have some untrusted apps running.

Notes

nbd-server in Debian testing (as of 100110) does not support the SDP (Socket Direct Protocol) so TCP/IP is used for the tests. SDP is claimed to offer a better performance.

I've read somewhere that NBD is not particularly good in case of connection problems.


DST (Obsolete)

DST stands for Distributed STorage

Resources:

Merged in (recent) 2.6.30 kernel. Update: Unfortunately it was removed as of the 2.6.33 kernel.
As far as I can see from various resources it is implemented as alternative of NBD and iSCSI.

Its author ( Evgeniy Polyakov ) looks like a good hacker and when a good hacker feels that he has to come with a new implementation there must be something wrong with the old one.

Performance tests done by the DST author show that aoe performs better though, so aoe is probably the first thing that I will try.

DST looks like the second option I will try as I also plan to implement similar backup solution in a distributed environment over insecure channels.

Notes:

  • New and probably unstable implementation. As of Linux v2.6.32 it is still int the "staging" area.
  • Native encryption support so is usable over insecure channels
  • Both the client&server are implemented in the kernel
  • Single vendor

AoE

ATA over Ethernet

Resources:

Notes:

  • Multiple vendors for the server implementation
  • The client is implemented in the kernel ( "aoe" module )
  • I will probably try the vblade+aoetools option first as it is already packaged in Debian.
  • GGAOED has Debian build scripts
  • TODO: Post some results from real world tests

AoE works in layer 2 (Data Link - Ethernet) directly, bypassing the processing overhead of upper layers (IP, TCP/UDP).
This is a candidate for a performance boost but it also has some drawbacks.
E.g. it could not be easily passed trough routers. Even if Ethernet in IP tunneling is used a TCP fragmentation will likely occur which will probably slow things down. Looks suitable for usage within the data center where performance is needed and the client and the server will either be directly connected or will be interconnected via a good switch supporting jumbo frames.

Security

The AoE protocol is insecure by design and it is stateless.
So if we want security we should use some additional measures.

Security of the storage

To guarantee the security of the storage we could think of some sort of isolation of the path.
Several options come to my mind:

  • A dedicated Ethernet interfaces and a dedicated wire between client and the server
  • VLAN isolation
  • MAC filtering on the server and on the switch(es)

With the first one, of course, being the most secure ( switches could also be penetrated ) .

The MAC filtering could be easily misused. If you do the filtering only on the server, then any other host within the network could be reconfigured to become a client.

The path isolation will guarantee that a breach in another host in the same LAN segment will not compromise the storage.

Security of the data

The data security is another topic. Although a man in the middle attack does not look too probable within the data center you might prefer to be paranoiac ( or you might simply have a different setup requiring it ). For this case you could always add additional layer of encryption on the client for the cost of more CPU cycles and probably slightly increased latency.

One additional aspect bugged me.
How about if a user account on the client host gets compromised ? Could it be used to run a AoE client in userspace to gain access to the data?
Thankfully no. The access to the server is done via raw sockets and a dedicated ethertype. The creation of the RAW sockets under Linux requires the CAP_NET_RAW privilege which is usually granted only to root.

Implementation

Both machines are Dell PowerEdge R200:

  • 1U

  • 1 Intel Xeon CPU X3320 @ 2.50GHz with 4 cores

  • 4 GB of memory.

  • Debian GNU/Linux testing/Squeeze

  • 2.6.30-2-686-bigmem kernel package

  • 2 x Broadcom NetXtreme BCM5721 ( 1Gbit, No jumbo frame support )

  • 2 HDDs each of them being:

    Model Family: Seagate Barracuda ES.2 Device Model: ST3750330NS
    Firmware Version: SN05
    User Capacity: 750,156,374,016 bytes

The servers are connected via a dedicated wire.

The network interfaces are at:

root@client:/# ethtool eth1
Settings for eth1:
    Speed: 1000Mb/s
    Duplex: Full
    Port: Twisted Pair
    Auto-negotiation: on
    Link detected: yes

Both systems were not completely stale during the tests.

Here goes the block device exportation:

root@server:/# lvcreate --verbose --size 500G --name nbd6.0 VGNAME /dev/md8 /dev/md9
root@server:/# vblade 6 0 eth1 /dev/VGNAME/nbd6.0 2>&1

md8 is soft raid0 (stripping) over 2x150 GB partitions at the end of the HDDs. So is md9. Two physical HDDs are used in total. The soft raid is added for performance. The partitioning is done for easier relocation of parts of the space.
The partitions being at the end of the drive gives roughly 1.5x to 2x performance penalty for sequential operations. This is due to the circular design of the Winchester hard drives. Inner tracks have smaller radius and thus length, so outer tracks offer higher number of storage points and are divided in more sectors. So for each revolution higher number of sectors are read from the outer tracks.

The performance I was able to get from this raid on the server looks like:

root@server:/# hdparm -tT /dev/VGNAME/nbd6.0                            
/dev/VGNAME/nbd6.0:
 Timing cached reads:   4146 MB in  2.00 seconds = 2073.21 MB/sec
 Timing buffered disk reads:  408 MB in  3.00 seconds = 135.88 MB/sec

Here goes the setup on the client side:

root@client:/# cat /etc/default/aoetools
INTERFACES="eth1"
LVMGROUPS=""
AOEMOUNTS=""

root@client:/# /etc/init.d/aoetools restart
Starting AoE devices discovery and mounting AoE filesystems: Nothing to mount.

At this point /dev/etherd was populated and it was time for some tests.

root@client:/# hdparm -tT /dev/etherd/e6.0
/dev/etherd/e6.0:
 Timing cached reads:   3620 MB in  2.00 seconds = 1810.16 MB/sec
 Timing buffered disk reads:  324 MB in  3.01 seconds = 107.63 MB/sec

So .. WOW !
I was not expecting such performance. My hopes were around 50MB max. At this point I was wondering if the bottleneck was not on the server side since in several of my hdparm invocations on the server showed a performance just around 80MB(probably of times of some server load).

So let's create an in-memory ( and sparse ) file and export it:

root@server:/# dd if=/dev/zero of=6.1 bs=1M count=1 seek=3071
root@server:/# vblade 6 1 eth1 /dev/shm/6.1

The /dev/etherd/e6.1 device was created on the client automagically.
Lets' do the tests once again:

root@client:/# hdparm -tT /dev/etherd/e6.1
/dev/etherd/e6.1:
 Timing cached reads:   4006 MB in  2.00 seconds = 2003.68 MB/sec
 Timing buffered disk reads:  336 MB in  3.00 seconds = 111.85 MB/sec

Not too much difference so I guess I was lucky and hit the top at my first try.
Lets also try a sequential write test:

root@client:/# dd if=/dev/zero of=/dev/etherd/e6.1 bs=1M count=1024 conv=sync,fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.0311 s, 107 MB/s

At the time of the tests the maximum network utilization reported by nload on the client was around 890 (outgoing) and 950 MBits (incoming). On the server it was 950 outgoing and 1330 (???) Mbits incoming .

/proc/net/dev on both the server and the client showed no errors or packet drops prior or after the tests.

I'm pleased to say that I'm astonished by the performance results from the isolated tests. A read/write speed of around 110-115 MB/s is more than enough for me when the theoretical maximum is around 125MB (before the exclusion Ethernet frame overhead). The CPU utilization of the vblade server process was around 50% of 1 core which is 1/8 of the available CPU resources. This also sound pretty good to me. I did not bother measuring the CPU utilization on the client as it happens inside the kernel ( with no dedicated thread to follow ). The tests were performed multiple times in order the results to be verified.

Unfortunately, I've started observing decreased write performance with AoE during real world tests. At first I've blamed NILFS, but when I did the tests with EXT4 the problem appeared again. So I've first tested the network throughput, which proved to be fine, and then did write tests ( dd if=/dev/zero of=/dev/etherd/e6.0 ) tests with the AoE device again. This time I have observed peaks and falls on the traffic graphs, with the bandwidth utilization from 10 to 900 Mbits. Sometimes it started fast, other times it ended fast, but the sustained rate was about 100 - 120 Mbits. I have tried various block sizes and tunning some kernel parameters with no real improvement. Searching the net showed that others also had write performance issues with AoE. This nice document - http://www.massey.ac.nz/~chmessom/APAC2007.pdf, shows that the most likely cause is the lack of Jumbo frames support of the network interfaces that I use. On the other side it also shows that others (e.g. iSCSI) could perform a lot better in a 1500 bytes MTU. So I wonder if the problem is in AoE protocol or in the software implementation. I could not easily switch Jumbo frames on, and there are not multiple AoE client implementations. I guess it is time to test ggaoed.


FCoE

Fiber Channel over Ethernet

etc.


NILFS2

Resources:

Implementation

root@client:/# mkfs -v -t nilfs2 -L nbd6.0 /dev/etherd/e6.0

FS creation took about 16 minutes for a 500 GB file system (with the above setup) and actually created an ext2 file system !!! So let's try again:

root@client:/# time mkfs.nilfs2 -L nbd6.0 /dev/etherd/e6.0
mkfs.nilfs2 ver 2.0
Start writing file system initial data to the device
   Blocksize:4096  Device:/dev/etherd/e6.0  Device Size:536870912000
File system initialization succeeded !!

real    0m0.122s
user    0m0.000s
sys     0m0.008s

Well, quite better - just about (16 * 60) / 0.122 = 7869 times faster.

root@client:/# mount -t nilfs2 /dev/etherd/e6.0 /mnt/protected/nbd6.0
mount.nilfs2: WARNING! - The NILFS on-disk format may change at any time.
mount.nilfs2: WARNING! - Do not place critical data on a NILFS filesystem.
root@client:/# df | grep etherd
/dev/etherd/e6.0      500G   16M  475G   1% /mnt/protected/nbd6.0

Two things to notice here. First there is no initial file system overhead of several gigs as with ext2/3 and second the missing 25 gigs are for the 5% reserved space ( see mkfs.nilfs2 ) .

On the bad side. I've tried to fill the file system with data. After the first 70-80 gigs I have noticed the things were pretty slow (network interface utilization of about 50 Mbits) and decided to do FS benchmarks. The throughoutput I was able to achieve was from 5-10 MB/s for sequential writes. Pretty disappointing. I've also tried to tune /etc/nilfs_cleanerd.conf by increasing the cleaning_interval from 5 seconds to half an hour and the nsegments_per_clean from 2 to 800. Unfortunately it did not produce any measurable speedup.

I've also observed a network utilization of about 30 Mbits in each direction while the FS was stale. Unmounting it stopped the traffic. Remounting it made it show again. So I decided that the cleaner process is doing it business after my "unconsidered" over increase of the parameters. Sadly the traffic was there several hours later.
Additionally the number of the checkpoint was increasing without any file system activity (versus the statement in the docs).
I don't need the auto checkpoint feature at all but the docs did not show me a way to disable it. Doing manual "mkcp -s" and "rmcp" later will do the job for my needs. I guess this also obsoletes the cleanerd for my use case.

Anyway. I will try to contact the NILFS maintainers and the community to see if anyone has a cure.
I could also implement a different solution, e.g. using LVM over the AoE device and using LVM snapshotting feature, but I would really like to give NILFS the chance it deserves.


Posted in dir: /articles/
Tags: linux

All tags SiteMap Owner Cookies policy [Atom Feed]