Sunday, 2 June 2013

OpenCL on Ubuntu 13.04

Unfortunately two years after my post about getting Intel's OpenCL to work on Ubuntu, the out-of-box situation for using OpenCL isn't much better - you can install the OpenCL library/headers/package config files with a nice and quick `apt-get install ocl-icd-opencl-dev`, but doing that will probably just get you to CL_PLATFORM_NOT_FOUND_KHR (-1001) error from clGetPlatformIDs. That's because you do have the dispatcher, but no actual OpenCL drivers. Therefore I decided to play with the three major OpenCL implementations (Intel, AMD, NVidia).

The reason why one can do the simple apt-get install is a free software project called ocl-icd - this provides and implements the dispatcher so that every icd-compatible OpenCL implementation can use it and therefore provides nice basis for using OpenCL, but to get to building and running CL kernels you need to have at least one driver, so let's look at them.

Intel's OpenCL

Intel recently released Intel SDK for OpenCL Applications XE 2013 - but Ubuntu/Debain users are out of luck, you won't find an official .deb on Intel's web. Fortunately that is not a deal breaker and with a few commands you can turn the rpm into a deb package (and it is what I recommended in my old post). But since we have ocl-icd these days, I'd go for a different approach this time - install just the bare minimum to have ocl-icd pick up the Intel driver (and add proper dependencies on ocl-icd-libopencl1 and libnuma1), so I created the debs manually this time.

So, what I did was:
  1. Grab the tgz from Intel's web.
  2. Extract it into a temporary location where you'll find license, readme file, a bunch of scripts and five rpms:
    • opencl-1.2-base-[version] - contains the, which we don't need cause we have ocl-icd-libopencl1
    • opencl-1.2-devel-[version] - contains OpenCL headers, which also aren't needed, as those are in opencl-headers package which is a dependency of ocl-icd-opencl-dev
    • opencl-1.2-intel-cpu-[version] - bingo, the ICD
    • opencl-1.2-intel-devel-[version] - a few development tools - offline compiler and Qt-based KernelBuilder application
    • opencl-1.2-intel-mic-[version] - additional libraries to support also Xeon Phi coprocessor besides Core processors - I ignored this one, cause I don't have access to such processor.
  3. Now, lets make binary debian packages from the unzipped files (note that only the intel-cpu package is required to use OpenCL apps, the devel tools are optional), this is pretty easy, following a simple debian package building HOWTO, I put the files from the intel-cpu and intel-devel rpms into the following directory structure:
     | \-DEBIAN
     | \-etc
     |   \-OpenCL
     |     \-vendors
     | \-usr
     |   \-lib
     |     \-x86_64-linux-gnu
     |       \-OpenCL
     |         \-vendors
     |           \-intel
     |   \-share
     |     \-doc
     |       \-opencl-driver-intel-cpu
     | \-bin
     | \-DEBIAN
     | \-usr
     |   \-lib
     |     \-x86_64-linux-gnu
     |       \-intel
     |         \-opencl-1.2-3.0.67279
     |   \-share
     |     \-doc
     |       \-opencl-driver-intel-tools

    The DEBIAN directories contain the control files:
    $ opencl-driver-intel-cpu/DEBIAN/control:
    Package: opencl-driver-intel-cpu
    Version: 3.0.67279-1
    Section: libs
    Priority: optional
    Architecture: amd64
    Depends: ocl-icd-libopencl1 (>= 2.0), libnuma1
    Maintainer: Your Name
    Description: Intel OpenCL CPU implementation
     This package provides Intel OpenCL implementation which can utilize Intel Core processors.

    $ opencl-driver-intel-tools/DEBIAN/control:

    Package: opencl-driver-intel-tools
    Version: 3.0.67279-1
    Section: libs
    Priority: optional
    Architecture: amd64
    Depends: ocl-icd-libopencl1 (>= 2.0)
    Maintainer: Your Name
    Description: Intel SDK for OpenCL Applications development tools
     This package contains the following tools:
      - Intel SDK for OpenCL - Kernel Builder, which enables building and analyzing OpenCL kernels and provides full offline OpenCL language compilation.
      - Intel SDK for OpenCL - Offline Compiler, a command-line utility, which enables offline compilation and building of OpenCL kernels.
    The leaf directory opencl-driver-intel-cpu/usr/lib/x86_64-linux-gnu/OpenCL/vendors/intel contains all the object files from the intel-cpu rpm, and opencl-driver-intel-tools/usr/lib/x86_64-linux-gnu/intel/opencl-1.2-3.0.67279/ contains the binaries from intel-devel rpm with tiny changes so the bash scripts point to a correct location.

    Once this is done, the only remaining file to tamper with is the actual .icd in opencl-driver-intel-cpu/etc/OpenCL/vendors/intel64.icd, which contains just one line with path to the library:
  4. Now just run `dpkg-deb --build opencl-driver-intel-cpu` and `dpkg-deb --build opencl-driver-intel-tools` and voila your debs with Intel's OpenCL are ready.
You can also download my opencl-driver-intel-cpu.deb and opencl-driver-intel-tools.deb.

Note that these are here just for reference, you should delete them after making sure that your package looks the same - ie I'm not redistributing these.

Once installed, a simple OpenCL app that lists the available platforms should output something like this:
  VENDOR: Intel(R) Corporation
    DEVICE:       Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
    DEVICE VENDOR: Intel(R) Corporation
    DEVICE VERSION: OpenCL 1.2 (Build 67279)

NVidia's OpenCL

In Ubuntu 13.04 there are now multiple nvidia packages which contain various versions of nvidia's driver, but I couldn't find anywhere a package with the icd file and therefore even if all the nvidia-3* packages do have the driver, the ICD loader isn't able to find it. Nonetheless, the fix for that is easy:

Run `locate`, this will probably find a few files, in my case these are found:
For some reason though, the *.so.1 are broken symlinks in my case (maybe because I have Optimus laptop), therefore I created the .icd in the following manner:

echo /usr/lib/nvidia-304/ > /etc/OpenCL/vendors/nvidia64.icd

The problem with this is that if the driver gets updated, you'll have to modify the .icd again, so if your *.so.1 symlink is not broken, you should use that instead.

Note that the driver also contains the OpenCL loader library ( and depending on your LD_LIBRARY_PATH settings, it might get used instead of the one provided by ocl-icd. That is not necessarily terrible, but keep in mind that NVidia's implementation is OpenCL 1.1, so even if other ICDs support 1.2, you'll be stuck with 1.1. Solution? Just remove the extra* from /usr/lib/nvidia-*/

Once done, listing CL platforms and devices should also list NVidia:
  VERSION: OpenCL 1.1 CUDA 4.2.1
  VENDOR: NVIDIA Corporation
    DEVICE: GeForce GT 635M
    DRIVER VERSION: 304.88

OpenCL and Bumblebee

As previously mentioned, I have an Optimus laptop (with integrated Intel GPU as well as NVidia which is used for more demanding applications). If the NVidia GPU is shut down, you won't see the NVidia platform as available (which I find a bit strange, should be a platform with 0 available devices, no?), but once turned on with a `optirun bash`, things should be working properly, although I haven't tried for example CL-GL interop, I can imagine that might not work with Bumblebee.

One issue I noticed though is that if you have both NVidia and Intel's drivers, Intel's driver will crash any OpenCL app run inside the optirun shell, which seems to be caused by the LD_PRELOAD libraries that VirtualGL uses. So either run your apps inside the shell with `LD_PRELOAD= ./myApp`, or just don't run them inside the optirun shell, use a regular one. As long as the GPU is active it is perfectly able to perform calculations even without VirtualGL set up.

AMD's driver

AMD enables to use both CPU and GPU devices with their driver. I don't have a GPU by AMD, so I only got to try the CPU implementation (luckily Intel's and AMD's CPUs are still compatible enough). The installer that they provide installs the whole SDK into /opt/AMDAPP, changes your /etc/profile to include the directories in LD_LIBRARY_PATH, and installs the icd to /etc/OpenCL/vendors.

What I don't like about this is that their SDK also contains, so it will be used instead of ocl-icd's. In this case this is less of a problem than in NVidia's case, cause AMD's implementation isn't limited to just OpenCL 1.1, but if you want to use ocl-icd, just remove the* from /opt/AMDAPP/lib/x86_64.

It would be nice to have a similar deb package for the AMD's CL driver, but I didn't get to that, maybe someone else wants to? ;) Anyway:
PLATFORM_NAME: AMD Accelerated Parallel Processing
  VERSION: OpenCL 1.2 AMD-APP (1113.2)
  VENDOR: Advanced Micro Devices, Inc.
    DEVICE: Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
    DEVICE VENDOR: GenuineIntel
    DEVICE VERSION: OpenCL 1.2 AMD-APP (1113.2)
    DRIVER VERSION: 1113.2 (sse2,avx)


As I was testing the various drivers I encountered quite a few issues - Intel's implementation crashes on ratGPU tests, AMD's pretends to work with my OpenCL face detection but doesn't detect anything (Intel's and NVidia's work fine), on top of that the crashes with Intel and Bumblebee/VirtualGL LD_PRELOAD shell. Samples from AMD's SDK crash when used with ocl-icd ICD loader because they call clReleaseContext(NULL), works with AMD's loader though. But in the end there is also a lot more that actually is working - for example a year ago my face detection didn't work at all with Intel's implementation, now it's fine, many of the SDK samples did work with all three drivers. I'd say there was some good progress.

So that's about the current OpenCL state, it's usable, just not out-of-the-box, I do hope that a year from now at least this post will be just saying "To use OpenCL just run `apt-get install opencl-driver-*`".

Friday, 3 August 2012

GUADEC 2012 and Zeitgeist hackfest

As many others, I've been to great La Coruna to meet up with fellow gnomies and zeitgeistians, and even though I arrived on Sunday, I still managed to make it to a couple of interesting talks and on Monday we started Zeitgeist hackfest which lasted till Wednesday.

The biggest chunk of work I managed to do was to review RainCT's libzeitgeist2 branch (with more than 3 thousand line vala-diff), which extracts the datamodel and dbus interface bits from the not-that-long-ago-rewritten-in-vala zeitgeist-daemon and puts those bits in a library which will supersede the current version of libzeitgeist. The old version was conceived during my GSoC (in 2010) and was purely C-based and since at that point the daemon was written in python it had no connection to the current sources, and currently this was more of a maintenance burden for us - you can imagine that it's easier to keep the lib up-to-date when the daemon itself is built with it. By the end of the hackfest the branch was merged into master, and even though there are still some small pieces missing (like documentation and syntax sugaring), we should finish those in a couple of days. The API is currently very similar to the old libzeitgeist, although we did change the stealing behaviour that was used, therefore now it's not as convenient to use in C as it used to be. On the other hand though, it's straightforward to use it from introspected languages as well as Vala itself.

Other than reviews of huge and small branches, we were brainstorming about Zeitgeist's FTS extension (which does textual search of the log for us, but has issues). Unfortunately it seems that all the open source search engine libraries have some issues, be it memory balooning problems, fact that they're written in Java (which I think is pretty unusable on desktop), limbo state of commits to them, or lack of features. Currently the best option seems to be LucenePlusPlus, but it falls into the "limbo state of commits" category. That being said, perhaps proclaiming our interest in it could change that? Pretty please? :)

Besides Zeitgeist, I also managed to stop by at the PyGObject hackfest and bother Pitti with memory leaks we're seeing when using libdee. Although we didn't manage to tackle them, I have high hopes that we do. I also discussed ways to make a library as optional as possible with Ryan, and will apply that to the instrumentation lib I'm currently working on.

One of the things that pleasantly surprised me was an increased general interest in Zeitgeist from the community (at least when compared to last year's GUADEC) and big number of smaller contributions, which are of course great and integrating with Zeitgeist is the way to improve the general user experience. Plus overall it's nice to see this after pushing for it for the past couple of years. Hopefully we will even see direct support for Zeitgeist in GTK soon. ;)

Last but not least, I want to thank GNOME foundation for sponsoring my stay.
Sponsored by GNOME foundation

Saturday, 21 April 2012

FTS engines - memory usage

Following up on Mathias's great post on Full Text Search engines, I decided to take a look at the memory usage of some of the engines while performing queries. Mathias looked at Lucene++, SQLite, QtCLucene, Tracker and Xapian, I focused only on three of them - Lucene++, SQLite and Xapian (version numbers match those that Mathias used as I'm also testing on Ubuntu 12.04).

The procedure was simple - I grabbed the benchmark repo (, used it to built two sets of the databases with 17251 and 121587 movies and then just ran valgrind's massif while only performing queries on the already built databases. Here are the values of peak memory usage:

Lucene++ SQLite Xapian
17251 1.4 MiB 2.5 MiB 1.2 MiB
121587 3.1 MiB 2.6 MiB 5.2 MiB

Of course, the peak memory usage by itself isn't terribly interesting value, what matters is also how does the engine work with memory over time, so let's look at that as well (images are courtesy of Milian's fantastic massif-visualizer, note that their scale is not relative to each other):

Lucene++ SQLite Xapian

We can see that both Lucene and SQLite seem to build a cache on the first query and then use it, Xapian on the other hand doesn't seem to be keeping a cache, as the rapid drops in mem usage suggest, but maybe there's different explanation to that.

So that's it for memory usage while performing standard queries, but what I was particularly interested in was memory usage when performing wildcard queries (as I saw some strange behaviour here and there with Xapian). Therefore I added one simple wildcard query "T*" to the list of executed queries and ran it on the largest DBs (ones with 121587 movies). As you can imagine the "T*" query is really generic and it matches around 110 thousand documents from the dataset, that's why I also added a limit of 10k results per query to each backend (although that shouldn't make much difference).

Let's look at the results:

Lucene++ SQLite Xapian
Peak mem usage 4.6 MiB 7.3 MiB 442.2 MiB

Now we can clearly see that Xapian uses huge amount of memory during expansion of the wildcard query (can this be considered a bug report? :)), SQLite has a couple of peaks there, but nothing to be worried too much about, and Lucene++ shines with its fairly constant (and really low) mem usage.

You saw the data, so I'll leave any conclusions up to you. ;)

The small number of changes I had to do to the original benchmark repository is available as a simple diff here.

Sunday, 4 March 2012

Face detection with OpenCL

I've been meaning to write about the topic of my thesis for quite some time, but didn't really get to it until now, so even though it's almost a year late, here we go.

Before I get into some technical details, here's a youtube video where you can see the OpenCL implementation of my detector in action:

Pretty neat, right? :) So what you just saw was an implementation of a detector based on the WaldBoost algorithm (a variant of AdaBoost) that had as its input a classifier trained for detecting frontal faces (and an awesome video of course) running on a GPU.

If you know anything about boosting algorithms, you'll know that one strong classifier is usually composed of lots of weak classifiers (which are usually very simple and computationally inexpensive functions) - in my case there are 1000 weak classifiers where each uses Local Binary Patterns to extract a feature from the input texture. Unfortunately such strong classifier is resolution dependent, and to be able to detect objects of various sizes in the input image, we need a pre-processing step.

During pre-processing we create a pyramid of images by gradually down-scaling the input (oh and we don't need colors, so we also convert it to greyscale). This way the detector can still detect only faces with resolution of 24x24, but using a mapping function we will know when it actually detected something in any of the downscaled versions of the image and there we have resolution independent detector. Interesting tidbit: it turned out that creating the pyramid texture by putting the downscaled images horizontally instead of vertically (which you can see on the image below) slightly improved performance of the detector - simply because the texture cache unit had higher hit ratio in such setup, but since the pyramid texture is approximately 3.6 times larger than the width of the original image, the detector wouldn't be able to process HD (1280x720) nor Full-HD (1920x1080) videos, because maximum texture size for OpenCL image is 4096 pixels (when using vertical layout though 1080 x 3.6 ~= 3900, so even Full-HD videos can be processed).

Left - original image, right - pyramid of downscaled images (real pyramid texture also has the original on top)

Once we have our pyramid image, it's divided into small blocks, which are processed by the GPU cores and each work item (or thread if you wish) in this block is evaluating the strong classifier at a particular window position of the pyramid image. Overall we'll evaluate every window position - think of every pixel. (in reality it's more complicated than that - the detector is using multiple kernels and each is evaluating only a part of the strong classifier - that's because WaldBoost can preliminary reject a window without evaluating all weak classifiers, so when a kernel finishes it just reduces the number of candidate windows and next kernel continues to evaluate only windows that survived the previous steps - this also ensures that we keep most of the work items in the work groups busy).

Once the detector finishes, we have a couple of window positions in the pyramid image and response value of the strong classifier in these windows, and these are sent back to the host. The CPU can then finish the detection (by simply thresholding the response values) and map the coordinates back to the input image. If you watched the video carefully you'd have noticed that there are multiple positive responses around a face, so this would be also a good place to do some post-processing and merge these. Plus there's a false detection from time to time, so again good place to get rid of them.

You're surely asking how does this compare to a pure CPU implementation and as you can imagine having to evaluate every window position in the pyramid image is very costly and even optimized SSE implementations can't get close to performance of a GPU (even though you need to copy a lot of data between the host and the GPU). So a simple graph to answer that (note the logarithmic scale):

Processed video frames per second (CPU: Core2 Duo E8200 @ 2.66GHz; GPU: GeForce GTX 285 - driver ver 270)
So why do I talk about all this on my free software related blog? Well of course I'm making the source available for anyone to play with it, optimize it further (there's still plenty of room for that) or do whatever you feel like doing with it. But I need to warn you first - the implementation is heavily optimized for nvidia's hardware and was never really tested on anything else (the AMD CPU implementation of OpenCL doesn't support images, the Intel CPU implementation does support images, but not the image formats I'm using, so that basically leaves only AMD GPU implementation, but I didn't have such hardware available). I'm also making assumptions that are true only on nvidia's hardware - like that there are 32 work items running at a given time (which is true for nvidia's warp). There are even some helper methods that allowed this to be run on hardware without local atomic operations (so even OpenCL 1.0 was enough), but I see now that I can no longer run it on my old GeForce 9300 with latest nvidia's driver (although it did work with version 270). So I don't even know if it works at all with the compiler in the latest driver... you've been warned.

Grab the code branch from Launchpad (bzr branch lp:~mhr3/+junk/ocl-detector), or get the tarball (the only dependencies are glib, opencv plus somewhere where the linker can find it). Run it with `./oclDetector -s CAM` (and if that doesn't seem to detect anything try `./oclDetector -r -20 -s CAM`).

Thursday, 3 November 2011

News from the Zeitgeist land

Hey everyone,

on behalf of the Zeitgeist team I'd like to announce that today we're releasing the latest version of Zeitgeist (0.8.99-alpha1). It's quite unusual for Zeitgeist to do alpha releases, but this one is special - the entire daemon was rewritten from Python to Vala, which most likely brings a couple of new bugs, but also fixes another bunch of old bugs. Therefore we'd need people to test the release to see if there are some outstanding issues we missed.

As usual, the tarball is available on Launchpad. Of course we'll be also pushing the package into our PPA soon (although the alpha release may be only available for Oneiric users). Please report any bugs you encounter to either Zeitgeist's bugzilla or to our Launchpad bug page.

The biggest difference you'll be able to see at this point is much faster startup time, other than that the changes will be minimal - we're still using the same database, as well as the same DBus API, so everything should be working as before. One thing we did break is the API for the Activity Journal extension, so if you want to continue using it with this and further releases, you need to update also Activity Journal.

What's still missing is rewrite of our FTS extension (which provides search capabilities), so for the time being we're still using the old one written in python which is included in the tarball.

Before I wrap up, I'd like to say huge thank you to Collabora and Canonical, who sponsored the development, and of course to the whole team: Seif (seiflotfy), Siegfried (RainCT), Mikkel (kamstrup) and Manish (m4n1sh).

Wednesday, 3 August 2011

Desktop Summit 2011

To make sure I don't forget: I'll be also at this year's Desktop Summit in Berlin, and it'll be my first time on such huge conference, so I'm quite excited. If you have any Synapse / Zeitgeist questions feel free to ask. :) Last, but not least, we'll also have a BOF about GtkRecent.

See you there!

Tuesday, 17 May 2011

Intel's OpenCL on Ubuntu

Since I work with OpenCL a lot and yesterday I found out that Intel's OpenCL is now finally available for Linux, I thought I'd share a few words of how to get it to work on Ubuntu (even though Intel currently provides only a rpm package for RHEL and Suse).

First of all, I'm testing all of this on Lucid 64bit, but I suppose it'd work also on newer Ubuntu releases (though you need to be using 64bit version, cause the Intel package is for 64).

So let's get to it.

  1. First of all grab the rpm package from
  2. Install the rpm and alien packages (`sudo apt-get install rpm alien`).
  3. Convert the rpm package to deb using alien - `fakeroot alien --to-deb <intel's rpm package filename>`. The conversion spits some warnings, I wouldn't pay any attention to them.
  4. Install the newly created deb package. `sudo dpkg -i intel-ocl-sdk-suse+11.1_1.1-2_amd64.deb`
  5. One extra package you need to install for the library to work is libnuma. `sudo apt-get install libnuma1`
  6. Make sure the ICD is installed. `sudo echo "" > /etc/OpenCL/vendors/intelocl64.icd`
  7. The package is nice and also installs OpenCL headers in /usr/include/CL. Also the main binary ( is installed in /usr/lib64 - if you don't have any other OpenCL platform installed on your system, I suggest moving it to /usr/lib (run `sudo ldconfig` afterwards), if you do have this library already (for example nvidia driver also contains it) just leave it there.
  8. Since the libraries are installed in non-standard location for Ubuntu (/usr/lib64/OpenCL/vendors/intel), you'll need to adjust your LD_LIBRARY_PATH. I usually do this using a script, but you can just run:
    export LD_LIBRARY_PATH=/usr/lib64/OpenCL/vendors/intel:$LD_LIBRARY_PATH
Running a OpenCL program that just lists available platforms should return now at least one platform. Or if you have multiple platforms including their ICDs installed you'd get something like:
There are 3 platforms available
  VENDOR: Intel(R) Corporation
    DEVICE: Intel(R) Core(TM)2 Duo CPU     P7370  @ 2.00GHz
  VERSION: OpenCL 1.1 ATI-Stream-v2.3 (451)
  VENDOR: Advanced Micro Devices, Inc.
    DEVICE: Intel(R) Core(TM)2 Duo CPU     P7370  @ 2.00GHz
    DEVICE VERSION: OpenCL 1.1 ATI-Stream-v2.3 (451)
  VERSION: OpenCL 1.0 CUDA 3.2.1
  VENDOR: NVIDIA Corporation
    DEVICE: GeForce 9300M GS
    DRIVER VERSION: 260.19.29

Good luck implementing your OpenCL kernels. :)