texte

Generic Functions with Python

PEAK has been offering generic functions similar to CLOS for Python for quite some time. I always wanted to play around with it, but for a long time it was just part of PyProtocols, and the installation was a bit tricky. However, since September of this year, it has been decoupled and much easier to install. So I dove right in.

And I must say: wow. What Phillip J. Eby has accomplished is truly fantastic. The integration with Python (works from Python 2.3 - he even invented his own implementation of decorators for Python 2.3) is superb, even if, of course, some things take a bit of getting used to.

A small example:

import dispatch

[dispatch.generic()]
def anton(a,b):
 "handle two objects"

[anton.when('isinstance(a,int) and isinstance(b,int)')]
def anton(a,b):
 return a+b

[anton.when('isinstance(a,str) and isinstance(b,str)')]
def anton(a,b):
 return a+b

[anton.when('isinstance(a,str) and isinstance(b,int)')]
def anton(a,b):
 return a*b

[anton.when('isinstance(a,int) and isinstance(b,str)')]
def anton(a,b):
 return b*a

[anton.before('True')]
def anton(a,b):
 print type(a), type(b)

This small example simply provides a function called 'anton', which executes different code based on the parameter types. The example is of course completely nonsensical, but it shows some important properties of generic functions:

Generic functions are - unlike classic object/class methods - not bound to any classes or objects. Instead, they are selected based on their parameter types.
Parameter types must therefore be defined - this usually happens via a mini-language with which the selection conditions are formulated. This is also the only syntactic part that I don't like so much: the conditions are stored as strings. However, the integration is very good, and you get clean syntax errors already when loading.
A generic function can be overloaded with any conditions - not just the first parameter is decisive. Conditions can also make decisions based on values - any arbitrary Python expression can be used there.
With method combinations (methods are the concrete manifestations of a generic function here), you can modify a method before or after its call without touching the code itself. The example uses a before method that is always (hence the 'True') used to generate debugging output. Of course, you can also use conditions with before/after methods to attach to specific manifestations of the call of the generic function - making generic functions a full-fledged event system.

A pretty good article about RuleDispatch (the generic functions package) can be found at Developerworks.

The example, by the way, shows the Python 2.3 syntax for decorators. With Python 2.4, of course, the @ syntax can also be used. One disadvantage should not be kept secret: the definition of generic functions and their methods is not possible interactively - at least not with the Python 2.3 syntax. Unfortunately, you generally have to work with external definitions in files here.

RuleDispatch will definitely find a place in my toolbox - the syntax is simple enough, the possibilities, however, are gigantic. As an event system, it surpasses any other system in flexibility, and as a general way of structuring code, it comes very close to CLOS. It's a shame that Django will likely align with PyDispatch - in my opinion, RuleDispatch would fit much better (as many aspects in Django could be written as dispatch on multiple parameter types).

Blogcounter, Penis Size Comparisons, and Other Lies

Right now, people are once again wildly discussing hit counts and similar nonsense. Usually, I don't care about these (my server has an absurdly high free allowance that I can never use, and the server load is also low - so why should I care how much comes in?), but with the various announcements of hit counts, page views, and visits, I always have to smile a little.

Just as a small analysis of the whole story. First, the most important part: where do these numbers come from? Basically, there are two possibilities. One relies on the fact that pages contain a small element (e.g., an image - sometimes invisible - or a piece of JavaScript or an iframe - all commonly referred to as a web bug (web bug)) that is counted. The other method goes to the log files of the web server and evaluates them. There is a third one, where the individual visitor is identified via a cookie - but this is rather rarely used, except for some rather unpopular advertising systems.

Basically, there are only a few real numbers that such a system can really provide (with the exception of individualization via cookies): on the one hand, hits, on the other hand, megabytes and transfer. Quite remotely useful, there is also the number of different hosts (IP addresses) that have accessed the site.

But these numbers have a problem: they are purely technical. And thus strongly dependent on technology. Hits go up if you have many external elements. Bytes go up if you have many long pages (or large images or ...). IP addresses go down if many visitors are behind proxies. And they go up if you have many ISDN users - because of the dynamic dial-up addresses. Changes in the numbers are therefore due to both changes in visitors and changes in the pages.

All these numbers are as meaningful as the coffee grounds in the morning cup. That's why people derive other numbers from these - at least technically defined - numbers, which are supposed to say something. Here, the visits (visits to the website), the page impressions (accesses to real page addresses), and the visitors (different visitors) are to be mentioned.

Let's take the simplest number, which at least has a rudimentary connection to the real world: page impressions. There are different ways to get there. You can put the aforementioned web bugs on the pages that are to be counted. Thus, the number is about as reliable as the counting system. Unfortunately, the counting systems are absolutely not, but more on that in a moment. The alternative - going through the web server log files - is a bit better. Here, you simply count how many hits with the MIME type text/html (or whatever is used for your own pages) are delivered. You can also count .html - but many pages no longer have this in the addresses, the MIME type is more reliable.

Significance? Well, rather doubtful. Many users are forced through their providers via proxies - but a proxy has the property of helping to avoid hits. If a visitor has retrieved the page, it may (depending on the proxy configuration) be delivered to other visitors from the cache, not fetched from the server. This affects, for example, the entire AOL - the numbers are clearly distorted there. The more A-list-bloggerish the blogger really is, the more distorted the numbers often are (since cache hits can be more frequent than with less visited blogs).

In addition, browsers also do such things - cache pages. Or visitors do something else - reload pages. Proxies repeat some loading process automatically because the first one may not have gone through completely due to timeout - all of these are distortions of the numbers. Nevertheless, page impressions are still at least halfway usable. Unless you use web bugs.

Because web bugs have a general problem: they are not main pages. But embedded objects. Here, browsers often behave even more stubbornly - what is in the cache is displayed from the cache. Why fetch the little picture again? Of course, you can prevent this with suitable headers - nevertheless, it often goes wrong. JavaScript-based techniques completely bypass users without JavaScript (and believe me, there are significantly more of them than is commonly admitted). In the end, web bugs have the same problems as the actual pages, only a few additional, own problems. Why are they still used? Because it is the only way to have your statistics counted on a system other than your own. So indispensable for global length comparisons.

Well, let's leave page impressions and thus the area of rationality. Let's come to visits, and thus closely related to visitors. Visitors are mysterious beings on the web - you only see the accesses, but who it is and whether you know them, that is not visible. All the more important for marketing purposes, because everything that is nonsense and cannot be verified can be wonderfully exploited for marketing.

Visitors are only recognizable to a web browser via the IP of the access, plus the headers that the browser sends. Unfortunately, this is much more than one would like to admit - but (except for the cookie setters with individual user tracking) not enough for unique identification. Because users share IPs - every proxy will be counted as one IP. Users may use something like tor - and thus the IP is often different than the last time. Users share a computer in an Internet café - and thus it is actually not users, but computers that are assigned. There are headers that are set by caches with which assignments can be made - but if the users behind the cache all use only private IP addresses (the 10.x.x.x or 172.x.x.x or 192.168.x.x addresses that you know from relevant literature), this does not help either.

Visitors can still be assigned a bit if the period is short - but over days? Sorry, but in the age of dynamic IP addresses, that doesn't help at all. The visitors of today and those of tomorrow can be the same or different - no idea. Nevertheless, it is proudly announced how many visitors one had in a month. Of course, this no longer has any meaning. Even daily numbers are already strongly changed by dynamic dial-ups (not everyone uses a flat rate and has the same address for 24 hours).

But to add to the madness, not only the visitors are counted (allegedly), but also their visits. Yes, that's really exciting. Because what is a visit? Ok, recognizing a visitor again over a short period of time (with all the problems that proxies and the like bring about, of course) works quite well - and you also know exactly when a visit begins. Namely, with the first access. But when does it end? Because there is no such thing as ending a web visit (a logout). You just go away. Don't come back so quickly (if at all).

Yes, that's when it gets really creative. Do you just take the time intervals of the hits? Or - because visitors always read the content - do you calculate the time interval from when a hit is a new visit from the size of the last retrieved page document? How do you filter out regular refreshes? How do you deal with the above visitor counting problems?

Not at all. You just suck. On the fingers. Then a number comes out. Usually based on a time interval between hits - long pause, new visit. That's just counted. And it's added to a sum. Regardless of the fact that a visit may have been interrupted by a phone call - and therefore two visits were one visit, just with a pause. Regardless of the fact that users share computers or IP addresses - and thus a visit in reality was 10 interwoven visits.

Oh, yes, I know that some software uses the referrer headers of the browser to assign paths through the system and thus build clearer visits. Which of course no longer works smoothly if the user goes back with the back button or enters an address again without a referrer being produced. Or uses a personal firewall that partially filters referrers.

What is really cute is that all these numbers are thrown on the market without clear statements being made. Of course, sometimes it is said which service the numbers were determined via - but what does that say? Can the numbers be faked there? Does the operator count correctly (at blogcounter.de you can certainly fake the numbers in the simplest way) and does he count sensibly at all? Oh well, just take numbers.

The argument is often brought up that although the numbers cannot be compared directly as absolute numbers across counter boundaries, you can compare numbers from the same counter - companies are founded on this, which make money by renting out this coffee ground technology to others and thus realizing the great cross-border rankings. Until someone notices how the counters can be manipulated in a trivial way ...

It gets really cute when the numbers are brought into line with the time axis and things like average dwell time are derived from this and then, in combination with the page size, it is determined how many pages were read and how many were just clicked (based on the usual reading speed, such a thing is actually "evaluated" by some software).

So let's summarize: there is a limited framework of information that you can build on. These are hits (i.e., retrievals from the server), hosts (i.e., retrieving IP addresses), and amounts transferred (summing the bytes from the retrievals). In addition, there are auxiliary information such as e.g. referrers and possibly cookies. All numbers can be manipulated and falsified - and many are actually falsified by common Internet technologies (the most common case being caching proxies).

These rather unreliable numbers are chased through - partly non-public - algorithms and then mumbo jumbo is generated, which is used to show what a cool frood you are and where the towel hangs.

And I'm supposed to participate in such nonsense?

PS: According to the awstats evaluation, the author of this posting had 20,172 visitors, 39,213 visits, 112,034 page views in 224,402 accesses, and pushed 3.9 gigabytes over the line last month - which, as noted above, is completely irrelevant and meaningless, except that he might look for more sensible hobbies.

Living Data

Funny title, isn't it? Well, I just noticed something while dealing with web frameworks and other applications, specifically in the Ruby and Python environments. Namely, the way mini-data is stored and how configuration data is handled, for example.

In the Java environment, there is an inflation of XML mini-languages - mountains of dead data. Dead because this data only exists in XML format and can only be processed and modified using XML tools. For example, if I have constantly repeating or algorithmically describable configuration blocks (e.g., a mountain of quite similar-looking URL patterns for a web framework), I can only generate these using XML tools - e.g., generate them from simpler formats using XSLT. Or I write small tools for this.

In Ruby, the situation is similar - only that instead of XML, YAML is used here. Ultimately, however, this is not better - the configuration is still a dead file.

But both in the Python environment and in various other dynamic languages, there is a good alternative to this: just use a module in your programming language. For example, Python modules live - if the structure is complex but partially repetitive - simply write a small Python function that helps with the dynamic creation of the config. If the config should partially come from database contents - simply write a Python function that reads this data from the DB at runtime and mixes it into the config. Living configuration data, after all.

Of course, security issues come into play here - we don't want to repeat the PHP mistake with the eternal eval. What is urgently necessary for this would be a clean sandbox for such modules. Unfortunately, there is a massive hole in the implementation right there in Python. There were bytecode hacks in the past, which were also revived - but these are just hacks. The method of building a pseudo-sandbox using restricted imports and proxy objects, as Zope does, is also not the be-all and end-all.

Perl offers a very clean method here - as is usual for all security features in Perl, this is of course used by almost no project - to regulate down to the smallest detail what the code in such a sandbox is allowed to do - and thus a configuration via Perl module is definitely better secured than in languages without such a concept.

Java itself, of course, has a pretty sophisticated security management system - necessarily, as it is also supposed to run in browsers with very restricted rights. This security model is also usable for applications and could be used, for example, for servlets or Java configs - especially since you can also easily translate files at runtime and load them dynamically with Java. Now explain to me why the Java people are so fixated on XML when they have the best foundations for secure living data ...

We will ignore the safe model of PHP here, because it is a soda-or-seltzer model - either every code runs under safemode, or none at all. What we would need is a selective activation of different security classes for a single code block or module import (ok, PHP also doesn't have module imports, only includes - I say, we just ignore it).

So far, you can only work with living configurations in Python if you are sure that the configurations are only edited by users without malicious intent. Django, for example, only uses living configurations - it would therefore be a pretty stupid idea to make the configuration files editable via the web for centrally hosted applications.

We urgently need a clean sandbox for Python. I even believe that this would be a more important subproject than the various syntactic extensions that are repeatedly addressed.

Software Patents - Commentary in the NY Times

The NY Times asks why Bill Gates wants 3,000 new patents and finds a massive siege of the patent office with mountains of software patents, which are often just trivial patents (like the cited patent for adding/removing spaces in documents). The commentator makes a demand in the comment (after considering whether Microsoft should not simply have all the patents it already has revoked):

Perhaps that is going too far. Certainly, we should go through the lot and reinstate the occasional invention embodied in hardware. But patent protection for software? No. Not for Microsoft, nor for anyone else.

And this from the country that has had software patents for a long time and that is repeatedly cited by software patent proponents in the EU as a reason for a necessary worldwide harmonization.

No, software patents are also not popular there and not really useful. Dan Bricklin, known to some as the father of VisiCalc, also thinks so:

Mr. Bricklin, who has started several software companies and defensively acquired a few software patents along the way, says he, too, would cheer the abolition of software patents, which he sees as the bane of small software companies. "The number of patents you can run into with a small product is immense," he said. As for Microsoft's aggressive accumulation in recent years, he asked, "Isn't Microsoft the poster child of success without software patents?"

And why is Microsoft doing this now? The manager responsible gives a reason, as only a business administrator could come up with, it's that stupid:

"We realized we were underpatenting," Mr. Smith explained. The company had seen studies showing that other information technology companies filed about two patents for every $1 million spent on research and development. If Microsoft was spending $6 billion to $7.5 billion annually on its R&D, it would need to file at least 3,000 applications to keep up with the Joneses.

Ok, the idea of patent applications alone being oriented towards numbers from the industry is absurd, but how stupid do you have to be to draw a connection between the number of patents and revenue in the field of research and development?

The NY Times also draws a parallel to the pharmaceutical industry, which - at least according to its own statements - is happy to get a patent for a drug when it invests 20 million in research (which is already critical enough, as can be seen in the fight against AIDS in Africa).

And the fallout is also well summarized in the NY Times:

Last year at a public briefing, Kevin R. Johnson, Microsoft's group vice president for worldwide sales, spoke pointedly of "intellectual property risk" that corporate customers should take into account when comparing software vendors. On the one side, Microsoft has an overflowing war chest and bulging patent portfolio, ready to fight - or cross-license with - any plaintiff who accuses it of patent infringement. On the other are the open-source developers, without war chest, without patents of their own to use as bargaining chips and without the financial means to indemnify their customers.

The question of what Jefferson (the founder of the US patent system) would say about what is now being patented is quite justified. In his sense - which was actually more about protecting real inventive genius from exploitation by corporations - this is definitely not the case.

Writing a Simple Filesystem Browser with Django

Dieser Artikel ist mal wieder in Englisch, da er auch für die Leute auf #django interessant sein könnte. This posting will show how to build a very simple filesystem browser with Django. This filesystem browser behaves mostly like a static webserver that allows directory traversal. The only speciality is that you can use the Django admin to define filesystems that are mounted into the namespace of the Django server. This is just to demonstrate how a Django application can make use of different data sources besides the database, it's not really meant to serve static content (although with added authentication it could come in quite handy for restricted static content!).

Even though the application makes very simple security checks on passed in filenames, you shouldn't run this on a public server - I didn't do any security tests and there might be buttloads of bad things in there that might expose your private data to the world. You have been warned. We start as usual by creating the filesystems application with the django-admin.py startapp filesystems command. Just do it like you did with your polls application in the first tutorial. Just as an orientation, this is how the myproject directory does look like on my development machine:


.
|-- apps
| |-- filesystems
| | |-- models
| | |-- urls
| | `-- views
| `-- polls
| |-- models
| |-- urls
| `-- views
|-- public_html
| `-- admin_media
| |-- css
| |-- img
| | `-- admin
| `-- js
| `-- admin
|-- settings
| `-- urls
`-- templates
 `-- filesystems

After creating the infrastructure, we start by building the model. The model for the filesystems is very simple - just a name for the filesystem and a path where the files are actually stored. So here it is, the model:


 from django.core import meta

class Filesystem(meta.Model):

fields = ( meta.CharField('name', 'Name', maxlength=64), meta.CharField('path', 'Path', maxlength=200), )

def repr(self): return self.name

def get_absolute_url(self): return '/files/%s/' % self.name

def isdir(self, path): import os p = os.path.realpath(os.path.join(self.path, path)) if not p.startswith(self.path): raise ValueError(path) return os.path.isdir(p)

def files(self, path=''): import os import mimetypes p = os.path.realpath(os.path.join(self.path, path)) if not p.startswith(self.path): raise ValueError(path) l = os.listdir(p) if path: l.insert(0, '..') return [(f, os.path.isdir(os.path.join(p, f)), mimetypes.guess_type(f)[0] or 'application/octetstream') for f in l]

def file(self, path): import os import mimetypes p = os.path.realpath(os.path.join(self.path, path)) if p.startswith(self.path): (t, e) = mimetypes.guess_type(p) return (p, t or 'application/octetstream') else: raise ValueError(path)

admin = meta.Admin( fields = ( (None, {'fields': ('name', 'path')}), ), list_display = ('name', 'path'), search_fields = ('name', 'path'), ordering = ['name'], )

As you can see, the model and the admin is rather boring. What is interesting, though, are the additional methods isdir , files and file . isdir just checks wether a given path below the filesystem is a directory or not. files returns the files of the given path below the filesystems base path and file returns the real pathname and the mimetype of a given file below the filesystems base path. All three methods check for validity of the passed in path - if the resulting path isn't below the filesystems base path, a ValueError is thrown. This is to make sure that nobody uses .. in the path name to break out of the defined filesystem area. So the model includes special methods you can use to access the filesystems content itself, without caring for how to do that in your views. It's job of the model to know about such stuff.

The next part of your little filesystem browser will be the URL configuration. It's rather simple, it consists of the line in settings/urls/main.py and the myproject.apps.filesystems.urls.filesystems module. Fist the line in the main urls module:


 from django.conf.urls.defaults import *

urlpatterns = patterns('',
 (r'^files/', include('myproject.apps.filesystems.urls.filesystems')),
 )

Next the filesystems own urls module:


 from django.conf.urls.defaults import *

urlpatterns = patterns('myproject.apps.filesystems.views.filesystems',
 (r'^$', 'index'),
 (r'^(?P<filesystem_name>.*?)/(?P<path>.*)$', 'directory'),
 )

You can now add the application to the main settings file so you don't forget to do that later on. Just look for the INSTALLED_APPS setting and add the filebrowser:


 INSTALLED_APPS = (
 'myproject.apps.polls',
 'myproject.apps.filesystems'
 )

One part is still missing: the views. This module defines the externally reachable methods we defined in the urlmapper. So we need two methods, index and directory . The second one actually doesn't work only with directories - if it get's passed a file, it just presents the contents of that file with the right mimetype. The view makes use of the methods defined in the model to access actual filesystem contents. Here is the source for the views module:


 from django.core import template_loader
 from django.core.extensions import DjangoContext as Context
 from django.core.exceptions import Http404
 from django.models.filesystems import filesystems
 from django.utils.httpwrappers import HttpResponse

def index(request):
 fslist = filesystems.getlist(orderby=['name'])
 t = templateloader.gettemplate('filesystems/index')
 c = Context(request, {
 'fslist': fslist,
 })
 return HttpResponse(t.render(c))

def directory(request, filesystem_name, path):
 import os
 try:
 fs = filesystems.getobject(name exact=filesystemname)
 if fs.isdir(path):
 files = fs.files(path)
 tpl = templateloader.gettemplate('filesystems/directory')
 c = Context(request, {
 'dlist': [f for (f, d, t) in files if d],
 'flist': [{'name':f, 'type':t} for (f, d, t) in files if not d],
 'path': path,
 'fs': fs,
 })
 return HttpResponse(tpl.render(c))
 else:
 (f, mimetype) = fs.file(path)
 return HttpResponse(open(f).read(), mimetype=mimetype)
 except ValueError: raise Http404
 except filesystems.FilesystemDoesNotExist: raise Http404
 except IOError: raise Http404

See how the elements of the directory pattern are passed in as parameters to the directory method - the filesystem name is used to find the right filesystem and the path is used to access content below that filesystems base path. Mimetypes are discovered using the mimetypes module from the python distribution, btw.

The last part of our little tutorial are the templates. We need two templates - one for the index of the defined filesystems and one for the content of some path below some filesystem. We don't need a template for the files content - file content is delivered raw. So first the main index template:


{% if fslist %}
<h1>defined filesystems</h1> <ul> {% for fs in fslist %}
<li><a href="{{ fs.get_absolute_url }}">{{ fs.name }}</a></li> {% endfor %}
</ul> {% else %}
<p>Sorry, no filesystems have been defined.</p> {% endif %}

The other template is the directory template that shows contents of a path below the filesystems base path:


 {% if dlist or flist %}
 <h1>Files in //{{ fs.name }}/{{ path }}</h1> <ul> {% for d in dlist %}
 <li> <a href="{{ fs.getabsoluteurl }}{{ path }}{{ d }}/" >{{ d }}</a> </li> {% endfor %}
 {% for f in flist %}
 <li> <a href="{{ fs.getabsoluteurl }}{{ path }}{{ f.name }}" >{{ f.name }}</a> ({{ f.type }})</li> {% endfor %}
 </ul> {% endif %}

Both templates need to be stored somewhere in your TEMPLATE PATH. I have set up a path in the TEMPLATE PATH with the name of the application: filesystems . In there I stored the files as index.html and directory.html . Of course you normally would build a base template for the site and extend that in your normal templates. And you would add a 404.html to handle 404 errors. But that's left as an exercise to the reader.After you start up your development server for your admin (don't forget to set DJANGO SETTINGS MODULE accordingly!) you can add a filesystem to your database (you did do django-admin.py install filesystems sometime in between? No? Do it now, before you start your server). Now stop the admin server, change your DJANGO SETTINGS MODULE and start the main settings server. Now you can surf to http://localhost:8000/files/(at least if you did set up your URLs and server like I do) and browse the files in your filesystem. That's it. Wasn't very complicated, right? Django is really simple to use

Django, lighttpd and FCGI, second take

In my first take at this stuff I gave a sample on how to run django projects behind lighttpd with simple FCGI scripts integrated with the server. I will elaborate a bit on this stuff, with a way to combine lighttpd and Django that gives much more flexibility in distributing Django applications over machines. This is especially important if you expect high loads on your servers. Of course you should make use of the Django caching middleware, but there are times when even that is not enough and the only solution is to throw more hardware at the problem.

Update: I maintain my descriptions now in my trac system. See the lighty+FCGI description for Django.

Caveat: since Django is very new software, I don't have production experiences with it. So this is more from a theoretical standpoint, incorporating knowledge I gained with running production systems for several larger portals. In the end it doesn't matter much what your software is - it only matters how you can distribute it over your server farm.

To follow this documentation, you will need the following packages and files installed on your system:

[Django][2] itself - currently fetched from SVN. Follow the setup instructions or use python setup.py install .
[Flup][3] - a package of different ways to run WSGI applications. I use the threaded WSGIServer in this documentation.
[lighttpd][4] itself of course. You need to compile at least the fastcgi, the rewrite and the accesslog module, usually they are compiled with the system.
[Eunuchs][5] - only needed if you are using Python 2.3, because Flup uses socketpair in the preforked servers and that is only available starting with Python 2.4
[django-fcgi.py][6] - my FCGI server script, might some day be part of the Django distribution, but for now just fetch it here. Put this script somewhere in your $PATH, for example /usr/local/bin and make it executable.
If the above doesn't work for any reason (maybe your system doesn't support socketpair and so can't use the preforked server), you can fetch [django-fcgi-threaded.py][7] - an alternative that uses the threading server with all it's problems. I use it for example on Mac OS X for development.

Before we start, let's talk a bit about server architecture, python and heavy load. The still preferred Installation of Django is behind Apache2 with mod python2. mod python2 is a quite powerfull extension to Apache that integrates a full Python interpreter (or even many interpreters with distinguished namespaces) into the Apache process. This allows Python to control many aspects of the server. But it has a drawback: if the only use is to pass on requests from users to the application, it's quite an overkill: every Apache process or thread will incorporate a full python interpreter with stack, heap and all loaded modules. Apache processes get a bit fat that way.

Another drawback: Apache is one of the most flexible servers out there, but it's a resource hog when compared to small servers like lighttpd. And - due to the architecture of Apache modules - mod_python will run the full application in the security context of the web server. Two things you don't often like with production environments.

So a natural approach is to use lighter HTTP servers and put your application behind those - using the HTTP server itself only for media serving, and using FastCGI to pass on requests from the user to your application. Sometimes you put that small HTTP server behind an Apache front that only uses mod proxy (either directly or via mod rewrite) to proxy requests to your applications webserver - and believe it or not, this is actually a lot faster than serving the application with Apache directly!

The second pitfall is Python itself. Python has a quite nice threading library. So it would be ideal to build your application as a threaded server - because threads use much less resources than processes. But this will bite you, because of one special feature of Python: the GIL. The dreaded global interpreter lock. This isn't an issue if your application is 100% Python - the GIL only kicks in when internal functions are used, or when C extensions are used. Too bad that allmost all DBAPI libraries use at least some database client code that makes use of a C extension - you start a SQL command and the threading will be disabled until the call returns. No multiple queries running ...

So the better option is to use some forking server, because that way the GIL won't kick in. This allows a forking server to make efficient use of multiple processors in your machine - and so be much faster in the long run, despite the overhead of processes vs. threads.

For this documentation I take a three-layer-approach for distributing the software: the front will be your trusted Apache, just proxying all stuff out to your project specific lighttpd. The lighttpd will have access to your projects document root and wil pass on special requests to your FCGI server. The FCGI server itself will be able to run on a different machine, if that's needed for load distribution. It will use a preforked server because of the threading problem in Python and will be able to make use of multiprocessor machines.

I won't talk much about the first layer, because you can easily set that up yourself. Just proxy stuff out to the machine where your lighttpd is running (in my case usually the Apache runs on different machines than the applications). Look it up in the mod_proxy documentation, usually it's just ProxyPass and ProxyPassReverse.

The second layer is more interesting. lighttpd is a bit weird in the configuration of FCGI stuff - you need FCGI scripts in the filesystem and need to hook those up to your FCGI server process. The FCGI scripts actually don't need to contain any content - they just need to be in the file system.

So we start with your Django project directory. Just put a directory public html in there. That's the place where you put your media files, for example the admin media directory. This directory will be the document root for your project server. Be sure only to put files in there that don't contain private data - private data like configs and modules better stay in places not accessible by the webserver. Next set up a lighttpd config file. You only will use the rewrite and the fastcgi modules. No need to keep an access log, that one will be written by your first layer, your apache server. In my case the project is in /home/gb/work/myproject - you will need to change that to your own situation. Store the following content as /home/gb/work/myproject/lighttpd.conf


 server.modules = ( "mod_rewrite", "mod_fastcgi" )
 server.document-root = "/home/gb/work/myproject/public_html"
 server.indexfiles = ( "index.html", "index.htm" )
 server.port = 8000
 server.bind = "127.0.0.1"
 server.errorlog = "/home/gb/work/myproject/error.log"

fastcgi.server = (
"/main.fcgi" => (
"main" => (
"socket" => "/home/gb/work/myproject/main.socket"
 )
 ),
"/admin.fcgi" => (
"admin" => (
"socket" => "/home/gb/work/myproject/admin.socket"
 )
 )
 )

url.rewrite = (
"^(/admin/.*)$" => "/admin.fcgi$1",
"^(/polls/.*)$" => "/main.fcgi$1"
 )

mimetype.assign = (
".pdf" => "application/pdf",
".sig" => "application/pgp-signature",
".spl" => "application/futuresplash",
".class" => "application/octet-stream",
".ps" => "application/postscript",
".torrent" => "application/x-bittorrent",
".dvi" => "application/x-dvi",
".gz" => "application/x-gzip",
".pac" => "application/x-ns-proxy-autoconfig",
".swf" => "application/x-shockwave-flash",
".tar.gz" => "application/x-tgz",
".tgz" => "application/x-tgz",
".tar" => "application/x-tar",
".zip" => "application/zip",
".mp3" => "audio/mpeg",
".m3u" => "audio/x-mpegurl",
".wma" => "audio/x-ms-wma",
".wax" => "audio/x-ms-wax",
".ogg" => "audio/x-wav",
".wav" => "audio/x-wav",
".gif" => "image/gif",
".jpg" => "image/jpeg",
".jpeg" => "image/jpeg",
".png" => "image/png",
".xbm" => "image/x-xbitmap",
".xpm" => "image/x-xpixmap",
".xwd" => "image/x-xwindowdump",
".css" => "text/css",
".html" => "text/html",
".htm" => "text/html",
".js" => "text/javascript",
".asc" => "text/plain",
".c" => "text/plain",
".conf" => "text/plain",
".text" => "text/plain",
".txt" => "text/plain",
".dtd" => "text/xml",
".xml" => "text/xml",
".mpeg" => "video/mpeg",
".mpg" => "video/mpeg",
".mov" => "video/quicktime",
".qt" => "video/quicktime",
".avi" => "video/x-msvideo",
".asf" => "video/x-ms-asf",
".asx" => "video/x-ms-asf",
".wmv" => "video/x-ms-wmv"
 )

I bind the lighttpd only to the localhost interface because in my test setting the lighttpd runs on the same host as the Apache server. In multi server settings you will bind to the public interface of your lighttpd servers, of course. The FCGI scripts communicate via sockets in this setting, because in this test setting I only use one server for everything. If your machines would be distributed, you would use the "host" and "port" settings instead of the "socket" setting to connect to FCGI servers on different machines. And you would add multiple entries for the "main" stuff, to distribute the load of the application over several machines. Look it up in the lighttpd documentation what options you will have.

I set up two FCGI servers for this - one for the admin settings and one for the main settings. All applications will be redirected through the main settings FCGI and all admin requests will be routed to the admin server. That's done with the two rewrite rules - you will need to add a rewrite rule for every application you are using.

Since lighttpd needs the FCGI scripts to exist to pass along the PATH_INFO to the FastCGI, you will need to touch the following files: /home/gb/work/myprojectg/public_html/admin.fcgi ``/home/gb/work/myprojectg/public_html/main.fcgi

They don't need to contain any code, they just need to be listed in the directory. Starting with lighttpd 1.3.16 (at the time of this writing only in svn) you will be able to run without the stub files for the .fcgi - you just add "check-local" => "disable" to the two FCGI settings. Then the local files are not needed. So if you want to extend this config file, you just have to keep some very basic rules in mind:

every settings file needs it's own .fcgi handler
every .fcgi needs to be touched in the filesystem - this might go away in a future version of lighttpd, but for now it is needed
load distribution is done on .fcgi level - add multiple servers or sockets to distribute the load over several FCGI servers
every application needs a rewrite rule that connects the application with the .fcgi handler

Now we have to start the FCGI servers. That's actually quite simple, just use the provided django-fcgi.py script as follows:


 django-fcgi.py --settings=myproject.work.main
 --socket=/home/gb/work/myproject/main.socket
 --minspare=5 --maxspare=10 --maxchildren=100
 --daemon

django-fcgi.py --settings=myproject.work.admin
 --socket=/home/gb/work/myproject/admin.socket
 --maxspare=2 --daemon

Those two commands will start two FCGI server processes that use the given sockets to communicate. The admin server will only use two processes - this is because often the admin server isn't the server with the many hits, that's the main server. So the main server get's a higher-than-default setting for spare processes and maximum child processes. Of course this is just an example - tune it to your needs.

The last step is to start your lighttpd with your configuration file: lighttpd -f /home/gb/work/myproject/lighttpd.conf

That's it. If you now access either the lighttpd directly at http://localhost:8000/polls/ or through your front apache, you should see your application output. At least if everything went right and I didn't make too much errors.

Running Django with FCGI and lighttpd

Diese Dokumentation ist für einen grösseren Kreis als nur .de gedacht, daher das ganze in Neuwestfälisch Englisch. Sorry. Update: I maintain the actually descriptions now in my trac system. See the FCGI+lighty description for Django. There are different ways to run Django on your machine. One way is only for development: use the django-admin.py runserver command as documented in the tutorial. The builtin server isn't good for production use, though. The other option is running it with mod_python. This is currently the preferred method to run Django. This posting is here to document a third way: running Django behind lighttpd with FCGI.

First you need to install the needed packages. Fetch them from their respective download address and install them or use preinstalled packages if your system provides those. You will need the following stuff:

[Django][2] itself - currently fetched from SVN. Follow the setup instructions or use python setup.py install .
[Flup][3] - a package of different ways to run WSGI applications. I use the threaded WSGIServer in this documentation.
[lighttpd][4] itself of course. You need to compile at least the fastcgi, the rewrite and the accesslog module, usually they are compiled with the system.

First after installing ligthttpd you need to create a lighttpd config file. The configfile given here is tailored after my own paths - you will need to change them to your own situation. This config file activates a server on port 8000 on localhost - just like the runserver command would do. But this server is a production quality server with multiple FCGI processes spawned and a very fast media delivery.


 # lighttpd configuration file
 #
 ############ Options you really have to take care of ####################

server.modules = ( "mod_rewrite", "mod_fastcgi", "mod_accesslog" )

server.document-root = "/home/gb/public_html/"
 server.indexfiles = ( "index.html", "index.htm", "default.htm" )

 these settings attch the server to the same ip and port as runserver would do

server.errorlog = "/home/gb/log/lighttpd-error.log"
 accesslog.filename = "/home/gb/log/lighttpd-access.log"

fastcgi.server = (
"/myproject-admin.fcgi" => (
"admin" => (
"socket" => "/tmp/myproject-admin.socket",
"bin-path" => "/home/gb/public_html/myproject-admin.fcgi",
"min-procs" => 1,
"max-procs" => 1
 )
 ),
"/myproject.fcgi" => (
"polls" => (
"socket" => "/tmp/myproject.socket",
"bin-path" => "/home/gb/public_html/myproject.fcgi"
 )
 )
 )

url.rewrite = (
"^(/admin/.*)$" => "/myproject-admin.fcgi$1",
"^(/polls/.*)$" => "/myproject.fcgi$1"
 )

This config file will start only one FCGI handler for your admin stuff and the default number of handlers (each one multithreaded!) for your own site. You can finetune these settings with the usual ligthttpd FCGI settings, even make use of external FCGI spawning and offloading of FCGI processes to a distributed FCGI cluster! Admin media files need to go into your lighttpd document root.

The config works by translating all standard URLs to be handled by the FCGI script for each settings file - to add more applications to the system you would only duplicate the rewrite rule for the /polls/ line and change that to choices or whatever your module is named. The next step would be to create the .fcgi scripts. Here are the two I am using:


 #!/bin/sh
 # this is myproject.fcgi - put it into your docroot

export DJANGOSETTINGSMODULE=myprojects.settings.main

/home/gb/bin/django-fcgi.py


 #!/bin/sh
 # this is myproject-admin.fcgi - put it into your docroot

export DJANGOSETTINGSMODULE=myprojects.settings.admin

/home/gb/bin/django-fcgi.py

These two files only make use of a django-fcgi.py script. This is not part of the Django distribution (not yet - maybe they will incorporate it) and it's source is given here:


 #!/usr/bin/python2.3

def main():
 from flup.server.fcgi import WSGIServer
 from django.core.handlers.wsgi import WSGIHandler
 WSGIServer(WSGIHandler()).run()

if name == 'main':
 main()

As you can see it's rather simple. It uses the threaded WSGIServer from the fcgi-module, but you could as easily use the forked server - but as the lighttpd already does preforking, I think there isn't much use with forking at the FCGI level. This script should be somewhere in your path or just reference it with fully qualified path as I do. Now you have all parts togehter. I put my lighttpd config into /home/gb/etc/lighttpd.conf , the .fcgi scripts into /home/gb/public_html and the django-fcgi.py into /home/gb/bin . Then I can start the whole mess with /usr/local/sbin/lighttpd -f etc/lighttpd.conf . This starts the server, preforkes all FCGI handlers and detaches from the tty to become a proper daemon. The nice thing: this will not run under some special system account but under your normal user account, so your own file restrictions apply. lighttpd+FCGI is quite powerfull and should give you a very nice and very fast option for running Django applications. Problems:

under heavy load some FCGI processes segfault. I first suspected the fcgi library, but after a bit of fiddling (core debugging) I found out it's actually the psycopg on my system that segfaults. So you might have more luck (unless you run Debian Sarge, too)
Performance behind a front apache isn't what I would have expected. A lighttpd with front apache and 5 backend FCGI processes only achieves 36 requests per second on my machine while the django-admin.py runserver achieves 45 requests per second! (still faster than mod_python via apache2: only 27 requests per second) Updates:
the separation of the two FCGI scripts didn't work right. Now I don't match only on the .fcgi extension but on the script name, that way /admin/ really uses the myproject-admin.fcgi and /polls/ really uses the myproject.fcgi.
I have [another document online][6] that goes into more details with regard to load distribution

Pass-Chips and their possible misuse

Owl Content

A bit older, but still interesting: Biometrics/BSI Lecture Program at CeBIT 2005. Particularly interesting are the statements about the authorization of the passport chip readers:

The ICAO standard suggests an optional passive authentication mechanism against unauthorized reading (Basic Access Control). Kügler estimated its effectiveness as only minor. However, Basic Access Control would be suitable for the facial image, as this involves only weakly sensitive data.

This is the part currently being discussed regarding the passport - the authentication of the reader by the passport via the data of the machine-readable zone. This method is not protected against copying the key - once it is determined, it can be used to identify a passport. Even from a greater distance.

The contactless chip in the passport according to ISO 14443 will (naturally) be machine-readable and digitally signed as well as contain the biometric data. As the reading distance, Kügler mentioned a few centimeters, but pointed out that with current technology, reading from several meters away is possible. To ensure copy protection, the RFID chip should actively authenticate itself using an individual key pair, which is also signed.

Important here: the copy protection is handled by an active two-way authentication. A passport could therefore only be read with a stored key if it is actively involved. The keys then transmitted are so to speak bound to the respective communication - because both the passport and the reader would have their own key pair. This makes attacks via sniffing of the authentication significantly more complicated, as two key pairs must be cracked to do something with the data. Unfortunately, however, only the basic procedure is currently planned, i.e., only the keys per reader. And it gets worse:

Kügler rated the fingerprint as a highly sensitive feature. Therefore, access protection must be ensured by an active authentication mechanism (Extended Access Control). This was not defined in the ICAO standard and is therefore only usable for national purposes or on a bilateral basis.

Otto Orwell dreams of storing fingerprints - the procedure for how these must be secured is not yet defined and standardized. Such storage would therefore not be usable across the board. It is also important to ensure that only authorized devices are allowed to read. To this end, all readers would receive a key pair, which must be signed by a central authority. Anyone who has ever dealt with a certification authority knows that there must inevitably be a revocation list - a way to withdraw certificates. This is especially important for passport readers if, for example, they are stolen (don't laugh, devices also disappear at border facilities - hey, entire X-ray gates have been stolen from airports). Unfortunately, the experts see it differently:

In the subsequent short discussion, the question was asked whether a mechanism is provided to revoke the keys of the readers. Kügler indicated that this is not the case so far. However, it is currently under discussion to limit the validity of the keys temporally, but this has not yet been decided.

Hello? So there is no way to revoke a device's key. And there is - currently - no expiration of a key. If someone gains access to a reader, they have the key of the device and its technology at their disposal to read every passport in the vicinity. Without the possibility of getting rid of a device used improperly. This is like a computer system where there is no way to change the password and no way to delete a user - even in case of proven misconduct.

And once again, the extended check (and this key technology plus certificate in the reader is probably only intended for this) is only a proposal (which may not even be implemented due to the lack of interest of the Americans in the whole thing):

Kügler then described the BSI's proposal regarding Extended Access Control. According to this, an asymmetric key pair with a corresponding, verifiable certificate is generated for each reader (authorization only per reader). Therefore, the chip must be able to provide computing power for Extended Access Control. [...] Within the EU, access protection by Extended Access Control is currently only to be seen as a proposal, said Kügler. Another (unnamed) BSI colleague agreed with him and added that the Americans do not demand a fingerprint as a biometric feature on the chip at all, but rather the digital facial image would suffice for them. Only within America is a digital recording of the fingerprint planned. For this reason, the technical implementation of Extended Access Control is not urgent.

Only in this proposal is it provided that the devices receive unique key pairs and certificates based on them. Why is all this so critical now? Well, the discussion constantly focuses only on the data and the reading of the data - but these are not even that critical. Because even the stored fingerprints are not the complete fingerprints for reconstruction, but only the relevant characteristics for re-identification (although the discussion is still ongoing as to whether these stored characteristics are really unique - especially in the global context we are talking about - or whether more data does not need to be stored than in a purely national approach).

But what is always possible when we talk about such passports: the authentication and identification of a person. A two-way authentication can alone as authentication already say who is near me. If, for example, I have stored a key of a passport for the simplified procedure, I can then determine at any time without contact whether this passport is nearby - of course only within the framework of the security of the cryptographic algorithms, but that would already be a fairly secure confirmation, because it would be a pretty failure of the whole procedure if two passports with the same key allow an authentication and this has hopefully been excluded by the developers.

I can therefore obtain the keys of persons - for the simplified procedure, the machine-readable line of the passport is sufficient for this - for example, simply through simple mechanical means such as burglary, pickpocketing, social engineering, etc. - and store them. I can then feed a reader with this that, for example, in a defined area simply checks several passport data that interest me when passing through a gate - for example, a revolving door with a predefined speed is very practical for this. Only the passport with the corresponding data in the machine-readable zone will release its data for this, or provide confirmation of the authentication.

I could therefore, for example, determine when a person enters and leaves a building - without the knowledge of that person and fully automatically. With an authentication time of 5 seconds, you can already check several keys while someone walks through the revolving door.

Of course, this is still not the identification of the person - but only of the passport. But especially when the person being monitored does not know about the monitoring, the passport is worn by the person. There is no reason not to have the passport with you. And abroad, it is often a bad idea not to have your passport with you - so it is compulsorily near the person in these cases.

Well, but according to Otto Orwell, all this is just scaremongering and anyway not true and completely wrong. Unfortunately, it is based on statements by employees of the BSI - who are basically his people.

Cleared for takedown

In the Zeit: Open Season, a dossier about the victims of attention-seeking ala Raab and Bild ...

The major problem I see here is not just the Bild newspaper and Raab and similar media garbage - the real problem is the acceptance with which this crap is consumed. After months, you no longer know where you read or heard something - and in doing so, you contribute as a vector to the spread of this nonsense.

When I then imagine the Springer publishing group wanting to get its hands on the Pro7/Sat.1 group and with that, presumably next, Bild newspaper and Raab pulling together on the same rope, I feel sick ...

A democratic society lives, among other things, on the diversity of opinion that must also be reflected in media diversity. But when the media landscape becomes dominated across media by a corporation with a clear political agenda (anyone who doubts that can just look at the coverage of Bild newspaper around the time of the last Hamburg citizenship election - best have a sickness bag ready or it'll hit the keyboard), an important factor for democracy is lost.

And so an ugly alliance forms between business associations and a media culture in which one no longer wants to use the word culture - and it degenerates into incitement against the sick, the unemployed, foreigners and left-wing politicians, which already uncomfortably resembles times one actually thought were over ...

Off to the police state

Owl Content

German cabinet approves bill to expand DNA analysis:

... DNA analyses of individuals may in future also be stored if they have committed only minor offenses such as property damage or trespassing, or if it is expected that they will commit such offenses in the future. Furthermore, investigators will be granted the right to order DNA analyses in an expedited procedure without a judge having to approve them.

You participate in a demo that someone doesn't like? No problem, your data will be recorded and filed. Trespassing at a demo can happen quickly, property damage can be quickly attributed to you, and if you don't need to ask a judge, you can also move much faster. And so, a small and fine DNA database of all those unpleasant subjects will quickly be collected that a state really doesn't need - namely people who engage publicly and speak up.

What, civil rights are left behind in the process? Forget it, it doesn't interest Otto Orwell nor the combined incompetence in the Ministry of Justice.

Oh, and who believes that I am only paranoid, here is the case example cited by the Ministry of Justice:

A has been convicted because he repeatedly scratched the paint of motor vehicles with a screwdriver. The prognosis is that corresponding criminal offenses are also to be expected from him in the future.

Yes, you are a wheelchair user and you are upset about the idiotically parked drivers and have scratched the paint of one? Hey, you are still in a wheelchair and we simply assume that you will continue to get upset about the idiotic drivers - so off to the DNA file with the murderers, terrorists, and sex offenders. After all, you are at least as threatening to society as they are.

What kind of shit is this red/green puppet theater in Berlin getting us into. It is absolutely unbelievable.

angry face

And if you think it would be better with the Union:

... on the other hand, the proposed amendment to the DNA analysis by the CDU is by no means sufficient. "The bill is a step in the right direction. It is too short," said the deputy chairman of the Union faction, Wolfgang Bosbach. The Union will further tighten the existing legal situation in the event of an election victory, explained the interior and legal politician. There is no right for offenders to remain anonymous.

Who spontaneously thinks of recording every striking worker there is probably on the right track according to their idea ...

And all this from people who, under the guise of neo-liberalism, have written a reduction of the state to its core functions on their banner - and see surveillance, exploitation, and harassment of citizens as core functions.

We are moving straight towards something that can no longer be associated with a democratic society and a rule of law.

How FileVault works

As a follow-up to the previous entry about the problems with backing up FileVaults from an active FileVault account, I took a closer look at what Apple actually does for FileVault. I'm not particularly enthusiastic about the approach.

First of all, a FileVault is nothing more than a so-called Sparse Image - a disk image in which only the actually used blocks are stored. So if it is empty, it doesn't matter how large it was dimensioned - it only takes up a little disk space. With the stored data, this image grows and you can have it cleaned up - in the process, the data blocks that have become free (e.g. through deletions) are also released again in the Sparse Image, so the image then shrinks. Additionally, encryption is enabled for the FileVault images. The shrinking happens semi-automatically when logging out: the system asks the user if it may. If the user agrees, it is cleaned up. But this is only the mechanism of how the files are stored - namely as an HFS+ volume in a special file. But how is it automatically opened at login and how is it ensured that programs find the data in the right places where they look for it? For this, the FileVault image must be mounted. In principle, the process is the same as when double-clicking on an image file - the file is mounted as a drive and is available in the list of drives in the Finder and on the desktop. However, for FileVault images, the desktop icon is suppressed. Instead of the desktop icon and mounting to /Volumes/ as is usually the case, mounting a FileVault image is somewhat modified. And that is, a FileVault image is usually located in the user directory of a user as a single file. So for a logged-out user hugo, there is a hugo.sparseimage in /Users/hugo/. As soon as the user hugo logs in, a number of things happen. First, the Sparse Image is moved from /Users/hugo/ to /Users/.hugo/. And is no longer called hugo.sparseimage but .hugo.sparseimage. Then it is mounted directly to /Users/hugo/ (which is now empty), which is why it must also be pushed out of the user directory, as it would otherwise not be accessible if another file system were mounted over it.

Now the volume is accessible as the user's home directory. Additionally, all programs see the data in the usual place, as it is mounted directly to /Users/hugo and thus, for example, /Users/hugo/Preferences/ is a valid directory in the image. When logging out, the whole thing is reversed: unmounting the image and then moving it back and removing the /Users/.hugo/ directory. Additionally - optionally - compressing the image.

Now it also becomes clear what problem backup programs have: when the backup runs, the home directory is empty and the image is moved to the dot directory. Booting into such a created backup would not find the user's home directory and would present the user with an empty home - it would appear as if all files had been lost. This is also one of the major problems of FileVault: if the computer crashes while you are logged in, the directories and files are moved and renamed. So if you use FileVault and can't access your files after a crash: maybe it helps to log in with another FileVault-free user (which you should also have for backups!) and repair the home directory. I don't know if Apple's disk repair program would do that - so far, none of my FileVault installations have crashed. But for the emergency, you might want to remember this. Overall, the whole thing gives me a rather hacked impression - I would prefer if the whole system could do without renaming and moving. For example, the FileVault could simply lie peacefully next to /Users/hugo as /Users/.hugo.sparseimage and only be mounted - then backups would have no problems, as the structure between logged in and logged out would be identical. I don't know why Apple took this rather complicated form, probably because of the rights to the Sparse Image and the resulting storage location in the user's home directory.

Experts Advocate for VAT Increase

Experts advocate for VAT increase - if you look at these alleged experts, you find IW director Hüther and the chief economist of Deutsche Bank. Completely neutral experts, of course. Why do these allegedly professional journalists write such nonsense? Every idiot from some employers' association or employer-affiliated institute or major bank is called an expert - but if something comes from the employees' camp, they are critics from the unions. This is how the neoliberal crap is beautifully upheld and the citizen is told where to look for his experts - regardless of whether these experts are anything but experts (I still think with horror of the mathematically completely untalented and otherwise quite incompetent financial expert Mertz) or pursue their own political agenda. That in this specific case something must be rotten with the experts should also be noticeable to the dumbest journalist: although the VAT should be increased, but of course only with accompanying measures. Look at these measures. One screams for a reduction in wage-related costs as an accompanying measure and the abolition of the solidarity surcharge - but only the latter is relevant for the consumer. And now look at what someone on social assistance or unemployment benefit II pays in solidarity surcharge - nothing. But this person still fully bears the VAT increase.

The other talks about the fact that the risk of reduced consumption must be accepted, as the advantages of reducing labor costs outweigh - because he also wants to reduce various payments. At least for both sides - at least he did not explicitly talk only from the employers' side, but presumably he simply forgot that there is also an employees' side. And here too: social assistance recipients and unemployment benefit II recipients are not relieved and get the full VAT increase.

None of the so-called experts has spoken about the fact that a VAT increase must be accompanied by an increase in social assistance and unemployment benefit II. Both accept that people who are already impoverished will be even worse off and that more people will fall below the poverty line. They act as if they were experts - but in the end they are only the henchmen of the exploiters and swindlers and want only the same thing that the employers' side has been demanding all along: to squeeze the employees even more.

VAT is the most unsocial tax we have. On the one hand, it is only relevant for consumers, and indeed for domestic consumers. On the other hand, it is based on consumption - and this can of course not fall below a certain level, because everyone has to live and has to pay for it - and thus this tax hits the hardest those who have the least. Because their consumption can hardly be reduced any further.

Contributions will not decrease again

Survey: Health insurance funds' financial situation worsening again - we're all being fooled. By politicians who promise to lower contribution rates and naturally can't. By funds that are supposed to represent our interests but naturally don't. By doctors who promise cooperation in cost reduction but naturally don't want to give up their income (*). By pharmacists who are supposed to serve as a trusted source for patients but have long since lost that trust.

Of course, the contribution reduction for employers - there's always money for that. Only the patients, they have to pay for all of this again. Funds, doctors, and pharmacists, on the other hand, sit on their vested interests and refuse to contribute even minimally to a reduction that would also affect their income.

Funds then do great things like the family doctor model and the in-house pharmacy model - but it doesn't help if the doctors simply refuse to participate (which happens here in Münster quite often). Correct billing of the practice fee is also rarely experienced - if a prescription is simply picked up, without the doctor providing even a bit of service (except for his signature), if the medication has been taken for years - doesn't matter, the practice fee is quickly taken again.

Quality control of doctors? No show - they refuse, that would be too much influence for the patient. So they continue to hide behind the allegedly free choice of doctor - which has long since become laughable only through the emigration of specialists from the associations of statutory health insurance physicians. In some specialties, as a statutory health insurance patient, you only have a chance in the hospital to meet a really qualified doctor - outside you only find quacks ...

At the same time, more and more politicians and functionaries of the various associations are talking about patients taking more responsibility and having to bear more of the costs. Of course, we are supposed to trust the doctors in consultation. We are supposed to trust the pharmacists in choosing the drug manufacturer. We are supposed to trust the funds in billing. How are we supposed to take on more responsibility in such a situation that is based on trust without control? What does taking responsibility mean in this context at all - it's not about responsibility, it's solely about cost shifting. And risk shifting: What, your complaints have worsened because you stopped the treatment too early because of the costs? Your own fault, why do you do such a thing. If patients are asked to take more responsibility, they must also be given the means to do so in the form of possibilities of influence and controls. Otherwise, these are just empty phrases.

Doctors receive preferential treatment from the pharmaceutical industry and then obediently prescribe their results - it's so conveniently practical and comfortable and you benefit from it. The funds sit there and deal more with their own bureaucracy and their own security than with keeping an eye on the doctors and ensuring that this very connection to the pharmaceutical industry does not get out of hand. The pharmacists fight for the preservation of their privileges and go against any alternative form of drug supply and argue with their consulting services - which, however, de facto often no longer exist, if in a pharmacy only one or two trained pharmacists work, the rest are at best better drugstore clerks ... (and the main turnover in pharmacies is made with care products, gummy bears and all kinds of obscure nonsense - hey, why should one trust people who offer homeopathic nonsense and "advise"?)

And the pharmaceutical industry? They are the laughing fifth in the background. Decent profit margins, of course, reduce jobs, because the margins have to increase. In principle, monopolies through absurd patent policy (I recall the nitrogen patent from Linde - which fortunately was overturned) and an increasingly opaque approval bureaucracy. Of course, medicines must be tested before approval - but what the current tests really bring, one has seen in various cases recently (Lipobay, Vioxx and other COX-2 inhibitors - just to name two cases).

What is needed is a much more radical restructuring of the health system, a restructuring designed to enable the patient to actually take responsibility, because he is given the information he needs for this and because he is given advisory facilities that support him in this.

Separation of the billing system and the control function in the funds - the control function is not sufficiently exercised by them anyway, it belongs to independent institutions financed by mandatory contributions from those involved in the health system (doctors, pharmacists, pharmaceutical industry and proportionally health insurance contributions).

The billing procedures should be handled by independent accounting offices for patients and doctors, which should only finance themselves through their billing services - this is already common practice in the economy, where billing services are outsourced to separate companies that are then financed by shares in the cost savings of the parties involved.

More transparency in the pharmaceutical industry - research results must be released if a company wants to obtain approval for medicines. Many research institutions are partly state-financed anyway or are close to universities through their state affiliation. A transparent testing guideline for medicines must be introduced - one that scientists and physicians can understand and in which these people are involved, so that problems can be detected earlier - and cannot be concealed by the company (as was the case with Vioxx).

At the same time, effective cost control for medicines must be introduced - the justifications with research costs are not sufficient here, the whole thing must be traceable. If you add up the alleged research costs of the pharmaceutical industry from various medicines, you eventually reach the point where the gross domestic product is generated alone in the research institutions of the pharmaceutical industry. Here, there must be much greater transparency in order to effectively prevent price gouging for medicines.

And the pharmacists? Sorry, but they simply have to think about what role they still have. This would include that they take their consulting services seriously again and concentrate on what their task would be: the application advice for medicines and the advice on the use of non-prescription medicines. However, a specialist saleswoman with a drugstore education cannot provide this. Justifying one's own existence with a sales monopoly for medicines is certainly not enough. And reading the package leaflet is not enough either.

(*) Here, of course, doctors in hospitals are excluded - their job is then pretty much the last in the health industry and decent working hours cannot be spoken of for them.

Genetic Engineering - It's Not Just About the Sausage

Bundesrat rejects GMO law - the Union wants us to eat GenFood and what the consequences are and whether, for example, organic farming near Gen-fields is no longer possible (because farmers cannot meet the strict requirements, since genetically modified plants do spread after all), they couldn't care less. The fact that most farmers don't value Genshit at all is also irrelevant. The fact that in the end only the big corporations win and are interested in the whole genetic technology - because they can strangle farmers and squeeze them even more - is probably not irrelevant. Because somewhere the donation millions must come from ...

Genetically modified foods serve the combination (forced combination!) of seeds and fertilizers or crop protection products and the patent protection of the use of the seeds. It directly attacks the classic traditional way of working of farmers - for example, the use of fruit for the next sowing is usually not possible (because infertile) or prohibited (by contract). There is no biological reason in Germany - neither do we have to endure extreme climatic conditions nor particularly catastrophic pest attacks. It is solely about the maximization of the companies that produce the genetically modified seeds.

If you then look at who is behind it, something else becomes apparent: another point is the elimination of the classic production sites for seeds - many of the genetic engineering companies are more associated with the pharmaceutical or chemical industry than with classical agriculture (although there are also black sheep among the seed producers - but these also belong more to the industry). Here, industry is simply moving into an area it could not serve before and wants to break into - ultimately with coercive means.

With genetically modified seeds, not only are foods produced whose consumption is rejected by the majority of consumers - an entire economic sector is also being strangled or possibly even destroyed. At least severely damaged.

Agriculture, through its structures with cooperatives, associations, interest groups and political lobbying, has a fairly large power and influence on its fate - so far. But now the bad guys want to play along, whose goal is exactly the takeover of this - previously self-managed - power.

Of course, the Union - which has repeatedly revealed itself to be industry-dependent - hitches itself to the cart. And of course, our industry chancellor performs this balancing act and Minister Künast has to present a law that is already watered down to the extreme - and even that is rejected in the council (which has a Union majority).

PostgreSQL 8.0.2 released with patent fix

Just found: PostgreSQL 8.0.2 released with patent fix. PostgreSQL has therefore received a new minor version in which a patented caching algorithm (arc) was replaced with a non-patented one (2Q). The interesting part: this is one of the patents that IBM has released for open source. And why did they switch anyway? Because IBM has released these patents for open source use, but not for commercial use - PostgreSQL, however, is under the BSD license, which explicitly allows completely free commercial use.

For PostgreSQL itself, this would not have been a problem: as long as it remains BSD, the use of the IBM patent would not have caused any problems. Only a later license change - such as when someone chooses BSD software as the basis for a commercial product - would have been excluded.

A nice example of how even liberally handled software patents cause problems. Because medium-sized companies that build commercial products on open source would have lost a previously available basis - solely due to the patented caching algorithm (efficient storage of and efficient access to data - so patentable according to Clements' idea).

In the case of PostgreSQL, it went smoothly: the patented algorithm is not faster or better than its non-patented counterpart. And for the software itself, nothing really world-shattering has changed. But this does not have to (and will not) always go so smoothly. In the field of audio processing and video processing, the patented minefields are much more extensive and therefore much more critical for free projects.

Okay, one might still argue that this would not have happened with a GPL license. But with a GPL license, certain forms of use as they already exist in PostgreSQL today (e.g., companies building special databases on PostgreSQL without making these special databases open source) are not possible. You can take a stand on this as you like - ideology aside - the PostgreSQL project has chosen the BSD license as its basis.

Even well-intentioned patent handling in the context of open source software would therefore be problematic. Exactly this is the reason why I am generally against software patents.

Police Fear Anonymity and Cryptography on the Internet

The police fear anonymity and cryptography on the internet - and therefore, for example, rail against state-funded anonymization services. However, this is simply the usual conflict of technology: the application can happen in two ways. No one talks about the reasons why anonymization services and encryption systems are quite legitimately used; only criminal use is the topic. Should we ban hammers and sickles, after all, you can kill people with both.

What is worrying about this development is that the use of cryptography will probably be restricted - or as it is called in modern German: regulated - in the short or long term. And at some point, the situation will arise where encrypted emails are already considered suspicious. Suspicion is no longer needed to spy on someone. And what is more obvious than to assume illegality of someone who encrypts their emails?

Every society must deal with abuse of the system and abuse of society - and with those who completely fall out of societal norms. This is annoying and in many cases even tragic - but cannot be changed. However, the problem is not solved by putting the entire society under general suspicion. Ultimately, what remains is a society that is no longer worth living in and preserving because everything is based on surveillance and denunciation. Restricting the rights of ordinary citizens does not result in a single fewer criminal - rather more, because more and more citizens will resist the regulations (and according to the definition of people like Otto Orwell, are then simply criminals).

What is completely ignored here, in my opinion, is the point that crime does not only consist of the perhaps technically difficult-to-access encrypted channel - there must always also be effects outside. Child pornography is not only traded on the internet - it is also produced at some point. Organized crime does not only organize the exchange of PGP keys on the internet - it organizes human smuggling, illegal gambling, drug trafficking, and who knows what else. Every crime therefore always has facets that take place quite openly and recognizably in society. Investigations are primarily carried out in this area to this day - the eavesdropping has not yet brought reproducibly better results than those already achieved through normal investigations. On the contrary: the eavesdropping, dragnet searches, and similar approaches have all failed, especially when considering the immense personnel deployments (and thus costs) of these actions. And no, the genetic sample was not decisive even in the Moshammermord case.

Regulating network technologies will not prevent their use for criminal purposes - it will only make legal use more difficult or stigmatize it. Someone who smuggles people certainly has far fewer scruples about violating cryptography laws than someone who only uses cryptography because they don't like the idea of the state reading everything.

Install grsecurity

I used to play around with grsecurity before, but the installation was a bit tricky - especially, you didn't know what to configure as a start and how to begin a reasonable rule-based security - the whole thing was more of a trial-and-error hopping than an understandable installation. However, for a security solution for an operating system, it is rather negative if you don't get the feeling of understanding what is happening there.

With the current versions of grsecurity, however, this has changed to a large extent. On the one hand, the patches run completely smoothly into the kernel, on the other hand there are two essential features that make the start easier: a Quick Guide and RBACK Full System Learning.

The Quick Guide provides a short and concise installation guide for grsecurity with a starting configuration for all the options that already offer a fairly good basis and excludes problematic options (which could exclude some system services). This way you get a grsecurity installation that offers a lot of protection but usually does not conflict with common system services. This is especially important for people with root servers - a wrong basic configuration could lock themselves out of the system and thus make the system unusable and a service case.

But the Full System Learning is really nice: here the RBAC engine is transformed into a logging system and it is logged which users execute what and what rights are needed for this. The whole thing is still controlled by corresponding basic configs that classify different system areas differently (e.g. ensure that the user can access everything in his home, but not necessarily everything in various system directories). You just let the system run for a few days (to also catch cron jobs) and then generate a starting configuration for RBAC from it. You can of course still fine-tune this (you should also do this later - but as a start it is already quite usable).

RBAC is basically a second security/rights layer above the classic user/group mechanisms of Linux. The root user does not automatically have all rights and access to all areas. Instead, a user must log in to the RBAC subsystem in parallel to his normal login (which happens implicitly through the system start for system services!). Rules are stored there that describe how different roles in the system have different access permissions.

The advantage: even automatically started system services are only allowed to access what is provided for in the RBAC configuration - even if they run under root rights. They only have limited capabilities in the system until they log in to the RBAC subsystem - but for this, a manual password entry is usually required for the higher roles. Attackers from the outside can indeed gain the user rights restricted by RBAC, but usually cannot get to the higher roles and therefore cannot interfere with the system as much as would be possible without RBAC.

The disadvantage (should not be concealed): RBAC is complex. And complicated. If you do something wrong, the system is locked - quite annoying for root servers that are somewhere out there in the network. You should always have fallback strategies so that you can still reach a blocked system. For example, after changes to the RBACs, comment out the automatic activation at system startup so that a reboot puts the system in a more open state in case of problems. Or have an emergency access through which you can still administer a blocked system to some extent. In general, as with all complex systems: Keep your hands off if you don't know what you're doing.

In addition to the very powerful RBAC, grsecurity offers a whole range of other mechanisms. The second major block is pax(important: here a current version must be used, in all older ones there is an evil security hole) - a subsystem that restricts buffer overflow attacks by removing the executability and/or writability from memory blocks. Especially important for the stack, as most buffer overflows start there. Pax ensures that writable areas are not executable at the same time.

A third larger block is the better protection of chroot jails. The classic possibilities for processes to break out of a chroot jail are no longer given, since many functions necessary for this are simply deactivated in a chroot jail. Especially for admins who run their services in chroot jails, grsecurity offers important tools, as these chroot jails were only very cumbersome to make really escape-proof.

The rest of grsecurity deals with a whole collection of smaller patches and changes in the system, many of which deal with better randomization of ports/sockets/pids and other system IDs. This makes attacks more difficult because the behavior of the system is less predictable - especially important for various local exploits, where, for example, the knowledge of the PID of a process is used to gain access to areas that are identified via the PID (memory areas, temporary files, etc.). The visibility of system processes is also restricted - normal users simply do not get access to the entire process list and are also restricted in the /proc file system - and can therefore not so easily attack running system processes.

A complete list of grsecurity features is online.

All in all, grsecurity offers a very sensible collection of security patches that should be recommended to every server operator - the possibility of remote exploits is drastically restricted and local system security is significantly enhanced by RBAC. There is no reason not to use the patch, for example, on root servers as a standard, given the rather simple implementation of the grsecurity patch in an existing system (simply patch the kernel and reinstall, boot, learn, activate - done). Actually, a security patch should be part of the system setup just like a backup strategy.

Now it would of course be even nicer if the actual documentation of the system was a bit larger than the man pages and a few whitepapers - and above all was up to date. This is still a real drawback, because the right feeling of understanding the system does not really set in without qualified documentation ...

mod_fastcgi and mod_rewrite

Well, I actually tried using PHP as FastCGI - among other things because I could also use a newer PHP version. And what happened? Nothing. And there was a massive problem with mod rewrite rules. In the WordPress .htaccess, everything is rewritten to the index.php. The actual path that was accessed is appended to the index.php as PATH INFO. Well, and the PHP then spits out this information again and does the right thing.

But when I had activated FastCGI, that didn't work - the PHP always claimed that no input file was passed. So as if I had called the PHP without parameters. The WordPress administration - which works with normal PHP files - worked wonderfully. And the permission stuff also worked well, everything ran under my own user.

Only the Rewrite-Rules didn't work - and thus the whole site didn't. Pretty annoying. Especially since I can't properly test it without taking down my main site. It's also annoying that suexec apparently looks for the actual FCGI starters in the document root of the primary virtual server - not in those of the actual virtual servers. This makes the whole situation a bit unclear, as the programs (the starters are small shell scripts) are not where the files are. Unless you have created your virtual servers below the primary virtual server - but I personally consider that highly nonsensical, as you can then bypass Perl modules loaded in the virtual server by direct path specifications via the default server.

Ergo: a failure. Unfortunately. Annoying. Now I have to somehow put together a test box with which I can analyze this problem ...

Update: a bit of searching and digging on the net and a short test and I'm wiser: PATH_INFO with PHP as FCGI version under Apache is broken. Apparently, PHP gets the wrong PATH_INFO entry and the wrong SCRIPT NAME. As a result, the interpreter simply does not find its script when PATH INFO is set and nothing works anymore. Now I have to search further to see if there is a solution. cgi.fix_pathinfo = 1 (which is generally offered as a help for this) does not work anyway. But if I see it correctly, there is no usable solution for this - at least none that is obvious to me. Damn.

Update 2: I found a solution. This is based on simply not using Apache, but lighttpd - and putting Apache in front as a transparent proxy. This works quite well, especially if I strongly de-core the Apache and throw the PHP out of it, it also becomes much slimmer. And lighttpd can run under different user accounts, so I also save myself the wild hacking with suexec. However, a lighttpd process then runs per user (lighttpd only needs one process per server, as it works with asynchronous communication) and the PHPs run wild as FastCGI processes, not as Apache-integrated modules. Apache itself is then only responsible for purely static presences or sites with Perl modules - I still have quite a few of those. At the moment I only have a game site running there, but maybe it will be switched in the next few days. The method by which cruft-free URIs are produced is quite funny: in WordPress you can simply enter the index.php as an Error-Document: ErrorDocument 404 /index.php?error=404 would be the entry in the .htaccess, in lighttpd there is an equivalent entry. This automatically redirects non-existent files (and the cruft-free URIs do not exist as physical files) to WordPress. There it is then checked whether there really is no data for the URI and if there is something there (because it is a WordPress URI), the status is simply reset. For the latter, I had to install a small patch in WordPress. This saves you all the RewriteRules and works with almost any server. And because it's now 1:41, I'm going to bed now ...

Back to Camino from Firefox ...

... and back. Odyssey of the web browsers.

After working with Firefox for a few days, I switched back to Camino. Why? Well, under OS X, Firefox is suboptimal. For one, I have the impression that fonts are generally displayed smaller than in Camino or other real Mac programs. It might be an illusion. However, it is not an illusion that Firefox under OS X does not support Services. And that is annoying - what's the point if a bunch of programs hook into the Services menu and provide useful services that build on highlighted text in other programs, if the main application in which I spend my time on the computer does not support it at all?

Just as annoying was the fact that Tab-X is not supported under OS X. This extension attaches a close icon to every tab. I don't know what the UI designer of Firefox was thinking, but I consider neither the mandatory activation of a tab and then clicking on a tiny X at the right edge of the toolbar to be ergonomic, nor closing a tab via the context menu. Okay, you can get used to that if necessary.

Furthermore, I was constantly bothered by the fact that Firefox has its own password manager and does not use the KeyChain. I find it simply practical that all kinds of programs can register at a central location and that I can delete my passwords there if I need to. In addition, this helps to avoid constantly having to re-enter passwords just because you visit a page with a different browser.

Unfortunately, I lose all the nice things that are available via Firefox extensions - for example, the Web Developer Toolbar. Only that it doesn't work on my Mac anyway, who knows why - so I've only ever had it under Linux, and there I continue to use Firefox. I will miss the plugin for the Google PageRank status and the plugin for mozcc, however - both were quite practical. It's somehow stupid that I can't have both - a Firefox with proper integration into OS X, that would be it ...

Due to the pretty broken 0.8.2 of Camino, I downloaded and installed the 0.8.1 again. At least it has functioning tabs and doesn't crash all the time. I have no idea what they did with the 0.8.2, but it was definitely not to the benefit of Camino.

And of course, right after I wrote this, Camino started acting up. I can't believe it. The 0.8.1 had worked flawlessly before. Nevertheless, there were the same problems as with the 0.8.2 - probably triggered by some sites with which I work more frequently now than before? I have no idea - I haven't installed any special tools under OS X, on the contrary, I have uninstalled one.

So, trying other browsers again. Safari 1.0 under OS X 10.2.8 is clearly behind in features - but it would still remain as an alternative, but it crashes on some pages. OmniWeb is basically a souped-up Safari, but it crashes even more frequently. And Opera doesn't get along with the CSS of the WordPress admin at all - it's wildly mixed up. In addition, it always asks multiple times for passwords and Keychain access when I access some protected pages. And it has had this quirk for months - not very confidence-inspiring.

The IE for Mac is not even a desperation option. Netscape? No, sorry, but that's not necessary. Mozilla also not - then rather Firefox, because Mozilla not only does not integrate well into the system, it also looks completely different from OS X applications ...

The only really usable alternative browser under OS X 10.2 is - despite its problems - OmniWeb. As a last resort, Safari, but OmniWeb is more advanced in rendering on some pages. However, it still does not support things like clicking on the label of a checkbox to toggle it - it is used in the WordPress admin and avoids silly target practice. Except in OmniWeb or Safari. Okay, the fact that the QuickTag bar is missing in OmniWeb and Safari is intentional in WordPress - the JavaScript is not quite compatible.

So, back to the whole thing and use Firefox again and complain about the missing services (which, by the way, can also work in Carbon applications - if the programmer has considered this in his program)? Or just play with OmniWeb and see if you can get around the problems?

And what do we learn from this? All browsers suck. Even the good ones.

New Game, New Luck: b2evolution

Today I took a look at b2evolution (as usual, just a brief superficial test flight). It's related to WordPress and that alone is interesting - let's see what others have done with the same base code. So I got the software, grabbed the Kubrick skin (hey, I'm liking Kubrick these days), and got started.

What immediately stands out: b2evolution places much more emphasis on multi-everything. Multi-blog (it comes pre-installed with 4 blogs, one of which is an "all blogs" blog and one is a link blog), multi-user (with permissions for blogs etc. - so suitable as a blogging platform for smaller user groups) and multi-language (nice: you can set the language for each post, set languages per blog). That's already appealing. The backend is reasonably easy to use and you can find most things pretty quickly.

But then the documentation. Ok, yes, the important stuff is documented and findable. But as soon as you go deeper, almost nothing is self-explanatory or documented. Ok, I admit I shouldn't have immediately set out to make the URIs as complicated as possible - namely via so-called stub files. These are alternative PHP files through which everything is pulled to preset special settings via them. Apparently you're supposed to be able to get a URI structure like WordPress with it - the b2evolution standard is that index.php always appears in the URI and the additional elements are tacked on at the end. That's ugly. I don't want that. Changing that apparently only works with Apache tools done by hand (no, not like WordPress's nice and friendly support for the auto-generated .htaccess file) and then corresponding settings in b2evolution. Ok, you can do that - I know Apache well enough. But why so complicated when there's an easier way?

Well, but the real catch for me comes next: b2evolution can only do blogs. At least in the standard configuration. Exactly - just lists of posts ordered chronologically. Boring. Not even simple static pages - sorry, but where do I put the imprint? Manually created files that you put alongside it? Possible, sure. But not exactly user-friendly.

There are also some anti-spam measures, for example a centrally maintained banned words list (well, I personally don't think word lists are that suitable) and user registration. Not much, but sufficient for now. You can certainly do more via plugins. Speaking of plugins, there's a very nice feature to mention: you can have different filters activated for each post. Each time anew depending on the post. Very nice - WordPress has a real deficit there, the activated filters apply to everything across the board - one change and old posts suddenly get formatted wrong (if it's an output filter).

Also nice: the hierarchical categories really behave hierarchically - in WordPress they're only hierarchically grouped, but e.g. not much is done with the hierarchy. In b2evolution, posts from a category automatically move to the parent category when a category is deleted. Also, thanks to the multi-blog feature, you can activate categories from different blogs for a single post and thus cross-post - if it's allowed in the settings.

Layout adjustments work via templates and skins. Templates are comparable to the WordPress 1.2 mode and skins are more like the WordPress 1.5 mode. So with templates everything is pulled through a PHP file and with skins multiple templates are combined and then the blog is built from that. Special customizations can then be done via your own stub files (the same ones that are supposed to be used for prettier URIs) and via those you could, for example, build fixed layouts with which you could simulate static pages.

All in all, the result of the short flight: nice system (despite the somewhat baroque corners in URI creation and quite sparse documentation) for hackers and people who like to dig into the code. For just getting started directly, I find it less suitable - WordPress is much easier to understand and get going with. And to compete with Drupal, b2evolution is too thin on features - just too focused on blogs. You can certainly bend it in the right direction - but why would you want to do that when you could just use something off-the-shelf that can already do all that?

Hmm. Sounds relatively similar to what I wrote about b2evolution almost a year ago. There hasn't been much development there in the meantime.

And log files again

Since I had an interesting study object, I wanted to see how much I could uncover in my logfiles with a bit of cluster analysis. So I created a matrix from referrers and accessing IP addresses and got an overview of typical user scenarios - how do normal users look in the log, how do referrer spammers look, and how does our friend look.

All three variants can be distinguished well, even though I'd currently rather shy away from capturing it algorithmically - all of it can be simulated quite well. Still, a few peculiarities are noticeable. First, a completely normal user:


aa.bb.cc.dd: 7 accesses, 2005-02-05 03:01:45.00 - 2005-02-04 16:18:09.00
 0065*-
 0001*http://www.tagesschau.de/aktuell/meldungen/0,1185,OID4031994 ...
 0001*http://www.tagesschau.de/aktuell/meldungen/0,1185,OID4031612 ...
 0001*http://mudbomb.com/archives/2005/02/02/wysiwyg-plugin-for-wo ...
 0001*http://www.heise.de/newsticker/meldung/55992
 0001*http://log.netbib.de/archives/2005/02/04/nzz-online-archiv-n ...
 0001*http://www.heise.de/newsticker/meldung/56000
 0001*http://a.wholelottanothing.org/2005/02/no_one_can_have.html

You can nicely see how this user clicked away from my weblog and came back - the referrers are by no means all links to me, but incorrect referrers that browsers send when switching from one site to another. Referrers are actually supposed to be sent only when a link is really clicked - hardly any browser does that correctly. The visit was on a defined day and they got in directly by entering the domain name (the "-" referrers are at the top and the earliest referrer that appears is at the top).

Or here's an access from me:


aa.bb.cc.dd: 6 accesses, 2005-02-04 01:11:56.00 - 2005-02-03 08:27:09.00
 0045*-
 0001*http://www.aylwardfamily.com/content/tbping.asp
 0001*http://temboz.rfc1437.de/view
 0001*http://web.morons.org/article.jsp?sectionid=1&id=5947
 0001*http://www.tagesschau.de/aktuell/meldungen/0,1185,OID4029220 ...
 0001*http://sport.ard.de/sp/fussball/news200502/03/bvb_verpfaende ...
 0001*http://www.cadenhead.org/workbench/entry/2005/02/03.html

I recognize myself by the referrer with temboz.rfc1437.de - that's my online aggregator. Looks similar - a lot of incorrectly sent referrers. Another user:


aa.bb.cc.dd: 19 accesses, 2005-02-12 14:45:35.00 - 2005-01-31 14:17:07.00
 0015*http://www.muensterland.org/system/weblogUpdates.py
 0002*-
 0001*http://www.google.com/search?q=cocoa+openmcl&ie=UTF-8&oe=UTF ...
 0001*http://blog.schockwellenreiter.de/8136
 0001*http://www.google.com/search?q=%22Rainer+Joswig%22&ie=UTF-8& ...
 0001*http://www.google.com/search?q=IDEKit&hl=de&lr=&c2coff=1&sta ...

This one came more often (across multiple days) via my update page on muensterland.org and also searched for Lisp topics. And they came from the shock wave guy once. Absolutely typical behavior.

Now in comparison, a typical referrer spammer:


aa.bb.cc.dd 6 accesses, 2005-02-12 17:27:27.00 - 2005-02-02 09:25:22.00
 0002*http://tramadol.freakycheats.com/
 0001*http://diet-pills.ronnieazza.com/
 0001*http://phentermine.psxtreme.com/
 0001*http://free-online-poker.yelucie.com/
 0001*http://poker-games.psxtreme.com/

All referrers are direct domain referrers. No "-" referrers - so no accesses without a referrer. No other accesses - if I analyzed it more precisely by page type, it would be noticeable that no images, etc. are accessed. Easy to recognize - just looks sparse. Typical is also that each URL is listed only once or twice.

Now our new friend:


aa.bb.cc.dd: 100 accesses, 2005-02-13 15:06:16.00 - 2005-02-11 07:07:55.00
 0039*-
 0030*http://irish.typepad.com
 0015*http://www208.pair.com
 0015*http://blogs.salon.com
 0015*http://hfilesreviewer.f2o.org
 0015*http://betas.intercom.net
 0005*http://vowe.net
 0005*http://spleenville.com

What stands out are the referrers without a trailing slash - atypical for referrer spam. Also, just normal sites. Also noticeable is that pages are accessed without a referrer - hidden behind these are the RSS feeds. This one is also easily distinguishable from users. Especially since there's a certain rhythm to it - apparently always 15 accesses with one referrer, then switch the referrer. Either the referrer list is quite small, or I was lucky that it tried the same one with me twice - one of them is there 30 times.

Normal bots don't need much comparison - few of them send referrers and are therefore completely uninteresting. I had one that caught my attention:


aa.bb.cc.dd: 5 accesses, 2005-02-13 15:21:26.00 - 2005-01-31 01:01:07.00
 2612*-
 0003*http://www.everyfeed.com/admin/new_site_validation.php?site= ...
 0002*http://www.everyfeed.com/admin/new_site_validation.php?site= ...

A new search engine for feeds that I didn't know yet. Apparently the admin had just entered my address somewhere beforehand and then the bot started collecting pages. After that, he activated my newly found feeds in the admin interface. Seems to be a small system - the bot runs from the same IP as the admin interface. Most other bots come from entire bot farms, web spidering is an expensive affair after all ...

In summary, it can be concluded that the current generation of referrer spammer bots and other bad bots are still quite primitive in structure. They don't use botnets to use many different addresses and hide that way, they use pure server URLs instead of page URLs and have other quite typical characteristics such as certain rhythms. They also almost always come multiple times.

Unfortunately, these are not good features to capture algorithmically - unless you run your referrers into a SQL database and check each referrer with appropriate queries against the typical criteria. This way you could definitely catch the usual suspects and block them right on the server. Because normal user accesses look quite different.

However, new generations are already in the works - as my little friend shows, the one with the missing slash. And thanks to the stupid browsers with their incorrectly generated referrers (which say much more about the browser's history than about actual link following), you can't simply counter-check the referenced pages, since many referrers are pure blind referrers.

Away with Trackback

Isotopp is pondering trackback spam on the occasion of spam day and presents several approaches. One of them uses a counter-check of the trackback URL against the IP of the submitting computer - if the computer has a different IP than the server advertised in the trackback, it would probably be spam. I've written down my own comments on this - and explained why I'd rather be rid of trackback today than tomorrow. Completely. And yes, that's a complete 180-degree turn on my part regarding trackback.

The IP test approach once again comes from the perspective of pure server-based blogs. But there's unfortunately a large heap of trackback-capable software installations that don't need to run (and often don't run) on the server where the blog pages are located - all tools that produce static output, for example. Large installations are Radio Userland blogs. Smaller PyDS blogs. Or also Blosxom variants in offline mode (provided there are now trackback-capable versions - but since they're typical hacker tools, they definitely exist).

Then there are the various tools that aren't trackback-capable, where users then use an external trackback agent to submit trackbacks.

And last but not least, there are also the various Blogger/MetaWeblogAPI clients that submit the trackback themselves because, for example, only MoveableType in the MetaWeblogAPI allows triggering trackbacks, but other APIs don't.

Because of this, the IP approach is either only to be seen as a filter that lets through some of the trackbacks, or it's a prevention of trackbacks from the users mentioned above. And the latter would be extremely unpleasant.

Actually, the problem is quite simple: Trackback is a sick protocol that was stitched together with a hot needle, without the developer giving even a moment's thought to the whole thing. And therefore belongs, in my opinion, on the garbage heap of API history. The fact that I support it here is simply because WordPress implemented it by default. Once the manual moderation effort becomes too high, trackback will be completely removed here.

Sorry, but on the trackback point the MoveableType makers really showed a closeness to Microsoft behavior: pushed through a completely inadequate pseudo-standard via market dominance - without giving even a thought to the security implications. Why do you think RFCs always have a corresponding section on security problems as mandatory? Unfortunately, all the blog developers faithfully followed along (yes, me too - at Python Desktop Server) and now we're stuck with this silly protocol. And its - completely predictable - problems.

Better to develop and push a better alternative now - for example PingBack. With PingBack, it's defined that the page that wants to execute a PingBack to another page must really contain this link there exactly as it is - in the API, two URLs are always transmitted, its own and the foreign URL. The own URL must point to the foreign URL in the source, only then will the foreign server accept the PingBack.

For spammers this is pretty absurd to handle - they would have to rebuild the page before every spam or ensure through appropriate server mechanisms that the spammed weblogs then present a page during testing that contains this link. Of course that's quite doable - but the effort is significantly higher and due to the necessary server technology, this is no longer feasible with foreign open proxies and/or dial-up access.

Because of this, the right approach would simply be to switch the link protocol. Away with Trackback. You can't plug the trackback hole. PS: anyone who looks at my trackback in Isotopp's post will immediately see the second problem with trackback: apart from the huge security problem, the character set support of trackbacks is simply a complete disaster. The original author of the pseudo-standard didn't think for a minute about possible problems here either. And then some people still wonder why TypeKey from the MoveableType people isn't so well accepted - sorry, but people who make such lousy standards won't be getting my login management either ...

Zope Hosting and Performance - English Version

Somebody asked for an english translation of my article on Zope Hosting and Performance. Here it is - ok, it's not so much a direct translation than a rewrite of the story in english. Enjoy.

Recently the Schockwellenreiter had problems with his blog server. He is using Zope with Plone and CoreBlog. Since I am doing professional Zope hosting for some years now, running systems that range in the 2000-3000 hits per minute scale, I thought I put together some of the stuff I learnt (sometimes the hard way) about Zope and performance.

The most important step I would take: slim down your application. Throw out everything you might have in the Zope database that doesn't need to stay there. If it doesn't need content management, store it in folders that are served by Apache. Use mod_rewrite to seemlessly integrate it into your site so that people from the outside won't notice a difference. This can be best done for layout images, stylesheets etc. - Apache is much faster in delivering those.
Use Zope caching if possible at any rate. The main parameter you need to check: do you have enough RAM. Zope will grow when using caching (especially the RAMCacheManager). The automatic cleanup won't rescue you - Zope will still grow. Set up some process monitoring that automatically kills and restarts Zope processes that grow above an upper bound to prevent paging due to too large memory consumption. This is even a good idea if you don't use caching at all.
There are two noteable cache managers: one uses RAM and the other uses an HTTP accelerator. The RAMCacheManager caches results of objects in memory and so can be used to cache small objects that take much time or much resources to construct. The HTTPCacheManager is for using a HTTP accelerator - most likely people will use Squid, but you can use an appropriately configured Apache, too. The cache manager will provide the right Expires and Cache-Control headers so that most traffic can be delivered our of the HTTP accelerators instead of Zope.
Large Zope objects kill Zopes performance. When using caching they destroy caching efficiency by polluting the cache with large blobs of stuff that isn't often required and Zope itself will get a drain in performance by them, too. The reason is that Zope output is constructed in-memory. Constructing large objects in memory takes much resources due to the security layers and architectural layers in Zope. Better to create them with cronjobs or other means outside the Zope server and deliver them directly with Apache. Apache is much faster. A typical situation is when users create PDF documents in Zope instead of creating them outside. Bad idea.
Use ZEO. ZEO rocks. Really. In essence it's just the ZODB with a small communication layer on top. This layer is used in Zope instances instead of using the ZODB directly. That way you can run several process groups on your machine, all connecting to the same database. This helps with the above mentioned process restarting: when one is down, the other does the work. Use mod_backhand in Apache to distribute the load between the process groups or use other load balancing tools. ZEO makes regular database packs easier, too: they run on the server and not in the Zope instances - they actually don't notice much of the running pack.
If you have, use a SMP machine. Or buy one. Really - that helps. You need to run ZEO and multiple Zope instances, though - otherwise the global interpreter lock of Python will hit you over the head and Zope will just use one of the two processors. That's one reason why you want multiple process groups in the first place - distribution of load on the machine itself, making use of multiple processors.
You can gain performance by reducing the architectural layers your code goes through. Python scripts are faster than DTML. Zope products are faster than Python scripts. Remove complex code from your server and move it into products or other outside places. This needs rewriting of application code, so it isn't allways an option to do - but if you do, it will pay back.
Don't let your ZODB file grow too large. The ZODB only appends on write access - so the file grows. It grows quite large, if you don't pack regularily. If you don't pack and you have multi-GB ZODB files, don't complain about slow server starts ...
If you have complex code in your Zope application, it might be worthwile to put them into some outside server and connect to Zope with some RPC means to trigger execution. I use my |TooFPy| for stuff like this - just pull out code, build a tool and hook it into the Zope application via XMLRPC. Yes, XMLRPC can be quite fast - for example pyXMLRPC is a C-written version that is very fast. Moving code outside Zope helps because this code can't block one of the statically allocated listeners to calculate stuff. Just upping the number of listener threads doesn't pay as you would expect: due to the global interpreter lock still only one thread will run at a time and if your code uses C extensions, it might even block all other threads while using it.
If you use PostgreSQL, use PsycoPG as the database driver. PsycoPG uses session pooling and is very fast when your system get's lots of hits. Other drivers often block Zope due to limitations like only one query at a time and other such nonsense. Many admins had to learn the hard way that 16 listener threads aren't really 16 available slots if SQL drivers come into play ...

There are more ways to help performance, but the above are doable with relatively small work and are mostly dependend on wether you have enough memory and maybe a SMP machine. Memory is important - the more the better. If you can put memory into your machine, do so. There is no such thing as too-much-memory (as long as your OS supports the amount of memory, of course).

What to do if even those tips above don't work? Yes, I was in that situation. If you come into such a situation, there is only one - rather brutish - solution: active caching. By that I mean pulling stuff from the Zope server with cronjobs or other means and storing it in Apache folders and using mod rewrite to only deliver static content to users. mod rewrite is your friend. In essence you just take those pages that kill you currently and make them pseudo-static - they are only updated once in a while but the hits won't reach Zope at all.

Another step, of course, is more hardware. If you use ZEO it's no problem to put a farm of Zope servers before your ZEO machine (we currently have 5 dual-processor machines running the Zope instances and two rather big, fat, ugly servers in the background for databases, frontend with two Apache servers that look allmost like dwarves in comparisons to the backend stuff).

Zope is fantastic software - don't mistake me there. I like it. Especially the fact that it is an integrated development environment for web applications and content management is very nice. And the easy integration of external data sources is nice, too. But Zope is a resource hog - that's out of discussin. There's no such thing as a free lunch.

Zope Hosting and Performance

Shockwave Rider is having problems with his Zope server. Since I've been doing professional Zope hosting in my company for several years now and run quite a few massive portals (between 2000 and 3000 hits per minute are not uncommon - though distributed across many systems), here are some tips from me on scaling Zope.

The most important step I would recommend to everyone is to streamline. Remove from Zope everything that doesn't need to be there - what can be created statically, what rarely changes, where no content management is needed: get rid of it. Put it in regular Apache directories. Use Apache's mod_rewrite to ensure the old URLs still work, but are served from Apache. This especially applies to all those little nuisances like layout graphics - they don't need to come from Zope, they're better served from Apache.
Use Zope caching whenever possible. Whenever possible means: enough memory on the server so that even memory-hungry processes have some breathing room. Generally, Zope's built-in caching causes processes to get fatter and fatter - the cleanup in its own cache is quite useless. So implement process monitoring that shoots down and restarts a Zope process when it uses too much memory. Yes, that really is sensible and necessary.
There are two good caching options in Zope: the RAMCacheManager and the HTTPCacheManager. The former stores results of Zope objects in main memory and can therefore cache individual page components - put the complex stuff in there. The second (HTTPCache) works together with Squid. Put a Squid in front of your Zope as an HTTP accelerator and configure the HTTP Cache Manager accordingly so that Zope generates the appropriate Expire headers. Then a large part of your traffic will be handled by Squid. It's faster than your Zope. Alternatively, you can configure an Apache as an HTTP accelerator with local cache - ideal for those who can't or don't want to install Squid, but do have options for further Apache configuration.
Large Zope objects (and I mean really large in terms of KB) kill Zope. With caching they destroy your best cache strategy, and Zope itself becomes incredibly slow when objects get too large. The reason lies in Zope's architecture: all objects are first laboriously pieced together through multiple layers by various software layers. In memory - and therefore take up corresponding space in memory. Get rid of complex objects with huge KB numbers. Make them smaller. Create them statically via cron job. Serve them from Apache - there's nothing dumber than storing all your large PDFs in Zope in the ZODB, or even generating them dynamically there.
Install ZEO. That thing rocks. Basically it's just the ZODB with a primitive server protocol. What's important: your Zope can be split into multiple process groups. You want this when you're using process monitoring to kill a rogue Zope process, but want the portal to appear as undamaged as possible from the outside - in that case just add mod_backhand to Apache, or another balancing technique between Apache and Zope. Additionally, ZEO also makes packing the ZODB (which should run daily) easier, since the pack runs in the background on the ZEO and the Zope servers themselves aren't greatly affected.
If you have it, use an SMP server. Or buy one. Really - it brings a lot. The prerequisite is the aforementioned technique with multiple process groups - Python has a global interpreter lock, which means that even on a multiprocessor machine, never more than one Python thread runs at a time. Therefore you want multiple process groups.
Performance is also gained by disabling layers. Unfortunately this often can only be realized with software changes, so it's more interesting for those who build it themselves. Move complex processes out of the Zope server and put them in Zope Products. Zope Products run natively without restrictions in the Python interpreter. Zope Python scripts and DTML documents, on the other hand, are dragged through many layers that ensure you respect Zope's access rights, don't do anything bad, and are generally well-behaved. And they make you slower. Products are worthwhile - but cost work and, unlike the other technical tips, aren't always feasible.
Additionally, it has proven useful not to put too much data in the ZODB, especially nothing that expands it - the ZODB only gets bigger, it only gets smaller when packing. After some time you easily have a ZODB in the GB range and shouldn't be surprised by slow server starts...
If more complex processes occur in the system, it can make sense to outsource them completely. I always use TooFPy for that. Simply convert all the more complex stuff into a tool and stick it in there - the code runs at full speed. Then simply access the tool server from Zope with a SOAP client or XMLRPC client and execute the functions there. Yes, the multiple XML conversion is actually less critical than running complex code in Zope - especially if that code demands considerable runtime. Zope then blocks one of its listeners - the number is static. And simply pushing it up doesn't help - thanks to the global interpreter lock, only more processes would wait for this lock to be released (e.g., for every C extension that's used). There's a good and fast C implementation for XMLRPC communication that can be integrated into Python, making the XML overhead problem irrelevant.
If you use PostgreSQL as a database: use PsycoPG as the database driver. Session pooling really gets Zope going. Generally you should check whether the corresponding database driver supports some form of session pooling - if necessary via an external SQL proxy. Otherwise, Zope might hang the entire system during SQL queries because a heavy query waits for its result. Many have already fallen into this trap and learned that 16 Zope threads doesn't necessarily mean 16 parallel processed Zope accesses when SQL databases are involved.

Of course there's a lot more you can do, but the above are largely manageable on the fly and mainly depend on you having enough memory in the server (and possibly a multiprocessor machine - but it works without one too). Memory is important - the more the better. If you can, just put more memory in. You can't have too much memory...

What to do if even all that's not enough (yes, I've had that - sometimes only the really heavy-handed approach helps). Well, in that case there are variations of the above techniques. My favorite technique in this area is active caching. By this I mean that Zope is configured at one point for which documents should be actively cached. This then requires a script on the machine that fetches the pages from Zope and puts them in a directory. Apache rewrite rules then ensure that the static content is served from the outside. Basically you're ensuring that the pages most frequently visited and suitable for this technique (i.e., for example, containing no personalization data) simply go out as a static page, no matter what else happens - the normal caching techniques just aren't brutal enough, too much traffic still goes through to the server.

Another step is of course the use of additional machines - simply put more machines alongside and connect them using the ZEO technique.

Zope is fantastic software - especially the high integration of development environment, CMS, and server is often incredibly practical, and the easy integration of external data sources is also very nice. But Zope is a resource hog, you have to put it that simply.

Caching for PHP Systems

Caching Strategies for PHP-Based Systems

There are basically two ways to implement caching in a PHP-based system. Okay, there are many more, but two main approaches are clearly identifiable. I've compiled what's interesting in this context - especially since some colleagues are currently suffering under high server load. The whole thing is kept general, but for understandable reasons also considers the specific implications for WordPress.

Caching of pre-compiled PHP pages
Caching of page output

There are numerous variations for both main approaches. PHP pages themselves exist on web servers as source code - unprocessed and not optimized in any way for the loading process. With complex PHP systems running, parsing and compiling into internal code happens for every PHP file. With systems that have many includes and many class libraries, this can be quite substantial. The first main direction of caching starts at this point: the generated intermediate code is simply stored away. Either in shared memory (memory blocks that are available to many processes of a system collectively) or on the hard disk. There are a number of solutions here - I personally use turck-mmcache. The reason is mainly that it doesn't cache in shared memory but on the disk (which as far as I know the other similar solutions also do) and that there is a Debian package for turck-mmcache. And that I've had relatively few negative experiences with it so far (at least on Debian stable - on Debian testing things are different, where PHP applications crash on you). Since WordPress is based on a larger set of library modules with quite substantial source content, such a cache brings quite a bit to reduce WordPress's baseline load. Since these caches are usually completely transparent - with no visible effects except for the speed improvement - you can also generally enable such a cache.

The second main direction for caching is the intermediate storage of page contents. Here's a special feature: pages are often dynamically generated depending on parameters - and therefore a page doesn't always produce the same output. Just think of mundane things like displaying the username when a user is logged in (and has stored a cookie for it). Page contents can also be different due to HTTP Basic Authentication (the login technique where the popup window for username and password appears). And POST requests (forms that don't send their contents via the URL) also produce output that depends on this data.

Basically, an output cache must consider all these input parameters. A good strategy is often not to cache POST results at all - because error messages etc. would also appear there, which depending on external sources (databases) could produce different outputs even with identical input values. So really only GET requests (URLs with parameters directly in the URL) can be meaningfully cached. However, you must consider both the sent cookies and the sent parameters in the URL. If your own system works with basic authentication, that must also factor into the caching concept.

A second problem is that pages are rarely purely static - even static pages certainly contain elements that you'd prefer to have dynamically. Here you need to make a significant decision: is purely static output enough, or does a mix come in? Furthermore, you still need to decide how page updates should affect things - how does the cache notice that something has changed?

One approach you can pursue is a so-called reverse proxy. You simply put a normal web proxy in front of the web server so that all access to the web server itself is technically routed through this web proxy. The proxy sits directly in front of the web server and is thus mandatory for all users. Since web proxies should already handle the problem of user authentication, parameters, and POST/GET distinction quite well (in the normal application situation for proxies, the problems are the same), this is a very pragmatic solution. Updates are also usually handled quite well by such proxies - and in an emergency, users can persuade the proxy to fetch the contents anew through a forced reload. Unfortunately, this solution only works if you have the server under your own control - and the proxy also consumes additional resources, which means there might not be room for it on the server. It also heavily depends on the application how well it works with proxies - although problems between proxy and application would also occur with normal users and therefore need to be solved anyway.

The second approach is the software itself - ultimately, the software can know exactly when contents are recreated and what needs to be considered for caching. Here there are again two directions of implementation. MovableType, PyDS, Radio Userland, Frontier - these all generate static HTML pages and therefore don't have the problem with server load during page access. The disadvantage is obvious: data changes force the pages to be recreated, which can be annoying on large sites (and led me to switch from PyDS to WordPress).

The second direction is caching from the dynamic application itself: on first access, the output is stored under a cache key. On the next access to the cache key, you simply check whether the output is already available, and if so, it's delivered. The cache key is composed of the GET parameters and the cookies. When database contents change, the corresponding entries in the cache are deleted and thus the pages are recreated on the next access.

WordPress itself has Staticize, a very practical plugin for this purpose. In the current beta, it's already included in the standard scope. This plugin creates a cache entry for pages as described above. And takes parameters and cookies into account - basic authentication isn't used in WordPress anyway. The trick, though, is that Staticize saves the pages as PHP. The cache pages are thus themselves dynamic again. This dynamism can now be used to mark parts of the page with special comments - which allows dynamic function calls to be used again for these parts of the page. The advantage is obvious: while the big efforts for page creation like loading the various library modules and reading from the database are completely done, individual areas of the site can remain dynamic. Of course, the functions for this must be structured so they don't need WordPress's entire library infrastructure - but for example, dynamic counters or displays of currently active users or similar features can thus remain dynamic in the cached pages. Matt Mullenweg uses it, for example, to display a random image from his library even on cached pages. Staticize simply deletes the entire cache when a post is created or changed - very primitive and with many files in the cache it can take a while, but it's very effective and pragmatic.

Which caches should you sensibly deploy and how? With more complex systems, I would always check whether I can deploy a PHP code cache - so turck mCache or Zend Optimizer or phpAccelerator or whatever else there is.

I would personally only activate the application cache itself when it's really necessary due to load - with WordPress you can keep a plugin on hand and only activate it when needed. After all, caches with static page generation have their problems - layout changes only become active after cache deletion, etc.

If you can deploy a reverse proxy and the resources on the machine are sufficient for it, it's certainly always recommended. If only because you then experience the problems yourself that might exist in your own application regarding proxies - and which would also cause trouble to every user behind a web proxy. Especially if you use Zope, for example, there are very good opportunities in Zope to improve the communication with the reverse proxy - a cache manager is available in Zope for this. Other systems also offer good fundamentals for this - but ultimately, any system that produces clean ETag and Last-Modified headers and correctly handles conditional GET (conditional accesses that send which version you already have locally and then only want to see updated contents) should be suitable.

Delayed Execution with Python

The original text has moved to the PyDS weblog. The reason is that I cannot manage the text properly with the new software because the necessary tools are not available here (specifically, source code formatting doesn't work here, and besides the text is too huge - at least when it is saved as XHTML).

Visual Studio Magazine - Guest Opinion - Save the Hobbyist Programmer

An older, but interesting article that points to an important problem: hobby programmers are increasingly being excluded from creating small hacks and simple solutions by ever more complex system interfaces and constant changes to APIs and programming tools in the Windows world. And it's not just the Windows world that suffers from this. Linux and OS X suffer from it in part as well.

Of course, there are still small utilities with simple programming capabilities. Or scripting languages that are easy to learn and use - for example, Python. But that's not really a solution for these tinkerers. What was once the omnipresent Basic for hobby tinkerers, or for example the - admittedly problematic - language in dBase, is missing today. Hardly any programming environment that doesn't come with an object-oriented approach. Hardly any solution approaches that don't try to be a general development environment for complete programs right away.

There are still some nice exceptions - FileMaker on the Mac still tries to appeal to the hobby hacker. But it's still true: the simple entry points are becoming fewer.

Even AppleScript on the Mac has become so complex and bloated in the meantime that it's hardly possible for a newcomer to just get started with it. Some corners of AppleScript are obscure and complicated even for old programming veterans like me. And of course, while there are many great integration possibilities for all these scripting languages, the documentation for precisely these parts is downright terrible.

To stick with the AppleScript example: while there are application dictionaries that document an application's AppleScript capabilities, nearly all the descriptions I've read in them have assumed that the user already has complete and extensive knowledge of AppleScript and AppleScript structures (what are objects in AppleScript, how do you work with containers, etc.). Although these dictionaries could serve as a starting point for the hobby programmer, their creators (professional programmers in software companies) design them in such a way that often even they themselves can make sense of them.

It's similar in the Linux world. TCL was once the standard scripting language for simple entry with simple structure, an almost primitive extension interface, and the ability for even non-programmers to quickly arrive at solutions. Today, TCL in the standard distribution (which is then nicely called "Batteries Included" - only unfortunately the understandable instructions are missing) already consists of mountains of packages, many of which deal with metaязыage aspects (e.g., incrTCL and the widget libraries built on it and on TK - good grief, in just this brief mention of the content there are more incomprehensible words for a beginner than filler words), which a beginner will never understand.

And I don't need to go into the dismal situation under Windows with the scripting host and the OLE Automation interfaces (or whatever they're called these days) - anyone who has experienced a version change of an application and had to completely rewrite their entire solution due to a total change in the scripting model of, say, Access, knows what I'm talking about.

Ultimately, we (we == professional programmers) are taking a piece of freedom away from end users - the freedom to tinker around and yes, also the freedom to shoot themselves in the foot. And I think that precisely in the world of free software, programmers should start spending some thoughts on this again. It's nice that almost every larger program embeds some scripting language. But what's not so nice is that hardly any of these embeddings have decent documentation of their capabilities, and only the most primitive examples and complete solutions for very complex applications are available as starting points for learning. Hobby programmers in particular learn most easily by reading existing tools. And yes, I'm not exactly a good example myself, because the Python Desktop Server has a number of extensibilities that are also intended for end users - but I also wrote far too little documentation for it. Somehow a shame, because that's how many projects become incestuous affairs, because the actual end users are left out. No, I don't have a real solution - because especially with free projects, documentation creation is often an annoying and unpopular part of the project and is therefore treated like a stepchild. Besides, most programmers aren't able to create generally understandable documentation anyway. But maybe that's also an opportunity for projects that try to increase activity in large open source community projects that have had lower participation so far. debian-women comes to mind spontaneously (since Jutta is currently working on it). Because greater participation by women would certainly also be helpful for documentation and information that doesn't necessarily require a fully trained master hacker. After all, not everyone has the desire to spend their entire life learning new APIs and tools ... Here's the original article.

Cactus Mite

For size comparison: the distance between the centers of the two darker bars is one millimeter! The image is not optimally sharp, as I had to capture the whole thing relatively primitively - for example, at the time of shooting I didn't have a macro focusing stage for focus adjustment, but had to do it manually. Still, it's impressive what kind of images you can get with relatively little effort.

Python Community Server

Muensterland.org works with the Python Community Server, so here's a bit about it from me. The Python Community Server is an open source implementation of the xmlStorageSystem. This is a protocol based on XML-RPC for storing static content. Essentially, the Python Community Server is nothing more than a web server with a somewhat unconventional upload protocol and a few pre-made CGIs - there are comments on articles, there's a mail form, and a few simple ways to subscribe to a website as an RDF channel.

So what's the point of all this? Why the hype over this tool? Things really get interesting only when you use Radio UserLand as a client. This is because xmlStorageSystem is the protocol that Radio UserLand uses as its backend system. Radio Communities are controlled through it.

Radio UserLand is a combination of a news aggregator, a website designer, and a weblog tool. The news aggregator collects news from the internet and makes it available locally. The user can then post individual articles to their own weblog. Additionally, they can simplify their website design with fairly powerful functions. What's special about Radio UserLand is that it's essentially a local website on the user's computer. And from this website, replication to other servers can take place. This can happen via standard protocols like the Blogger API (where only weblog content is transported, the layout remains with the server operator), via FTP (where static HTML exports are created, essentially Radio is just an overblown mirror script, so interactive features are quite limited), and via xmlStorageSystem. And this closes the circle back to the Python Community Server, because it's nothing more than the implementation of the latter.

By the way, there's also a tool for Linux, but this is more oriented toward classical weblog tools and doesn't offer the advanced layout tools that Radio UserLand does. And of course, there's now also the Python Desktop Server, which essentially works like Radio. It's available for almost every POSIX platform where Python runs.

Otherwise: just register a weblog here and use it. Try it all out. Muensterland.org is free for now; anyone can set up a weblog there. It's - as you can tell from the domain - of course primarily intended for the Münsterland region, but others can participate too. There are expat Münsterlanders after all.

Comparison of Rollei 6008 and Hasselblad System

Since I recently saw someone search for "Rollei and Hasselblad comparison" coming to this site, I got to thinking about why I actually have a Rollei 6008 and no Hasselblad. With the M6, the Hasselblad would fit much better - both mechanical. The Rollei, on the other hand, is a high-tech monster. Ok, one reason was that the Rollei was sitting in the window and the price was good, sure. But I could have left it there and waited for a Hasselblad. So why Rollei?

For me, the Rollei is the crown of the development of cameras with manual focus in many respects. I couldn't imagine what else you could put into it. The Rollei has a whole range of special features compared to many other MF cameras. Top of the list is the light metering even with the light shaft. But it's not just that alone, but also the way the exposure is measured and controlled. That's exactly how I always imagined it: a free choice of metering mode, arbitrarily combinable with aperture priority, shutter speed priority, or manual follow-up metering. Ok, it also has a program auto mode for hectic use. Just set the settings to automatic for what should be automatic - if both aperture and shutter speed are set to automatic, it's a program auto. No silly mode dial.

Then there are of course the other Rollei features that convinced me: built-in motor (it's not fast, but it's built-in and therefore compact). The roller blind on the magazines is also a great thing, which means no more lost sliders. The long film path in the magazines helps against the annoying film flatness problem of classic Hasselblad and Zeiss magazines. The electronic transmission of film speed from the magazine to the camera makes magazine changes with different speeds practical and quick.

And then the Rollei of course has the "fine points": the 1/1000th second with the PQS lenses, for example. The purely electronic signal transmission, which required no change to the bayonet even with the new AF lenses. The absolutely excellent Zeiss calculations that produce really fine lenses - even though I only have a single lens (the 2.8/80 PQS). And the whole thing also has a robust housing.

My conclusion: of course, one of the large Hasselblad models with integrated exposure metering and an additional winder would have many of the Rollei's features, but definitely not all of them. And not in this very pleasant to operate form. And certainly not at the used price I paid for it.

Hmm. I really need to go out with the Rollei again soon.