Тех-Детали: 2015

воскресенье, 27 декабря 2015 г.

nginx module to enable haskell binding to nginx configuration files

Do you like haskell and nginx? I love them both and this inspired me to write an nginx module for inlining haskell source code straight into nginx configuration files. The module was published on github as nginx-haskell-module and shipped with an nginx configuration example to show its basic usage. Let’s look at it.

user                    nobody;
worker_processes        2;

events {
    worker_connections  1024;
}

http {
    default_type        application/octet-stream;
    sendfile            on;

    haskell ghc_extra_flags '-hide-package regex-pcre';

    haskell compile /tmp/ngx_haskell.hs '

import qualified Data.Char as C
import           Text.Regex.PCRE
import           Safe

toUpper = map C.toUpper
NGX_EXPORT_S_S (toUpper)

takeN x y = take (readDef 0 y) x
NGX_EXPORT_S_SS (takeN)

NGX_EXPORT_S_S (reverse)

matches :: String -> String -> Bool
matches a b = not $ null (getAllTextMatches $ a =~ b :: [String])
NGX_EXPORT_B_SS (matches)

    ';

    server {
        listen       8010;
        server_name  main;
        error_log    /tmp/nginx-test-haskell-error.log;
        access_log   /tmp/nginx-test-haskell-access.log;

        location / {
            haskell_run toUpper $hs_a $arg_a;
            echo "toUpper ($arg_a) = $hs_a";
            if ($arg_b) {
                haskell_run takeN $hs_a $arg_a $arg_b;
                echo "takeN ($arg_a, $arg_b) = $hs_a";
                break;
            }
            if ($arg_c) {
                haskell_run reverse $hs_a $arg_c;
                echo "reverse ($arg_c) = $hs_a";
                break;
            }
            if ($arg_d) {
                haskell_run matches $hs_a $arg_d $arg_a;
                echo "matches ($arg_d, $arg_a) = $hs_a";
                break;
            }
        }
    }
}

Haskell source code is placed inside the second argument of the directive haskell compile. In this example it contains some imports, three definitions of functions and four special export directives to introduce the functions on the nginx configuration level. There are four types of export directives: NGX_EXPORT_S_S, NGX_EXPORT_S_SS, NGX_EXPORT_B_S and NGX_EXPORT_B_SS for functions of types String -> String, String -> String -> String, String -> Bool and String -> String -> Bool respectively. The code gets written into the path specified in the first argument of the directive haskell compile (it must be an absolute path) and then compiled to a shared library at the very start of nginx. Sometimes ghc may require extra options besides defaults. And here is the case. As soon as import of Text.Regex.PCRE can be ambiguous (because two haskell packages regex-pcre and regex-pcre-builtin provide it), ghc must know which package to use. There is a special ghc flag -hide-package for hiding unwanted packages and it was used in this example by the directive haskell ghc_extra_flags. There is another nginx haskell directive haskell load which is similar to the haskell compile except it does not require the second argument (i.e. the haskell source code). The directive tries to load compiled shared library that corresponds to the path specified in its first argument (/tmp/ngx_haskell.so in this example). If the code argument is present but there is not compiled shared library, the latter will be first compiled and then loaded. If the haskell code fails to compile then nginx won’t start (or won’t reload workers if the code had been wrongly changed in the configuration file and nginx has been sent the SIGHUP signal). Any errors will be logged. To run the compiled haskell code in the nginx context there is another nginx directive haskell_run. It is allowed in server, location and location-if configuration clauses and may accept three or four arguments depending on the arity of the exported function to run which is specified in the first argument of the directive. The second argument introduces an nginx variable where return value of the haskell function will be saved. For example directive

                haskell_run takeN $hs_a $arg_a $arg_b;

introduces a new nginx variable $hs_a which will be calculated on demand as result of running an exported haskell function takeN with arguments $arg_a and $arg_b. Let’s do some curl tests. First of all nginx must be built and run. Besides the haskell module the build requires nginx echo module. It must be specified in options --add-module of nginx configure script.

./configure --add-module=<path-to-nginx-echo-module> --add-module=<path-to-nginx-haskell-module>

Placeholders <path-to-nginx-... are to be replaced with real paths to the modules. After running make we start the nginx daemon.

./objs/nginx -c /home/lyokha/devel/nginx-haskell-module/nginx.conf
[1 of 1] Compiling NgxHaskellUserRuntime ( /tmp/ngx_haskell.hs, /tmp/ngx_haskell.o )
Linking /tmp/ngx_haskell.so ...

Nginx option -c specifies location of the configuration file. The shared library /tmp/ngx_haskell.so was built upon start which was printed on the terminal. And now we are going to ask nginx server to do some haskell calculations!

curl 'http://localhost:8010/?a=hello_world'
toUpper (hello_world) = HELLO_WORLD
curl 'http://localhost:8010/?a=hello_world&b=4'
takeN (hello_world, 4) = hell
curl 'http://localhost:8010/?a=hello_world&b=oops'
takeN (hello_world, oops) = 
curl 'http://localhost:8010/?c=intelligence'
reverse (intelligence) = ecnegilletni
curl 'http://localhost:8010/?d=intelligence&a=^i'
matches (intelligence, ^i) = 1
curl 'http://localhost:8010/?d=intelligence&a=^I'
matches (intelligence, ^I) = 0

Hmm, I did not escape caret inside the URL in the matches examples: seems that curl allowed it, anyway the calculations were correct. Let’s do some changes in the haskell code inside the configuration file, say takeN will be

takeN x y = ("Incredible! " ++) $ take (readDef 0 y) x

, reload nginx configuration

pkill -HUP nginx

and do curl again.

curl 'http://localhost:8010/?a=hello_world&b=4'
takeN (hello_world, 4) = Incredible! hell

Nice, kind of runtime haskell reload! Now I want to explain some details of implementation and concerns about efficiency and exceptions. Haskell code gets compiled while reading nginx configuration file. It means that ghc must be accessible during nginx runtime (except when using haskell load with already compiled haskell library). Only one haskell compile (or haskell load) directive is allowed through the whole configuration. Directive haskell compile (or haskell load) must be placed before directives haskell_run and after directive haskell ghc_extra_flags (if any). Compiled haskell library gets loaded by dlopen() call when starting a worker process (i.e. haskell runtime is loaded for each nginx worker process separately). The haskell code itself is wrapped inside a special haskell FFI code (it can be found in /tmp/ngx_haskell.hs in case of the above example). The FFI tries its best to minimize unnecessary allocations when reading arguments that were passed from nginx, however for S_S and S_SS function types it does allocate a new string for return value which gets promptly copied to an nginx memory pool and freed. It was possible to avoid allocations for return values, but in this case ghc should have known about internal nginx string implementation, so I decided to sacrifice efficiency to avoid runtime dependency on nginx source code (again, this dependency would not have been necessary if this special FFI interface had been compiled in a separate module, but… probably I’ll do that in future). Another concern about efficiency is using in the exported haskell handlers haskell Strings which are simple lists of characters (which is assumed to be inefficient). On the other hand they are lazy (and this often means efficient). Ok, this is a traditional haskell trade-off matter… What about haskell exceptions? C code is unable to handle them and this is a bad news. However writing pure haskell code makes a strong guarantee about high level of exception safety (at least all IO exceptions must go). In the above example I used function readDef from module Safe instead of its partial counterpart read which increased safety level as well.

понедельник, 7 декабря 2015 г.

Шрифты infinality-ultimate для Fedora 23

Добрый человек собрал репозиторий с пропатченными пакетами для infinality. Спасибо ему большое! Некоторое время я вообще думал, что infinality мертв, оказывается нет: здесь находится активный репозиторий разработки, но без поддержки Fedora. Лично я установил только основные патчи, без шрифтов — так, как написано в секции Installation — на мой вкус этого вполне достаточно. До этого у меня стояли версии freetype и fontconfig от Russian Fedora и, в общем-то, я не жаловался на рендеринг, однако теперь я вижу разницу: шрифты стали четче, что-ли. Короче, чувствую, что стало лучше, но доказать не могу :) И сравнительные картинки приводить не стану: не сохранил оригинальный вид, да и в интернете такие сравнения найти несложно.

вторник, 1 декабря 2015 г.

Не такой уж простой модуль nginx для создания комбинированных апстримов

Когда-то давно я написал статью о простом модуле nginx для создания комбинированных апстримов. В ней шла речь о реализации простой директивы add_upstream на уровне блока upstream в конфигурации nginx, которая позволяет добавлять серверы из других апстримов: очень удобно, когда вам требуется собрать апстримы, скомбинированные из нескольких других апстримов, без копирования объявлений составляющих их серверов. На данный момент я больше не могу назвать этот модуль простым, поскольку кроме расширенной функциональности, в его реализации появились разнообразные механизмы nginx, такие как фильтрация заголовков и тела ответов, доступ к переменным и подзапросы (subrequests). Теперь модуль называется модулем комбинированных апстримов, он выложен на гитхабе и снабжен подробной документацией на английском языке. В этой статье я хочу перечислить все возможности данного модуля с примерами их использования.

Директива add_upstream в блоке upstream. Это то, с чего все начиналось.

Конфигурация

events {
    worker_connections  1024;
}

http {
    upstream u1 {
        server localhost:8020;
    }
    upstream u2 {
        server localhost:8030;
    }
    upstream u3 {
        server localhost:8040;
    }
    upstream ucombined {
        add_upstream u1;
        add_upstream u2;
        add_upstream u3 backup;
    }

    server {
        listen       127.0.0.1:8010;
        server_name  main;

        location / {
            proxy_pass http://ucombined;
        }
    }
    server {
        listen       127.0.0.1:8020;
        server_name  server1;
        location / {
            echo "Passed to $server_name";
        }
    }
    server {
        listen       127.0.0.1:8030;
        server_name  server2;
        location / {
            echo "Passed to $server_name";
        }
    }
    server {
        listen       127.0.0.1:8040;
        server_name  server3;
        location / {
            echo "Passed to $server_name";
        }
    }
}

Тест
```
for i in `seq 10` ; do curl 'http://localhost:8010/' ; done
Passed to server1
Passed to server1
Passed to server2
Passed to server2
Passed to server1
Passed to server1
Passed to server2
Passed to server2
Passed to server1
Passed to server1
```
Правильно. Если вам интересно, почему каждый сервер опрашивается по два раза подряд, то ответ таков. Мой системный файл /etc/hosts содержит следующие две строки.
```
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
```
Это значит, что loopback интерфейс имеет два адреса — для IPv4 и IPv6 (по крайней мере в /etc/hosts, который nginx читает на старте). Для каждого адреса nginx создает отдельный элемент-сервер в списке round robin peers. Достаточно закомментировать вторую строку в /etc/hosts и перезапустить nginx, чтобы получить настоящий round robin цикл в этом тесте.

Директива combine_server_singlets в блоке upstream. Эта штука позволяет плодить апстримы в невероятных количествах :) Представьте, что у вас есть такой апстрим

    upstream u1 {
        server localhost:8020;
        server localhost:8030;
        server localhost:8040;
    }

и вы хотите создать три следующих производных апстрима-синглета (не надо спрашивать зачем, у меня была такая задача и я точно знаю, что она имеет смысл).

    upstream u11 {
        server localhost:8020;
        server localhost:8030 backup;
        server localhost:8040 backup;
    }
    upstream u12 {
        server localhost:8020 backup;
        server localhost:8030;
        server localhost:8040 backup;
    }
    upstream u13 {
        server localhost:8020 backup;
        server localhost:8030 backup;
        server localhost:8040;
    }

Не нужно их создавать вручную! Достаточно поместить новую директиву внутрь порождающего апстрима

    upstream u1 {
        server localhost:8020;
        server localhost:8030;
        server localhost:8040;
        combine_server_singlets;
    }

и апстримы-синглеты будут созданы автоматически. Для тонкой настройки имен порожденных апстримов директива предоставляет два опциональных параметра: суффикс и разрядное выравнивание порядкового номера апстрима.

Конфигурация

events {
    worker_connections  1024;
}

http {
    upstream u1 {
        server localhost:8020;
        server localhost:8030;
        server localhost:8040;
        combine_server_singlets;
        combine_server_singlets _tmp_ 2;
    }

    server {
        listen       127.0.0.1:8010;
        server_name  main;

        location /1 {
            proxy_pass http://u11;
        }
        location /2 {
            proxy_pass http://u1_tmp_02;
        }
        location /3 {
            proxy_pass http://u1$cookie_rt;
        }
    }
    server {
        listen       127.0.0.1:8020;
        server_name  server1;
        location / {
            add_header Set-Cookie "rt=1";
            echo "Passed to $server_name";
        }
    }
    server {
        listen       127.0.0.1:8030;
        server_name  server2;
        location / {
            add_header Set-Cookie "rt=2";
            echo "Passed to $server_name";
        }
    }
    server {
        listen       127.0.0.1:8040;
        server_name  server3;
        location / {
            add_header Set-Cookie "rt=3";
            echo "Passed to $server_name";
        }
    }
}

Тест

curl 'http://localhost:8010/1'
Passed to server1
curl 'http://localhost:8010/2'
Passed to server2
curl 'http://localhost:8010/3'
Passed to server1
curl -D- -b 'rt=2' 'http://localhost:8010/3'
HTTP/1.1 200 OK
Server: nginx/1.8.0
Date: Tue, 01 Dec 2015 10:59:00 GMT
Content-Type: text/plain
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: rt=2

Passed to server2
curl -D- -b 'rt=3' 'http://localhost:8010/3'
HTTP/1.1 200 OK
Server: nginx/1.8.0
Date: Tue, 01 Dec 2015 10:59:10 GMT
Content-Type: text/plain
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: rt=3

Passed to server3

Обмен кукой rt дает подсказку, где синглетные апстримы могут быть полезны.

Апстрэнды (upstrands). Это такие комбинированные апстримы, внутри которых составляющие их апстримы не теряют свою целостность и идентичность. Слово upstrand образовано из двух составляющих: up~~stream~~ и strand и означает пучок или жилу апстримов. Я касался деталей реализации апстрэндов в этой статье на английском языке. В двух словах, апстрэнд представляет собой высокоуровневую структуру, которая может опрашивать составляющие ее апстримы по кругу (round robin) до тех пор, пока не найдет апстрим, удовлетворяющий заданному условию — код ответа апстрима (HTTP response) не должен входить в список, заданный директивой next_upstream_statuses. Технически апстрэнды являются блоками — такими же как и апстримы. Они точно так же задаются внутри секции http конфигурации nginx, но вместо серверов составляющими их компонентами являются обычные апстримы. Апстримы добавляются в апстрэнд с помощью директивы upstream. Если имя апстрима начинается с символа тильда, то оно рассматривается как регулярное выражение. Отдельные апстримы внутри апстрэнда могут быть помечены как бэкапные, также имеется возможность блэклистить апстримы на определенное время с помощью параметра blacklist_interval. Опрос нескольких апстримов внутри апстрэнда реализован с помощью механизма подзапросов (subrequests). Этот механизм запускается в результате доступа к встроенной переменной upstrand_NAME, где NAME соответствует имени существующего апстрэнда. Я предполагаю, что в основном апстрэнды будут применяться в директиве proxy_pass модуля nginx proxy, однако здесь нет искусственных ограничений: доступ к механизму запуска подзапросов через переменную позволяет использовать апстрэнды в любом пользовательском модуле. На случай, если имя апстрэнда заранее неизвестно (например, оно приходит в куке), предусмотрена директива dynamic_upstrand, которая записывает имя следующего апстрима предполагаемого апстрэнда в свой первый аргумент-переменную на основании оставшегося списка аргументов (имя апстрэнда будет соответствовать первому не пустому аргументу из этого списка). Директива доступна на уровнях конфигурации server, location и location-if. Апстрэнды предоставляют несколько статусных переменных, среди них upstrand_addr, upstrand_status, upstrand_cache_status, upstrand_connect_time, upstrand_header_time, upstrand_response_time, upstrand_response_length — все они соответствуют аналогичным переменным из модуля upstream, только хранят значения всех посещенных апстримов в рамках данного HTTP запроса — и upstrand_path, в которой записан хронологический порядок (путь) посещения апстримов в рамках данного запроса. Статусные переменные полезны для анализа работы апстрэндов в access логе. А теперь пример конфигурации и curl-тест.
- Конфигурация
```
events {
    worker_connections  1024;
}

http {
    upstream u01 {
        server localhost:8020;
    }
    upstream u02 {
        server localhost:8030;
    }
    upstream b01 {
        server localhost:8040;
    }
    upstream b02 {
        server localhost:8050;
    }

    upstrand us1 {
        upstream ~^u0 blacklist_interval=10s;
        upstream b01 backup;
        next_upstream_statuses 5xx;
    }
    upstrand us2 {
        upstream b02;
        next_upstream_statuses 5xx;
    }

    log_format  fmt '$remote_addr [$time_local]\n'
                    '>>> [path]          $upstrand_path\n'
                    '>>> [addr]          $upstrand_addr\n'
                    '>>> [response time] $upstrand_response_time\n'
                    '>>> [status]        $upstrand_status';

    server {
        listen       127.0.0.1:8010;
        server_name  main;
        error_log    /tmp/nginx-test-upstrand-error.log;
        access_log   /tmp/nginx-test-upstrand-access.log fmt;

        dynamic_upstrand $dus1 $arg_a us2;

        location /us {
            proxy_pass http://$upstrand_us1;
        }
        location /dus {
            dynamic_upstrand $dus2 $arg_b;
            if ($arg_b) {
                proxy_pass http://$dus2;
                break;
            }
            proxy_pass http://$dus1;
        }
    }
    server {
        listen       8020;
        server_name  server1;

        location / {
            echo "Passed to $server_name";
            #return 503;
        }
    }
    server {
        listen       8030;
        server_name  server2;

        location / {
            echo "Passed to $server_name";
            #return 503;
        }
    }
    server {
        listen       8040;
        server_name  server3;

        location / {
            echo "Passed to $server_name";
        }
    }
    server {
        listen       8050;
        server_name  server4;

        location / {
            echo "Passed to $server_name";
        }
    }
}
```
- Тест
```
for i in `seq 6` ; do curl 'http://localhost:8010/us' ; done
Passed to server1
Passed to server2
Passed to server1
Passed to server2
Passed to server1
Passed to server2
```
  В логах nginx мы увидим
```
tail -f /tmp/nginx-test-upstrand-*
==> /tmp/nginx-test-upstrand-access.log <==

==> /tmp/nginx-test-upstrand-error.log <==

==> /tmp/nginx-test-upstrand-access.log <==
127.0.0.1 [01/Dec/2015:16:52:03 +0300]
>>> [path]          u01
>>> [addr]          (u01) 127.0.0.1:8020
>>> [response time] (u01) 0.000
>>> [status]        (u01) 200
127.0.0.1 [01/Dec/2015:16:52:03 +0300]
>>> [path]          u02
>>> [addr]          (u02) 127.0.0.1:8030
>>> [response time] (u02) 0.000
>>> [status]        (u02) 200
127.0.0.1 [01/Dec/2015:16:52:03 +0300]
>>> [path]          u01
>>> [addr]          (u01) 127.0.0.1:8020
>>> [response time] (u01) 0.000
>>> [status]        (u01) 200
127.0.0.1 [01/Dec/2015:16:52:03 +0300]
>>> [path]          u02
>>> [addr]          (u02) 127.0.0.1:8030
>>> [response time] (u02) 0.001
>>> [status]        (u02) 200
127.0.0.1 [01/Dec/2015:16:52:03 +0300]
>>> [path]          u01
>>> [addr]          (u01) 127.0.0.1:8020
>>> [response time] (u01) 0.000
>>> [status]        (u01) 200
127.0.0.1 [01/Dec/2015:16:52:03 +0300]
>>> [path]          u02
>>> [addr]          (u02) 127.0.0.1:8030
>>> [response time] (u02) 0.001
>>> [status]        (u02) 200
```
  А теперь давайте закомментируем директивы echo и раскомментируем директивы return 503 в локейшнах двух первых бэкендов (server1 и server2), перезапустим nginx и протестируем снова.
```
for i in `seq 6` ; do curl 'http://localhost:8010/us' ; done
Passed to server3
Passed to server3
Passed to server3
Passed to server3
Passed to server3
Passed to server3
```
  Логи nginx.
```
==> /tmp/nginx-test-upstrand-access.log <==
127.0.0.1 [01/Dec/2015:16:58:06 +0300]
>>> [path]          u01 -> u02 -> b01
>>> [addr]          (u01) 127.0.0.1:8020 (u02) 127.0.0.1:8030 (b01) 127.0.0.1:8040
>>> [response time] (u01) 0.001 (u02) 0.000 (b01) 0.000
>>> [status]        (u01) 503 (u02) 503 (b01) 200
127.0.0.1 [01/Dec/2015:16:58:06 +0300]
>>> [path]          b01
>>> [addr]          (b01) 127.0.0.1:8040
>>> [response time] (b01) 0.000
>>> [status]        (b01) 200
127.0.0.1 [01/Dec/2015:16:58:06 +0300]
>>> [path]          b01
>>> [addr]          (b01) 127.0.0.1:8040
>>> [response time] (b01) 0.000
>>> [status]        (b01) 200
127.0.0.1 [01/Dec/2015:16:58:06 +0300]
>>> [path]          b01
>>> [addr]          (b01) 127.0.0.1:8040
>>> [response time] (b01) 0.000
>>> [status]        (b01) 200
127.0.0.1 [01/Dec/2015:16:58:06 +0300]
>>> [path]          b01
>>> [addr]          (b01) 127.0.0.1:8040
>>> [response time] (b01) 0.000
>>> [status]        (b01) 200
127.0.0.1 [01/Dec/2015:16:58:06 +0300]
>>> [path]          b01
>>> [addr]          (b01) 127.0.0.1:8040
>>> [response time] (b01) 0.000
>>> [status]        (b01) 200
```
  Ждем десять секунд — заблэклисченные апстримы должны разблэклиститься, и повторяем снова.
```
for i in `seq 2` ; do curl 'http://localhost:8010/us' ; done
Passed to server3
Passed to server3
```
  Логи nginx.
```
127.0.0.1 [01/Dec/2015:17:01:44 +0300]
>>> [path]          u01 -> u02 -> b01
>>> [addr]          (u01) 127.0.0.1:8020 (u02) 127.0.0.1:8030 (b01) 127.0.0.1:8040
>>> [response time] (u01) 0.000 (u02) 0.000 (b01) 0.001
>>> [status]        (u01) 503 (u02) 503 (b01) 200
127.0.0.1 [01/Dec/2015:17:01:44 +0300]
>>> [path]          b01
>>> [addr]          (b01) 127.0.0.1:8040
>>> [response time] (b01) 0.000
>>> [status]        (b01) 200
```
  А теперь протестируем работу динамических апстрэндов (предварительно вернув оригинальные настройки локейшнов двух первых бэкендов).
```
curl 'http://localhost:8010/dus?a=us1'
Passed to server1
curl 'http://localhost:8010/dus?a=us2'
Passed to server4
curl 'http://localhost:8010/dus?a=foo&b=us1'
Passed to server2
curl 'http://localhost:8010/dus'
Passed to server4
curl 'http://localhost:8010/dus?b=foo'
<html>
<head><title>500 Internal Server Error</title></head>
<body bgcolor="white">
<center><h1>500 Internal Server Error</h1></center>
<hr><center>nginx/1.8.0</center>
</body>
</html>
```
  В первом запросе мы через аргумент a попали на один из апстримов апстрэнда us1 — им оказался апстрим u01, который содержит единственный сервер server1. Во втором запросе, тоже через аргумент a, мы попали на апстрэнд us2 — апстрим b02 — сервер server4. В третьем запросе мы задействовали новый динамический апстрим dus2 через аргумент b, который отправил нас на второй апстрим (round robin же) u02 апстрэнда us1 и сервер server2. В четвертом запросе мы не предоставили аргументов и сработал последний не пустой элемент динамического апстрэнда dus1 — us2 с его единственным апстримом b02 и единственным сервером server4. В последнем запросе я показал, что может произойти, если динамический апстрэнд вернет пустое значение. В данном случае значение dus2 оказалось пустым и директива proxy_pass, попытавшись выполнить проксирование на неверно сформированный адрес http://, вернула ошибку 500.
Апстрэнды могут быть полезны, во-первых, для создания двумерных round robin циклов, когда вы знаете, что если некоторый сервер из определенного апстрима вернул неудовлетворительный ответ, то нет необходимости обращаться к другим серверам этого апстрима, а следует незамедлительно переходить к следующему апстриму — простой upstream round robin механизм не способен эмулировать такое поведение, поскольку серверы внутри апстрима не могут образовывать кластеры, и, во-вторых, для переноса части логики протокола уровня приложения с клиента на роутер. Например, если в логике приложения код ответа 204, присланный из некоторого апстрима, обозначает отсутствие данных и клиенту следует тут же проверить наличие данных в другом апстриме, то можно ограничить общение клиента с бэкендом всего одним запросом, перенеся опрос всех направлений-апстримов на плечи роутера, в котором все эти апстримы будут помещены в один апстрэнд. Такой подход полезен еще тем, что инкапсулирует знание топологии бэкендов внутри роутера, ведь клиентам это знание больше не нужно.

среда, 11 ноября 2015 г.

cgrep vs ack

Cgrep и ack — это две grep-подобные программы, приспособленные для поиска внутри исходных текстов на разных языках программирования. Ack написана на perl и существует довольно давно, в то время как cgrep — относительно новая программа, написанная на haskell и обладающая большим количеством экспериментальных возможностей. Я не буду подробно сравнивать всевозможные аспекты использования этих программ, но отмечу их наиболее яркие положительные качества и недостатки и в конце приведу мои собственные настройки для декорирования их вывода на терминал. Итак, начнем с положительных качеств ack.

Возможность настроить рекурсивный поиск только внутри исходников указанного языка программирования (или нескольких языков). Соответствие файла определенному языку программирования определяется несколькими способами: расширением имени файла, точным именем файла (например, Makefile для make) и соответствием имени файла или первой строки в файле заданному шаблону регулярного выражения. Настройки соответствий языков по умолчанию можно вывести с помощью опции ack --dump: список языков находится внутри опций --type-add. Собственно включение в поиск определенного языка задается синтетической опцией --язык, где язык соответствует одному из значений в выведенном списке, стоящих сразу после символа =. Например, чтобы включить поиск внутри исходников C++, нужно указать опцию --cpp.
Возможность гибкой настройки вывода контекста вокруг строк с найденными совпадениями: для этого имеются целых три опции -A, -B и -C.
Гибкая настройка формата вывода, включая цветовую подсветку найденных совпадений и возможность перенаправления во внешнюю программу или скрипт с опцией --pager.
Задание настроек по умолчанию в файле .ackrc в домашней директории и поддиректориях.
Поиск файлов по имени или регулярному выражению (опция -g). Это просто киллер-фича! Теперь от громоздкой и негибкой команды find . -name '*pattern*' можно отказаться. Или почти отказаться… Есть два тонких момента, даже три. Во-первых, имеются опции по умолчанию --ignore-file, которые можно вывести с помощью опции --dump: в них указано, какие файлы будут игнорироваться при поиске. В частности, туда должны входить файлы jpg, png и тому подобные. Чтобы отменить все настройки по умолчанию, в том числе настройки игнорирования файлов, нужно просто добавить опцию --ignore-ack-defaults. Но… Вряд ли после этого ack начнет находить jpg и png файлы! Потому что они бинарные! И это было во-вторых. Чтобы преодолеть это ограничение, нужно поступить так, как сказано здесь, то есть добавить еще две опции --type-set='all:match:.*' -k (см. также определение алиаса ackf в конце этой статьи). Ну а в-третьих, пустые директории, а также всякого рода потоки и другие не-файлы, искусственно привязанные к файловой системе (сокеты, FIFO и т.п.), ack будет игнорировать в любом случае.

Прежде чем перейти к плюсам cgrep, я отмечу ее серьезный недостаток по сравнению с ack. Это невозможность вывода контекста вокруг строк с найденными совпадениями. А теперь плюсы.

Скорость и возможность параллельного исполнения. Это все-таки haskell!
Поддержка разных типов поиска. Кроме дословного поиска и поиска совпадения с регулярным выражением (включая POSIX и PCRE), сюда входят префиксный и суффиксный поиски, а также поиск с учетом расстояния Левенштейна. Последний тип поиска позволяет находить похожие слова или слова с возможными ошибками.
Поддержка семантического поиска позволяет искать совпадения в коде, комментариях или строковых литералах отдельно (опции -c, -m и -l), ограничивать поиск специфическими типами идентификаторов, такими как ключевые слова, литералы, директивы препроцессора и операторы (наилучшим образом поддерживаются языки C и C++). Опция -S позволяет настроить семантический поиск с помощью специального семантического языка (простейшие примеры можно увидеть, запустив cgrep --help).
Как и в ack, можно настроить поиск только внутри исходников определенного языка (или языков). Список поддерживаемых языков и соответствующие им расширения имен файлов можно вывести с помощью опции --lang-maps. Чтобы включить в поиск определенный язык, нужно указать его в опции --lang, например, --lang=Cpp.
Как и в ack, можно гибко настроить формат вывода с помощью опции --format. Вывод подсветки задается, но сами цвета не настраиваются (в функции cgr, которую я приведу ниже, изменение цветов подсветки достигается с помощью обработки в sed).
Некоторые настройки по умолчанию можно записать в файле .cgreprc в домашней директории.

Нужно отметить, что поиск на основе регулярных выражений в cgrep отличается от того, как это реализовано в ack или в обычном grep. Если последние ищут совпадения в каждой отдельной строке ввода, то cgrep применяет шаблон ко всему входному тексту. Это может приводить к тонким семантическим отличиям для некоторых шаблонов. Например, класс символов \s включает переносы строк, а значит найденное совпадение после применения шаблона с \s может быть многострочным. Ситуация осложняется тем, что cgrep выведет только первую строку совпадения (а это уже баг, на мой взгляд), которая может не содержать значимой информации, тем самым дезориентируя пользователя. Поэтому я советую вместо \s использовать класс горизонтальных пробелов \h (правда, он доступен только в PCRE) — это будет гарантировать отсутствие переносов строк в найденном совпадении. Ниже я привожу картинку с выводом ack и cgrep на моем терминале, а также настройки и скрипты, которые позволяют добиться этого результата.

Файл .ackrc.

--color-match=rgb551
--color-filename=rgb342
--color-lineno=rgb233

--noheading
--pager=sed -e 's/\(.\+\)\([:-]\)\(\x1b\[38;5;109m[[:digit:]]\+\x1b\[0m\)\2/\1 ⎜ \3 ⎜ /'
            -e 's/\x1b/@eCHr@/g' | column -t -s⎜ -o⎜ |
        sed -e 's/@eCHr@/\x1b/g'
            -e 's/\(\s\+\)\(\x1b\[38;5;109m[[:digit:]]\+\x1b\[0m\)\(\s\+\)/\3\2\1/' |
        hl -u -255 -b '^--$' -rb -216 '\x{239C}'

Обратите внимание, я разбил последнюю строку с опцией --pager на пять отдельных строк для удобочитаемости. На самом деле это должна быть одна строка и разбивать ее на части нельзя! А это функция-обертка cgr (ее можно поместить в .bashrc).

function cgr
{
    `env which cgrep` --color --format='#f ⎜ #n ⎜ #l' -r "$@" |
            sed -e 's/\x1b\[1m\x1b\[94m/@eCHrF@/' \
                -e 's/\x1b\[1m/@eCHrB@/g' \
                -e 's/\x1b\[m/@eCHrE@/g' |
            column -t -s⎜ -o⎜ |
            sed -e 's/@eCHrF@/\x1b\[38;5;150m/' \
                -e 's/@eCHrB@/\x1b\[38;5;227m/g' \
                -e 's/@eCHrE@/\x1b\[m/g' \
                -e 's/\(\s\+\)\([[:digit:]]\+\)\(\s\+\)/\3\2\1/' |
            hl -u -73 '\h+\d+\h+(?=\x{239C})' -216 '\x{239C}'
}

(Здесь переносы допустимы). Программа hl доступна отсюда — она нужна для дополнительной подсветки колонки с номерами строк, я писал о ней здесь и здесь. Ну и алиас для поиска файлов с помощью ack.

alias ackf='ack --ignore-ack-defaults --type-set=all:match:. -k --color --nopager -g'

Update. Начиная с версии util-linux 2.28 программа column научилась игнорировать управляющие последовательности ANSI, поэтому ужасные вре́менные замены текста типа @eCHr@ в опции --pager из .ackrc и в функции cgr больше не нужны. Соответственно, теперь они будут выглядеть так (содержимое опции --pager должно по-прежнему находиться в одной строке).

--pager=sed 's/\(.\+\)\([:-]\)\(\x1b\[38;5;109m[[:digit:]]\+\x1b\[0m\)\2/\1 ⎜ \3 ⎜ /' |
        column -t -s⎜ -o⎜ |
        sed 's/\(\s\+\)\(\x1b\[38;5;109m[[:digit:]]\+\x1b\[0m\)\(\s\+\)/\3\2\1/' |
        hl -u -255 -b '^--$' -rb -216 '\x{239C}'

function cgr
{
    `env which cgrep` --color --format='#f ⎜ #n ⎜ #l' --no-column -r "$@" |
            column -t -s⎜ -o⎜ |
            sed -e 's/\x1b\[1;94m/\x1b\[38;5;150m/' \
                -e 's/\x1b\[1m/\x1b\[38;5;227m/g' \
                -e 's/\(\s\+\)\([[:digit:]]\+\)\(\s\+\)/\3\2\1/' |
            hl -u -73 '\h+\d+\h+(?=\x{239C})' -216 '\x{239C}'
}

Также обратите внимание на некоторые изменения в функции cgr, связанные с обновленным cgrep, в частности на новую опцию --no-column.

пятница, 18 сентября 2015 г.

nginx upstrand to configure super-layers of upstreams

Recently I equipped my nginx combined upstreams module with a new entity that I named upstrand. An upstrand is an nginx configuration block that can be defined in the http clause after upstream blocks. It was designed to combine upstreams in super-layers where all upstreams keep their identity and do not get flattened down to separate servers. This can be useful in multiple areas. Let me rephrase here an example from the module documentation. Imagine a geographically distributed system of backends to deliver tasks to an executor. Let the executor be implemented as a poller separated from the backends by an nginx router proxying requests from the poller to an arbitrary upstream in a list. These upstreams (i.e. geographical parts) may have multiple servers within. Let them send HTTP status 204 if they do not have any tasks at the moment. If an upstream has 10 servers and the first server sends 204 No tasks upon receiving a request from the poller then other 9 servers will presumably send the same status in a short time interval. The nginx upstreams are smart enough to skip checking all servers if the first server returns status 204: the status will be sent to the poller and then the poller must decide what to do next. This scheme has several shortcomings. Remember words arbitrary upstream that I intentionally emphasized in the previous paragraph? Nginx cannot choose arbitrary upstreams! It can combine several upstreams into a bigger upstream but in this case the poller may sequentially receive up to 10 204 responses from the same upstream trying to get a next task. The concrete value (10 or less) of repeated requests depends on how the bigger upstream combines separate servers from the geographical upstreams. On the other hand the poller may know the topology of the backends and send requests to concrete upstreams. I even do not want to explain how bad this is. And again it will send new requests when already polled upstreams have no tasks. What if encapsulate all the polling logic inside the proxy? In this case the proxy router would initiate a new request to another upstream itself if the previous server responds with 204 No tasks. It would eventually send back to the poller a new task or the final 204 status when there were no tasks in all upstreams. The single request from the poller and no need for knowledge about the backends topology downstream the router! This was a good example of what the new upstrand thing can accomplish. Let’s look at a simple nginx configuration that involves upstrands.

worker_processes  1;
error_log  /var/log/nginx/error.log  info;

events {
    worker_connections  1024;
}

http {
    upstream u01 {
        server localhost:8020;
    }
    upstream u02 {
        server localhost:8030;
    }
    upstream b01 {
        server localhost:8040;
    }
    upstream b02 {
        server localhost:8050;
    }
    
    upstrand us1 {
        upstream ~^u0;
        upstream b01 backup;
        order start_random;
        next_upstream_statuses 204 5xx;
    }
    upstrand us2 {
        upstream ~^u0;
        upstream b02 backup;
        order start_random;
        next_upstream_statuses 5xx;
    }

    server {
        listen       8010;
        server_name  main;

        location /us1 {
            proxy_pass http://$upstrand_us1;
        }
        location /us2 {
            rewrite ^ /index.html last;
        }
        location /index.html {
            proxy_pass http://$upstrand_us2;
        }
    }
    server {
        listen       8020;
        server_name  server01;

        location / {
            return 503;
        }
    }
    server {
        listen       8030;
        server_name  server02;

        location / {
            return 503;
        }
    }
    server {
        listen       8040;
        server_name  server03;

        location / {
            echo "In 8040";
        }
    }
    server {
        listen       8050;
        server_name  server04;

        location / {
            proxy_pass http://rsssf.com/;
        }
    }
}

Upstrand us1 imports 2 upstreams u01 and u02 using a regular expression (directive upstream regards names as regular expressions when they start with tilde) and backup upstream b01. While searching through the upstreams for a response status that may be sent to the client the upstrand performs 2 cycles: normal and backup. Backup cycle starts only when all upstreams in the normal cycle responded with unacceptable statuses. Directive order start_random indicates that the first upstream in the upstrand after the worker started up will be chosen randomly, next upstreams will follow the round-robin order. Directive next_upstream_statuses lists HTTP response statuses that will sign to the router to postpone the response and send the request to the next upstream. The directive accepts 4xx and 5xx statuses notation. If all servers in normal and backup cycles responded with unacceptable statuses (like 204 No tasks in the above example) the last response is sent to the client. Upstrand us2 does not differ a lot. It imports same u01 and u02 upstreams in normal cycle and b02 in backup cycle. Each upstream in this configuration has a single server. Servers from upstreams u01 and u02 simply return status 503, server from b01 says In 8040, server from b02 proxies the request to a good old days site rsssf.com (which is a great place for soccer stats!) that I chose because it won’t redirect to https and will send back a huge response: a nice thing to test if buffering of the response won’t break it while going through the proxy router and filters of the combined upstreams module. Let’s look how it works. Start nginx with the above configuration and a sniffer on the loopback interface (better in another terminal).

nginx -c /path/to/our/nginx.conf
ngrep -W byline -d lo '' tcp

Run a client for location /us1.

curl 'http://localhost:8010/us1'
In 8040

Nice. We are in the backup upstream b01. However it shows too little information. That’s why we ran a sniffer. Here is its output.

######
T 127.0.0.1:44070 -> 127.0.0.1:8010 [AP]
GET /us1 HTTP/1.1.
User-Agent: curl/7.40.0.
Host: localhost:8010.
Accept: */*.
.

#####
T 127.0.0.1:37341 -> 127.0.0.1:8030 [AP]
GET /us1 HTTP/1.0.
Host: u02.
Connection: close.
User-Agent: curl/7.40.0.
Accept: */*.
.

##
T 127.0.0.1:8030 -> 127.0.0.1:37341 [AP]
HTTP/1.1 503 Service Temporarily Unavailable.
Server: nginx/1.8.0.
Date: Fri, 18 Sep 2015 12:58:01 GMT.
Content-Type: text/html.
Content-Length: 212.
Connection: close.
.
<html>.
<head><title>503 Service Temporarily Unavailable</title></head>.
<body bgcolor="white">.
<center><h1>503 Service Temporarily Unavailable</h1></center>.
<hr><center>nginx/1.8.0</center>.
</body>.
</html>.

########
T 127.0.0.1:50128 -> 127.0.0.1:8020 [AP]
GET /us1 HTTP/1.0.
Host: u01.
Connection: close.
User-Agent: curl/7.40.0.
Accept: */*.
.

##
T 127.0.0.1:8020 -> 127.0.0.1:50128 [AP]
HTTP/1.1 503 Service Temporarily Unavailable.
Server: nginx/1.8.0.
Date: Fri, 18 Sep 2015 12:58:01 GMT.
Content-Type: text/html.
Content-Length: 212.
Connection: close.
.
<html>.
<head><title>503 Service Temporarily Unavailable</title></head>.
<body bgcolor="white">.
<center><h1>503 Service Temporarily Unavailable</h1></center>.
<hr><center>nginx/1.8.0</center>.
</body>.
</html>.

########
T 127.0.0.1:55270 -> 127.0.0.1:8040 [AP]
GET /us1 HTTP/1.0.
Host: b01.
Connection: close.
User-Agent: curl/7.40.0.
Accept: */*.
.

##
T 127.0.0.1:8040 -> 127.0.0.1:55270 [AP]
HTTP/1.1 200 OK.
Server: nginx/1.8.0.
Date: Fri, 18 Sep 2015 12:58:01 GMT.
Content-Type: text/plain.
Connection: close.
.
In 8040

#####
T 127.0.0.1:8010 -> 127.0.0.1:44070 [AP]
HTTP/1.1 200 OK.
Server: nginx/1.8.0.
Date: Fri, 18 Sep 2015 12:58:01 GMT.
Content-Type: text/plain.
Transfer-Encoding: chunked.
Connection: keep-alive.
.
8.
In 8040
.
0.
.

The client sent the request to port 8010 (our router’s frontend). Then the router started the normal cycle: it proxied the request to the upstream u02 (8030 server) which responded with unacceptable status 503, after that the router tried the next upstream u01 from the normal cycle with the only 8020 server and received 503 status response again. Finally the router started backup cycle and received response In 8040 with HTTP status 200 from 8040 server that belonged to upstream b01, This response was sent back to the client and shown on the terminal. I won’t show results of testing location /us2: the response was very large while the scenario is essentially the same. To search through the sniffer output is boring. Let’s better show what we are visiting during waiting for response on the terminal. An addition of a simple location

        location /echo/us1 {
            echo $upstrand_us1;
        }

into the frontend 8010 server will make our testing easy. (Beware that older versions of the echo module compiled for nginx-1.8.0 will behave badly in the next tests! I used latest version v0.58 and it worked fine.) Restart nginx and run some curls again.

curl 'http://localhost:8010/echo/us1'
u02
curl 'http://localhost:8010/echo/us1'
u01
curl 'http://localhost:8010/echo/us1'
u02
curl 'http://localhost:8010/echo/us1'
u01
curl 'http://localhost:8010/echo/us1'
u02

Hmm. Servers from the normal cycle following each other in the round-robin manner. Upstrand us1 must return response from the first upstream in normal cycle whose status is not listed in next_upstream_statuses. Directive echo normally returns 200 and this status is not present in the list. Thus the list of the upstreams shown on the terminal corresponds to first upstreams in the normal cycle: they start randomly and proceed in the round-robin sequence. Let’s now show the last upstreams chosen during the two cycles. As soon as servers from upstreams u01 and u02 in the normal cycle will always return unacceptable status 503 we must expect that upstream b01 from backup cycle will always be the last. To check this we will add status 200 which is normally returned by echo into the list of directive next_upstream_statuses in the upstrand us1.

    upstrand us1 {
        upstream ~^u0;
        upstream b01 backup;
        order start_random;
        next_upstream_statuses 200 204 5xx;
    }

Run curls.

curl 'http://localhost:8010/echo/us1'
b01
curl 'http://localhost:8010/echo/us1'
b01
curl 'http://localhost:8010/echo/us1'
b01

Nice, as expected. And now let’s show all upstreams the router tries before sending the last response. For such a task there is another upstrand block directive debug_intermediate_stages (do not use it in other cases besides testing because it is not designed for normal usage in upstrands).

    upstrand us1 {
        upstream ~^u0;
        upstream b01 backup;
        order start_random;
        next_upstream_statuses 200 204 5xx;
        debug_intermediate_stages;
    }

Restart nginx and run curls again.

curl 'http://localhost:8010/echo/us1'
b01
u02
u01
curl 'http://localhost:8010/echo/us1'
b01
u01
u02
curl 'http://localhost:8010/echo/us1'
b01
u02
u01
curl 'http://localhost:8010/echo/us1'
b01
u01
u02

Looks interesting. The last upstream is shown first and the first upstream is shown last. Do not worry, this is an artefact of the upstrand implementation. And now I want to say a little bit about the implementation. All standard and third-party directives like proxy_pass and echo that magically traverse upstreams in an upstrand refer to a magic variable whose name starts with upstrand_ and ends with the upstrand name. Such magic variables are created automatically in the configuration handler ngx_http_upstrand_block() for each declared upstrand. The handler of these variables ngx_http_upstrand_variable() creates the combined upstreams module’s request context with imported from the module’s main configuration indices of the next upstreams in both normal and backup cycles and shifts the main configuration’s indices forward for the future requests. The module installs response header and body filters. The header filter checks whether the response context exists (i.e. an upstrand variable was accessed) and if so it may start up a subrequest: this depends on the response status of the main request or the previous subrequest if they have been already launched. In nginx a subrequest runs the location’s content handler again. Remember that the both proxy and echo directives referred to our magic variable? This means that their content handlers accessed the variable’s handler on every subrequest because the upstrand variables had attribute nocacheable. The variable’s handler might feed the subrequests with the shifted index of the next upstream in the upstrand after checking whether to move to the backup cycle or finish the subrequests when all upstreams in the both cycles had been exhausted. In the latter case the final response was extracted from the subrequests’ response headers and body buffer chains and returned to the client. Want to know how? In case when the subrequests have been launched, original headers of the main request are replaced by headers of the last subrequest. Thanks to running subrequests from the header filter the response bodies nest in reverse order (this explains why our last test listed visited upstreams from last to first). This makes possible to break feeding next body filters with older responses bodies following the last response body and thus to return the body of the last subrequest. The beginning of the older response bodies is recognized by the last buffer in the last response’s buffer chain. Replacing the main request output headers and the response body buffer chain by those from the last response does the trick.

воскресенье, 27 декабря 2015 г.