воскресенье, 12 апреля 2020 г.

Recursive calculations in Nginx with recursive subrequests from NgxExport.Tools.Subrequest

Disclaimer. Do not use recursive calculations demonstrated below in production code as they use Nginx and system resources very extensively. Recursion is always attractive but dangerous when implemented naively! Here I want to show something weird and beautiful. Module NgxExport.Tools.Subrequest provides natural HTTP subrequests running inside normal client requests. Being directed towards the Nginx server itself (e.g. with 127.0.0.1), they can make recursive flows. It means… Yes, we can calculate things like factorials and Fibonacci numbers simply by doing nested subrequests to the server! Let’s calculate them. Firstly, we have to install the latest version of package ngx-export-tools-extra with Cabal.
cabal v1-install ngx-export-tools-extra
(You may prefer the new-style v2-install instead.) Then write a bit of Haskell code.
{-# OPTIONS_GHC -Wno-unused-imports #-}

{-# LANGUAGE TemplateHaskell, TypeApplications #-}

module NgxSubrequestRecursion where

import           NgxExport
import           NgxExport.Tools
import           NgxExport.Tools.Subrequest

import           Data.ByteString (ByteString)
import qualified Data.ByteString.Lazy as L
import qualified Data.ByteString.Lazy.Char8 as C8L

showAsLazyByteString :: Show a => a -> L.ByteString
showAsLazyByteString = C8L.pack . show

type IntegerTuple = (Integer, Integer)

multiply :: ByteString -> L.ByteString
multiply = showAsLazyByteString .
    maybe 0 (uncurry (*)) . readFromByteString @IntegerTuple

ngxExportYY 'multiply

add :: ByteString -> L.ByteString
add = showAsLazyByteString .
    maybe 0 (uncurry (+)) . readFromByteString @IntegerTuple

ngxExportYY 'add

prev :: ByteString -> L.ByteString
prev = showAsLazyByteString .
    maybe 0 pred . readFromByteString @Integer

ngxExportYY 'prev
Besides the asynchronous handler makeSubrequest imported from module NgxExport.Tools.Subrequest, module NgxSubrequestRecursion exports a number of simple arithmetic handlers such as multiply, add, and prev: they will be used as auxiliary functions in the recursive algorithms. Let’s compile the module.
ghc -O2 -dynamic -shared -fPIC -lHSrts_thr-ghc$(ghc --numeric-version) ngx_subrequest_recursion.hs -o ngx_subrequest_recursion.so
Gather all dependent libraries (see Utility hslibdeps).
hslibdeps -t /var/lib/nginx/x86_64-linux-ghc-8.8.3 ngx_subrequest_recursion.so
And finally, install all the artifacts into directory /var/lib/nginx being a superuser.
cp .hslibs/libHS* /var/lib/nginx/x86_64-linux-ghc-8.8.3/
cp ngx_subrequest_recursion.so /var/lib/nginx/
Now let’s turn to calculation of factorials and Fibonacci numbers based on recursive subrequests spawned right out of the Nginx configuration. Remember how recursive equations for factorials and Fibonacci numbers are defined? Factorial. F0=1 F_0 = 1 Fn=Fn1n F_{n} = F_{n-1} \cdot n Fibonacci numbers. F0=0 F_0 = 0 F1=1 F_1 = 1 Fn=Fn1+Fn2 F_{n} = F_{n-1} + F_{n-2} We are going to encode them inside the configuration!
user                    nobody;
worker_processes        2;

events {
    worker_connections  1024;
}

http {
    default_type        application/octet-stream;
    sendfile            on;

    haskell load /var/lib/nginx/ngx_subrequest_recursion.so;

    haskell_var_empty_on_error $hs_subrequest, $hs_subrequest2;

    server {
        listen       8010;
        server_name  main;
        error_log    /tmp/nginx-test-haskell-error.log;
        access_log   /tmp/nginx-test-haskell-access.log;

        location = /Factorial {
            rewrite ^ /Calculate/Factorial last;
        }

        location = /Fibonacci {
            rewrite ^ /Calculate/Fibonacci last;
        }

        location ~ ^/Calculate/(.*) {
            internal;

            set $algorithm $1;

            if ($arg_v !~ '^\d+$') {
                return 400;
            }

            if ($arg_v = 1) {
                echo 1;
                break;
            }

            set $check0 $arg_v$algorithm;

            if ($check0 = 0Factorial) {
                echo 1;
                break;
            }

            if ($check0 = 0Fibonacci) {
                echo 0;
                break;
            }

            rewrite ^ /Internal/$algorithm last;
        }

        location = /Internal/Factorial {
            internal;

            haskell_run prev $hs_prev_v $arg_v;

            haskell_run_async makeSubrequest $hs_subrequest
                    '{"uri": "http://127.0.0.1:8010/Factorial?v=$hs_prev_v"}';

            haskell_run multiply $hs_result '($arg_v, $hs_subrequest)';

            if ($hs_result = 0) {
                echo_status 500;
                echo "Error while getting factorial $arg_v";
                break;
            }

            echo $hs_result;
        }

        location = /Internal/Fibonacci {
            internal;

            haskell_run prev $hs_prev_v $arg_v;
            haskell_run prev $hs_prev2_v $hs_prev_v;

            haskell_run_async makeSubrequest $hs_subrequest
                    '{"uri": "http://127.0.0.1:8010/Fibonacci?v=$hs_prev_v"}';
            haskell_run_async makeSubrequest $hs_subrequest2
                    '{"uri": "http://127.0.0.1:8010/Fibonacci?v=$hs_prev2_v"}';

            haskell_run add $hs_result '($hs_subrequest, $hs_subrequest2)';

            if ($hs_result = 0) {
                echo_status 500;
                echo "Error while getting Fibonacci number $arg_v";
                break;
            }

            echo $hs_result;
        }
    }
}
There are two entry points: locations /Factorial and /Fibonacci. They rewrite to an internal location /Calculate which filters bad requests (when the value of variable arg_v is not a natural number) and defines base cases for the recursion rules. If the number passed in arg_v is big enough, then the request gets rewritten to locations /Internal/Factorial or /Internal/Fibonacci where calls to makeSubrequest perform recursion over the same locations until the base cases get reached. Let’s calculate basic factorials and Fibonacci numbers.
curl 'http://localhost:8010/Factorial?v=0'
1
curl 'http://localhost:8010/Factorial?v=1'
1
curl 'http://localhost:8010/Fibonacci?v=0'
0
curl 'http://localhost:8010/Fibonacci?v=1'
1
Factorials of big numbers.
time curl 'http://localhost:8010/Factorial?v=10'
3628800

real    0m0,015s
user    0m0,005s
sys     0m0,006s
If you look into the access log /tmp/nginx-test-haskell-access.log, you will see 10 lines which correspond to 10 subrequests.
127.0.0.1 - - [11/Apr/2020:23:42:29 +0300] "GET /Factorial?v=1 HTTP/1.1" 200 12 "-" "-"
127.0.0.1 - - [11/Apr/2020:23:42:29 +0300] "GET /Factorial?v=2 HTTP/1.1" 200 12 "-" "-"
127.0.0.1 - - [11/Apr/2020:23:42:29 +0300] "GET /Factorial?v=3 HTTP/1.1" 200 12 "-" "-"
127.0.0.1 - - [11/Apr/2020:23:42:29 +0300] "GET /Factorial?v=4 HTTP/1.1" 200 13 "-" "-"
127.0.0.1 - - [11/Apr/2020:23:42:29 +0300] "GET /Factorial?v=5 HTTP/1.1" 200 14 "-" "-"
127.0.0.1 - - [11/Apr/2020:23:42:29 +0300] "GET /Factorial?v=6 HTTP/1.1" 200 14 "-" "-"
127.0.0.1 - - [11/Apr/2020:23:42:29 +0300] "GET /Factorial?v=7 HTTP/1.1" 200 15 "-" "-"
127.0.0.1 - - [11/Apr/2020:23:42:29 +0300] "GET /Factorial?v=8 HTTP/1.1" 200 16 "-" "-"
127.0.0.1 - - [11/Apr/2020:23:42:29 +0300] "GET /Factorial?v=9 HTTP/1.1" 200 17 "-" "-"
127.0.0.1 - - [11/Apr/2020:23:42:29 +0300] "GET /Factorial?v=10 HTTP/1.1" 200 18 "-" "curl/7.66.0"
time curl 'http://localhost:8010/Factorial?v=100'
93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000

real    0m0,124s
user    0m0,011s
sys     0m0,010s
Subrequests are spawned sequentially (time to calculate factorial 100 is 10 times bigger than time to calculate factorial 10), but they are still asynchronous. So, when run in a bunch, they’ll probably run out of limit of open file descriptors.
for i in {1..8} ; do curl 'http://localhost:8010/Factorial?v=100' & done
Error while getting factorial 100
Error while getting factorial 100
Error while getting factorial 100
Error while getting factorial 100
93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000
93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000
93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000
93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000
If you look into the error log /tmp/nginx-test-haskell-error.log, you will find a zillion number of HttpExceptionRequest errors with messages Error while getting factorial …, but more interesting are messages with failed to create async event channel for future async result, skipping IO task (24: Too many open files). There should be four such messages: they are the real cause of the four errors. We can evaluate the number of open file descriptors needed for running 8 simultaneous calculations of factorial 100. Every request creates 99 nested subrequests (1+99=1001 + 99 = 100), each of them creates 2 TCP sockets for the Haskell HTTP client and the Nginx HTTP server, add to this 1 eventfd channel for Haskell/Nginx communication. Altogether 3100=3003 \cdot 100 = 300 open file descriptors for a single request at the peak, or 8300=24008 \cdot 300 = 2400 open file descriptors for 8 simultaneous calculations at the peak which is more than 2 times bigger than the value of worker_connections we set (1024). The same story with Fibonacci numbers: each sub-calculation is sequential and nevertheless asynchronous. This means that the bigger the number, the longer time to complete the whole calculation, on the other hand, independent calculations work in parallel and may exhaust open file descriptors as fast as factorial calculations.
time curl 'http://localhost:8010/Fibonacci?v=10'
55

real    0m0,113s
user    0m0,016s
sys     0m0,005s
time curl 'http://localhost:8010/Fibonacci?v=20'
6765

real    0m7,539s
user    0m0,007s
sys     0m0,004s
The parallel computation shall bring a real TCP storm.
for i in {1..8} ; do curl 'http://localhost:8010/Fibonacci?v=20' & done
6765
6765
6765
6765
6765
6765
6765
6765
This added to the access log 175128 new records! Notice that this time there were no errors as far as depth 20 is not enough to exhaust open file descriptors.

понедельник, 16 декабря 2019 г.

Новый переключатель раскладки g3kb-switch для Gnome 3 и vim-xkbswitch

Люди говорят (а я проверил), что xkb-switch больше (и наверное уже давно) не работает в Gnome Shell. Поэтому я решил написать его аналог для этой популярной среды. Он называется g3kb-switch и ведет себя практически аналогично xkb-switch. Так же, как и xkb-switch, он может быть настроен для работы с vim-xkbswitch. В базовом варианте для этого нужны всего две строки в файле .vimrc.
let g:XkbSwitchEnabled = 1
let g:XkbSwitchLib = '/usr/local/lib/libg3kbswitch.so'
Реализация g3kb-switch сильно отличается от xkb-switch: он не работает на уровне X протокола, а выполняет удаленные действия в Gnome Shell с помощью синхронных вызовов, передаваемых через шину сообщений D-Bus.

понедельник, 20 мая 2019 г.

Exploring Nginx workers load arbitration

In the documentation of Haskell module NgxExport.Tool.Aggregate there is a small example of how to establish monitoring of Nginx worker’s load. In a few words, it is possible to set up an internal server that would sit on an arbitrary Nginx worker process and accumulate various data from all the workers. This data could be retrieved via a specified interface configured in the Nginx configuration script. In the example, the internal server collects the number of requests and bytes that were sent back to clients from each worker process. This data is accessible in JSON format via a dedicated virtual server listening on port 8020. Say, to retrieve the current load, we can simply use curl and jq (to pretty-print JSON).

curl 'http://127.0.0.1:8020/' | jq
[
  "2019-04-22T14:29:04Z",
  {
    "5910": [
      "2019-04-22T14:31:34Z",
      {
        "bytesSent": 17751,
        "requests": 97,
        "meanBytesSent": 183
      }
    ],
    "5911": [
      "2019-04-22T14:31:31Z",
      {
        "bytesSent": 549,
        "requests": 3,
        "meanBytesSent": 183
      }
    ]
  }
]

Now I want to show how to retrieve this data repeatedly and render it on an interactive dashboard online using R and a Shiny application with plotly.

Below is the annotated code saved in file load_monitoring.r.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
library(shiny)
library(plotly)
library(jsonlite)

ui <- fluidPage(
    fluidRow(
        headerPanel(h1("Nginx workers load arbitration", align = "center"),
                    "Nginx workers load arbitration")),
    fluidRow(
        wellPanel(div(align = "center",
                      div(style = "display: inline-block; margin-right: 20px",
                          textInput("i_url", NULL, "http://127.0.0.1:8020/",
                                    width = "200px")), span(),
                      div(style = "display: inline-block; margin-right: 20px",
                          radioButtons("rb_mode", NULL,
                                       c("Requests" = "reqs",
                                         "Bytes_sent" = "bsent"),
                                       selected = "reqs", inline = TRUE)),
                      div(style = "display: inline-block",
                          actionButton("b_reset", "Reset Traces"))))),
    fluidRow(
        div(plotlyOutput("plot"), id = "graph"))
)

server <- function(input, output, session) {
    values <- reactiveValues()
    values$init <- TRUE
    values$load <- list("", list())
    values$pids_prev <- list()

    observe({
            invalidateLater(5000, session)

            if (class(values$load) != "try-error") {
                values$pids_prev <- names(values$load[[2]])
            }
            values$load <- try(fromJSON(input$i_url))

            m <- `if`(input$rb_mode == "reqs", 2, 1)

            if (class(values$load) == "try-error" ||
                length(values$load[[2]]) == 0) {
                invalidateLater(1000, session)
            } else {
                pids <- names(values$load[[2]])

                if (values$init) {
                    values$init <- FALSE
                    values$p <- plot_ly(type = "scatter",
                                        mode = "lines",
                                        colors = "YlOrRd")

                    for (i in 1:length(values$load[[2]])) {
                        xs <- as.POSIXct(values$load[[2]][[i]][[1]],
                                         format = "%Y-%m-%dT%H:%M:%S")
                        values$p <- add_trace(values$p,
                            name = paste(pids[i],
                                         names(values$load[[2]][[i]][[2]][m])),
                            x = xs,
                            y = values$load[[2]][[i]][[2]][[m]],
                            line = list(width = 2)) %>%
                        add_annotations(
                            x = xs,
                            y = values$load[[2]][[i]][[2]][[m]],
                            text = "<span />",
                            showarrow = TRUE, arrowcolor = "#bbb")
                    }
                    values$p <- layout(values$p, yaxis = list(range = 0))

                    output$plot <- renderPlotly(values$p)
                } else {
                    vs <- list()
                    xs <- list()
                    ts <- list()
                    len <- length(pids)

                    if (length(names(values$load[[2]])) !=
                            length(values$pids_prev) ||
                        length(setdiff(names(values$load[[2]]),
                                       values$pids_prev)) > 0) {
                        invalidateLater(1000, session)
                        values$init <- TRUE
                    } else {
                        i <- 1
                        while (i <= len) {
                            vs[[i]] <- list(values$load[[2]][[i]][[2]][[m]])
                            xs[[i]] <- list(values$load[[2]][[i]][[1]])
                            ts[[i]] <- i
                            i <- i + 1
                        }

                        plotlyProxy("plot", session) %>%
                            plotlyProxyInvoke("extendTraces",
                                              list(x = xs, y = vs), ts)
                    }
                }
            }
        }
    )

    observeEvent(input$b_reset, {
            values$init <- TRUE
        }
    )

    observeEvent(input$rb_mode, {
            values$init <- TRUE
        }
    )
}

In lines 1–3 all required libraries are loaded: shiny for UI, plotly for interactive plotting, and jsonlite for reading JSON data. The user interface is built in lines 5–23. It consists of the header (lines 6–8), a panel with control widgets (lines 9–20), and the plot area (lines 21–22).

In lines 25–110 a Shiny server that would render data online, is defined. In lines 26–29 a number of reactive values are declared and initialized: they will be used in reactive observer observe defined in lines 31–99. The observer runs every 5 seconds (line 32) retrieving new data from Nginx (line 37) and plotting it on the dashboard in case of success (lines 44–97). If retrieval fails then observe re-runs after 1 second (lines 41–43).

There are two branches of execution on successful data retrieval: initialization of the plot (lines 47–70), and extending traces (lines 71–97). The initialization triggers when the reactive value of values$load is TRUE: in this phase the plotly object values$p gets initialized in calls to plot_ly, add_trace, and add_annotations. The plot type is set to scatter, its mode is lines, and the color palette for traces is YlOrRd (lines 49–51). The traces and the annotations are added in a loop (lines 53–67), the number of iterations depends on JSON data saved in the value values$load and corresponds to the number of Nginx worker processes. The annotations are empty strings (they have value <span />): they are only needed for drawing arrows at the beginnings of the traces.

Extending traces comes after the initialization phase, and only if the PIDs of the Nginx worker processes (found in line 45) did not change after the previous data retrieval (it gets checked in lines 77–82). If the PIDs have changed then the plot will be redrawn in 1 second (lines 81–82). A more graceful solution would be adding new traces dynamically, without redrawing of the whole plot, however this would be more challenging for such a simple example, and not very useful, taking into account that the workers’ PIDs should not change often in the normal case. So, when the PIDs do not change, the traces get extended with new values (lines 83–95).

In lines 101–109 actions for the Reset button and the Mode radio-button are defined: they simply reset the value of values$init to TRUE in order to redraw the plot.

Let’s start Nginx with this configuration.

nginx -c /path/to/nginx.conf
ps -ef | grep nginx
root     29516     1  0 13:01 ?        00:00:00 nginx: master process nginx -c /path/to/nginx.conf
nobody   29523 29516  1 13:01 ?        00:00:00 nginx: worker process
nobody   29524 29516  1 13:01 ?        00:00:00 nginx: worker process

The PIDs of the worker processes are 29523 and 29524. Later we should see them on the plot.

Now let’s start an R shell and run the server.

source("load_monitoring.r")
shinyApp(ui, server)

Listening on http://127.0.0.1:6678

The application will open in a browser. Now run a number of requests to the Nginx server from a shell.

for i in {1..100} ; do curl 'http://127.0.0.1:8010/' & done
  ...
for i in {1..100} ; do curl 'http://127.0.0.1:8010/' & done
  ...

The application in the bowser shall look like on the image below.

The plot gets updated in real time. The address of the Nginx stats server and types of traces can be altered using control widgets on the grayish panel above the plot.

воскресенье, 3 марта 2019 г.

Offline visualization of geolocation data from Statcounter logs with R

Statcounter is a nice web traffic analysis tool. It collects ISP and geolocation data of visitors of a tracked site. The data is logged on the Statcounter site and can be downloaded by the tracked site’s owner in XLSX or CSV format. In this article I want to show how I managed to visualize geolocation data from the CSV log using R. First of all, I have to admit that I am a newbie in R. I’ve been using Statcounter for years, but have only one month experience in R. All my older scripts were written in awk, perl and bash: they could download and merge Statcounter logs and do some basic data visualization such as plotting bar charts with Gnuplot. I will refer to some of them below in this article, while you can find them in my project statcounter-utils. I discovered R when I decided to put all visits of my blog on a world map. I checked out a plethora of complete GIS solutions, but they all seemed to me unnecessarily heavy-weight and rigid. Then I read somewhere on Stackoverflow about Leaflet and R. This was excellent finding, because this promised programming, and I love to program! Below is the annotated solution of the world map visits with examples (in the statcounter-utils, the R code is located in a file named cities.r).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
library(leaflet)
library(htmltools)
library(plyr)
library(dplyr)
library(tidyr)

cities <- function(gcities, geocode, len = as.integer(.Machine$integer.max),
                   FUN = function(x) TRUE) {
    if (!is.data.frame(gcities)) {
        gcities <- read.csv(gcities, header = TRUE, sep = ";", as.is = TRUE)
    }
    geocode <- read.csv(geocode, header = TRUE, sep = ";", as.is = TRUE,
                        na.strings = "null")

    d <- merge(gcities, geocode, 1:3)
    d <- d[order(-d$Count), ]
    d <- d[!is.na(d$Longitude) & FUN(d), ]

    m <- leaflet() %>% addTiles()

    dh <- head(d, len)

    nrow <- nrow(dh)
    if (nrow == 0) {
        print("No cities to render", quote = FALSE)
        return(m)
    }

    color <- c("#FF3300", "#FF9900", "#0033FF", "#666666")
    #           Country    Region     City       Unknown location

    dh$nc <- case_when(
                 nzchar(dh$City) ~ paste0(htmlEscape(dh$City), color[3]),
                 nzchar(dh$Region) ~ paste0(htmlEscape(dh$Region), color[2]),
                 nzchar(dh$Country) ~ paste0(htmlEscape(dh$Country), color[1]),
                 TRUE ~ paste0(htmlEscape("<UNKNOWN LOCATION>"), color[4]))

    dh <- separate(dh, "nc", c("Name", "Color"), -7)

    m <- addCircleMarkers(m, lng = dh$Longitude, lat = dh$Lattitude,
                          color = dh$Color, radius = 5 * log(dh$Count, 10),
                          popup = paste(dh$Name, ",", dh$Count),
                          label = dh$Name)

    m <- addLegend(m, "bottomright",
                   colors = c(circle_marker_to_legend_color(color[3]),
                              circle_marker_to_legend_color(color[2]),
                              circle_marker_to_legend_color(color[1])),
                   labels = c("City", "Region", "Country"),
                   opacity = 0.5)

    print(paste(nrow, "cities rendered"), quote = FALSE)

    return(m)
}


cities_df <- function(statcounter_log_csv, cities_spells_filter_awk = NULL,
                      warn_suspicious = TRUE, type = "page view") {
    df <- read.csv(`if`(is.null(cities_spells_filter_awk),
                        statcounter_log_csv,
                        pipe(paste("awk -f", cities_spells_filter_awk,
                                   `if`(warn_suspicious,
                                        "-v warn_suspicious=yes", NULL),
                                   statcounter_log_csv))),
                   header = TRUE, sep = ",", quote = "\"", as.is = TRUE)

    if (!is.null(type)) {
        df <- df[df$Type == type, ]
    }

    return(df)
}

gcities <- function(cs) {
    d <- plyr::count(cs, c("Country", "Region", "City"))
    names(d)[4] <- "Count"

    return(d[order(-d$Count), ])
}

circle_marker_to_legend_color <- function(color,
                                          marker_opacity = 0.3,
                                          stroke_opacity = 0.7,
                                          stroke_width = "medium") {
    c <- col2rgb(color)
    cv <- paste("rgba(", c[1], ", ", c[2], ", ", c[3], ", ", sep = "")

    return(paste(cv, marker_opacity, "); border-radius: 50%; border: ",
                 stroke_width, " solid ", cv, stroke_opacity,
                 "); box-sizing: border-box", sep = ""))
}
In lines 1–5 all required libraries are loaded: leaflet for the map, htmltools for function htmlEscape, plyr for function count, dplyr for function case_when, and tidyr for function separate. Function cities (lines 7–55) renders locations (they include not only cities, but also regions and countries, as it will be hinted on the legend of the map) on an interactive world map that shall open in a browser window. This function accepts a list of cities gcities tagged with a count. The gcities can be a file name or a data frame. Geolocation data with locations found in gcities is expected to be passed in parameter geocode: this can only be a CSV file. Other two parameters — len and FUN — define how many top-cities to put on the map and a custom subsetting function. Files for parameters gcities and geocode can be obtained with bash utility group_cities which can extract geolocation data from a Statcounter log and group cities by count. For geocoding, group_cities makes use of the Python Geocoder. Script group_cities can be found in the statcounter-utils.
group_cities -f cities_spells_fix.awk StatCounter-Log.csv > gcities.csv
group_cities -g -f cities_spells_fix.awk StatCounter-Log.csv > geocode.csv
A sample script cities_spells_fix.awk can also be found in the statcounter-utils. This is a manually crafted database of cities and regions synonyms, various transcriptions, misspellings, and apparent errors met in Statcounter logs: the script collapses all variants of a single location to a single value. Files gcities.csv and geocode.csv have schemes with headers Country;Region;City;Count and Country;Region;City;Longitude;Lattitude respectively. In lines 15–16 the data get merged by the first 3 fields (Country, Region, and City) and ordered by field Count from gcities. Then, in line 17, cities with wrong geocode data (more specifically, when field Longitude from geocode is null) get filtered out, and the custom subsetting function FUN is applied. Later, on line 21, top len cities from the survived after all the previous filters set are picked. Basic leaflet construction takes place in line 19. In lines 29–43 cities (as well as regions if there is no city in the record, and countries if there is neither a city nor a region in the record) are marked by circle markers and annotated with popups containing the name and the count. The size of a circle marker is a logarithmic function of the count, whereas its color depends on whether the location is a city or a region or a country. In lines 45–50 a legend with color circles is added to hint a user why circle markers have different colors. Putting circles on a legend is not a trivial task. Function circle_marker_to_legend_color in lines 82–92 accomplishes this using the fact that Leaflet legend’s parameter colors is hackable by supplying a specially crafted HTML code. Selecting cities from a Statcounter log and grouping them by count is a trivial task for R. In other words, there is no need to pass preliminary crafted file gcities.csv, but instead, it makes sense to create a data frame inside R. This makes also possible to apply yet more sophisticated subsettings to the original data because now we are getting access to all the fields in the log directly from R. But remember that we have to apply the cities spells database cities_spells_fix.awk. This seems to be the only complication for function cities_df defined in lines 58–73. This function reads a Statcounter log and returns the desired data frame with grouped cities. Its obscure parameters warn_suspicious and type correspond to whether the awk script should print on the stderr suspicious replacements, and what type of visits to select from the Statcounter log: the default value “page view” is what a user normally expects. A data frame returned from cities_df can be further subset by a custom function as it contains all original data fields. Function gcities (lines 75–80) collapses the data frame fields to the scheme with headers Country;Region;City;Count compatible with input of function cities, and orders the data by count. Let’s run a few examples in an R shell. For all of them, we have to load script cities.r and collect all page view visits from a Statcounter log StatCounter-Log.csv.
source("cities.r")
pv <- cities_df("StatCounter-Log.csv", "cities_spells_fix.awk")
pvC < gcities(pv)
Now let’s render all collected cities on a world map.
cities(pvC, "geocode.csv")
Here is how it looks in my browser.

Seems to be cluttered by myriads of circle markers (function cities printed [1] 1920 cities rendered). No problem! The map is interactive (however, not in this blog) and can be zoomed (look at the buttons at the top-left corner). The legend at the bottom-right corner shows why the markers have different colors. Let’s put on a map cities from the Moscow region.
pvMosk <- pv[grepl("^(Moskva|Moscow)", pv$Region), ]
pvMoskC <- gcities(pvMosk)
cities(pvMoskC, "geocode.csv")

In the next examples I won’t show maps any longer to not clutter the article. Render cities from Russian Federation only.
cities(pvC, "geocode.csv", FUN = function(x) grepl("Russia", x$Country))
Render top 10 cities all over the world with total visits from 10 to 20.
cities(pvC, "geocode.csv", 10, FUN = function(x) x$Count %in% 10:20)
Render all the cities with visits in year 2018.
pv2018 <- pv[grepl("^2018", pv$Date.and.Time), ]
cities(gcities(pv2018), "geocode.csv")
Function cities looks good to me, except it seems to make sense to create geocode data frame separately and pass it to the function like gcities data frame. Perhaps I will implement this in the future. Other improvements may also include using library leaflet.extras for searching of marked cities. (Update: both improvements were implemented in the statcounter-utils.) Now let’s turn to the bar charts of cities. I said that I used Gnuplot for that. But R is capable of making them as well! The following solution (which is the rest of cities.r) makes use of ggplot2 and plotly. As such, lines
library(ggplot2)
library(plotly)
must be put on the top of the script.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
gcities.compound <- function(cs) {
    d <- plyr::count(cs, c("Country", "Region", "City"))
    d$City <- paste(d$Country, "/", d$Region, "/", d$City)
    names(d)[4] <- "Count"

    return(d[order(-d$Count), c("City", "Count")])
}

gcountries <- function(cs) {
    d <- plyr::count(cs, c("Country"))
    names(d)[2] <- "Count"

    return(d[order(-d$Count), ])
}

cities.plot <- function(cs, title = NULL, tops = NULL, width = NULL) {
    w0 <- 1200
    wf <- if (is.null(width)) 1 else w0 / width
    mf <- wf * (max(cs$Count) / 10000)
    cw <- 21
    to <- (cw * nchar(cs$Count) + 300) * mf
    ym <- cs[1, ][["Count"]] + to[1] * 2
    nrow <- nrow(cs)

    p <- ggplot(cs, aes(reorder(cs[[1]], cs$Count), cs$Count)) +
        scale_x_discrete(limits = rev(cs[[1]])) +
        scale_y_continuous(expand = c(0, 50 * mf, 0, 300 * mf),
                           limits = c(0, NA)) +
        coord_flip() +
        geom_col(fill = "darkseagreen", alpha = 1.0) +
        geom_text(aes(label = cs$Count, y = cs$Count + to, alpha = 0.75),
                  size = 3.4) +
        theme(axis.ticks.y = element_blank(),
              axis.ticks.x = element_blank(),
              axis.text.x = element_blank(),
              panel.grid.major = element_blank(),
              panel.grid.minor = element_blank(),
              panel.background = element_blank()
              ) +
        labs(title = title, x = NULL, y = NULL)

    if (is.null(tops)) {
        p <- p + annotate("rect", xmin = 0.1, xmax = 0.9, ymin = 0, ymax = ym,
                          fill = alpha("green", 0.0))
    } else {
        cur <- 0
        ac <- 0.1
        for (i in 1:length(tops)) {
            if (is.na(tops[i]) || tops[i] > nrow) {
                tops[i] <- nrow
            }
            p <- p + annotate("rect",
                              xmin = nrow - tops[i] + 0.5,
                              xmax = nrow - cur + 0.5,
                              ymin = 0, ymax = ym,
                              fill = alpha("green", ac),
                              color = alpha("firebrick1", 0.4),
                              size = 0.4, linetype = "solid") +
                     annotate("text",
                              x = nrow - tops[i] + 1,
                              y = ym - 300 * mf, color = "blue",
                              label = tops[i], size = 3.0, alpha = 0.5)
            cur <- tops[i]
            ac <- ac / 2
            if (tops[i] == nrow) {
                break
            }
        }
    }

    # Cairo limits linear canvas sizes to 32767 pixels!
    height <- min(25 * nrow, 32600)
    p <- ggplotly(p, height = height, width = width)
    p <- config(p, toImageButtonOptions =
                list(filename = `if`(is.null(title), "cities",
                                     gsub("[^[:alnum:]_\\-]", "_", title)),
                     height = height,
                     width = `if`(is.null(width), w0, width), scale = 1))

    print(paste(nrow, "cities plotted"), quote = FALSE)

    return(p)
}
Functions gcities.compound and gcountries (lines 1–14) accept an original data frame read from a Statcounter log and return a data frame with two columns: a location (a compound location comprised from pasted Country, Region and City columns, or from column Country) and the count. They are needed for input to function cities.plot which renders a plot in a browser window. Function cities.plot accepts additionally a title, a tops, and a width. The title and the width correspond to the title and the width of the plot, they can be unset. The tops is a set of counts to form a number of emphasis boxes which help to emphasize top cities on the plot. In the preamble of function cities.plot (lines 17–23) a number of (a bit empirical) factors are configured. The value of w0 corresponds to the standard width for saving a plot as a PNG image and calculating factors for text positions of bar labels, wf is the width factor, the mf is the ultimate width factor which takes into account not only the width of the plot, but also the extents of the Count axis (which depends on the length of the top bar), cw stands for character width, totext offset — the final y (i.e. Count) position of a text label on the plot, ym is the y extent of the emphasis boxes. All these complicated calculations are needed because in ggplot2 there seems to exist no tool for binding a text label to a bar: there is an appropriate package ggrepel but it is not supported in plotly. The first iteration of the plot creation takes place in lines 25–40. We bind data directly to the plot (cs is an argument of ggplot), flip coordinates (line 29) as soon as bars must lie horizontally, say that the x axis must be discrete (line 26) to enable setting exact positions of the emphasis boxes borders, reverse the x scale (on the same line) to put cities with the same count alphabetically from the top to the bottom, and reorder cs by count (aes settings in line 25). Then put bars (line 30), bar labels (lines 31–32), and the title (line 40). In lines 33–39 not needed elements of x and y axes, as well as the grid, get removed. In lines 42–69 the emphasis boxes get applied on the plot as annotations. Notice that being annotations, they lie on top of the bars and the bar labels like colored glass sheets. This means that the boxes must be very transparent to not obscure the bars and the labels under them, and the colors of the bars must be close to the colors of the boxes. When tops is not specified (lines 42–44), a single fully transparent box is still applied: this is done to ensure that the lengths of the bars will keep the same independently of whether the tops specified or not. Alpha channels of the emphasis boxes decrease steadily from value 0.1 (line 47) by factor 2 (line 64). Each box is additionally annotated by a number at the bottom-right corner which corresponds to the position of the lowest bar it covers (lines 59–62). When the value NA (or any number greater than the number of all the bars) are met in the tops, the current emphasis box gets extended to the bottom of the plot, and the loop over the tops stops (lines 49–51 and 65–67). The height of the plot is limited by the maximum linear size allowed in Cairo (lines 71–73). Finally, in lines 74–78, the plotly toolbar gets configured. Particularly, button toImage (Download plot as a png when hovered) no longer proposes to save the image as newplot.png (lines 75–76), instead, it proposes a name based on the title of the plot. Besides this, the width of the PNG image gets 1200 pixels when saved in case if it has not been specified in the width parameter of function cities.plot (line 78). Now let’s render a bar chart of all visits from the Moscow region with top 10, top 40 and top-to-the-end emphasis boxes.
pvMoskCc <- gcities.compound(pvMosk)
cities.plot(pvMoskCc, paste("Moscow region", format(Sys.time(), "(%F %R)")), c(10, 40, NA), 1200)

I cut out a piece of the image (the white lacuna across its lower part) to conform to the limitations on image sizes in this blog.