Data analysis in PostgreSQL for Openfire activity audit
Background
Openfire XMPP is serving as a presence, multi-user chat room server, and communications component in a larger application. Openfire users are connected not via standard chat clients, but by custom applications which communicate via XMPP. One class of users has corresponding MUC Rooms and each goes through a workflow, represented as a Finite State Machine (FSM).
For auditing activity in Openfire, most XMPP events are captured for later analysis or data extraction.
The Use Case(s)
- For each user/MUC Room, provide a list of states entered, and duration of time in each state. The same state may have been entered more than once, according to the FSM. This use case will be used further for examples.
- For each user/MUC Room, provide the total time between certain phases of workflow, categorized by entering the starting state and finally reaching the end-state for that part of the workflow. Average time for part of the workflow provides a useful metric.
- For each User, provide a list of times he/he dropped connection or reconnected. Show the duration of connections, disconnections, and counts.
- For each MUC Room, provide an extracted copy of the messages sent through it, and sender.
Ingredients
- Openfire XMPP Server
- Openfire Custom Audit plugin – Kafka publisher
- Openfire Custom Workflow / FSM plugin
- Kafka
- Standard Event Schema for Kafka Audit/Event Topic
- Sink Connector
- PostgreSQL
The PostgreSQL toolbox
- CTE – Common Table Expressions
- Substring, with regular expressions
- Partitioning over data and ordering
- Assignment of virtual row numbers within partition
- AGE – Calculating duration between a partition’s virtual rows
WITH or CTE
A Common Table Expression (CTE) or a WITH clause allows a query to be used as if it is a table, for latter parts of the same SQL statement. In the example below, a relevant data rows are extracted and manipulated for further use.
From the PostgreSQL docs:
WITH provides a way to write auxiliary statements for use in a larger query. These statements, which are often referred to as Common Table Expressions or CTEs, can be thought of as defining temporary tables that exist just for one query. Each auxiliary statement in a WITH clause can be a SELECT, INSERT, UPDATE, or DELETE; and the WITH clause itself is attached to a primary statement that can also be a SELECT, INSERT, UPDATE, or DELETE.
CTE PostgreSQL Example
In the below example, from the PostgreSQL website, both regional_sales and top_regions are CTEs (WITH queries), and get used in the main query to show total units and product sales per region and product.
WITH regional_sales AS (
SELECT region, SUM(amount) AS total_sales
FROM orders
GROUP BY region
), top_regions AS (
SELECT region
FROM regional_sales
WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales)
)
SELECT region,
product,
SUM(quantity) AS product_units,
SUM(amount) AS product_sales
FROM orders
WHERE region IN (SELECT region FROM top_regions)
GROUP BY region, product;
CTE – Our Use Case Example
The following CTE gets the name of the Openfire MUC Room, the Subject, and the Timestamp for the Subject change.
WITH data AS (
SELECT
SUBSTRING(actor FROM '^(.*)@(.*)$') actor
, target
, (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles') event_timestamp
, ROW_NUMBER() OVER (
PARTITION BY actor
ORDER BY (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles')
) state_num
FROM public.standard_event
WHERE
action='room_subject_changed'
AND (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles') > '2018-12-11 10:30:31.476'
AND (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles') < '2018-12-11 12:05:31.476'
)
Substring with regular Expressions
The actor’s (or the MUC Room) is logged as muc_room@conference.bogus.domain.
However, we’re only interested in the 1st part. PostgreSQL’s substring with a regular expression functionality is idea for this and allows getting the desired data in a single step.
SUBSTRING(actor FROM '^(.)@(.)$')
Partitioning over data and assigning virtual row numbers
In the subquery snippet below, you can see use of ROW_NUMBER() OVER (PARTITION BY actor, ORDER BY example_timestamp)
This tells PostgreSQL to assign sequentially increasing row numbers, restarting for each ‘actor’, and sequenced according to example_timestamp’s for that partition element.
ROW_NUMBER() OVER (
PARTITION BY actor
ORDER BY example_timestamp
) state_num
AGE – Calculating duration
The duration between two virtual row numbers can be calculated by joining twice to the CTE, where there is a difference of 1 between the row numbers. Then, the AGE between the two rows’ event timestamps can be calculated.
SELECT
d1.actor
,d1.state_num
,AGE(d2.event_timestamp,d1.event_timestamp) duration
FROM
data d1
INNER JOIN data d2
ON d2.actor = d1.actor
AND d2.state_num - d1.state_num = 1
Query – Putting it all together
WITH data AS (
SELECT
SUBSTRING(actor FROM '^(.*)@(.*)$') actor
, target
, (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles') event_timestamp
, ROW_NUMBER() OVER (
PARTITION BY actor
ORDER BY (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles')
) state_num
FROM public.standard_event
WHERE
action='room_subject_changed'
AND (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles') > '2018-12-11 10:30:31.476'
AND (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles') < '2018-12-11 12:05:31.476'
)
SELECT
d1.actor
,d1.state_num
,d1.target initial_state
,d2.target end_state
,AGE(d2.event_timestamp,d1.event_timestamp) duration
FROM
data d1
INNER JOIN data d2
ON d2.actor = d1.actor
AND d2.state_num - d1.state_num = 1
ORDER BY d2.actor,d2.state_num
Sample Result
Below, you can see two “actors”, with the duration between several example states.