Data analysis in PostgreSQL for Openfire activity audit

Background

Openfire XMPP is serving as a presence, multi-user chat room server, and communications component in a larger application.   Openfire users are connected not via standard chat clients, but by custom applications which communicate via XMPP.  One class of users has corresponding MUC Rooms and each goes through a workflow, represented as a Finite State Machine (FSM).

For auditing activity in Openfire, most XMPP events are captured for later analysis or data extraction.

The Use Case(s)

  1. For each user/MUC Room, provide a list of states entered, and duration of time in each state. The same state may have been entered more than once, according to the FSM. This use case will be used further for examples.
  2. For each user/MUC Room, provide the total time between certain phases of workflow, categorized by entering the starting state and finally reaching the end-state for that part of the workflow.  Average time for part of the workflow provides a useful metric.
  3. For each User, provide a list of times he/he dropped connection or reconnected. Show the duration of connections, disconnections, and counts.
  4. For each MUC Room, provide an extracted copy of the messages sent through it, and sender.

Ingredients

  • Openfire XMPP Server
  • Openfire Custom Audit plugin – Kafka publisher
  • Openfire Custom Workflow / FSM plugin
  • Kafka
  • Standard Event Schema for Kafka Audit/Event Topic
  • Sink Connector
  • PostgreSQL

The PostgreSQL toolbox

  • CTE – Common Table Expressions
  • Substring, with regular expressions
  • Partitioning over data and ordering
  • Assignment of virtual row numbers within partition
  • AGE – Calculating duration between a partition’s virtual rows

WITH or CTE 

A Common Table Expression (CTE) or a WITH clause allows a query to be used as if it is a table, for latter parts of the same SQL statement.  In the example below,  a relevant data rows are extracted and manipulated for further use.

From the PostgreSQL docs:

WITH provides a way to write auxiliary statements for use in a larger query. These statements, which are often referred to as Common Table Expressions or CTEs, can be thought of as defining temporary tables that exist just for one query. Each auxiliary statement in a WITH clause can be a SELECT, INSERT, UPDATE, or DELETE; and the WITH clause itself is attached to a primary statement that can also be a SELECT, INSERT, UPDATE, or DELETE.

CTE PostgreSQL Example

In the below example, from the PostgreSQL website, both regional_sales and top_regions are CTEs (WITH queries),  and get used in the main query to show total units and product sales per region and product.

WITH regional_sales AS (
        SELECT region, SUM(amount) AS total_sales
        FROM orders
        GROUP BY region
     ), top_regions AS (
        SELECT region
        FROM regional_sales
        WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales)
     )
SELECT region,
       product,
       SUM(quantity) AS product_units,
       SUM(amount) AS product_sales
FROM orders
WHERE region IN (SELECT region FROM top_regions)
GROUP BY region, product;

CTE – Our Use Case Example

The following CTE gets the name of the Openfire MUC Room, the Subject, and the Timestamp for the Subject change. 

WITH data AS (
	SELECT 
	SUBSTRING(actor FROM '^(.*)@(.*)$') actor
	, target
	, (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles') event_timestamp
	, ROW_NUMBER() OVER (
	    PARTITION BY actor
	    ORDER BY (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles')
	) state_num
	FROM public.standard_event
	WHERE
	  action='room_subject_changed'
    AND (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles') > '2018-12-11 10:30:31.476'
    AND (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles') < '2018-12-11 12:05:31.476'     
)

Substring with regular Expressions
The actor’s (or the MUC Room) is logged as muc_room@conference.bogus.domain. 

However, we’re only interested in the 1st part.  PostgreSQL’s substring with a regular expression functionality is idea for this and allows getting the desired data in a single step. 

SUBSTRING(actor FROM '^(.)@(.)$')

Partitioning over data and assigning virtual row numbers
In the subquery snippet below, you can see use of ROW_NUMBER()  OVER (PARTITION BY actor, ORDER BY example_timestamp)

This tells PostgreSQL to assign sequentially increasing row numbers, restarting for each ‘actor’, and sequenced according to example_timestamp’s for that partition element.

ROW_NUMBER() OVER (
	    PARTITION BY actor
	    ORDER BY example_timestamp
) state_num

AGE – Calculating duration

The duration between two virtual row numbers can be calculated by joining twice to the CTE, where there is a difference of 1 between the row numbers. Then, the AGE between the two rows’ event timestamps can be calculated.

SELECT 
   d1.actor
  ,d1.state_num 
  ,AGE(d2.event_timestamp,d1.event_timestamp) duration  
FROM 
  data d1
INNER JOIN data d2
  ON d2.actor = d1.actor
  AND d2.state_num - d1.state_num = 1

Query – Putting it all together

WITH data AS (
	SELECT 
	SUBSTRING(actor FROM '^(.*)@(.*)$') actor
	, target
	, (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles') event_timestamp
	, ROW_NUMBER() OVER (
	    PARTITION BY actor
	    ORDER BY (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles')
	) state_num
	FROM public.standard_event
	WHERE
	  action='room_subject_changed'
    AND (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles') > '2018-12-11 10:30:31.476'
    AND (((timestamp 'epoch' + event_timestamp * interval '1 ms') AT TIME ZONE 'UTC') AT TIME ZONE 'America/Los_Angeles') < '2018-12-11 12:05:31.476'     
)
SELECT 
  d1.actor
  ,d1.state_num 
  ,d1.target initial_state
  ,d2.target end_state
  ,AGE(d2.event_timestamp,d1.event_timestamp) duration  
FROM 
  data d1
INNER JOIN data d2
  ON d2.actor = d1.actor
  AND d2.state_num - d1.state_num = 1
ORDER BY d2.actor,d2.state_num

Sample Result

Below, you can see two “actors”,  with the duration between several example states.

Leave a Reply

Your email address will not be published. Required fields are marked *