Keeping track of graph changes using temporal versioning
In this post we’re going to cover why versioning is important and how to do time-based versioning in Neo4j including managing retrospective (bi-temporal) changes.
Introduction
Both Ian Robinson and Tom Geudens wrote excellent posts on the topic of time-based versioning in Neo4j. Unfortunately those post are no longer available. This post is a write-up of the NODES2019 talk I did on the subject.
Why versioning?
First and foremost, versioning allows us to keep track of change. We can see what has been altered and retrace our steps if necessary.
Another reason for versioning would be for what-if analysis. Whereas keeping track of change shows us historical changes, suggesting a change that hasn’t already occurred allows us to start examining what might happen under certain scenarios and project into the future.
There are many use-cases where you might see versioning in use, for example:
- Identity and Access management — We can keep track of who is accessing what and when, and start to do analysis on any interesting behaviours we should be investigating
- Monitoring changes in networks — Much like the above example, not only can we keep track of what’s been accessed to watch out for unusual behaviour, but we can also do predictive behaviours based on what happens when we make changes and what might the impacts be
- Collaborative working — Think GitHub and other collaborative working tools. We can track changes being made, see when they occurred, and reduce the likelihood of losing contributions.
Versioning in Neo4j
As with many other database systems, there is no out of the box solution with Neo4j. You will need to build in versioning as part of your data modelling process. We’ll show you some approaches later on in the post.
One thing to bear in mind, the community has created a versioning plugin, called Versioner-core. It does provide for some degree of automated versioning. You can check it out from the author’s GitHub repository.
Introduction to time-based versioning
Time-based versioning is very useful for a number of situations:
- You can use it to track changes, reversing any errors
- You can make updates to your data without deleting anything
- You too can become a time-traveller and move through time to understand state change.
The principles behind time-based versioning are pretty straightforward:
- Separate the object from state, these will be linked by a relationship
- Capture change times within the relationship property linking these two entities.
And so, it is time to introduce our example (no pun intended!):
Scenario 1
There is a company that produces a product called Widget. On the 4th May 2016, a couple of decisions were made:
- Rename Widget to MiniWidget, due to a new product similar to Widget coming out
- Reduce the price of this product down to 3.99
As you can see, we’ve split out our Product node which just contains the property to uniquely identify it (id), and we then capture other information such as name and price within ProductState
nodes. We then connect Product
to ProductSate
using the HAS_PRODUCT_STATE
relationship. Lastly, we capture information about when this state was valid with from
and to
properties on the relationship itself.
With our newly refactored graph, we can start to ask some questions of it.
Query 1: What is the current name of product with id:123?
MATCH (:Product {id:123})-[r:HAS_PRODUCT_STATE]->(ps:ProductState)
WHERE NOT EXISTS(r.to)
RETURN ps.name;
We use WHERE NOT EXISTS
to bring back the most recent state, as that will be the node with no to
property.
Query 2: How much did the product with id:123 cost 3 years ago?
MATCH (:Product {id:123})-[r:HAS_PRODUCT_STATE]->(ps:ProductState)
WHERE r.from <= 20161010 AND (r.to>=20161010 OR NOT EXISTS(r.to))
WITH ps, r ORDER BY r.from LIMIT 1
RETURN ps.price
Here we try to gather all the ProductState
nodes from 3 years ago, bearing in mind the current node may still be current.
Query 3: What is the SKU (Stock Keeping Unit) for MiniWidget?
MATCH (:ProductState {name:”MiniWidget”})<-[:HAS_PRODUCT_STATE]-(p:Product)
RETURN p.id;
As you can see, just by separating out object from state, we are able to capture a lot of information about changes, and pull back information depending on the time filter.
Versioning relationships
More than likely, as well as versioning objects, we’re quite likely to want to version the relationships between objects. We’ll want to do this because we want to understand how entities are or were connected to other entities and how that changed over time.
The principles behind versioning relationships is pretty much the same as how we version object states:
- Connect the two nodes involved in the relationship
- Provide a time range for when that relationship became live, if relevant
Time to have a look at our next scenario!
In the next iteration of our data model, you can see how we’ve extended versioning to relationships by joining Customer
and Product
with the BUYS
relationship. By adding a date as a property, we show when that relationship occurred, hence versioning it. Some of you may have spotted that Customer is not versioned, more on that in the next model….
Scenario 2
A customer buys a product on 18th September 2016. Sometime after the purchase has been made, the customer has moved home and updates their address.
As previously, let’s ask some questions!
Query 4: Which customer last purchased a product with id:123?
MATCH (:Product {id:123})<-[r1:BUYS]-(c:Customer)
WITH c, r1 ORDER BY r1.date DESC LIMIT 1
MATCH (c)-[r2:HAS_CUSTOMER_DETAILS]->(cd:CustomerDetails)
WHERE NOT EXISTS r2.to
RETURN cd.name;
We use the ORDER BY r1.date DESC LIMIT 1
to get the newest BUYS
relationship.
Query 5: Where has Jane lived and when did she move?
MATCH (:Customer {id:456})-[r:HAS_CUSTOMER_DETAILS]->(cd:CustomerDetails)
RETURN DISTINCT cd.address AS Address, r.from AS From, r.to AS To
ORDER BY From;
Managing retrospective changes
Sometimes just capturing one date or timestamp is not enough. There are scenarios where we capture (or don’t capture!) something happening. Later on we discover we need to apply a correction for that event. We can’t just go and correct it, because we lose information about something going wrong. We need to be able to manage retrospective changes. For example, you may need to manage retrospective changes in situations such as:
- Dealing with new information that is discovered that needs to be reconciled — for example, you deposited some money into your savings account, but the bank missed it
- Provide an audit trail for regulatory purposes — a company needs to demonstrate to a governing body what went wrong and how it was corrected
- Used in what-if and other analysis based on events happening at different potential points in time — applying when a process should execute in the future, but the current date when it was authorised, e.g. price increase in a monthly subscription or changing how much power is flowed down a network.
This type of versioning is also commonly known as bi-temporal versioning.
The principles are fairly simple, we now use two date/timestamps instead of one:
- one to represent when something should have happened, in our following scenario we shall refer to this as business date, or
bizDate
for short - one to represent when something actually happened, for example when a correction has taken place. In the scenario we shall refer to this as process date, or
procDate
for short
So, on to our final scenario!
Scenario 3
A customer buys a product from the company. Unfortunately something has happened during the process and the transaction to capture the sale fails. During the bi-annual stock-take, it is identified that there is one item of type product less compared to the records. After some investigation, the missing transaction is identified and rectified.
In our latest iteration we added bi-temporal versioning elements to the BUYS
relationships:
bizDate
captures the business date of when a transaction took place (or should have taken place). E.g. this is the date we’d show the sale went throughprocDate
captures the date of when the transaction or any corrections actually took place. If everything is working as expected,bizDate
andprocDate
will be identical. If, per our scenario, we miss and then later identify a transaction, we would setprocDate
as the current date when the correction is applied, andbizDate
would be the retrospective date of when the transaction should have taken place.
Query 6: How many transactions were missed that we retrospectively captured?
MATCH (:Customer)-[r:BUYS]->(:Product)
WHERE r.bizDate < r.procDate
RETURN count(*);
Query 7: What were the captured transactions for the past year?
MATCH (:Customer)-[r:BUYS]->(p:Product)
WHERE r.bizDate = r.procDate AND r.bizDate >= 20181010
RETURN p.id AS SKU, r.bizDate AS `Sale Date`
ORDER BY `Sale Date`;
Query 8: What are all of the actual transactions for the past year?
MATCH (:Customer)-[r:BUYS]->(p:Product)
WHERE r.bizDate >= 20181010
RETURN p.id AS SKU, r.bizDate AS `Sale Date`, r.procDate AS `Transaction Date`
ORDER BY `Sale Date`;
Advantages and Disadvantages of these approaches for capturing change
As with all modelling decisions, there will be advantages and disadvantages, and time-based versioning is no different.
Advantages:
- All changes to the data are captured, including relationship changes
- Able to step backwards and forwards in time according to the questions we are looking to answer.
Disadvantages:
- Need to do additional work to model changes in relationships
- No indexing on relationship properties — further model iteration is required if there are many state changes
- Querying is a bit more complex.
What has been shown here is the formalised view for time-based versioning. Of course, the data you are working with and the questions you want to ask will provide you with opportunities to exercise pragmatism and only use the components that are useful to capture up to the level of versioning you require.
Summary
In this post, we’ve looked at why you may want to have versioning in your graph database and some use-cases where it is useful. We discussed that, as versioning is not available out of the box, we need to incorporate it into our data model.
We looked at time-based versioning, and how this can be extended for versioning relationships, as well as capturing retrospective changes.
Last but not least, we discussed some of the advantages and disadvantages with this method, and the importance of pragmatism in your versioning approaches.