This is a repost of my article on an implementation of the traclus algorithm using C++. If you just want to use TRACLUS, consider the blog post on how to use it from within R. If you want to understand or modify the implementation, read on...

Trajectory clustering is a specific form of data clustering in which the data objects at hand are trajectories. To keep it simple, these are just ordered sequences of spatial points, often together with a timestamp. They occur often in modern data science, they might be created from GPS logs, from smart phone data or from simulations or measurements (e.g., weather, hurricanes), from handwriting and in many other ways.

Trajectories are computationally interesting, because most algorithms of spatial data analysis do not (scalably) work with them. This is due to the fact that a lot of these algorithms are based on indexing structures exploiting a specific principle of locality, which is not supported by trajectories, which can be arbitrarily large.

Some algorithms, and especially our current example TRACLUS (see http://hanj.cs.illinois.edu/pdf/sigmod07_jglee.pdf), try to overcome this problem by segmenting trajectories into small pieces that behave more like points. In the mentioned TRACLUS framework, this is done using an information-theoretic approach of minimum description length and a set of special distance measures defined for line segments. The idea is to cut the large trajectories into small pieces (line segments), which behave more like points and use a classical clustering algorithm on these line segments. The only additional step taken by the TRACLUS algorithm is to remove clusters of segments that have been generated from too few different trajectories.

Implementing TRACLUS was not actually a challenge, but it wasn’t easy either. Mainly, because the results I expected (especially from the figures in the original paper) were difficult to generate. After a colleague implemented a sketch of TRACLUS in octave (just to see how it works), I decided to write a reasonably fast version in modern C++ (C++11, STL and templates).

The result is a small data science demonstration tool, which is best summarized by the following video on YouTube:

The implementation is a header-only library and has no dependencies other than my trajectory computing headers. So, if you download the source and want to integrate it into your data science project, all you need are the *.hpp header files.

The demonstration itself depends on my data science lab software, which I will be covering in a different article, but basically, it uses OpenGL to render and plain C++ to perform calculations.

# The Source Code

After you got an idea of what TRACLUS is and how it works, let us have a look at the implementation. It is contained in a single header trajcomp_traclus.hpp, which you can download from the end of this document (altogether with other useful resources). Let us now discuss the complete file step by step:

## Basic Geometry

The file starts with including some STL headers, namely queue, set, unordered_map and the trajcomp libary (for distance measurements; You could also use boost::geometry with some small changes…)

```#include<queue>
#include<set>
#include<unordered_map>
#include<trajcomp/trajcomp.hpp>

#define UNCLASSIFIED -1
#define NOISE -2
#define REMOVED -3```

Then three values are defined: UNCLASSIFIED will mean that a segment has not been assigned to a cluster. NOISE will mean, that the segment is considered noise as there are too few sufficiently similar segments. REMOVED will be used for segments that belong to a cluster of segments, but not to a cluster of trajectories as the segments come from too few different trajectories

Then, TRACLUS is using line segments. Though the following approach is neither very elegant nor very performant, it is easy to read and understand: We create a class tLineSegment managing our line segments:

```template<typename point>
class tLineSegment
{
public:
point s;  /// the start of the segment
point e;  /// the end of the segment
size_t trajectory_index;  /// the index of the trajectory, where this segment has been extracted from
int cluster; /// the cluster ID of the segment

/// The default segment is created as UNCLASSIFIED
tLineSegment(point &_s, point &_e, size_t index):s(_s),e(_e),trajectory_index(index),cluster(UNCLASSIFIED)
{
}

friend ostream& operator<<(ostream& os, const tLineSegment<point>& dt)
{
os << trajcomp::tools::make_string(dt.s) << " " << trajcomp::tools::make_string(dt.e) << " " << dt.trajectory_index << " " << dt.cluster;
return os;
}
};
```

This code snippet is a simple C++ class. However, we did not want to fix the data type of a point. Therefore, we create a template class parametrized by the type of the point. The class of a line segment stores two such points (data elements of not yet defined type point) called s and e to denote the beginning and the end of the line segment. Additionally, we add a size_t variable (positive integer) to store the index of the trajectory from which a line segment was generated and we store the clutser ID of the cluster.

The constructor tLineSegment now initializes all data suitably taking a reference to a start and an end point and the index of the trajectory. Note that due to this construction, the type tLineSegment cannot be instantiated without giving these three data objects. However, we will never do this. This can become a tricky error, when you once create a vector and want to resize this:

```std::vector<tLineSegment> segments;
segments.resize(100); // this will fail, as there is no way to create a line segment without arguments...```

The last part of the class is a typical friend implementation of a stream opeerator: When this type is being used with any C++ stream (let it be cout, cerr, some file or something else), this function controls the output of a line segment. In the given implementation, the points are converted to strings using some tool function of libtrajcomp (replace this for boost::geometry).

The next part of the header handles some basic geometry operations, which will be used by the TRACLUS algorithm: First we instantiate a functional d_euc calculating the Euclidean distance between two points. Again, this functional is provided by libtrajcomp…

`trajcomp::default_element_distance<std::vector<double>> d_euc;`

Then, a function is given, which calculates a projection of a point p on a segment from u to v. The previous instantiation of d_euc already showed that we will be using a vector of doubles as our point type. With this type, it is quite easy to calculate the nearest point on the (infinite) line as follows:

```/*Projection point calculation*/
template<typename sample_type>
sample_type projection_point(sample_type &u,sample_type &v, sample_type &p)
{

// the length of the segment
trajcomp::default_element_distance_squared<sample_type> d2;

double l = d2(u,v);

if (fabs(l) < 1E-12)
return u;

double t = 0;
for (size_t i=0; i <  u.size(); i++)
t += ((v[i]-u[i])*(p[i]-u[i]));
t/=l;

sample_type proj;
for (size_t i=0; i <  u.size(); i++)
proj.push_back(u[i]+t*(v[i]-u[i]));

return proj;
}
```

This function is again based on a not clearly specified type sample_type. However, this type has to support several operations used by the implementation. The collection of all such features a specific template type must support is often called concept. For this implementation, the type sample_type must support a bracket operator [], the method push_back and empty initialization. Furthermore, it should be compatible with double and the value type must support arithmetic operations as seen in the source code. A good candidate for this implementation is std::vector<double>, which is exactly the point type, we will be using.

Now we are ready to define the first three functions specific to the TRACLUS framework: The paper mentions three distance: The perpendicular distance, the angle distance and the parallel distance. These distance more or less do what one should expect and are best explained with concrete implementations:

```template<typename point>
double perpen_dist(point &si,point &ei,point &sj,point &ej)
{
point ps = projection_point(si,ei,sj);
point pe = projection_point(si,ei,ej);

double dl1 = d_euc(sj,ps);
double dl2 = d_euc(ej,pe);

if (fabs(dl1+dl2) < 0.0001)
return 0;

return (dl1*dl1 + dl2*dl2)/(dl1 + dl2);
};

template<typename point>
double angle_dist(point &si,point &ei,point &sj,point &ej)
{
double alpha1 = atan2(ei - si,ei - si);
double alpha2 = atan2(ej - sj,ej - sj);

double l = d_euc(sj,ej);

return l * fabs(sin(alpha2-alpha1));
}

template<typename point>
double par_dist(point &si,point &ei,point &sj,point &ej)
{
point ps = projection_point(si,ei,sj);
point pe = projection_point(si,ei,ej);

double l1 = std::min(d_euc(ps,si),d_euc(ps,ei));
double l2 = std::min(d_euc(pe,si),d_euc(pe,ei));

return std::min(l1,l2);
}
```

The first distance metric for a line segment (represented by points si,ei and sj, ej) takes distances between projected points, the second distance uses the angle difference (sin (0) = 0 and scales some length with it, the third distance measures a distance between two constructed parallel lines. See the paper for details. At this point, one should note that all these three functions return length values. This might become useful when trying to use spatial indices such as an R* tree to index nearest neighbors during the clustering phase.

These three distances are now combined using a weighted sum of all three distances as shown in the following implementation:

```/*Total distance, which is a weighted sum of above*/
template<typename point>
double total_dist(point &si,point &ei,point &sj,point &ej,
double w_perpendicular=0.33, double w_parallel=0.33, double w_angle=0.33)
{
double td =   w_perpendicular * perpen_dist(si,ei,sj,ej)
+ w_parallel      *    par_dist(si,ei,sj,ej)
+ w_angle         *  angle_dist(si,ei,sj,ej);
return td;
}```

This will be the final similarity measure for line segments in the TRACLUS framework. Still, the framework allows for replacing these measures with anything else. The weighting expresses, how much influence will be given to the angle, the parallel and the orthogonal distance components.

## TRACLUS Trajectory Segmentation

Now, the TRACLUS framework segments trajectories by using a specific idea based on the minimum description length (MDL) principle. The following code segments show this: Basically, some information value (those logarithms) are compared for a line segment and the subtrajectory between the beginning and the end of the line segment. This gives the result MDL_PAR:

```template<typename iteratortype>
double MDL_PAR(iteratortype st, iteratortype en)
{
double d1 = d_euc(*st, *en);

double PER_DIST=0, ANG_DIST=0;
iteratortype it = st;
iteratortype it2 = it;
it2 ++;

while (true)
{
double d2 = d_euc(*it, *it2);
if (d1 >= d2)
{
PER_DIST += perpen_dist(*st,*en,*it,*it2);
ANG_DIST += angle_dist(*st,*en,*it,*it2);
}else{
PER_DIST += perpen_dist(*it,*it2,*st,*en);
ANG_DIST += angle_dist(*it,*it2,*st,*en);
}

if (it2 == en)
break;
it ++;
it2 ++;
}

// information calculation
double LH = log2(d1);
double LDH = log2(PER_DIST) + log2(ANG_DIST);

return LH + LDH;
}```

This is now used to segment trajectories (yes, now we are ready to do the first real thing to our trajectories!). A trajectory is segmented whenever the MDL_PAR value becomes larger than the cost of a line segment. This is implemented as follows:

```template<typename ttraj>
ttraj traclus_partition (ttraj &A)
{
ttraj CP;
CP.push_back(A);
typename ttraj::iterator it,it2,it2_old;
it =  A.begin();
it2_old = it2 = it;
it2 ++;
while (it2 != A.end())
{
double cost = MDL_PAR(it, it2);
double cost2 = log2(d_euc(*it,*it2));
if (cost > cost2 && !(fabs(cost) < 0.0001)) // right side: skip over equal points
{
CP.push_back(*it2_old);
it = it2_old;
it2++;
it2_old ++;
}else{
it2 ++;
it2_old ++;
}
}
CP.push_back(A.back());
return CP;
}```

This creates a novel trajectory by adding points to CP. Note that this function expects the trajectory begin non-empty, such that the push_back and the it2 ++ line can be run at least once.

## The TRACLUS approach (equivalent to DBSCAN)

For clustering segments, the TRACLUS framework uses an unmodified DBSCAN algorithm with a single step of post-processing. Inside DBSCAN, the epsilon neighborhood of an object is used a lot, so we need to be able to compute this. The following linear scan implementation is not very efficient, but compatible with any change in the geometry. For large datasets, however, you will have to use spatial indexing (such as an R* tree) in order to construct it in a faster way. Still, we make the linear scan fully parallel by exploiting annotation based parallelism provided by OpenMP:

The following function computes the indices (into the vector of line segments) of the neighborhood of a line segment, without adding the query line segment itself.

```template<typename point, typename distance_functor>
std::vector<size_t> compute_NeIndizes(std::vector<tLineSegment<point>> &L,size_t idx,
double eps, distance_functor &_d)
{
std::vector<size_t> ret;
#pragma omp parallel for
for (size_t i=0; i < L.size(); i++)
if (idx != i)
if (_d(L,i,idx) <= eps)
{
#pragma omp critical
ret.push_back(i);
}

return ret;
}```

In this case, the functional _d will be given by the caller (see the template type distance_functor) and can be anything (a class with an operator () or a plain C function) that can be called as shown in the code segment. Basically, this creates all indices for which the distance as calculated by the parameter _d is smaller than epsilon expect the query line segment.

For DBSCAN, any object can be expanded to its epsilon-connected neighborhood adding objects for which a sequence of epsilon-near objects exists. This is best implemented using a queue in which elements are being added until no epsilon-near object exists in the environment of the current epsilon-connected region. This is done in the next function

```/**
* Expand the epsilon-connected neighborhood of a cluster
* using the line segment collection L, the queue of unprocessed elements
* to be looked at, epsilon, minlines, the clusterID to be assigned and
* the distance functional.
*
*/

template<typename point,typename distance_functor>
void expandCluster(std::vector<tLineSegment<point>> &L,
std::queue<size_t> &Q,
double eps,    size_t minLines, size_t clusterID,distance_functor &_d)
{
while (!Q.empty())
{
size_t m = Q.front();
auto Ne = compute_NeIndizes(L,m,eps,_d);
Ne.push_back(m);
if (Ne.size() >= minLines)
{
for (auto it = Ne.begin(); it != Ne.end(); it++)
{

if (L[*it].cluster == UNCLASSIFIED)
Q.push(*it);
if (L[*it].cluster < 0)
L[*it].cluster = clusterID;
}
}
Q.pop();
}
}```

There are two interesting facts about this function: First of all, it takes a reference to a queue. So there will be only one queue in total, which is changed by every call to this function. Additionally, the queue should not be empty in which case the function would do nothing. Furthermore, only bad elements (UNCLASSIFIED or NOISE) can be expanded into a cluster. As all segments start out UNCLASSIFIED (see the constructor of line segments), this will be no restriction until elements have been inserted into any other cluster.

With these preparations, we can write down a simple DBSCAN implementation called grouping and using a lot of template parameters as follows:

``` /**
* Perform the grouping. This is the central clustering work:
*
* Finds the next UNCLASSIFIED segment, calculates the epsilon-neighborhood,
* if it is large enough (minLines) expands the Cluster using density-connectedness and
* the current clusterID, which is then incremented,
* otherwise it marks the eps-neighborhood as NOISE segments
* */

template<typename point,typename distance_functor, typename progress_visitor>
void grouping(std::vector<tLineSegment<point>> &L, double eps, size_t minLines,
distance_functor &_d, progress_visitor &pvis)
{
size_t clusterID=0;
std::queue<size_t> Q;
pvis.init(L.size());
for (size_t i=0; i < L.size(); i++)
{
if (i%100==0)
pvis(i,L.size(),"Phase 2: Clustering Segments");
if (L[i].cluster == UNCLASSIFIED)
{
std::vector<size_t> Ne = compute_NeIndizes<point>(L,i,eps,_d);
if (Ne.size()+1 >= minLines)
{
L[i].cluster = clusterID;
for (auto it = Ne.begin(); it != Ne.end(); it++)
{
L[*it].cluster = clusterID;
Q.push(*it);
}
expandCluster(L,Q, eps,minLines,clusterID,_d);
clusterID++;
}else  // not minLines
{
L[i].cluster = NOISE;
}
}
}

};

```

This function is very simple, the pvis() calls are for a progress bar given as a parameter (vis stands for visitor points, google the concept of a visitor in C++ algorithms for details) or look at the complete source to understand how it works. This algorithm works linearly through the dataset, expands each UNCLASSIFIED line segment and creates a cluster if there are enough segments in this expanded neighborhood or marks all neighborhood segments as NOISE.

For TRACLUS, this function is extended by a step removing all clutsers of trajectory segments to which not enough different trajectories contribute:

```for (int i=0; i < (int)clusterID; i++)
{
std::set<size_t> sources;
for (size_t j = 0; j < L.size(); j++)
if (L[j].cluster == i)
sources.insert (L[j].trajectory_index);

if (sources.size() < minLines)
{
for (size_t j = 0; j < L.size(); j++)
if (L[j].cluster == i)
L[j].cluster = REMOVED;
}
}```

This post-processing is performed as part of the grouping method after the DBSCAN work as explained before.

## TRACLUS function

Now, we are ready to define the complete TRACLUS function, which performs segmentation and grouping and provides a visitor interface to progress bars:

```template<typename TrajectoryCollection, typename partitioning_functional,
typename distance_functional, typename progress_visitor>
std::vector<tLineSegment<typename TrajectoryCollection::value_type::value_type>>
traclus(TrajectoryCollection &A, double eps, size_t minLines, partitioning_functional &part,
distance_functional &_d, progress_visitor  &pvis)
{

typedef typename TrajectoryCollection::value_type TrajectoryType;
std::vector<size_t> index_map;
std::vector<tLineSegment<typename TrajectoryType::value_type>> segments;
TrajectoryType L;
size_t i=0;
pvis.init(A.size());
auto it = A.begin();
while (it != A.end())
{
if((*it).size() == 0)        // ignore empty
continue;
TrajectoryType CPi;
CPi = part(*it);
for (size_t k=0; k < CPi.size() -1; k++)
{
segments.push_back(tLineSegment<typename TrajectoryType::value_type>(CPi[k],CPi[k+1],i));
}
it++;i++;
pvis(i,A.size(),"Phase 1: Segment Creation");
}

grouping(segments,eps,minLines,_d, pvis);
pvis.finish();
return segments;
};```

This is now very obvious (if we ignore, that most types are template parameters): First, take the dataset named A (a collection of trajectories) and create segments storing the index of the trajectory where a segment came from and update a (possibly existent) progress bar with the text „Phase 1: Segment Creation“. Then, perform the grouping of the segments with epsilon and minlines. The functional _d is again taken from outside. So a user of this TRACLUS algorithm can give the distance as a parameter. This is also the reason, why spatial indexing is not integrated: We do not know how this distance behaves…

As a result, the segments are returned. They now contain the clusterIDs. It is worth noting that a trajectory is always completely contained in a single cluster as all segments of the trajectory will be in the same clusterID as the segments touch with each other.

Now, in the example program, the TRACLUS algorithm is invoked as follows showing how powerfull the template approach is: We replace the distance with a function doing a distance lookup on a global data object containing precalculated distances:

```static UTSquareMatrix<double> D;
template<typename point>
double distance_lookup(std::vector<tLineSegment<point>> &L,size_t i, size_t j)
{
return D[i][j];
}

[...]
segments = traclus(coll,eps,(size_t) minLines, traclus_partition<ttraj>,distance_lookup<ttraj::value_type>,  p);```

In this way, distance computations can be precached and the user can perform interactive clustering in very short running times, especially, as we did not enforce any specific type of distance by integrating a spatial index.

In the next days, I will publish the complete source code of the sample along with a binary distirbution for Microsoft Windows 64-bit with which you can play around.

If you have comments, just drop me a line…