STLAlgorithmExtensions/ClusteringAlgorithms

Boost Clustering Algorithm

The overall objective of the Boost Clustering Library is to provide a modern and portable C++ clustering algorithm library suitable for wide use and eventual standardization.

What is clustering?

A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

Clustering algorithms can be applied in many fields, for instance:

Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;
Biology: classification of plants and animals given their features;
Libraries: book ordering;
Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;
City-planning: identifying groups of houses according to their house type, value and geographical location;
Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones;
WWW: document classification; clustering weblog data to discover groups of similar access patterns.

Requirements

Scalability
Dealing with different types of attributes
Discovering clusters with arbitrary "points"
Minimal requirements for domain knowledge to determine input parameters
Ability to deal with noise and outliers
Insensitivity to order of input records
High dimensionality

Action items

Name space? Library name?
- Jon proposes boost::algorithm::cluster
- Anders proposes #include <boost/algorithm/cluster/algoname.hpp>

Define "point" concept. Multi-attribute data point? Tuple syntax? (Jon, Anders)
- Jon proposes N-Tuple concept, modeled after Boost.Tuple.
- Algorithm(s) would iterate on sequences of N-tuples.

Define cluster output mechanism. Specify output container type as template parameter? Use hash_map as default? (Jon, Anders)
- After talking with Doug Gregor, we should probably define a special Partition or Clustering object
- Partition object should be returned by value, and contain a smart ptr to partition data generated by algorithm
- Partition object can then be passed to other related algorithms. e.g. classify/add
- Partition object should be function object, usable in std::transform and other algorithms.

Implement dbscan (Jon)

Implement k-mean (Yuan)

Resolve questions about range query implementation (Jon)
- Currently should only be needed for dbscan.

Determine if we need both clustering and classification algorithms in the library (Anders)

Can we add points to an existing cluster? (Jon, Anders)
- Current thinking is Yes. Need to define interface for this.

Write design rationale and other docs.
- Post current iters on wiki?

Create sandbox for code and docs in svn.
- Directory created: https://svn.boost.org/svn/boost/sandbox/boost/algorithm/cluster

Create tests for each algorithm.

Algorithms

DBSCAN

K-Means

Concepts

The relevant concepts of the library are specified in the following sub sections.

Point

Should support a Distance Comparable concept.
p is a point instance of type Point
- get<N>(p) should be a valid expression that return the N'th element of the point.
- length< D >::value should be a valid expression that return the number of elements (arrity) of the type P.

Distance Function

A Distance Function is a type that is callable (like a standard C++ function) with two arguments of types that model the Point concept. The Distance Function should return a distance of type Ret between the points (using some metric).

A valid expression would be: ret = f(t1, t2);
- f is of type F
- t1 is of type T1 and T1 is a model of Point
- t2 is of type T2 and T2 is a model of Point
- ret is of type Ret

Return type should be LessThanComparable? (http://www.sgi.com/tech/stl/LessThanComparable.html).

Open issues:
- Should the name of the concept be Distance Function?
  - Anders: I'm in favor of this name.
- Should we require T1 == T2?
  - Anders: I don't see any reason to do that if we require T1 and T2 to model Point. It will probably be a rare and esoteric case if they were different though.
  - Jon: Since the points come from a sequence, they will be the same in practice. However, I have no issue w/ allowing T1 != T2. I don't *think* it can hurt.
- Should we allow two different invocations of f to return different results?
  - Anders: Is this ever going to be use full? I would like to add a requirement that two calls with the same arguments should all ways return the same result. If we don't have that requirement it will be impossible to cache distance function values in the algorithms.
  - Jon: Same value, or same type? Should probably be both for any given invocation.

Supporting Objects

cluster_data

Should model the Container concept (http://www.sgi.com/tech/stl/Container.html)
Contains clusters, which are in turn containers of tuples.

Disclaimer: This site not officially maintained by Boost Developers