mcl-algorithm is a scalable cluster algorithm for graphs based on stochastic flow.
The flow process employed by the algorithm is mathematically sound and intrinsically tied to cluster structure in graphs, which is revealed as the imprint left by the process.
The threaded implementation has handled graphs of up to one million nodes within hours, and is widely used in the field of protein family analysis. It comes with a wide range of sibling utilities for handling and analyzing graphs, matrices, and clusterings.
The MCL algorithm simulates flow using (alternating) two simple algebraic operations on matrices. Its formulation is simple and elegant. There are no high-level procedural instructions for assembling, joining, or splitting of groups - cluster structure is bootstrapped via a flow process that is inherently affected by any cluster structure present.
The first operation used by MCL is expansion, which coincides with normal matrix multiplication. Expansion models the spreading out of flow, it becoming more homogeneous. The second is inflation, which is mathematically speaking a Hadamard power followed by a diagonal scaling. Inflation models the contraction of flow, it becoming thicker in regions of higher current and thinner in regions of lower current. The MCL process causes flow to spread out within natural clusters and evaporate inbetween different clusters.
By varying parameters, clusterings on different scales of granularity can be found. The number of clusters can not and need not be specified in advance, but the algorithm can be adapted to different contexts.
The issue 'how many clusters?' is not dealt with in an arbitrary manner, but rather by strong internal logic. Cluster structure leaves its marks on the flow process simulated by the algorithm, and the flow parameters control the granularity of the cluster imprint.
The limit of the MCL process (the process simulated by the algorithm) is in general extremely sparse, and the iterands are sparse in a weighted sense. This gives the means to scale the algorithm drastically, leading to a worst-case complexity of order Nk^2, where N is the number of nodes of the input graph, and where k is a threshold for the number of resources allocated per node.
The rate of convergence of the MCL process, and projection of the iterands afterwards onto the resulting clustering, give hooks for unsupervised parameter adjustment.
The iterands of the MCL process have structural properties which allow a cluster interpretation, and which generalize the mapping of MCL limits onto clusterings. The mathematics associated with the MCL process shows that there is an intrinsic relationship between the MCL process and cluster structure in graphs. This is very valuable given the many heuristic approaches in cluster analysis.
What's New in This Release: [ read full changelog ]
· The clustering program mcl is now considerably faster due to optimizations in the memory management code.
· Additionally, it utilises vanilla matrix/vector multiplication where this is faster than sparse matrix/vector multiplication.
· The speed gains range from slightly below twofold to sixfold depending on graph size and edge density.
· Large speed increases may be obtained for graphs of sizes up to several hundred thousands of nodes.
· The graph query program has acquired several edge weight output and summary statistics options, and documentation was fixed and updated.