Posts Tagged ‘Zipf’s law’

As reported in my previous post on Debian Dependency Maps we started to study the properties of dependency relation and the kind of networks the relation can generate. One preliminary study we (me along with Arnab K. Ray and Rajiv Nair) posit a nonlinear model for the global analysis of data pertaining to the semantic network of a complex operating system (free and open-source software). While the distribution of links in the dependency network of this system is scale-free for the intermediate nodes, we found that the richest nodes deviate from this trend, and exhibit a nonlinearity-induced saturation effect. This also distinguishes the two directed networks of incoming and outgoing links from each other. The initial condition for a dynamic model, evolving towards the steady dependency distribution, determines the saturation properties of the mature scale-free network.

Here I give some of the motivations on conducting this study and some conclusions.
The full paper with all the technical details is uploaded at arxiv.org.

Scale-free distributions in complex networks have been very well studied by now. The ubiquity of scale-free properties is quite noteworthy, and spans across vastly diverse domains like (to name a few) the World Wide Web and the Internet, the social, ecological, biological and linguistic networks, income and wealth distributions, trade and business networks, and semantic networks.

It should occasion no surprise, therefore, that further developments have led to the discovery of scale-free features in the architecture of computer software as well. A recent
work has shown that the structure of object-oriented software is a heterogeneous network characterised by a power-law distribution. More in keeping with the purpose of this present paper, an earlier work on complex networks in software engineering had found evidence of power-law behaviour in the inter-package dependency networks in free and open-source software (FOSS). It is a matter of common knowledge that when it comes to installing a software package from the Debian GNU/Linux repository, many other packages — the “dependencies” — are also called for as prerequisites. This leads to a network of these dependencies, and every such package may be treated as a node in a network of dependency relationships. Each dependency relationship connecting any two packages (nodes) is treated as a link (an edge), and every link establishes a relation between a prior package and a posterior package, whereby the functions defined in the
prior package are called in the posterior package. This enables reuse (economy) of functions and eliminates duplicate development. As a result the whole operating system emerges as a coherent and stable semantic network. However, unlike other semantic networks, the network of nodes in the Debian repository is founded on a single relation spanning across all its nodes: Y depends on X; its inverse, X is required for Y .
So, given any particular node, its links (the relations with other nodes) can be of two types — incoming links and outgoing links — as a result of which, there will arise two distinct kinds of directed network. For the network of incoming links, a newly-reported work has empirically established the relevance of Zipf’s law and the conditions attendant on it in Debian GNU/Linux distribution. Carrying further along these very lines, the present study purports to analyse and model the finite-size effects in a FOSS network. There is a general appreciation that for any system with a finite size, the power-law trend is not manifested indefinitely, and in the context of the FOSS network, this is a matter that is recognised as one worthy of a more thorough investigation. Deviations from the power-law trend appear for both the heavily-linked and the sparsely-linked nodes. The former case corresponds to the distribution of a disproportionately high number of links connected to a very few special nodes — the so-called “top nodes” (or rich nodes). The importance of these nodes is, therefore, a self-evident fact.

The data needed for the modelling pertain to the current stable Debian release, Etch (Debian GNU/Linux 4.0). The respective networks of both the incoming links and the outgoing links span 18630 nodes (software packages).
The study argues for the signiﬁcance of non-linearity and saturation, as regards a quantitative characterisation of the incoming and outgoing distribution in the Debian GNU/Linux network. One might rightly expect to encounter similar features in other networks. And indeed, given the possibility that the entire network of software packages in an operating system can be construed to be a semantic (albeit non-autonomous) system, its characteristics can furnish a model that can shed light on much more complex but realistic autonomous semantic and cognitive systems, such as the human society, or
even the human mind.

In the road ahead, the gnowledge lab will conduct a similar study for the dependencies between concepts and activities as and when we obtain sufficient number of nodes at gnowledge.org. Currently we have only about 1000 dependency relations. As more people get to know about the need of establishing dependency relation between concepts, and as and when the portal itself matures with features to attract more users we can study the properties of the resulting knowledge network.

The full paper with all the technical details is uploaded at arxiv.org.