Wikis as Social Networks: Evolution and Dynamics Ralf Klamma and Christian Haasler RWTH Aachen University Information Systems Ahornstr. 55, 52056 Aachen, Germany (klamma|haasler)

Abstract. Despite of the enormous success of wikis in public and corporate knowledge sharing projects we do not know much about the evolution and dynamics of wikis. Our approach is to analyze wikis as social networks and apply dynamic network analysis on them. In our prototypical environment we handle the complex data management problems arising when dealing with different wiki engines and different sizes of wiki dumps. The analysis and visualization of evolving wiki networks allow wiki stakeholder to research the social dynamics of their wikis.



Since wikis have become very successful in huge collaborative projects like the Wikipedia, an encyclopedia with millions of entries in hundreds dozens of languages edited by a countless crowd of editors. But also in organizational settings wikis are already introduced as Web 2.0 tools for knowledge sharing and project management [1–3]. Therefore, questions like what is making a wiki project particularly successful have become very interesting for research and practice. A lot of added-value services and businesses have been built around the wiki concept which lead to a big variety in wiki software and wiki hosting services. Even if one can start a wiki with a simple ’wiki-on-a-stick’ solution like TiddlyWiki, the maintenance of large collaborative wiki demands more elaborated platforms. Two well-known providers of wiki platforms based on the MediaWiki engine are the Wikimedia Foundation and the hosting service Wikia. A variety of wiki projects are hosted on the MediaWiki engine, e.g. the Wikipedia. We concentrate on this engine here. Due the chosen architecture, also other engines can be supported. We demonstrate this with the TikiWiki engine. The enormous number of public and organizational wikis has created a long tail [4] of wikis. Besides the very successful and very visible wiki-based knowledge creation and sharing projects, there are many others with much lesser numbers of editors and edits. If we want to give wiki stakeholders tools for analyzing the dynamics and the evolution of their wikis we have to deal with different wiki hosting software, with wikis ranging from a few hundred nodes to wikis with millions of nodes, with

wiki dumps from unreliable public and corporate sources, with data management problems and with complex algorithmic problems. Especially, the ability to host the analysis and visualization of different wiki engines for the whole long tail of wikis is still a true challenge. Social networks in computer mediated communication have drawn also a lot of scientific attention, e.g. [5–7, 1]. In this paper we concentrate on the dynamic analysis of wikis, especially dynamic network analysis (DNA). DNA [8] is an emerging area of science advancing traditional social network analysis (SNA) by the idea that networks evolve over time in terms of changes of nodes in the networks and changes of links between nodes. We argue in our paper that DNA is applicable for wikis. For wiki users, wiki managers, and wiki hosting services it is extremely important to know if wikis are still going to grow in numbers of authors, edits and wiki articles or if the wiki is going into a phase of stagnation. It is important to know if and when non-existing articles will be created and edited by users. When a node (wiki page, an editor, a URL) is ‘important’ in the moment, will it stay important over the lifetime of the wiki or will its importance change over time? If a network is heterogeneous will it become homogeneous after a while or will it be that way for ever? In the Web 2.0 not only wikis but also other new media have become tremendously successful [9–11]. By developing standard operations for handling Web 2.0 data analysis and visualization we hope to encourage communities to apply dynamic network analysis thus increasing their agency in a world where we leave billions of virtual footprints day by day. To serve the needs of different stakeholders and communities in DNA we have developed a framework called the MediaBase [12]. A MediaBase consists of three elements: (a) a collection of crawlers specialized for distinguished Web 2.0 media like blogs, wikis, pods, feeds, and so on; (b) the crawlers feed multimedia databases with a common metamodel for all the different media, artifacts, actors, and communities leading to a community-oriented cross-media repository; (c) a collection of web-based analysis and visualization tools for DNA. Examples for MediaBases are available for technology enhanced learning communities (, for German cultural science communities (, and for the cultural heritage management of the UNESCO world heritage Bamyian Valley in Afghanistan ( The WikiWatcher introduced in this paper is part of the MediaBase. The rest of the paper is organized as follows. In Section 2 we analyze prior approaches and open issues. In Section 3 we characterize wikis as social networks where DNA is applicable. Section 4 describes design and architecture of our software prototype WikiWatcher. In Section 5 we are presenting the main results of our analysis of different wikis. We conclude our paper with a discussion and an outlook on further research.


Static Analysis of Wikis

There is an already existing literature on the analysis of wikis. Most of the studies concentrate on static aspects of wikis. A lot of studies have already been performed to analyze wikis. In general, those studies can be classified in studies which make use of the publicly available wiki data (dumps) themselves and in studies making use of additional data like access log files [13]. In this paper, we concentrate on the analysis of publicly available wiki dumps. In this regard, we can further classify studies concentrating on the static analysis of wiki dumps [14, 15] and those concentrating on the dynamic aspects. But first, let us start with a well-known example: the Wikipedia. The Wikipedia is the most researched wiki. We want to mention only a few studies to characterize the scientific progress in the DNA of wikis. Among the first comprehensive studies of Wikipedia was the 2005 study by Voß [14]. Wikipedia was measured according to its network characteristics. In particular, the article referred to the changes in size of the wiki database, the number of articles, words, users and links. Among other things Voß figured out that the distribution of links behaves scale-free with respect to growth and preferential attachment [16]. Wilkinson and Huberman evaluated qualitatively the collaboration of the Wikipedia community. They showed that the accretion of edits to an article is described by a simple stochastic mechanism, resulting in a heavy tail of highly visible articles with a large number of edits [17]. They figured out that the quality of an article depends on the number of its modifications. Kittur et al. [7] examined the success of Wikipedia. In particular, they analyzed if it is a great number of contributors where each deals with only a few articles (‘wisdom of the crowd’) or if it is only a small elite group of contributors that has the lion’s share (‘power of the few’). The later is true. On this qualitative view on Wikipedia, the work of Priedhorsky et al. [13] is built up. They dealt with vandalism in Wikipedia articles. For this purpose, two types of information were used: Wikipedia articles themselves and their log files. By this means it could be measured which article revisions the visitors had viewed and if it was an intact or damaged version. The researchers aimed to quantify the influence of article edits and revisions respectively to the visitors. The number of vandalized pages viewed by real readers is extremely low. Further research classified users with respect to their position in online communities like wikis. Although Wikipedia’s equal treatment of editors some members seem to get a leading role [18]. Most research in this area aim for Wikipedia, but is not ready for arbitrary wikis. A general concept for handling and analyzing any wiki is missing. Wikis are applied in a variety of social and organizational environments. It would be useful to obtain methods and tools for interpreting those incidental social structures. Hence, the motivations of this work are to afford a view on wikis as social networks, to build up formal network models, to apply measurements of SNA, to visualize wiki networks, and to consider the dynamic aspect of wiki networks. The data and information basis of the most projects and researches is built up on wiki log files or direct database access. With respect to the ‘open’ wiki concept this work uses only public wiki data. Most wikis offer automatically generated

dumps which can also be inquired e.g. via the MediaWiki page special:export. Articles, links and references as well as authors are treated as network components (actors) – not only as a growing number. We apply Actor-Network Theory [19] for the data management, i.e. we do not differentiate between human and nonhuman actors and we can aggregate groups of actors as a new actor. Actor relationships and dependencies evolve during any given period. The static network structure of classic SNA is extended consequently. Qualitative characteristics such as the community leading role are now measurable by network analysis methods. Characteristics like scale-free networks, hierarchical structures, shortest paths and centrality (‘importance’) of network components can be measured and analyzed in a time series context. Thus, it is possible to illustrate social change and evolution in wiki networks. Collaborative work, playing a fundamental role in wikis, is now able to be visualized by means of dynamic network visualization. Not only ‘clinical’ numbers and measured values but also graph visualization help to identify strategic actors and their activities. These possibilities give an aid to handle with wiki actors, e.g. in a social, economic or security-relevant way. Our concept and our implementation are introduced in the following.


Wikis as Dynamic Networks

Social science deals with the analysis of relations between different kind of actors, such as single persons, interacting groups or organisations. Social network analysis is concerned with patterns of relationships between social actors [20]. Social networks can be seen as constructs of relations and entities like actors and artifacts. Wikis conform to these qualitative aspects of social networks. The main idea of ‘writing articles in common’ by Ward Cunningham [21] can be realized only if wiki users collaborate in creating, modifying and maintaining articles. These writing processes imply different kinds of social networks. Wiki users as well as wiki pages can be seen as objects in a social network that help to achieve the aim of establishing the wiki. Writing articles in common creates relations between the participated authors and hence edges between author nodes in the network. Wiki pages (articles) and their linked structure can only be warranted by wiki users. Like the most networks in social science our different kind of wiki networks evolve during the editing process. At this point one can see the restriction of SNA. A great lack of dynamic components becomes noticeable. Social networks hold characteristics like growth and adjustment. To solve this challenge we apply dynamic network analysis. The static view is enhanced by a dynamic one while the evolution process considers the agency and behavior of the network actors. This is realized by adding one or more time parameters to the networks. The corresponding models are introduced in the next sections. A well-known classification of network topology will be introduced briefly. Existing empirical and theoretical results indicate that complex networks can be divided into the two major classes of homogeneous and heterogeneous networks [22]. This classification is based on the connectivity distribution P (k)

which gives the probability that an arbitrary node is connected to k other nodes [22]. Homogeneous networks are characterized by almost the same number of links at each node. In contrast to that heterogeneous networks are often characterized by the existence of clusters, i.e. the aggregation of nodes. Furthermore they have a degree distribution that is characterized in such a way that not all nodes in a network have the same number of edges [23]. The regarding distribution function acts in accordance to the power law P (k) ∼ k −γ [22, 24]. Some seminal indices for determining the ‘important components’ in social networks are centrality indices [25]. They quantify central nodes which can often spotted intuitively by considering the visualized networks. Degree centrality which refers to the distribution of links as described above is one of the easiest measurements for determining the influence of a node on its neighbors. For undirected graph d(v) is the number of adjacent edges of node v, analogous d− (v) and d+ (v) for directed graphs. The focus of closeness centrality lies on measuring the closeness of a node to all other nodes in the network [26]. In contrast to degree centrality there is no local restriction any more. Closeness centrality CC (v) of a node v is defined as follows. d(v, t) is the distance from v to a node t ∈ V . CC (v) = P

1 t∈V d(v, t)

Betweenness centrality is based on shortest paths measurements. It indicates which nodes have strong influence on the network. They control the information flow through the network since many shortest paths going through them [26]. Betweenness centrality is defined as follows. σst denotes the number of shortest paths between nodes s and t, σst (v) denotes the number of shortest paths between s and t with v on it. X σst (v) CB (v) = σst s6=v6=t

With respect to the interdependent wiki actors two network models are established. While the term network refers to the informal concept describing an object composed of elements and interactions or connections between these elements, the natural means to model networks mathematically is provided by the notion of graphs [27]. An article graph is such a graph with directed edges. It can be constituted intuitively by considering wikis as a part of the World Wide Web (WWW). Each wiki article (‘page’) induces a node that is labeled by the page title and its namespace. As one can observe in the XML dumps as well as in the page URLs almost every page is denoted by its namespace and title (separated by a colon). Namespaces help to group wiki pages. For example, the Wikipedia page Category:Football denotes the category page of football. Page names without a namespace prefix refer to the main namespace of the wiki (in the following denoted by ARTICLE). Like in the WWW articles are linked among each other.

Links can be set arbitrarily by wiki users either to other articles or to external resources like ‘normal’ web pages. Furthermore, it is possible to set links to other wiki articles that do not exist yet. In the standard wiki theme those links are red colored. Due to the evolution of a wiki and its time dependent graphs four different types of nodes have to be considered: – ‘Normal’ article nodes, type existing: They already exist in the wiki and in the XML dump respectively. They have got a text body and at least one revision. – Article nodes of type requested: They refer to requested articles on which a link is set in another article. Requested articles will be established in the future of the XML dump. A usual way to create new articles is to set a link to them in some special pages called Seed or Sandbox. Requested articles will change their type to existing articles somewhere in the future of the XML dump. – Article nodes of type never exists: Wiki dumps correspond to a certain time period that begins at the creation of the wiki and ends at the moment of the dump creation. Never existing articles can be seen as part of the requested articles set, but in contrast to them they will not be created until the end of the wiki dump. – URL nodes: They refer to URL artifacts that are referred in the text body. Naturally, the last three node types only possess incoming edges. The set of all nodes is denoted by Varticle , the set of edges by Earticle . Since the graph depends on a certain timestamp, it is defined as Garticle (t) = (Varticle , Earticle ) where t ∈ T S with T S a set of timestamps. The ‘oldest’ element corresponds to the wiki creation, the ‘youngest’ one to the point of the wiki dump. Author graphs can not be perceived in such an intuitive way as article graphs. They are built up on collaboration of wiki users (authors). The main idea of modifying wiki articles is the equality of wiki authors. Every web user is allowed to participate. Of course, due to vandalism there are some exceptions and restrictions for some articles with a more or less sensitive content. Furthermore, just a few users have special admin rights, but this does not effect the model. An easy way to edit articles classifies authors in the two types anonymous authors and registered authors. Anonymous authors are denoted by their IP address, registered auhtors by their username. Consequently, the node set Vauthor contains all authors involved in the wiki. A social relation (undirected edge) between two authors arises when they have worked in common on a wiki article, i.e. an intentional or unintentional collaboration by modifying the text body. Eauthor denotes the set of the collaboration edges. Due to the high dynamics and growth of a wiki an author graph is time-dependent, too. In contrast to article graphs they’ve got two timestamps t0 and t1 ∈ T S as input parameters. Thus, Gauthor (t0 , t1 ) = (Vauthor , Eauthor ) determines the graph where those nodes are connected which authors have worked on a common article during the given time period. According to the introduced wiki graphs and network models a system database was established. It considers all entities and their dependencies that are required to generate author networks and article networks respectively.

Figure 1 shows the corresponding entity relationship diagram. The entities wiki








author_id wiki article


wiki author 1:1

isA refersTo modified


article revision






revision_id refersTo




Fig. 1. entity relationship diagram.

author and wiki article take center stage. They correspond to the network nodes described above. Articles can be identified by their labels consisting of namespace and title. The attribute type reflects one of the three article node types. An article revision is a previous version of an article, but also the article itself. Due to the wiki concept every article revision is saved in the wiki database. They are determined by their revision timestamp additionally. Their disc space can be an interesting attribute, too. Every article revision may contain an arbitrary number of links in its text body to other articles, but not to article revisions. This is clear by considering the link format that does not contain any revision or time information, e.g. [[Mathematics]] and [[Category:Science]]. Each article text body may also contain web links of the format [] (squared brackets are optional). They correspond to the entity URL which can be part of an arbitrary number of revisions. Last but not least, every wiki author can be identified by its name and type (anonymous/registered). Every author may have modified/created 0 to n article revisions, but every revision was only modified by exactly one user.


Design and Architecture

For the realization of the dynamic network analysis of wikis a two-stage system was developed. As mentioned before it is called WikiWatcher. WikiWatcher con-

sists of a prototype that contains parsing modules for extracting the required network data out of the dump files. Furthermore, a graphical user interface was built offering functions for the generation, visualization and measurement of wiki networks. Data which is used by the system derives from wikis with MediaWiki engine and conforms to the introduced network models. The conceptual design of the system is shown in Figure 2. Stage 1 realizes

Wiki Network Data

Stage 1 Generating XML dump/export files

Parsing wiki data/ database transfer

authors article pages, URLS, revisions article

Joe Liz Tim


RDB [[requested]]

Stage 2 article [http://…]

Generating Networks


[[Article2]] [[never exists]]

Metadata Visualization

Network Analysis

Fig. 2. conceptual design of the system.

the SAX-based parser tool that gets XML dumps as input data. It extracts the necessary data that are used for generating networks later on and transfers it into a system database. The system database serves as an interface between both stages. Stage 2 uses the previously stored network data of the system database. The prototype on this stage offers functions for generating wiki networks according to some input parameters that correspond to the network models mentioned before. It also visualizes networks and applies them for methods and algorithms of social network analysis. These methods support the verification of assumptions and hypotheses about the behavior and characteristics of wiki networks and help to accomplish the dynamic network analysis. As a programming language Perl was chosen for stage 1 with respect to its comfortable existing modules and its possibilities of defining regular expressions. One of the most important questions was how to extract network information out of a wiki. Due to the tremendous amount of information of such wikis like Wikipedia, Wikiversity or Wikia Search the most feasible and effective way was to use the SAX standard (Simple API for XML). It provides the potentiality of parsing XML documents in linear time and constant memory space. These properties are essential when treating XML dumps with a disk space in the range of some Gigabytes. In general, the SAX parser works event-based, i.e. if a certain XML tag or an attribute emerges the parser invokes user-defined methods. Entity or attribute

values can be read and prepared for further processing. Memory will be cleared and can be used later on. The parser tool of WikiWatcher considers in particular the dynamic aspect of wikis. This means to process dumps that were generated with the option full only. These dumps contain wiki pages with all their revisions and timestamps. The schema of the system database that was built up for storing the wiki information is geared to the ER-diagram. For example one interrelational dependency specifies that each revision is dealt by exactly one author. Thus, the authors table has to be filled before the revisions table is allowed to be filled. Hence, the parser is divided in some sub-modules that care about these restrictions. After extracting some ‘heading information’ like the wiki name, wiki URL, namespaces etc. the first pass through the XML dump extracts all participated authors. As a result of the ‘natural’ appearance of redundancies in XML documents some database constraints have to be set. In wiki dumps author names and IP addresses are stored redundantly for instance. Another problem occurs when considering different types of article nodes due to the evolution of a wiki. This also effects the parsing sequence. By having a look at the text body of a revision, there may be some links to other articles (pages) that do not exist at this point of time. Because of the sequential parsing process there can arise two cases: The article occurs later on or the article will never exist in the whole dump. For getting all articles with their types (existing, requested, never existing) and for storing them into the pages table there must be a first pass for getting the existing pages and a second pass through the text body of each revision that scans links to requested articles (“timestamp of the current revision must be smaller than the timestamp of first revision of the requested article’) and links to articles that will not exist at all. While scanning the revision text all occurring URLs are stored into the system database. The last step is to store some further revision data like timestamps, size and participated authors (their IDs) and to save the link structure of each revision to other articles and URLs (with references to previously saved data records). Scanning links and URLs means to create appropriate regular expressions. Links may have the form [[Football]], [[Football (Soccer)|Football]] or [http://some.url]. One challenge was that MediaWiki software allows to set links to external web pages either with or without squared brackets. On URLs without brackets terminal symbols like question marks, exclamation marks or commas are not considered for generating the corresponding web page link, but there may occur ambiguousness. Another challenge is given by typing or syntax errors induced by wiki users. There may emerge some wrong ‘links’ like [[Football] or [[]] that result in incorrect data records. By editing Wikipedia articles it is obligated often to have a preview before saving the modification, but it will not be possible to avoid this problem completely. Due to treating XML dumps as sequential data streams the time complexity is O(n) and the space complexity is O(c) with n as the length of the XML document and c constant. For the representation of wiki networks the

Java graph tool kit yFiles has been integrated into WikiWatcher on stage 2. It provides classes and methods for generation, visualization and measurement of networks. A graphical user interface was built that provides elements to choose parameters like namespaces, node types (article networks), author types (author networks) as well as for timestamps and measurement methods. It further provides functions to compute mean values, standard deviation etc. as well as functions to export visualized networks into common formats. A general problem in research is the representation and visualization of tremendous networks with more than 10,000 nodes and edges. While the parser on stage 1 is able to process in linear time and constant space, some measurement methods in stage 2 like centrality indices are costly to compute [28]. For unweighted graphs the complexity to compute the betweenness centrality using yFiles is O(|V |·|E|), closeness centrality is O(|V |2 +|V |·|E|). The system database that serves as an intermediary between both stages is realized with IBM DB2.


Dynamic Network Analysis of Wikis

The characteristics of wiki networks, their behavior, rates of change and distinctive features during the evolution process are the basis of the DNA. The developed prototype WikiWatcher allows to conduct the DNA by offering methods for measurement and representation of wiki networks. It permits the verification of assumptions about network characteristics. Not only the stage 2 prototype but also the system database allows to query and analyze wiki data. While the measurement data at stage 2 can be classified in information that refers to network structure, the gained information at the database level may refer more to network dimension of wikis. Structural aspects correspond to characteristics like centrality, clustering, network diameter or shortest path issues, whereas dimensional aspects cover properties like number of articles and authors, number of modifications, size of articles as well as their rates of change. We start with a couple of well known ideas and come to some newer hypotheses later. The rate of new authors/articles into a wiki network falls off after a period of time. The idea is to come from a ‘foundation fever’ of a wiki. Figure 3 shows both the growth rate of the number of authors and articles. A few wikis are treated in the diagrams. In general, the assumption can not be verified. It couldn’t determined a fall off in the rate of growth in both cases. The growth’s characteristics may be up to semantic aspects of a wiki, e.g. up-to-date incidences that animate new users to write new articles. It has to be proved individually. In the case of Wikia Search it seems to be clear. In January 2008 it went online for public – observably in the sharp bend in both network types. The measurements of Wikipedia (Simple English) show a progressive growth rate, in the case of Wikiversity it fluctuates sometimes or it may have leaps and bounds in other wikis. Because Wikiversity’s articles are strongly categorised, further name spaces are included. There is a remarkable observation that wasn’t intended when considering both growth rates separately. Joining new authors to a

Fig. 3. Rate of growth (author/article networks).

wiki mostly means new articles – it does not mean working on already existing articles. Wiki networks are heterogeneous during the whole evolution process. In homogeneous networks the number of k links per node is about the average hki [22]. Such a uniform distribution couldn’t be verified in (social) wiki networks. Applying and measuring the degree centrality showed an imbalance between the network nodes in terms of their links. According to a lot of situations in social structures a small portion of actors have above-average links and do most of the work, i.e. editing articles and establishing new relations. This is shown in Figure 4 where two author networks are given (left, center). To make contact to other users, one needs to edit a lot of articles. But, this kind of users are the minority. This distinctive heterogeneity not only occurs in author networks, but also in article networks (see Figure 4, right). For article networks this is

Fig. 4. Heterogeneous author/article networks

proven in Figure 5 by using the degree centrality. Incoming as well as outgoing article edges and links respectively are observed over a certain time period. The measurements showed in all considered wikis a continuous strong standard deviation of edges to nodes. Depending on semantic issues there may be a very high standard deviation of outgoing links. This is given in Aachen Wiki which serves

Fig. 5. Article networks: degree centrality and standard deviation

as a information wiki for the city of Aachen and as an index which naturally has many outgoing references. Central nodes hold their important role during the evolution process. As described, the ‘importance’ of a node can be determined by using the betweenness centrality. This means that most shortest paths in the network go through these nodes. This measurement is done for Wikia Search for the time period August 2004 to August 2005. The left side of Figure 6 gives for every registered author its betweenness centrality depending on time (unnormalized for a better view). Like the degree centrality there is only a small part of authors that have a high betweenness centrality. In general they hold or increase their high value during the evolution process. The survey can be found in article networks as well. The right side of Figure 6 shows the betweenness centrality for Jabber Wiki, a wiki as the name suggests. One of the most evident characteristics of wiki networks is the



500 CB



250 250 CB





200 100








100 Angela Melancholie Jasonr Payo1 Nlw Sgeo Dedalus Tim Maurreen Anthony_DiPierro Bdesham Weide Ellmist Christopher_mcdermott Hashar Aphrael_Runestar Fennec AlexG Ingoolemo Jimbo_Wales Ppp Jim Cimon_Avaro Vickie Nickshanks Htaccess Andrevan Mdavis Timur Node_ue Jfs Par Yann Gskur




0 Jabberfaehige_Programme Transports

Vorteile_von_Jabber Helga Warum_Jabber Kryptografie Einrichtung_Psi Helga−HTTP Helga−Gruppen Aktuelle_Ereignisse Alte_Meldungen Kryptografie_SSL/TLS Helga−Befehle Einrichtung_Gajim Einrichtung_von_Pidgin Externe_Bots Gemeinsame_Benutzergruppen Helga−Ideen Kryptografie_OTR Einrichtung_JBother Moderation_von_Gruppenchats Gruppenchat Helga−ErsteSchritte Interceptor Einrichtung_Pandion Kryptografie_OpenPGP Verbesserungen_der_Software Einrichtung_CampusTalk Ideensammlung FAQ Anmelden_und_Aktivieren Einrichtung_Miranda Einrichtung_AdiumX Einrichtung_mcabber Einrichtung_kopete Helga_(HTTP) RSS−Bot Einrichtung_Centericq Helga_(Bugs) Hauptseite Helga.php Dateitransfer Helga_(Befehle) Ideen_fuer_weitere_Jabber−Dienste Einrichtung_Spark Einrichtung_Meebo Einrichtung_Tkabber Einrichtung_Exodus Gemeinsame_Benutzergruppen_in_der_Kontaktliste Einrichtung_iChat Kryptografie_SimpLite Moeglichkeiten_fuer_Moderatoren_in_Gruppenchats Einrichtung_Trillian Helga_(Server_Bot) AGB TODO Audio−_und_Videochat RSS_Feeds_Bot Helga_(Ideen) Helga−Bugs CampusWebTalk Einrichtung_Neos Admin_Log Einrichtung_von_Gaim Testphase_Fahrplan WebReg Intern Einrichtung_von_Sim Jabber−Administratoren Helga_(Gruppen) Bekannte_Probleme Fingerprint KryptografieOTR


articles 12


10 8 6 4 2 0


12 10 8 6 4 2 0


Fig. 6. Betweenness centrality of author and article networks

heterogeneity during their entire evolution processes. In homogeneous networks the number of links k per node is about the average hki [22]. Such homogeneous structures do not appear in wiki author or wiki article networks. In fact, there exist a few nodes with a lot of adjacent edges and a plenty of nodes with only a few edges. Figure 7 gives an impression of heterogeneous networks. The author

network (circular layout) is a collaboration network of anonymous and registered users of the BerlinWiki that is hosted on Wikia. The article network (organic layout) gives the status of the German Wikia itself in May 2008 including all namespaces. Requested and never existing links as well as URLs are excluded because they’ve got only incoming edges. However, a strongly unbalanced edge distribution can be observed. The exponential distribution of wiki networks dur-

Fig. 7. Heterogeneous author/article networks

ing the whole evolution is also shown by considering betweenness and degree centrality indices. Furthermore, the standard deviation of the number of edges points out these characteristics. But also semantically it can be explained by considering the ‘intention’ of certain articles. For instance, there are a few articles that hold an index character with many outgoing links and references. Considering this issue, a consequential claim is to divide (registered) authors into two classes. The distinctive criterion is how intensive the participation on articles by authors is in the whole wiki. The system database outputs for every author its number of revisions. Ordered by this number, it could be drawn a line where the discrepancy of the revision numbers allocated to the authors was high. For example, in the Wikipedia (simple english) only 377 authors did 93% of the work (revisions), almost 5,000 authors did only a small part of it (7%). This phenomenon could be observed in all considered wikis independent from their size. A predominant number of revisions can be allocated to a small group of authors. After a short period of time a small group of users gathers around an article. Registered authors often serves as ‘connectors’ of anonymous author network components. This is another phenomenon could be discovered in the connections between registered and anonymous wiki users. Handling with a number of wikis has shown that this has to be proven for each wiki. In the given example of Wikia Search (see left side of Figure 8), it is remarkable that anonymous authors

can be identified in a certain way. Although they can only be spotted by their

Fig. 8. ‘Connectors’ in author and article networks

IP addresses, one can divide them after a period of time into single groups or graph components that are completely separated from each other. It would be interesting for further research to decompose the IP addresses and to allocate them geographically. By this way, it could be observable which addresses belong to single users. By adding registered authors to an anonymous author network one obtains only one strongly connected network component. The Wikia Search example gives the state of the anonymous author network with t0 = July 15, 2004 and t1 = January 7, 2008 (shortly before the ‘official start’). Nodes with a high betweenness rate are gateways to the rest of the web. As shown in the right side of Figure 8 which gives the state of the article-URL network of the AachenWiki in May 2008 there are a few article nodes that contain a lot of outgoing edges to external resources (web pages). These articles can be important as ‘connectors’ to the WWW. An interesting question may be if there is a correlation between these articles with high degree centrality to external pages and articles that have a high betweenness centrality to other wiki articles. Articles with a high CB control the information flow within a network. They are important by clicking through the wiki and must be protected against vandalism and other damages. Figure 9 treats ten of the most important article nodes of the Unofficial Google Wiki from January to December 2007. The wiki is hosted on Wikia. First, one can observe an almost constant betweenness centrality that is occupied by every node during the whole period. At this point a nice side effect emerges: Vandalism was detected by using the model. On July 31, 2007 (see month 8) the content of the main article Google Wiki was deleted completely. This means of course vanishing of all edges to other articles. Hence no shortest

Fig. 9. Betweenness and degree centrality of article nodes

path could go through the main page. This implies a betweenness centrality of 0 – in the diagram visualized as a ‘gap’. The right side of Figure 9 shows a stable number of outgoing edges to URLs. Both measurements show the strong heterogeneity of wiki networks. In general it can not be assumed a correlation between an article-article-CB and articleURL-CD . But, in the treated wikis all articles with a CB greater than 0 hold a number of URLs. Hence, these nodes are important in both ways: for the internal structure of the wiki as well as a connector to the ‘real world’. Complex networks become denser durring their evolution. Approaches of Leskovec et al. [29, 30] concerning the Densification Power Law showed that complex networks may become denser during their evolution and growth. Generally, this could not be verified for wiki author networks. Figure 10 reflects two essential characteristics. A few wikis were treated by considering their shortest path lengths. The measurements begin at the creation of the wikis (first month) and end at the moment of the XML dump. At each measurement point the greatest

Fig. 10. Lengths of shortest paths in author networks

strongly connected component of an author network was considered by computing the average shortest path length from one author node to another author node. As a consequence of the easy way of (intended or unintended) collaboration users are connected very quickly to other users. (Remember, you just need to work on an article in common.) On average, the shortest path length to another author is not longer than 3. But after a time period of ‘self-discovery’ the average distances stagnate at nearly 2 for all treated author networks. This kind of wiki self-discovery is depicted in Figures 11. They show the author networks (anonymous and registered authors) of Wikia (de) in July and August 2007 respectively. In the beginning, authors work in small groups on ‘their’ articles. More and more new authors join the network. After accomplishing this first evolution process strongly connected components will be merged to one single component (apart from some isolated nodes). The figure shows the important author link between both components. Of course, the average distance increases when a new component is connected (see peak in Figure 10). But due to more and more interactions between authors, the average distances will level off at 2 until the measurement ends. Hence, a growing densification during the evolution process could not be determined.

Fig. 11. Author network evolution


Conclusions and Outlook

In this paper, our aim was to establish a dynamic network analysis view on wikis. Wikis are continuously mutating and growing network structures. We introduced different quantitative and qualitative characterizations of wiki networks to model evolution and dynamics of wikis. In the formal network models we introduced

different node types in the network like authors, articles, revisions, and URLs. Each of the nodes is annotated by a time component allowing us to track complex changes in the structures over time. Due to the limited space, we cannot present all the hypotheses we tested for the study, cf. [31, 32] for more details. We highlighted here that a predominant number of revisions can be allocated to a small group of authors. We described that an anonymous user can be spotted by her editing behavior regardless of the IP address. We demonstrated that wiki pages with a high betweenness centrality also contain a lot of external links thus serving as a gateway to the external web. In the end, we had a closer look at the assumed densification of wiki networks which could not be affirmed. The applied DNA refers to structural aspects of wiki networks. Measurements of centrality indices revealed a growing heterogeneity in wiki networks. Like in other social networks we could determine a strong hierarchical structure of important and unimportant nodes. Furthermore, we have built a bridge to the Small World Phenomenon [33–35] that can be found in social science frequently. It was shown a continuous growth in the number of authors and articles with a remarkable correlation. But there could not made a general assertion about the kind of growth. This has to be checked in any particular case. But it offers interesting starting points for further research in cross-medial network types like author–article networks. What effect does have a weighting of edges? What kind of influence do have minor edits? In addition, semantic analysis of corresponding discussion, talk or user pages in terms of growth and changing may be interesting. What kind of benefits does have DNA for wikis? Next to getting an overview on hidden interrelationships and pointing out remarkable actors there are some further applications. Vandalism is widespread in the web. Wikis are concerned of it, too. So, it is necessary to protect particular areas and articles respectively. Wikipedia protects its articles due to semantic decisions – if an article has a sensitive content. But by means of network analysis articles could be protected by their importance to guarantee a secure information flow in the network. There may be further advantages of considering wiki networks, e.g. economic or social aspects that are based on network measurements. They can give recommendations to users according to the gained network data. This may be commercial advertisement or social information. In this paper wikis based on MediaWiki software were considered only. For generating the networks according to the models we implemented a two-staged system. It consists of a crawler that takes care of data extraction, transferring them into a system database, preparing them for generating and visualizing networks as well as applying measurement methods. Stage 1 is able to manage XML dumps of arbitrary file size. Parsing is done in linear time and constant space using SAX. Stage 2 uses the advantages of existing graph drawing libraries and their network analysis algorithms. One of the biggest problems was to handle with ‘big’ wikis as measured by their number of nodes. The English Wikipedia contains more than 2 million articles (nodes) and its German counterpart after all 1 million articles (nodes). Until now, it is a research challenge how to generate, analyze and visualize such tremendously huge networks. The main aspect of

wikis (‘writing articles in common’) echoes in all wiki systems. Every wiki is able to be represented as a mutating and developing network. Due to the 2stage design approach modifications for other wiki software are easily done. We already implemented a modified first stage for the content management system TikiWiki. Stage 2 can be untouched. Amongst other things like export format, namespaces and author types the different tagging of links had to be considered (see table 1). MediaWiki TikiWiki [[article]] ((article)) [[article|description]] ((article|description)) [] [] [ eg] [|eg] Table 1. Tagging of links

By adjusting the parser module on stage 1 according to the new requirements it is possible to adapt the most common wikis and even TikiWiki to the system. In this manner it is possible to apply DNA methods on arbitrary wikis based on arbitrary wiki engines. For dynamic network analysis incremental dumps are more appropriate than the static dumps we used for the extraction of timed information from the evolutionary wiki. Future work also includes the design of such incremental dump options for existing wiki software.



This work was supported by the German National Science Foundation (DFG) within the collaborative research center SFB/FK 427 ‘Media and Cultural Communication’, within the research cluster established under the excellence initiative of the German government ‘Ultra High-Speed Mobile Information and Communication (UMIC)’ and within the cluster project CONTICI. We thank our colleagues for the inspiring discussions.

