1. Multirelational k-Anonymity --- dotnet/java Nergiz, Mehmet Ercan Clifton, Christopher Nergiz, Ahmet Erhan Sabanci University, Istanbul, Purdue University, West Lafayette; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Aug. 2009 Volume: 21, Issue: 8 On page(s): 1104-1117 ISSN: 1041-4347 Digital Object Identifier: 10.1109/TKDE.2008.210 First Published: 2008-10-17 Current Version Published: 2009-06-23 Abstract k-Anonymity protects privacy by ensuring that data cannot be linked to a single individual. In a k-anonymous data set, any identifying information occurs in at least k tuples. Much research has been done to modify a single-table data set to satisfy anonymity constraints. This paper extends the definitions of k-anonymity to multiple relations and shows that previously proposed methodologies either fail to protect privacy or overly reduce the utility of the data in a multiple relation setting. We also propose two new clustering algorithms to achieve multirelational anonymity. Experiments show the effectiveness of the approach in terms of utility and efficiency. 2.Olex: Effective Rule Learning for Text Categorization – java /dotnet Rullo, Pasquale Policicchio, Veronica Lucia Cumbo, Chiara Iiritano, Salvatore University of Calabria, Rende; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Aug. 2009 Volume: 21, Issue: 8 On page(s): 1118-1132 ISSN: 1041-4347 Digital Object Identifier: 10.1109/TKDE.2008.206 First Published: 2008-10-10 Current Version Published: 2009-06-23 Abstract This paper describes Olex, a novel method for the automatic induction of rule-based text classifiers. Olex supports a hypothesis language of the form "if T_{1} or cdots or T_{n} occurs in document d, and none of T_{n + 1}, ldots T_{n + m} occurs in d, then classify d under category c,” where each T_{i} is a conjunction of terms. The proposed method is simple and elegant. Despite this, the results of a systematic experimentation performed on the Reuters-21578, the Ohsumed, and the ODP data collections show that Olex provides classifiers that are accurate, compact, and comprehensible. A comparative analysis
conducted against some of the most well-known learning algorithms (namely, Naive Bayes, Ripper, C4.5, SVM, and Linear Logistic Regression) demonstrates that it is more than competitive in terms of both predictive accuracy and efficiency.
3..Ranking and Suggesting Popular Items -java /dotnet Vojnović , Milan Cruise, James Gunawardena, Dinan Marbach, Peter Microsoft Research Ltd., Cambridge; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Aug. 2009 Volume: 21, Issue: 8 On page(s): 1133-1146 ISSN: 1041-4347 Digital Object Identifier: 10.1109/TKDE.2009.34 First Published: 2009-01-23 Current Version Published: 2009-06-23 Abstract We consider the problem of ranking the popularity of items and suggesting popular items based on user feedback. User feedback is obtained by iteratively presenting a set of suggested items, and users selecting items based on their own preferences either from this suggestion set or from the set of all possible items. The goal is to quickly learn the true popularity ranking of items (unbiased by the made suggestions), and suggest true popular items. The difficulty is that making suggestions to users can reinforce popularity of some items and distort the resulting item ranking. The described problem of ranking and suggesting items arises in diverse applications including search query suggestions and tag suggestions for social tagging systems. We propose and study several algorithms for ranking and suggesting popular items, provide analytical results on their performance, and present numerical results obtained using the inferred popularity of tags from a month-long crawl of a popular social bookmarking service. Our results suggest that lightweight, randomized update rules that require no special configuration parameters provide good performance.
4.Similarity-Profiled Temporal Association Mining –java/dotnet Yoo, Jin Soung Shekhar, Shashi Indiana University-Purdue University, Fort Wayne; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Aug. 2009 Volume: 21, Issue: 8 On page(s): 1147-1161
ISSN: 1041-4347 Digital Object Identifier: 10.1109/TKDE.2008.185 First Published: 2008-09-12 Current Version Published: 2009-06-23 Abstract Given a time stamped transaction database and a user-defined reference sequence of interest over time, similarity-profiled temporal association mining discovers all associated item sets whose prevalence variations over time are similar to the reference sequence. The similar temporal association patterns can reveal interesting relationships of data items which co-occur with a particular event over time. Most works in temporal association mining have focused on capturing special temporal regulation patterns such as cyclic patterns and calendar scheme-based patterns. However, our model is flexible in representing interesting temporal patterns using a user-defined reference sequence. The dissimilarity degree of the sequence of support values of an item set to the reference sequence is used to capture how well its temporal prevalence variation matches the reference pattern. By exploiting interesting properties such as an envelope of support time sequence and a lower bounding distance for early pruning candidate item sets, we develop an algorithm for effectively mining similarity-profiled temporal association patterns. We prove the algorithm is correct and complete in the mining results and provide the computational analysis. Experimental results on real data as well as synthetic data show that the proposed algorithm is more efficient than a sequential method using a traditional support-pruning scheme.
5.RiMOM: A Dynamic Multistrategy Ontology Alignment Framework –java /dotnet Li, Juanzi Tang, Jie Li, Yi Luo, Qiong Tsinghua University, Beijing; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Aug. 2009 Volume: 21, Issue: 8 On page(s): 1218-1232 ISSN: 1041-4347 Digital Object Identifier: 10.1109/TKDE.2008.202 First Published: 2008-09-26 Current Version Published: 2009-06-23 Abstract Ontology alignment identifies semantically matching entities in different ontologies. Various ontology alignment strategies have been proposed; however, few systems have explored how to automatically combine multiple strategies to improve the matching effectiveness. This paper presents a dynamic multistrategy ontology alignment framework, named RiMOM. The key insight in this framework is that similarity characteristics between ontologies may vary widely. We propose a systematic approach to quantitatively estimate the similarity characteristics for each alignment task and propose
a strategy selection method to automatically combine the matching strategies based on two estimated factors. In the approach, we consider both textual and structural characteristics of ontologies. With RiMOM, we participated in the 2006 and 2007 campaigns of the Ontology Alignment Evaluation Initiative (OAEI). Our system is among the top three performers in benchmark data sets. 6.IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. X, NO. X, XXXX 2009 Role Engineering via Prioritized -java Subset Enumeration Jaideep Vaidya, Member, IEEE, Vijayalakshmi Atluri, Senior Member, IEEE, Janice Warner, Student Member, IEEE, and Qi Guo, Student Member, IEEE Abstract—Today, role-based access control (RBAC) has become a well-accepted paradigm for implementing access control because of its convenience and ease of administration. However, in order to realize the full benefits of the RBAC paradigm, one must first define the roles accurately. This task of defining roles and associating permissions with them, also known as role engineering, is typically accomplished either in a top-down or in a bottom-up manner. Under the top-down approach, a careful analysis of the business processes is done to first define job functions and then to specify appropriate roles from them. While this approach can help in defining roles more accurately, it is tedious and time consuming since it requires that the semantics of the business processes be well understood. Moreover, it ignores existing permissions within an organization and does not utilize them. On the other hand, under the bottom-up approach, existing permissions are used to derive roles from them. As a result, it may help automate the process of role definition. In this paper, we present an unsupervised approach, called RoleMiner, for mining roles from existing user-permission assignments. Since a role, when semantics are unavailable, is nothing but a set of permissions, the task of role mining is essentially that of clustering users having the same (or similar) permissions. However, unlike the traditional applications of data mining that ideally require identification of nonoverlapping clusters, roles will have overlapping permissions and thus permission sets that define roles should be allowed to overlap. It is this distinction from traditional clustering that makes the problem of role mining nontrivial. Our experiments with real and simulated data sets indicate that our role mining process is quite accurate and efficient. Since our role mining approach is based on subset enumeration, it is fairly robust to reasonable levels of noise. 7. On the Effect of Location Uncertainty in Spatial Querying –java Elias Frentzos, Kostas Gratsias, and Yannis Theodoridis, Member, IEEE IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 3, MARCH 2009
Abstract—An emerging topic in the field of spatial data management is the handling of location uncertainty of spatial objects, mainly due to inaccurate measurements. The literature on location uncertainty so far has focused on modifying traditional spatial search algorithms in order to handle the impact of objects’ location uncertainty on the query results. In this paper, we present the first, to the best of our knowledge, theoretical analysis that estimates the average number of false hits introduced in the results of rectangular range queries in the case of data points uniformly distributed in 2D space. Then, we relax the original distribution assumptions showing how to deal with arbitrarily distributed data points and more realistic location uncertainty distributions. The accuracy of the results of our analytical approach is demonstrated through an extensive experimental study using various synthetic and real data sets. Our proposal can be directly employed in spatial database systems in order to provide users with the accuracy of spatial query results based only on known data set and query parameters. Index Terms—Spatial databases, GIS 8. Design and Evaluation of the iMed Intelligent Medical Search Engine –dotnet Gang LuoIBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532, USA IEEE International Conference on Data Engineering1084-4627/09 $25.00 © 2009 IEEE Abstract — Searching for medical information on the Web is popular and important. However, medical search has its own unique requirements that are poorly handled by existing medical Web search engines. This paper presents iMed, the first intelligent medical Web search engine that extensively uses medical knowledge and questionnaire to facilitate ordinary Internet users to search for medical information. iMed introduces and extends expert system technology into the search engine domain. It uses several key techniques to improve its usability and search result quality. First, since ordinary users often cannot clearly describe their situations due to lack of medical background, iMed uses a questionnaire-based query interface to guide searchers to provide the most important information about their situations. Second, iMed uses medical knowledge to automatically form multiple queries from a searcher’ answers to the questions. Using these queries to perform search can significantly improve the quality of search results. Third, iMed structures all the search results into a multilevel hierarchy with explicitly marked medical meanings to facilitate searchers’ viewing. Lastly, iMed suggests diversified,related medical phrases at each level of the search result hierarchy. These medical phrases are extracted from the MeSH ontology and can help searchers quickly digest search results and refine their inputs. We evaluated iMed under a wide range of medical scenarios. The results show that iMed is effective and efficient for medical search. 9. SPOT Databases: Efficient Consistency Checking and Optimistic Selection in Probabilistic Spatial Databases –java /dotnet
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 1, JANUARY 2009 ABSTRACT Spatial PrObabilistic Temporal (SPOT) databases are a paradigm for reasoning with probabilistic statements about where a vehicle may be now or in the future. They express statements of the form “Object O is in spatial region R at some time t with some probability in the interval [L, U].” Past work on SPOT databases has developed selection operators based on selecting SPOT atoms that are entailed by the SPOT database—we call this “cautious” selection. In this paper, we study several problems. First, we note that the runtime of consistency checking and cautious selection algorithms in past work is influenced greatly by the granularity of the underlying Cartesian space. In this paper, we first introduce the notion of “optimistic” selection, where we are interested in returning all SPOT atoms in a database that are consistent with respect to a query, rather than having an entailment relationship. We then develop an approach to scaling SPOT databases that has three main contributions: 1) We develop methods to eliminate variables from the linear programs used in past work, thus greatly reducing the size of the linear programs used—the resulting advances apply to consistency checking, optimistic selection, and cautious selection. 2) We develop a host of theorems to show how we can prune the search space when we are interested in optimistic selection. 3) We use the above contributions to build an efficient index to execute optimistic selection queries over SPOT databases. Our approach is superior to past work in two major respects: First, it makes fewer assumptions than all past works. Second, our experiments, which are based on real-world data about ship movements, show that our algorithms are much more efficient. 10 .Robust blind watermarking mechanism for point sampled geometry –java Robust Blind Watermarking of Point-Sampled Geometry Agarwal, P.; Prabhakaran, B. Information Forensics and Security, IEEE Transactions on Volume 4, Issue 1, Date: March 2009, Pages: 36-48 Digital Object Identifier 10.1109/TIFS.2008.2011081 ABSTRACT Watermarking schemes for copyright protection of point cloud representation of 3D models operate only on the geometric data, and are also applicable to mesh based representations of 3D models, defined using geometry and topological information. For building such generic copyright schemes for 3D models, this paper presents a robust spatial blind watermarking mechanism for 3D point sampled geometry. To find the order in which points are to be encoded/decoded, a clustering approach is proposed. The points are divided into clusters, and ordering is achieved using inter-cluster and intra-cluster ordering. Inter-cluster ordering achieves local ordering of points, whereas intra-cluster ordering does it globally. Once ordered, a sequence of clusters is chosen based on nearest neighbor heuristic. An extension of quantization index of bit encoding scheme is proposed, and used to encode and decode inside the clusters. The encoding mechanism makes the technique robust against uniform affine transformations (rotation, scaling, and transformation), reordering attack and topology altering (e.g. retriangulation) attack when
applied to 3D meshes as well. Replication of watermark provides robustness against localized noise addition, cropping, simplification and global noise addition attacks. Security of the scheme is analyzed, and the time complexity is estimated as O (n log n), where n is the number of 3D points. Theoretical bounds on hiding capacity are estimated, and experiments show that a high hiding capacity is high, with embedding rate greater than 3 bits/point. The bit encoding method reduces the distortions and makes the watermark imperceptible, indicated by a signal to noise ratio greater than 100 dB. 11. A Hypothesis Testing Approach to Semifragile Watermark-Based -Authentication –java Chuhong Fei; Kwong, R.H.; Kundur, D. Information Forensics and Security, IEEE Transactions on Volume 4, Issue 2, Date: June 2009, Pages: 179-192 Digital Object Identifier 10.1109/TIFS.2009.2015039
This paper studies the problem of achieving watermark semi-fragility in multimedia authentication through a composite hypothesis testing approach. The embedding of a semi-fragile watermark serves to distinguish legitimate distortions caused by signal processing manipulations from illegitimate ones caused by malicious tampering. This leads us to consider authentication verification as a composite hypothesis testing problem with the watermark as a priori information. Based on the hypothesis testing model, we investigate the best embedding strategy which assists the watermark verifier to make correct decisions. Our results show that the quantization; based watermarking method is more appropriate than the spread spectrum method to achieve the best tradeoff between two error probabilities. This observation is confirmed by a case study of additive Gaussian white noise channel with Gaussian source using two figures of merit: relative entropy of the two hypothesis distributions and the receiver operating characteristic. Finally, we focus on certain common signal processing distortions such as JPEG compression and image filtering, and investigate the best test statistic and optimal decision regions to distinguish legitimate and illegitimate distortions. The results of the paper show that our approach provides insights for authentication watermarking and allows better control of semi-fragility in specific applications.
12 Spread-Spectrum Watermarking Security –java./dotnet Perez-Freire, L.; Perez-Gonzalez, F. Information Forensics and Security, IEEE Transactions on Volume 4, Issue 1, Date: March 2009, Pages: 2-24 Digital Object Identifier 10.1109/TIFS.2008.2009603 Abstract This paper presents both theoretical and practical analyses of the security offered by watermarking and data hiding methods based on spread spectrum. In this context,
security is understood as the difficulty of estimating the secret parameters of the embedding function based on the observation of watermarked signals. On the theoretical side, the security is quantified from an information-theoretic point of view by means of the equivocation about the secret parameters. The main results reveal fundamental limits and bounds on security and provide insight into other properties, such as the impact of the embedding parameters, and the tradeoff between robustness and security. On the practical side, workable estimators of the secret parameters are proposed and theoretically analyzed for a variety of scenarios, providing a comparison with previous approaches, and showing that the security of many schemes used in practice can be fairly low. 13.A Game Theoretical Framework on Intrusion Detection in Heterogeneous Networks –java Lin Chen; Leneutre, J. Information Forensics and Security, IEEE Transactions on Volume 4, Issue 2, Date: June 2009, Pages: 165-178 Digital Object Identifier 10.1109/TIFS.2009.2019154 Abstract Due to the dynamic, distributed, and heterogeneous nature of today's networks, intrusion detection systems (IDSs) have become a necessary addition to the security infrastructure and are widely deployed as a complementary line of defense to classical security approaches. In this paper, we address the intrusion detection problem in heterogeneous networks consisting of nodes with different noncorrelated security assets. In our study, two crucial questions are: What are the expected behaviors of rational attackers? What is the optimal strategy of the defenders (IDSs)? We answer the questions by formulating the network intrusion detection as a noncooperative game and performing an in-depth analysis on the Nash equilibrium and the engineering implications behind. Based on our game theoretical analysis, we derive the expected behaviors of rational attackers, the minimum monitor resource requirement, and the optimal strategy of the defenders. We then provide guidelines for IDS design and deployment. We also show how our game theoretical framework can be applied to configure the intrusion detection strategies in realistic scenarios via a case study. Finally, we evaluate the proposed game theoretical framework via simulations. The simulation results show both the correctness of the analytical results and the effectiveness of the proposed guidelines.
14 Unseen Visible Watermarking: A Novel Methodology for Auxiliary Information Delivery via Visual Contents –java /dotnet Chun-Hsiang Huang; Shang-Chih Chuang; Yen-Lin Huang; Ja-Ling Wu Information Forensics and Security, IEEE Transactions on Volume 4, Issue 2, Date: June 2009, Pages: 193-206 Digital Object Identifier 10.1109/TIFS.2009.2020778
Abstract A novel data hiding scheme, denoted as unseen visible watermarking (UVW), is proposed. In UVW schemes, hidden information can be embedded covertly and then directly extracted using the human visual system as long as appropriate operations (e.g., gamma correction provided by almost all display devices or changes in viewing angles relative to LCD monitors) are performed. UVW eliminates the requirement of invisible watermarking that specific watermark extractors must be deployed to the receiving end in advance, and it can be integrated with 2-D barcodes to transmit machine-readable information that conventional visible watermarking schemes fail to deliver. We also adopt visual cryptographic techniques to guard the security of hidden information and, at the same time, increase the practical value of visual cryptography. Since UVW can be alternatively viewed as a mechanism for visualizing patterns hidden with leastsignificant-bit embedding, its security against statistical steganalysis is proved by empirical tests. Limitations and other potential extensions of UVW are also addressed.
15 A Fast Search Algorithm for a Large Fuzzy Database –java/dotnet Feng Hao; Daugman, J.; Zielinski, P. Information Forensics and Security, IEEE Transactions on Volume 3, Issue 2, Date: June 2008, Pages: 203-212 Digital Object Identifier 10.1109/TIFS.2008.920726 Abstract In this paper, we propose a fast search algorithm for a large fuzzy database that stores iris codes or data with a similar binary structure. The fuzzy nature of iris codes and their high dimensionality render many modern search algorithms, mainly relying on sorting and hashing, inadequate. The algorithm that is used in all current public deployments of iris recognition is based on a brute force exhaustive search through a database of iris codes, looking for a match that is close enough. Our new technique, Beacon Guided Search (BGS), tackles this problem by dispersing a multitude of ldquobeaconsrdquo in the search space. Despite random bit errors, iris codes from the same eye are more likely to collide with the same beacons than those from different eyes. By counting the number of collisions, BGS shrinks the search range dramatically with a negligible loss of precision. We evaluate this technique using 632,500 iris codes enrolled in the United Arab Emirates (UAE) border control system, showing a substantial improvement in search speed with a negligible loss of accuracy. In addition, we demonstrate that the empirical results match theoretical predictions.
16 Digital Image Forensics via Intrinsic Fingerprints-java Swaminathan, A.; Min Wu; Liu, K.J.R. Information Forensics and Security, IEEE Transactions on Volume 3, Issue 1, Date: March 2008, Pages: 101-117 Digital Object Identifier 10.1109/TIFS.2007.916010 Abstract Digital imaging has experienced tremendous growth in recent decades, and digital camera images have been used in a growing number of applications. With such increasing popularity and the availability of low-cost image editing software, the integrity of digital image content can no longer be taken for granted. This paper introduces a new methodology for the forensic analysis of digital camera images. The proposed method is based on the observation that many processing operations, both inside and outside acquisition devices, leave distinct intrinsic traces on digital images, and these intrinsic fingerprints can be identified and employed to verify the integrity of digital data. The intrinsic fingerprints of the various in-camera processing operations can be estimated through a detailed imaging model and its component analysis. Further processing applied to the camera captured image is modelled as a manipulation filter, for which a blind deconvolution technique is applied to obtain a linear time-invariant approximation and to estimate the intrinsic fingerprints associated with these postcamera operations. The absence of camera-imposed fingerprints from a test image indicates that the test image is not a camera output and is possibly generated by other image production processes. Any change or inconsistencies among the estimated camera-imposed fingerprints, or the presence of new types of fingerprints suggest that the image has undergone some kind of processing after the initial capture, such as tampering or steganographic embedding. Through analysis and extensive experimental studies, this paper demonstrates the effectiveness of the proposed framework for nonintrusive digital image forensics. 17 Watermarking Robustness Evaluation Based on Perceptual Quality via Genetic Algorithms Boato, G.; Conotter, V.; De Natale, F.G.B.; Fontanari, C. Information Forensics and Security, IEEE Transactions on Volume 4, Issue 2, Date: June 2009, Pages: 207-216 Digital Object Identifier 10.1109/TIFS.2009.2020362 Abstract This paper presents a novel and flexible benchmarking tool based on genetic algorithms (GA) and designed to assess the robustness of any digital image watermarking system. The main idea is to evaluate robustness in terms of perceptual quality, measured by weighted peak signal-to-noise ratio. Through a stochastic approach, we optimize this quality metric, by finding the minimal degradation that needs to be introduced in a marked image in order to remove the embedded watermark. Given a set of attacks, chosen according to the considered application scenario, GA support the optimization of the parameters to be assigned to each processing operation, in order to obtain an
unmarked image with perceptual quality as high as possible. Extensive experimental results demonstrate the effectiveness of the proposed evaluation tool. 18 Low-Complexity Iris Coding and Recognition Based on Directionlets –java Velisavljevic, V Information Forensics and Security, IEEE Transactions on Volume PP, Issue 99, Date: 0, Pages: 1-1 Digital Object Identifier 10.1109/TIFS.2009.2024025 Abstract A novel iris recognition method is presented. In the method, the iris features are extracted using the oriented separable wavelet transforms (directionlets) and they are compared in terms of a weighted Hamming distance. The feature extraction and comparison are shift-, size- and rotation-invariant to the location of iris in the acquired image. The generated iris code is binary, whose length is fixed (and therefore commensurable), independent of the iris image, and comparatively short. The novel method shows a good performance when applied to a large database of irises and provides reliable identification and verification. At the same time, it preserves conceptual and computational simplicity and allows for a quick analysis and comparison of iris samples.
19 Lanczos Vectors versus Singular Vectors for Effective Dimension Reduction – java/dotnet Chen, Jie Saad, Yousef University of Minnesota, Minneapolis; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Aug. 2009 Volume: 21, Issue: 8 On page(s): 1091-1103 ISSN: 1041-4347 Digital Object Identifier: 10.1109/TKDE.2008.228 First Published: 2008-11-17 Current Version Published: 2009-06-23 Abstract This paper takes an in-depth look at a technique for computing filtered matrix-vector (mat-vec) products which are required in many data analysis applications. In these applications, the data matrix is multiplied by a vector and we wish to perform this product accurately in the space spanned by a few of the major singular vectors of the matrix. We examine the use of the Lanczos algorithm for this purpose. The goal of the method is identical with that of the truncated singular value decomposition (SVD),
namely to preserve the quality of the resulting mat-vec product in the major singular directions of the matrix. The Lanczos-based approach achieves this goal by using a small number of Lanczos vectors, but it does not explicitly compute singular values/vectors of the matrix. The main advantage of the Lanczos-based technique is its low cost when compared with that of the truncated SVD. This advantage comes without sacrificing accuracy. The effectiveness of this approach is demonstrated on a few sample applications requiring dimension reduction, including information retrieval and face recognition. The proposed technique can be applied as a replacement to the truncated SVD technique whenever the problem can be formulated as a filtered mat-vec multiplication. 20 Optimal-Location-Selection Query Processing in Spatial Databases –java Gao, Yunjun Zheng, Baihua Chen, Gencai Li, Qing Singapore Management University, Singapore; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Aug. 2009 Volume: 21, Issue: 8 On page(s): 1162-1177 ISSN: 1041-4347 Digital Object Identifier: 10.1109/TKDE.2009.81 First Published: 2009-04-17 Current Version Published: 2009-06-23 Abstract This paper introduces and solves a novel type of spatial queries, namely, OptimalLocation-Selection (OLS) search, which has many applications in real life. Given a data object set D_A, a target object set D_B, a spatial region R, and a critical distance d_c in a multidimensional space, an OLS query retrieves those target objects in D_B that are outside R but have maximal optimality. Here, the optimality of a target object b in D_B located outside R is defined as the number of the data objects from D_A that are inside R and meanwhile have their distances to b not exceeding d_c. When there is a tie, the accumulated distance from the data objects to b serves as the tie breaker, and the one with smaller distance has the better optimality. In this paper, we present the optimality metric, formalize the OLS query, and propose several algorithms for processing OLS queries efficiently. A comprehensive experimental evaluation has been conducted using both real and synthetic data sets to demonstrate the efficiency and effectiveness of the proposed algorithms. 21 Evaluating the Effectiveness of Personalized Web Search –java/dotnet Dou, Zhicheng Song, Ruihua Wen, Ji-Rong Yuan, Xiaojie Microsoft Research Asia, Beijing;
This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Aug. 2009 Volume: 21, Issue: 8 On page(s): 1178-1190 ISSN: 1041-4347 Digital Object Identifier: 10.1109/TKDE.2008.172 First Published: 2008-08-22 Current Version Published: 2009-06-23 Abstract Although personalized search has been under way for many years and many personalization algorithms have been investigated, it is still unclear whether personalization is consistently effective on different queries for different users and under different search contexts. In this paper, we study this problem and provide some findings. We present a large-scale evaluation framework for personalized search based on query logs and then evaluate five personalized search algorithms (including two click-based ones and three topical-interest-based ones) using 12-day query logs of Windows Live Search. By analyzing the results, we reveal that personalized Web search does not work equally well under various situations. It represents a significant improvement over generic Web search for some queries, while it has little effect and even harms query performance under some situations. We propose click entropy as a simple measurement on whether a query should be personalized. We further propose several features to automatically predict when a query will benefit from a specific personalization algorithm. Experimental results show that using a personalization algorithm for queries selected by our prediction model is better than using it simply for all queries.
22 A Divide-and-Conquer Approach for Minimum Spanning Tree-Based Clustering –java Xiaochun Wang Xiali Wang Wilkes, D.M. Dept. of Electr. Eng. & Comput. Sci., Vanderbilt Univ., Nashville, TN; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: July 2009 Volume: 21, Issue: 7 On page(s): 945-958 ISSN: 1041-4347 INSPEC Accession Number: 10666309 Digital Object Identifier: 10.1109/TKDE.2009.37 First Published: 2009-02-02 Current Version Published: 2009-05-26 Abstract Due to their ability to detect clusters with irregular boundaries, minimum spanning treebased clustering algorithms have been widely used in practice. However, in such clustering algorithms, the search for nearest neighbor in the construction of minimum
spanning trees is the main source of computation and the standard solutions take O(N2) time. In this paper, we present a fast minimum spanning tree-inspired clustering algorithm, which, by using an efficient implementation of the cut and the cycle property of the minimum spanning trees, can have much better performance than O(N2).
23 Determining Attributes to Maximize Visibility of Objects –java Miah, M. Das, G. Hristidis, V. Mannila, H. Dept. of Comput. Sci. & Eng., Univ. of Texas at Arlington, Arlington, TX; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: July 2009 Volume: 21, Issue: 7 On page(s): 959-973 ISSN: 1041-4347 INSPEC Accession Number: 10666311 Digital Object Identifier: 10.1109/TKDE.2009.72 First Published: 2009-03-21 Current Version Published: 2009-05-26 Abstract In recent years, there has been significant interest in the development of ranking functions and efficient top-k retrieval algorithms to help users in ad hoc search and retrieval in databases (e.g., buyers searching for products in a catalog). We introduce a complementary problem: How to guide a seller in selecting the best attributes of a new tuple (e.g., a new product) to highlight so that it stands out in the crowd of existing competitive products and is widely visible to the pool of potential buyers. We develop several formulations of this problem. Although the problems are NP-complete, we give several exact and approximation algorithms that work well in practice. One type of exact algorithms is based on integer programming (IP) formulations of the problems. Another class of exact methods is based on maximal frequent item set mining algorithms. The approximation algorithms are based on greedy heuristics. A detailed performance study illustrates the benefits of our methods on real and synthetic data. 24 Efficient Skyline Computation in Structured Peer-to-Peer Systems –java Bin Cui Lijiang Chen Linhao Xu Hua Lu Guojie Song Quanqing Xu Dept. of Comput. Sci., Peking Univ., Beijing; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: July 2009 Volume: 21, Issue: 7 On page(s): 1059-1072 ISSN: 1041-4347 INSPEC Accession Number: 10666307 Digital Object Identifier: 10.1109/TKDE.2008.235
First Published: 2008-12-31 Current Version Published: 2009-05-26 Abstract An increasing number of large-scale applications exploit peer-to-peer network architecture to provide highly scalable and flexible services. Among these applications, data management in peer-to-peer systems is one of the interesting domains. In this paper, we investigate the multidimensional skyline computation problem on a structured peerto-peer network. In order to achieve low communication cost and quick response time, we utilize the iMinMax(theta ) method to transform high-dimensional data to onedimensional value and distribute the data in a structured peer-to-peer network called BATON. Thereafter, we propose a progressive algorithm with adaptive filter technique for efficient skyline computation in this environment. We further discuss some optimization techniques for the algorithm, and summarize the key principles of our algorithm into a query routing protocol with detailed analysis. Finally, we conduct an extensive experimental evaluation to demonstrate the efficiency of our approach.
25 A Communication Perspective on Automatic Text Categorization –java/dotnet Capdevila, M. Florez, O.W.M. Signal & Commun. Process. Dept., Univ. of Vigo, Vigo; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: July 2009 Volume: 21, Issue: 7 On page(s): 1027-1041 ISSN: 1041-4347 INSPEC Accession Number: 10666308 Digital Object Identifier: 10.1109/TKDE.2009.22 First Published: 2009-01-19 Current Version Published: 2009-05-26 Abstract The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is developed where Automatic Text Categorization(ATC) is studied under a Communication System perspective. Under this approach, the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction (reduction factor greater than 0.99) with a minimum loss (lower
than 3 percent) and, in some cases, gain (greater than 4 percent) of final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs). 26 Predicting Missing Items in Shopping Carts –j2ee/dotnet Wickramaratna, K. Kubat, M. Premaratne, K. Dept. of Electr. & Comput. Eng., Univ. of Miami, Coral Gables, FL; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: July 2009 Volume: 21, Issue: 7 On page(s): 985-998 ISSN: 1041-4347 INSPEC Accession Number: 10666304 Digital Object Identifier: 10.1109/TKDE.2008.229 First Published: 2008-12-02 Current Version Published: 2009-05-26 Abstract Existing research in association mining has focused mainly on how to expedite the search for frequently co-occurring groups of items in ldquoshopping cartrdquo type of transactions; less attention has been paid to methods that exploit these ldquofrequent itemsetsrdquo for prediction purposes. This paper contributes to the latter task by proposing a technique that uses partial information about the contents of a shopping cart for the prediction of what else the customer is likely to buy. Using the recently proposed data structure of itemset trees (IT-trees), we obtain, in a computationally efficient manner, all rules whose antecedents contain at least one item from the incomplete shopping cart. Then, we combine these rules by uncertainty processing techniques, including the classical Bayesian decision theory and a new algorithm based on the Dempster-Shafer (DS) theory of evidence combination. 27 Discovery of Structural and Functional Features in RNA Pseudoknotsjava/dotnet Chen, Q. Chen, Y.-P.P. Fac. of Sci. & Technol., Deakin Univ., VIC; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: July 2009 Volume: 21, Issue: 7 On page(s): 974-984 ISSN: 1041-4347 INSPEC Accession Number: 10666305 Digital Object Identifier: 10.1109/TKDE.2008.231
First Published: 2008-12-09 Current Version Published: 2009-05-26 Abstract An RNA pseudoknot consists of nonnested double-stranded stems connected by singlestranded loops. There is increasing recognition that RNA pseudoknots are one of the most prevalent RNA structures and fulfill a diverse set of biological roles within cells, and there is an expanding rate of studies into RNA pseudoknotted structures as well as increasing allocation of function. These not only produce valuable structural data but also facilitate an understanding of structural and functional characteristics in RNA molecules. PseudoBase is a database providing structural, functional, and sequence data related to RNA pseudoknots. To capture the features of RNA pseudoknots, we present a novel framework using quantitative association rule mining to analyze the pseudoknot data. The derived rules are classified into specified association groups regarding structure, function, and category of RNA pseudoknots. The discovered association rules assist biologists in filtering out significant knowledge of structure-function and structure-category relationships. A brief biological interpretation to the relationships is presented, and their potential correlations with each other are highlighted. 28 Predictive Ensemble Pruning by Expectation Propagation –java Huanhuan Chen Tiho, P. Xin Yao Sch. of Comput. Sci., Univ. of Birmingham, Birmingham; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: July 2009 Volume: 21, Issue: 7 On page(s): 999-1013 ISSN: 1041-4347 INSPEC Accession Number: 10666302 Digital Object Identifier: 10.1109/TKDE.2009.62 First Published: 2009-03-06 Current Version Published: 2009-05-26 Abstract An ensemble is a group of learners that work together as a committee to solve a problem. The existing ensemble learning algorithms often generate unnecessarily large ensembles, which consume extra computational resource and may degrade the generalization performance. Ensemble pruning algorithms aim to find a good subset of ensemble members to constitute a small ensemble, which saves the computational resource and performs as well as, or better than, the unpruned ensemble. This paper introduces a probabilistic ensemble pruning algorithm by choosing a set of ldquosparserdquo combination weights, most of which are zeros, to prune the ensemble. In order to obtain the set of sparse combination weights and satisfy the nonnegative constraint of the combination weights, a left-truncated, nonnegative, Gaussian prior is adopted over every combination weight. Expectation propagation (EP) algorithm is employed to approximate the posterior estimation of the weight vector. The leave-one-out (LOO) error can be
obtained as a by-product in the training of EP without extra computation and is a good indication for the generalization error. Therefore, the LOO error is used together with the Bayesian evidence for model selection in this algorithm. An empirical study on several regression and classification benchmark data sets shows that our algorithm utilizes far less component learners but performs as well as, or better than, the unpruned ensemble. Our results are very competitive compared with other ensemble pruning algorithms. 29 Rough Cluster Quality Index Based on Decision Theory-java/dotnet Lingras, P. Min Chen Duoqian Miao Dept. of Math. & Comput. Sci., St. Mary's Univ., Halifax, NS; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: July 2009 Volume: 21, Issue: 7 On page(s): 1014-1026 ISSN: 1041-4347 INSPEC Accession Number: 10666306 Digital Object Identifier: 10.1109/TKDE.2008.236 First Published: 2008-12-31 Current Version Published: 2009-05-26 Abstract Quality of clustering is an important issue in application of clustering techniques. Most traditional cluster validity indices are geometry-based cluster quality measures. This paper proposes a cluster validity index based on the decision-theoretic rough set model by considering various loss functions. Experiments with synthetic, standard, and realworld retail data show the usefulness of the proposed validity index for the evaluation of rough and crisp clustering. The measure is shown to help determine optimal number of clusters, as well as an important parameter called threshold in rough clustering. The experiments with a promotional campaign for the retail data illustrate the ability of the proposed measure to incorporate financial considerations in evaluating quality of a clustering scheme. This ability to deal with monetary values distinguishes the proposed decision-theoretic measure from other distance-based measures. The proposed validity index can also be extended for evaluating other clustering algorithms such as fuzzy clustering. 30 ANGEL: Enhancing the Utility of Generalization for Privacy Preserving Publication –java/dotnet Yufei Tao Hekang Chen Xiaokui Xiao Shuigeng Zhou Donghui Zhang Dept. of Comput. Sci. & Eng., Chinese Univ. of Hong Kong, Hong Kong; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: July 2009 Volume: 21, Issue: 7
On page(s): 1073-1087 ISSN: 1041-4347 INSPEC Accession Number: 10666310 Digital Object Identifier: 10.1109/TKDE.2009.65 First Published: 2009-03-06 Current Version Published: 2009-05-26 Abstract Generalization is a well-known method for privacy preserving data publication. Despite its vast popularity, it has several drawbacks such as heavy information loss, difficulty of supporting marginal publication, and so on. To overcome these drawbacks, we develop ANGEL,1 a new anonymization technique that is as effective as generalization in privacy protection, but is able to retain significantly more information in the microdata. ANGEL is applicable to any monotonic principles (e.g., l-diversity, t-closeness, etc.), with its superiority (in correlation preservation) especially obvious when tight privacy control must be enforced. We show that ANGEL lends itself elegantly to the hard problem of marginal publication. In particular, unlike generalization that can release only restricted marginals, our technique can be easily used to publish any marginals with strong privacy guarantees. 31 Clustering and Sequential Pattern Mining of Online Collaborative Learning Data –j2ee /dotnet Perera, D. Kay, J. Koprinska, I. Yacef, K. Zaiane, O.R. Sch. of Inf. Technol., Univ. of Sydney, Sydney, NSW; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 759-772 ISSN: 1041-4347 INSPEC Accession Number: 10588847 Digital Object Identifier: 10.1109/TKDE.2008.138 First Published: 2008-07-15 Current Version Published: 2009-04-24 Abstract Group work is widespread in education. The growing use of online tools supporting group work generates huge amounts of data. We aim to exploit this data to support mirroring: presenting useful high-level views of information about the group, together with desired patterns characterizing the behavior of strong groups. The goal is to enable the groups and their facilitators to see relevant aspects of the group's operation and provide feedback if these are more likely to be associated with positive or negative outcomes and indicate where the problems are. We explore how useful mirror information can be extracted via a theory-driven approach and a range of clustering and sequential pattern mining. The context is a senior software development project where
students use the collaboration tool TRAC. We extract patterns distinguishing the better from the weaker groups and get insights in the success factors. The results point to the importance of leadership and group interaction, and give promising indications if they are occurring. Patterns indicating good individual practices were also identified. We found that some key measures can be mined from early data. The results are promising for advising groups at the start and early identification of effective and poor practices, in time for remediation.
32 Monitoring Online Tests through Data Visualization –j2ee/dotnet Costagliola, G. Fuccella, V. Giordano, M. Polese, G. Dipt. di Mat. ed Inf., Univ. degli Studi di Salerno, Fisciano; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 773-784 ISSN: 1041-4347 INSPEC Accession Number: 10588843 Digital Object Identifier: 10.1109/TKDE.2008.133 First Published: 2008-07-15 Current Version Published: 2009-04-24 Abstract We present an approach and a system to let tutors monitor several important aspects related to online tests, such as learner behavior and test quality. The approach includes the logging of important data related to learner interaction with the system during the execution of online tests and exploits data visualization to highlight information useful to let tutors review and improve the whole assessment process. We have focused on the discovery of behavioral patterns of learners and conceptual relationships among test items. Furthermore, we have led several experiments in our faculty in order to assess the whole approach. In particular, by analyzing the data visualization charts, we have detected several previously unknown test strategies used by the learners. Last, we have detected several correlations among questions, which gave us useful feedbacks on the test quality. 33. Communities and Emerging Semantics in Semantic Link Network: Discovery and Learning Hai Zhuge Inst. of Comput. Technol., Chinese Acad. of Sci., Beijing;
This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 785-799 ISSN: 1041-4347 INSPEC Accession Number: 10588848 Digital Object Identifier: 10.1109/TKDE.2008.141 First Published: 2008-07-15 Current Version Published: 2009-04-24 Abstract The World Wide Web provides plentiful contents for Web-based learning, but its hyperlink-based architecture connects Web resources for browsing freely rather than for effective learning. To support effective learning, an e-learning system should be able to discover and make use of the semantic communities and the emerging semantic relations in a dynamic complex network of learning resources. Previous graph-based community discovery approaches are limited in ability to discover semantic communities. This paper first suggests the semantic link network (SLN), a loosely coupled semantic data model that can semantically link resources and derive out implicit semantic links according to a set of relational reasoning rules. By studying the intrinsic relationship between semantic communities and the semantic space of SLN, approaches to discovering reasoningconstraint, rule-constraint, and classification-constraint semantic communities are proposed. Further, the approaches, principles, and strategies for discovering emerging semantics in dynamic SLNs are studied. The basic laws of the semantic link network motion are revealed for the first time. An e-learning environment incorporating the proposed approaches, principles, and strategies to support effective discovery and learning is suggested.
34 Toward a Fuzzy Domain Ontology Extraction Method for Adaptive e-Learning Lau, R.Y.K. Dawei Song Yuefeng Li Cheung, T.C.H. Jin-Xing Hao Dept. of Inf. Syst., City Univ. of Hong Kong, Kowloon; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 800-813 ISSN: 1041-4347 INSPEC Accession Number: 10588846 Digital Object Identifier: 10.1109/TKDE.2008.137 First Published: 2008-07-15 Current Version Published: 2009-04-24 Abstract With the widespread applications of electronic learning (e-Learning) technologies to
education at all levels, increasing number of online educational resources and messages are generated from the corresponding e-Learning environments. Nevertheless, it is quite difficult, if not totally impossible, for instructors to read through and analyze the online messages to predict the progress of their students on the fly. The main contribution of this paper is the illustration of a novel concept map generation mechanism which is underpinned by a fuzzy domain ontology extraction algorithm. The proposed mechanism can automatically construct concept maps based on the messages posted to online discussion forums. By browsing the concept maps, instructors can quickly identify the progress of their students and adjust the pedagogical sequence on the fly. Our initial experimental results reveal that the accuracy and the quality of the automatically generated concept maps are promising. Our research work opens the door to the development and application of intelligent software tools to enhance e-Learning. 35 Open Smart Classroom: Extensible and Scalable Learning System in Smart Space Using Web Service Technology Yue Suo Miyata, N. Morikawa, H. Ishida, T. Yuanchun Shi Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 814-828 ISSN: 1041-4347 INSPEC Accession Number: 10588841 Digital Object Identifier: 10.1109/TKDE.2008.117 First Published: 2008-06-27 Current Version Published: 2009-04-24 Abstract Real-time interactive virtual classroom with teleeducation experience is an important approach in distance learning. However, most current systems fail to meet new challenges in extensibility and scalability, which mainly lie with three issues. First, an open system architecture is required to better support the integration of increasing human-computer interfaces and personal mobile devices in the classroom. Second, the learning system should facilitate opening its interfaces, which will help easy deployment that copes with different circumstances and allows other learning systems to talk to each other. Third, problems emerge on binding existing systems of classrooms together in different places or even different countries such as tackling systems intercommunication and distant intercultural learning in different languages. To address these issues, we build a prototype application called Open Smart Classroom built on our software infrastructure based on the multiagent system architecture using Web Service technology in Smart Space. Besides the evaluation of the extensibility and scalability of the system, an experiment connecting two Open Smart Classrooms deployed in different countries is also undertaken, which demonstrates the influence of these new features on the
educational effect. Interesting and optimistic results obtained show a significant research prospect for developing future distant learning systems. 36 NNexus: An Automatic Linker for Collaborative Web-Based Corpora Gardner, J. Krowne, A. Xiong, L. Dept. of Math. & Comput. Sci., Emory Univ., Atlanta, GA; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 829-839 ISSN: 1041-4347 INSPEC Accession Number: 10588844 Digital Object Identifier: 10.1109/TKDE.2008.136 First Published: 2008-07-15 Current Version Published: 2009-04-24 Abstract In this paper, we introduce Noosphere Networked Entry eXtension and Unification System (NNexus), a generalization of the automatic linking engine of Noosphere (at PlanetMath.org) and the first system that automates the process of linking disparate "encyclopediardquo entries into a fully connected conceptual network. The main challenges of this problem space include: 1) linking quality (correctly identifying which terms to link and which entry to link to with minimal effort on the part of users), 2) efficiency and scalability, and 3) generalization to multiple knowledge bases and webbased information environment. We present the NNexus approach that utilizes subject classification and other metadata to address these challenges. We also present evaluation results demonstrating the effectiveness and efficiency of the approach and discuss ongoing and future directions of research. 37 Effective Collaboration with Information Sharing in Virtual Universities Hua Wang Yanchun Zhang Jinli Cao Dept. of Math. & Comput., Univ. of Southern Queensland, Toowoomba, QLD; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 840-853 ISSN: 1041-4347 INSPEC Accession Number: 10588852 Digital Object Identifier: 10.1109/TKDE.2008.132 First Published: 2008-07-15 Current Version Published: 2009-04-24
Abstract A global education system, as a key area in future IT, has fostered developers to provide various learning systems with low cost. While a variety of e-learning advantages has been recognized for a long time and many advances in e-learning systems have been implemented, the needs for effective information sharing in a secure manner have to date been largely ignored, especially for virtual university collaborative environments. Information sharing of virtual universities usually occurs in broad, highly dynamic network-based environments, and formally accessing the resources in a secure manner poses a difficult and vital challenge. This paper aims to build a new rule-based framework to identify and address issues of sharing in virtual university environments through role-based access control (RBAC) management. The framework includes a rolebased group delegation granting model, group delegation revocation model, authorization granting, and authorization revocation. We analyze various revocations and the impact of revocations on role hierarchies. The implementation with XML-based tools demonstrates the feasibility of the framework and authorization methods. Finally, the current proposal is compared with other related work.
38 Learning in an Ambient Intelligent World: Enabling Technologies and Practices Xiang Li Ling Feng Lizhu Zhou Yuanchun Shi Tsinghua Univ., Beijing; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 910-924 ISSN: 1041-4347 INSPEC Accession Number: 10588849 Digital Object Identifier: 10.1109/TKDE.2008.143 First Published: 2008-07-18 Current Version Published: 2009-04-24 Abstract The rapid evolution of information and communication technology opens a wide spectrum of opportunities to change our surroundings into an Ambient Intelligent (AmI) world. AmI is a vision of future information society, where people are surrounded by a digital environment that is sensitive to their needs, personalized to their requirements, anticipatory of their behavior, and responsive to their presence. It emphasizes on greater user friendliness, user empowerment, and more effective service support, with an aim to bring information and communication technology to everyone, every home, every business, and every school, thus improving the quality of human life. AmI unprecedentedly enhances learning experiences by endowing the users with the opportunities of learning in context, a breakthrough from the traditional education settings. In this survey paper, we examine some major characteristics of an AmI learning environment. To deliver a feasible and effective solution to ambient learning, we
overview a few latest developed enabling technologies in context awareness and interactive learning. Associated practices are meanwhile reported. We also describe our experience in designing and implementing a smart class prototype, which allows teachers to simultaneously instruct both local and remote students in a context-aware and natural way. 39Interactive Correction and Recommendation for Computer Language Learning and Training Pahl, C. Kenny, C. Sch. of Comput., Dublin City Univ., Dublin; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 854-866 ISSN: 1041-4347 INSPEC Accession Number: 10588850 Digital Object Identifier: 10.1109/TKDE.2008.144 First Published: 2008-07-18 Current Version Published: 2009-04-24 Abstract Active learning and training is a particularly effective form of education. In various domains, skills are equally important to knowledge. We present an automated learning and skills training system for a database programming environment that promotes procedural knowledge acquisition and skills training. The system provides meaningful knowledge-level feedback such as correction of student solutions and personalized guidance through recommendations. Specifically, we address automated synchronous feedback and recommendations based on personalized performance assessment. At the core of the tutoring system is a pattern-based error classification and correction component that analyzes student input in order to provide immediate feedback and in order to diagnose student weaknesses and suggest further study material. A syntax-driven approach based on grammars and syntax trees provides the solution for a semantic analysis technique. Syntax tree abstractions and comparison techniques based on equivalence rules and pattern matching are specific approaches. 40 Subontology-Based Resource Management for Web-Based e-Learning Zhaohui Wu Yuxin Mao Huajun Chen Sch. of Comput. Sci., Zhejiang Univ., Hangzhou; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 867-880
ISSN: 1041-4347 INSPEC Accession Number: 10588842 Digital Object Identifier: 10.1109/TKDE.2008.127 First Published: 2008-07-15 Current Version Published: 2009-04-24 Abstract Recent advances in Web and information technologies have resulted in many e-learning resources. There is an emerging requirement to manage and reuse relevant resources together to achieve on-demand e-learning in the Web. Ontologies have become a key technology for enabling semantic-driven resource management. We argue that to meet the requirements of semantic-based resource management for Web-based e-learning, one should go beyond using domain ontologies statically. In this paper, we provide a semantic mapping mechanism to integrate e-learning databases by using ontology semantics. Heterogeneous e-learning databases can be integrated under a mediated ontology. Taking into account the locality of resource reuse, we propose to represent context-specific portions from the whole ontology as sub-ontologies. We present a sub-ontology-based approach for resource reuse by using an evolutionary algorithm. We also conduct simulation experiments to evaluate the approach with a traditional Chinese medicine elearning scenario and obtain promising results. 41 Enhancing Learning Objects with an Ontology-Based Memory Zouaq, A. Nkambou, R. Univ. of Quebec at Montreal, Montreal, QC; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 881-893 ISSN: 1041-4347 INSPEC Accession Number: 10588851 Digital Object Identifier: 10.1109/TKDE.2009.49 First Published: 2009-02-20 Current Version Published: 2009-04-24 Abstract The reusability in learning objects has always been a hot issue. However, we believe that current approaches to e-Learning failed to find a satisfying answer to this concern. This paper presents an approach that enables capitalization of existing learning resources by first creating "content metadatardquo through text mining and natural language processing and second by creating dynamically learning knowledge objects, i.e., active, adaptable, reusable, and independent learning objects. The proposed model also suggests integrating explicitly instructional theories in an on-the-fly composition process of learning objects. Semantic Web technologies are used to satisfy such an objective by creating an ontology-based organizational memory able to act as a knowledge base for multiple training environments.
42 Providing Flexible Process Support to Project-Centered Learning Ceri, S. Daniel, F. Matera, M. Raffio, A. Dipt. di Elettron. e Inf., Politec. di Milano, Milan; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 894-909 ISSN: 1041-4347 INSPEC Accession Number: 10588845 Digital Object Identifier: 10.1109/TKDE.2008.134 First Published: 2008-07-15 Current Version Published: 2009-04-24 Abstract While business process definition is becoming more and more popular as an instrument for describing human activities, there is a growing need for software tools supporting business process abstractions to help users organize and monitor their desktop work. Tools are most effective when they embed some knowledge about the process, e.g., in terms of the typical activities required by the process, so that users can execute the activities without having to define them. Tools must be lightweight and flexible, so as to enable users to create or change the process as soon as there is a new need. In this article, we first describe an application-independent approach to flexible process support by discussing the abstractions required for modeling, creating, enacting, and modifying flexible processes. Then, we show our approach at work in the context of project-centered learning. In this application, learners are challenged to perform concrete tasks in order to master specific subjects; in doing so, they have to conduct significant projects and cope with realistic (or even real-life) working conditions and scenarios. Often, students are geographically dispersed or under severe timing constraints, because these activities intertwine with their normal university activity. As a result, they need communication technology in order to interact and workflow technology in order to organize their work. The developed platform provides a comprehensible, e-learning-specific set of activities and process templates, which can be combined through a simple Web interface into project-centered collaboration processes. We discuss how the general paradigm of flexible processes was adapted to the learning concept, implemented, and experienced by students.
43 An Implementation of the CORDRA Architecture Enhanced for Systematic Reuse of Learning Objects Lin, F.H. Shih, T.K. Won Kim Dept. of Inf. Manage., Chihlee Inst. of Technol., Taipei;
This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 925-938 ISSN: 1041-4347 INSPEC Accession Number: 10588853 Digital Object Identifier: 10.1109/TKDE.2008.130 First Published: 2008-07-15 Current Version Published: 2009-04-24 Abstract The Sharable Content Object Reference Model (SCORM) specification defines metadata of learning objects, which are used as the elementary reusable components in distance learning. The Content Object Repository Discovery And Registration/Resolution Architecture (CORDRA) specification provides a common architecture for the resolution, discovery, and sharing of these learning objects. These two specifications together define standardized ways in which learning objects can be discovered and reused by content designers. However, the current CORDRA and the definition of objects in SCORM only allow an object to be copied, updated, and reorganized in a new content aggregation, which is used as a delivery package to end users. This paper proposes a revised CORDRA architecture and a reusability mechanism to make instruction design easier. In particular, it proposes a structure called a reusability tree for tracking the history of reuse of learning objects in CORDRA. This paper also defines the notions of similarity, diversity, and relevancy of learning objects to make it easier for users to precisely search for and reuse learning objects. 44 On the Expressive Power of the Relational Algebra on Finite Sets of Relation Pairs Fletcher, G.H.L. Gyssens, M. Paredaens, J. Van Gucht, D. Sch. of Eng. & Comput. Sci., Washington State Univ., Vancouver, WA; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: June 2009 Volume: 21, Issue: 6 On page(s): 939-942 ISSN: 1041-4347 INSPEC Accession Number: 10588854 Digital Object Identifier: 10.1109/TKDE.2008.221 First Published: 2008-10-31 Current Version Published: 2009-04-24 Abstract We give a language-independent characterization of the expressive power of the relational algebra on finite sets of source-target relation instance pairs. The associated decision problem is shown to be co-graph-isomorphism hard and in co NP. The main result is also applied in providing a new characterization of the generic relational queries.
45.Adapted One-versus-All Decision Trees for Data Stream Classification Hashemi, S. Ying Yang Mirzamomen, Z. Kangavari, M. Monash Univ., Clayton, VIC; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: May 2009 Volume: 21, Issue: 5 On page(s): 624-637 ISSN: 1041-4347 INSPEC Accession Number: 10520351 Digital Object Identifier: 10.1109/TKDE.2008.181 First Published: 2008-08-29 Current Version Published: 2009-03-24 Abstract One versus all (OVA) decision trees learn k individual binary classifiers, each one to distinguish the instances of a single class from the instances of all other classes. Thus OVA is different from existing data stream classification schemes whose majority use multiclass classifiers, each one to discriminate among all the classes. This paper advocates some outstanding advantages of OVA for data stream classification. First, there is low error correlation and hence high diversity among OVA's component classifiers, which leads to high classification accuracy. Second, OVA is adept at accommodating new class labels that often appear in data streams. However, there also remain many challenges to deploy traditional OVA for classifying data streams. First, as every instance is fed to all component classifiers, OVA is known as an inefficient model. Second, OVA's classification accuracy is adversely affected by the imbalanced class distribution in data streams. This paper addresses those key challenges and consequently proposes a new OVA scheme that is adapted for data stream classification. Theoretical analysis and empirical evidence reveal that the adapted OVA can offer faster training, faster updating and higher classification accuracy than many existing popular data stream classification algorithms.
46 Bayes Vector Quantizer for Class-Imbalance Problem Diamantini, C. Potena, D. Dipt. di Ing. Inf., Gestionale e dell'Autom., Universitd Politec. delle Marche, Ancona; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: May 2009 Volume: 21, Issue: 5 On page(s): 638-651 ISSN: 1041-4347 INSPEC Accession Number: 10520352
Digital Object Identifier: 10.1109/TKDE.2008.187 First Published: 2008-09-12 Current Version Published: 2009-03-24 Abstract The class-imbalance problem is the problem of learning a classification rule from data that are skewed in favor of one class. On these datasets traditional learning techniques tend to overlook the less numerous class, at the advantage of the majority class. However, the minority class is often the most interesting one for the task at hand. For this reason, the class-imbalance problem has received increasing attention in the last few years. In the present paper we point the attention of the reader to a learning algorithm for the minimization of the average misclassification risk. In contrast to some popular classimbalance learning methods, this method has its roots in statistical decision theory. A particular interesting characteristic is that when class distributions are unknown, the method can work by resorting to stochastic gradient algorithm. We study the behavior of this algorithm on imbalanced datasets, demonstrating that this principled approach allows to obtain better classification performances compared to the principal methods proposed in the literature.
47 Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data Hung-Leng Chen Ming-Syan Chen Su-Chen Lin Dept. of Electr. Eng., Nat. Taiwan Univ., Taipei; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: May 2009 Volume: 21, Issue: 5 On page(s): 652-665 ISSN: 1041-4347 INSPEC Accession Number: 10520353 Digital Object Identifier: 10.1109/TKDE.2008.192 First Published: 2008-09-19 Current Version Published: 2009-03-24 Abstract Sampling has been recognized as an important technique to improve the efficiency of clustering. However, with sampling applied, those points that are not sampled will not have their labels after the normal process. Although there is a straightforward approach in the numerical domain, the problem of how to allocate those unlabeled data points into proper clusters remains as a challenging issue in the categorical domain. In this paper, a mechanism named MAximal Resemblance Data Labeling (abbreviated as MARDL) is proposed to allocate each unlabeled data point into the corresponding appropriate cluster based on the novel categorical clustering representative, namely, N-Nodeset Importance Representative (abbreviated as NNIR), which represents clusters by the importance of the combinations of attribute values. MARDL has two advantages: (1) MARDL exhibits high
execution efficiency, and (2) MARDL can achieve high intracluster similarity and low intercluster similarity, which are regarded as the most important properties of clusters, thus benefiting the analysis of cluster behaviors. MARDL is empirically validated on real and synthetic data sets and is shown to be significantly more efficient than prior works while attaining results of high quality.
48 GridVideo: A Practical Example of Nonscientific Application on the Grid Bruneo, D. Iellamo, G. Minutoli, G. Puliafito, A. Dept. of Math., Univ. of Messina, Messina; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: May 2009 Volume: 21, Issue: 5 On page(s): 666-680 ISSN: 1041-4347 INSPEC Accession Number: 10520354 Digital Object Identifier: 10.1109/TKDE.2008.191 First Published: 2008-09-19 Current Version Published: 2009-03-24 Abstract Starting from 1990s and until now, Grid computing has been mainly used in scientific laboratories. Only in the last few years, it is evolving into a business-innovating technology that is driving commercial adoption. In this paper, we describe GridVideo, a Grid-based multimedia application for the distributed tailoring and streaming of media files. The objective is to show, starting from a real experience, how Grid technologies can be used for the development of nonscientific applications. Relevant performance aspects are analyzed, regarding both user-oriented (in terms of responsiveness) and provideroriented (in terms of system efficiency) requirements. Different multimedia data dissemination strategies have been analyzed and an innovative technique, based on the Fibonacci series, is proposed. To respond to the stringent quality-of-service (QoS) requirements, typical of soft real-time applications, a reservation-based architecture is presented. Such architecture is able to manage the Grid resource allocation, thus enabling the provisioning of advanced services with different QoS levels. Technical and practical problems encountered during the development are discussed, and a thorough performance evaluation of the developed prototype is presented.
49 Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization Hammouda, K.M. Kamel, M.S. Desire2Learn Inc., Kitchener, ON;
This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: May 2009 Volume: 21, Issue: 5 On page(s): 681-698 ISSN: 1041-4347 INSPEC Accession Number: 10520355 Digital Object Identifier: 10.1109/TKDE.2008.189 First Published: 2008-09-19 Current Version Published: 2009-03-24 Abstract In distributed data mining, adopting a flat node distribution model can affect scalability. To address the problem of modularity, flexibility and scalability, we propose a Hierarchically-distributed Peer-to-Peer (HP2PC) architecture and clustering algorithm. The architecture is based on a multi-layer overlay network of peer neighborhoods. Supernodes, which act as representatives of neighborhoods, are recursively grouped to form higher level neighborhoods. Within a certain level of the hierarchy, peers cooperate within their respective neighborhoods to perform P2P clustering. Using this model, we can partition the clustering problem in a modular way across neighborhoods, solve each part individually using a distributed K-means variant, then successively combine clusterings up the hierarchy where increasingly more global solutions are computed. In addition, for document clustering applications, we summarize the distributed document clusters using a distributed keyphrase extraction algorithm, thus providing interpretation of the clusters. Results show decent speedup, reaching 165 times faster than centralized clustering for a 250-node simulated network, with comparable clustering quality to the centralized approach. We also provide comparison to the P2P K-means algorithm and show that HP2PC accuracy is better for typical hierarchy heights. Results for distributed cluster summarization match those of their centralized counterparts with up to 88% accuracy. 50 Exact Knowledge Hiding through Database Extension Gkoulalas-Divanis, A. Verykios, V.S. Dept. of Comput. & Commun. Eng., Univ. of Thessaly, Volos; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: May 2009 Volume: 21, Issue: 5 On page(s): 699-713 ISSN: 1041-4347 INSPEC Accession Number: 10520356 Digital Object Identifier: 10.1109/TKDE.2008.199 First Published: 2008-09-26 Current Version Published: 2009-03-24
Abstract In this paper, we propose a novel, exact border-based approach that provides an optimal solution for the hiding of sensitive frequent itemsets by (i) minimally extending the original database by a synthetically generated database part - the database extension, (ii) formulating the creation of the database extension as a constraint satisfaction problem, (iii) mapping the constraint satisfaction problem to an equivalent binary integer programming problem, (iv) exploiting underutilized synthetic transactions to proportionally increase the support of non-sensitive itemsets, (v) minimally relaxing the constraint satisfaction problem to provide an approximate solution close to the optimal one when an ideal solution does not exist, and (vi) by using a partitioning in the universe of the items to increase the efficiency of the proposed hiding algorithm. Extending the original database for sensitive itemset hiding is proved to provide optimal solutions to an extended set of hiding problems compared to previous approaches and to provide solutions of higher quality. Moreover, the application of binary integer programming enables the simultaneous hiding of the sensitive itemsets and thus allows for the identification of globally optimal solutions. 51 GLIP: A Concurrency Control Protocol for Clipping Indexing Chang-Tien Lu Jing Dai Ying Jin Mathuria, J. Dept. of Comput. Sci., Virginia Tech., Falls Church, VA; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: May 2009 Volume: 21, Issue: 5 On page(s): 714-728 ISSN: 1041-4347 INSPEC Accession Number: 10520357 Digital Object Identifier: 10.1109/TKDE.2008.183 First Published: 2008-09-12 Current Version Published: 2009-03-24 Abstract Multidimensional databases are beginning to be used in a wide range of applications. To meet this fast-growing demand, the R-tree family is being applied to support fast access to multidimensional data, for which the R+-tree exhibits outstanding search performance. In order to support efficient concurrent access in multiuser environments, concurrency control mechanisms for multidimensional indexing have been proposed. However, these mechanisms cannot be directly applied to the R+-tree because an object in the R+-tree may be indexed in multiple leaves. This paper proposes a concurrency control protocol for R-tree variants with object clipping, namely, Granular Locking for clipping indexing (GLIP). GLIP is the first concurrency control approach specifically designed for the R+tree and its variants, and it supports efficient concurrent operations with serializable isolation, consistency, and deadlock-free. Experimental tests on both real and synthetic
data sets validated the effectiveness and efficiency of the proposed concurrent access framework. 52 Fast Query Point Movement Techniques for Large CBIR Systems Danzhou Liu Hua, K.A. Khanh Vu Ning Yu Sch. of Electr. Eng. & Comput. Sci., Univ. of Central Florida, Orlando, FL; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: May 2009 Volume: 21, Issue: 5 On page(s): 729-743 ISSN: 1041-4347 INSPEC Accession Number: 10520358 Digital Object Identifier: 10.1109/TKDE.2008.188 First Published: 2008-09-19 Current Version Published: 2009-03-24 Abstract Target search in content-based image retrieval (CBIR) systems refers to finding a specific (target) image such as a particular registered logo or a specific historical photograph. Existing techniques, designed around query refinement based on relevance feedback, suffer from slow convergence, and do not guarantee to find intended targets. To address these limitations, we propose several efficient query point movement methods. We prove that our approach is able to reach any given target image with fewer iterations in the worst and average cases. We propose a new index structure and query processing technique to improve retrieval effectiveness and efficiency. We also consider strategies to minimize the effects of users' inaccurate relevance feedback. Extensive experiments in simulated and realistic environments show that our approach significantly reduces the number of required iterations and improves overall retrieval performance. The experimental results also confirm that our approach can always retrieve intended targets even with poor selection of initial query points.
53 Schema Vacuuming in Temporal Databases Roddick, J.F. Sch. of Comput. Sci., Eng. & Math., Flinders Univ., Adelaide, SA; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: May 2009 Volume: 21, Issue: 5 On page(s): 744-747 ISSN: 1041-4347 INSPEC Accession Number: 10520359 Digital Object Identifier: 10.1109/TKDE.2008.201
First Published: 2008-09-26 Current Version Published: 2009-03-24 Abstract Temporal databases facilitate the support of historical information by providing functions for indicating the intervals during which a tuple was applicable (along one or more temporal dimensions). Because data are never deleted, only superceded, temporal databases are inherently append-only resulting, over time, in a large historical sequence of database states. Data vacuuming in temporal databases allows for this sequence to be shortened by strategically, and irrevocably, deleting obsolete data. Schema versioning allows users to maintain a history of database schemata without compromising the semantics of the data or the ability to view data through historical schemata. While the techniques required for data vacuuming in temporal databases have been relatively well covered, the associated area of vacuuming schemata has received less attention. This paper discusses this issue and proposes a mechanism that fits well with existing methods for data vacuuming and schema versioning.
54 The Subgraph Similarity Problem De Nardo, L. Ranzato, F. Tapparo, F. Dipt. di Mat. Pura ed Applicata, Univ. of Padova, Padova; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: May 2009 Volume: 21, Issue: 5 On page(s): 748-749 ISSN: 1041-4347 INSPEC Accession Number: 10520360 Digital Object Identifier: 10.1109/TKDE.2008.205 First Published: 2008-10-10 Current Version Published: 2009-03-24 Abstract Similarity is a well known weakening of bisimilarity where one system is required to simulate the other and vice versa. It has been shown that the subgraph bisimilarity problem, a variation of the subgraph isomorphism problem where isomorphism is weakened to bisimilarity, is NP-complete. We show that the subgraph similarity problem and some related variations thereof still remain NP-complete.
55 Histogram-Based Global Load Balancing in Structured Peer-to-Peer Systems Quang Hieu Vu Beng Chin Ooi Rinard, M. Kian-Lee Tan Sch. of Comput., Nat. Univ. of Singapore, Singapore;
This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009 Volume: 21, Issue: 4 On page(s): 595-608 ISSN: 1041-4347 INSPEC Accession Number: 10477132 Digital Object Identifier: 10.1109/TKDE.2008.182 First Published: 2008-09-05 Current Version Published: 2009-02-24 Abstract Over the past few years, peer-to-peer (P2P) systems have rapidly grown in popularity and have become a dominant means for sharing resources. In these systems, load balancing is a key challenge because nodes are often heterogeneous. While several load-balancing schemes have been proposed in the literature, these solutions are typically ad hoc, heuristic based, and localized. In this paper, we present a general framework, HiGLOB, for global load balancing in structured P2P systems. Each node in HiGLOB has two key components: 1) a histogram manager maintains a histogram that reflects a global view of the distribution of the load in the system, and 2) a load-balancing manager that redistributes the load whenever the node becomes overloaded or underloaded. We exploit the routing metadata to partition the P2P network into nonoverlapping regions corresponding to the histogram buckets. We propose mechanisms to keep the cost of constructing and maintaining the histograms low. We further show that our scheme can control and bound the amount of load imbalance across the system. Finally, we demonstrate the effectiveness of HiGLOB by instantiating it over three existing structured P2P systems: Skip Graph, BATON, and Chord. Our experimental results indicate that our approach works well in practice. 56 Progressive Parametric Query Optimization Bizarro, P. Bruno, N. DeWitt, D.J. CISUC/DEI, Univ. of Coimbra, Coimbra; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009 Volume: 21, Issue: 4 On page(s): 582-594 ISSN: 1041-4347 INSPEC Accession Number: 10477131 Digital Object Identifier: 10.1109/TKDE.2008.160 First Published: 2008-08-01 Current Version Published: 2009-02-24
Abstract Commercial applications usually rely on pre-compiled parameterized procedures to interact with a database. Unfortunately, executing a procedure with a set of parameters different from those used at compilation time may be arbitrarily sub-optimal. Parametric query optimization (PQO) attempts to solve this problem by exhaustively determining the optimal plans at each point of the parameter space at compile time. However, PQO is likely not cost-effective if the query is executed infrequently or if it is executed with values only within a subset of the parameter space. In this paper we propose instead to progressively explore the parameter space and build a parametric plan during several executions of the same query. We introduce algorithms that, as parametric plans are populated, are able to frequently bypass the optimizer but still execute optimal or nearoptimal plans. 57 Multiscale Representations for Fast Pattern Matching in Stream Time Series Xiang Lian Lei Chen Yu, J.X. Jinsong Han Jian Ma Dept. of Comput. Sci. & Eng., Hong Kong Univ. of Sci. & Technol., Kowloon; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009 Volume: 21, Issue: 4 On page(s): 568-581 ISSN: 1041-4347 INSPEC Accession Number: 10479185 Digital Object Identifier: 10.1109/TKDE.2008.184 First Published: 2008-09-12 Current Version Published: 2009-02-24 Abstract Similarity-based time-series retrieval has been a subject of long-term study due to its wide usage in many applications, such as financial data analysis, weather data forecasting, and multimedia data retrieval. Its original task was to find those time series similar to a pattern (query) time-series data, where both the pattern and data time series are static. Recently, with an increasing demand on stream data management, similaritybased stream time-series retrieval has raised new research issues due to its unique requirements during the stream processing, such as one-pass search and fast response. In this paper, we address the problem of matching both static and dynamic patterns over stream time-series data. We will develop a novel multiscale representation, called multiscale segment mean, for stream time-series data, which can be incrementally computed and thus perfectly adapted to the stream characteristics. Most importantly, we propose a novel multistep filtering mechanism, step by step, over the multiscale representation. Analysis indicates that the mechanism can greatly prune the search space and thus offer fast response. Furthermore, batch processing optimization and the dynamic case where patterns are also from stream time series are discussed. Extensive experiments
show the multiscale representation together with the multistep filtering scheme can efficiently filter out false candidates and detect patterns, compared to the multiscale wavelet.
58 Optimal Lot Sizing Policies For Sequential Online Auctions Tripathi, A.K. Nair, S.K. Karuga, G.G. Dept. of Inf. Syst. & Oper. Manage., Univ. of Washington, Seattle, WA; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009 Volume: 21, Issue: 4 On page(s): 554-567 ISSN: 1041-4347 INSPEC Accession Number: 10477130 Digital Object Identifier: 10.1109/TKDE.2008.145 First Published: 2008-07-18 Current Version Published: 2009-02-24 Abstract This study proposes methods for determining the optimal lot sizes for sequential auctions that are conducted to sell sizable quantities of an item. These auctions are fairly common in business to consumer (B2C) auctions. In these auctions, the tradeoff for the auctioneer is between the alacrity with which funds are received, and the amount of funds collected by the faster clearing of inventory using larger lot sizes. Observed bids in these auctions impact the auctioneer's decision on lot sizes in future auctions. We first present a goal programming approach for estimating the bid distribution for the bidder population from the observed bids, readily available in these auctions. We then develop models to compute optimal lot sizes for both stationary and non-stationary bid distributions. For stationary bid distribution, we present closed form solutions and structural results. Our findings show that the optimal lot size increases with inventory holding costs and number of bidders. Our model for non-stationary bid distribution captures the inter-auction dynamics such as the number of bidders, their bids, past winning bids, and lot size. We use simulated data to test the robustness of our model.
59 A Pure Nash Equilibrium-Based Game Theoretical Method for Data Replication across Multiple Servers Khan, S.U. Ahmad, I. Dept. of Electr. & Comput. Eng., North Dakota State Univ., Fargo, ND; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009 Volume: 21, Issue: 4
On page(s): 537-553 ISSN: 1041-4347 INSPEC Accession Number: 10477129 Digital Object Identifier: 10.1109/TKDE.2008.171 First Published: 2008-08-22 Current Version Published: 2009-02-24 Abstract This paper proposes a non-cooperative game based technique to replicate data objects across a distributed system of multiple servers in order to reduce user perceived Web access delays. In the proposed technique computational agents represent servers and compete with each other to optimize the performance of their servers. The optimality of a non-cooperative game is typically described by Nash equilibrium, which is based on spontaneous and non-deterministic strategies. However, Nash equilibrium may or may not guarantee system-wide performance. Furthermore, there can be multiple Nash equilibria, making it difficult to decide which one is the best. In contrast, the proposed technique uses the notion of pure Nash equilibrium, which if achieved, guarantees stable optimal performance. In the proposed technique, agents use deterministic strategies that work in conjunction with their self-interested nature but ensure system-wide performance enhancement. In general, the existence of a pure Nash equilibrium is hard to achieve, but we prove the existence of such equilibrium in the proposed technique. The proposed technique is also experimentally compared against some well-known conventional replica allocation methods, such as branch and bound, greedy, and genetic algorithms.
60 On the Design and Applicability of Distance Functions in High-Dimensional Data Space Chih-Ming Hsu Ming-Syan Chen Dept. of Electr. Eng., Nat. Taiwan Univ., Taipei; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009 Volume: 21, Issue: 4 On page(s): 523-536 ISSN: 1041-4347 INSPEC Accession Number: 10477128 Digital Object Identifier: 10.1109/TKDE.2008.178 First Published: 2008-08-29 Current Version Published: 2009-02-24 Abstract Effective distance functions in high dimensional data space are very important in solutions for many data mining problems. Recent research has shown that if the Pearson variation of the distance distribution converges to zero with increasing dimensionality, the distance function will become unstable (or meaningless) in high dimensional space,
even with the commonly used Lp metric in the Euclidean space. This result has spawned many studies the along the same lines. However, the necessary condition for unstability of a distance function, which is required for function design, remains unknown. In this paper, we shall prove that several important conditions are in fact equivalent to unstability. Based on these theoretical results, we employ some effective and valid indices for testing the stability of a distance function. In addition, this theoretical analysis inspires us that unstable phenomena are rooted in variation of the distance distribution. To demonstrate the theoretical results, we design a meaningful distance function, called the shrinkage-divergence proximity (SDP), based on a given distance function. It is shown empirically that the SDP significantly outperforms other measures in terms of stability in high dimensional data space, and is thus more suitable for distance-based clustering applications. 61 Mining Projected Clusters in High-Dimensional Spaces Bouguessa, M. Shengrui Wang Dept. of Comput. Sci., Univ. of Sherbrooke, Sherbrooke, QC; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009 Volume: 21, Issue: 4 On page(s): 507-522 ISSN: 1041-4347 INSPEC Accession Number: 10477127 Digital Object Identifier: 10.1109/TKDE.2008.162 First Published: 2008-08-01 Current Version Published: 2009-02-24 Abstract Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. To address this problem, a number of projected clustering algorithms have been proposed. However, most of them encounter difficulties when clusters hide in subspaces with very low dimensionality. These challenges motivate our effort to propose a robust partitional distance-based projected clustering algorithm. The algorithm consists of three phases. The first phase performs attribute relevance analysis by detecting dense and sparse regions and their location in each attribute. Starting from the results of the first phase, the goal of the second phase is to eliminate outliers, while the third phase aims to discover clusters in different subspaces. The clustering process is based on the k-means algorithm, with the computation of distance restricted to subsets of attributes where object values are dense. Our algorithm is capable of detecting projected clusters of low dimensionality embedded in a high-dimensional space and avoids the computation of the
distance in the full-dimensional space. The suitability of our proposal has been demonstrated through an empirical study using synthetic and real datasets.
62 IMine: Index Support for Item Set Mining Baralis, E. Cerquitelli, T. Chiusano, S. Dipt. di Autom. e Inf., Politec. di Torino, Torino; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009 Volume: 21, Issue: 4 On page(s): 493-506 ISSN: 1041-4347 INSPEC Accession Number: 10477126 Digital Object Identifier: 10.1109/TKDE.2008.180 First Published: 2008-08-29 Current Version Published: 2009-02-24 Abstract This paper presents the IMine index, a general and compact structure which provides tight integration of item set extraction in a relational DBMS. Since no constraint is enforced during the index creation phase, IMine provides a complete representation of the original database. To reduce the I/O cost, data accessed together during the same extraction phase are clustered on the same disk block. The IMine index structure can be efficiently exploited by different item set extraction algorithms. In particular, IMine data access methods currently support the FP-growth and LCM v.2 algorithms, but they can straightforwardly support the enforcement of various constraint categories. The IMine index has been integrated into the PostgreSQL DBMS and exploits its physical level access methods. Experiments, run for both sparse and dense data distributions, show the efficiency of the proposed index and its linear scalability also for large datasets. Item set mining supported by the IMine index shows performance always comparable with, and sometimes better than, state of the art algorithms accessing data on flat file.
63 Compression and Aggregation for Logistic Regression Analysis in Data Cubes Ruibin Xi Nan Lin Yixin Chen Dept. of Math., Washington Univ., St. Louis, MO; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009 Volume: 21, Issue: 4 On page(s): 479-492 ISSN: 1041-4347 INSPEC Accession Number: 10477125
Digital Object Identifier: 10.1109/TKDE.2008.186 First Published: 2008-09-12 Current Version Published: 2009-02-24 Abstract Logistic regression is an important technique for analyzing and predicting data with categorical attributes. In this paper, We consider supporting online analytical processing (OLAP) of logistic regression analysis for multi-dimensional data in a data cube where it is expensive in time and space to build logistic regression models for each cell from the raw data. We propose a novel scheme to compress the data in such a way that we can reconstruct logistic regression models to answer any OLAP query without accessing the raw data. Based on a first-order approximation to the maximum likelihood estimating equations, we develop a compression scheme that compresses each base cell into a small compressed data block with essential information to support the aggregation of logistic regression models. Aggregation formulae for deriving high-level logistic regression models from lower level component cells are given. We prove that the compression is nearly lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded and approaches to zero when the data size increases. The results show that the proposed compression and aggregation scheme can make feasible OLAP of logistic regression in a data cube. Further, it supports real-time logistic regression analysis of stream data, which can only be scanned once and cannot be permanently retained. Experimental results validate our theoretical analysis and demonstrate that our method can dramatically save time and space costs with almost no degradation of the modeling accuracy.
64 A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems Wolff, R. Bhaduri, K. Kargupta, H. Dept. of Manage. Inf. Syst., Haifa Univ., Haifa; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: April 2009 Volume: 21, Issue: 4 On page(s): 465-478 ISSN: 1041-4347 INSPEC Accession Number: 10477124 Digital Object Identifier: 10.1109/TKDE.2008.169 First Published: 2008-08-22 Current Version Published: 2009-02-24 Abstract In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load
experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.
65 mproving Personalization Solutions through Optimal Segmentation of Customer Bases Tianyi Jiang Tuzhilin, A. AvePoint Inc., Jersey, NJ; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009 Volume: 21, Issue: 3 On page(s): 305-320 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10416748 Digital Object Identifier: 10.1109/TKDE.2008.163 First Published: 2008-08-08 Current Version Published: 2009-01-23 Abstract On the Web, where the search costs are low and the competition is just a mouse click away, it is crucial to segment the customers intelligently in order to offer more targeted and personalized products and services to them. Traditionally, customer segmentation is achieved using statistics-based methods that compute a set of statistics from the customer data and group customers into segments by applying distance-based clustering algorithms in the space of these statistics. In this paper, we present a direct grouping-based approach to computing customer segments that groups customers not based on computed statistics, but in terms of optimally combining transactional data of several customers to build a data mining model of customer behavior for each group. Then, building customer segments becomes a combinatorial optimization problem of finding the best partitioning of the customer base into disjoint groups. This paper shows that finding an optimal customer partition is NP-hard, proposes several suboptimal direct grouping segmentation methods, and empirically compares them among themselves, traditional statistics-based hierarchical and affinity propagation-based segmentation, and one-to-one methods across multiple experimental conditions. It is shown that the best direct grouping method
significantly dominates the statistics-based and one-to-one approaches across most of the experimental conditions, while still being computationally tractable. It is also shown that the distribution of the sizes of customer segments generated by the best direct grouping method follows a power law distribution and that microsegmentation provides the best approach to personalization. 66 Effective and Efficient Query Processing for Video Subsequence Identification Heng Tao Shen Jie Shao Zi Huang Xiaofang Zhou Sch. of Inf. Technol. & Electr. Eng., Univ. of Queensland, Brisbane, QLD; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009 Volume: 21, Issue: 3 On page(s): 321-334 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10416749 Digital Object Identifier: 10.1109/TKDE.2008.168 First Published: 2008-08-15 Current Version Published: 2009-01-23 Abstract With the growing demand for visual information of rich content, effective and efficient manipulations of large video databases are increasingly desired. Many investigations have been made on content-based video retrieval. However, despite the importance, video subsequence identification, which is to find the similar content to a short query clip from a long video sequence, has not been well addressed. This paper presents a graph transformation and matching approach to this problem, with extension to identify the occurrence of potentially different ordering or length due to content editing. With a novel batch query algorithm to retrieve similar frames, the mapping relationship between the query and database video is first represented by a bipartite graph. The densely matched parts along the long sequence are then extracted, followed by a filter-and-refine search strategy to prune some irrelevant subsequences. During the filtering stage, maximum size matching is deployed for each subgraph constructed by the query and candidate subsequence to obtain a smaller set of candidates. During the refinement stage, submaximum similarity matching is devised to identify the subsequence with the highest aggregate score from all candidates, according to a robust video similarity model that incorporates visual content, temporal order, and frame alignment information. The performance studies conducted on a long video recording of 50 hours validate that our approach is promising in terms of both search accuracy and speed.
67 Automatically Determining the Number of Clusters in Unlabeled Data Sets Liang Wang Leckie, C. Ramamohanarao, K. Bezdek, J. Dept. of Comput. Sci. & Software Eng., Univ. of Melbourne, Melbourne, VIC; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009 Volume: 21, Issue: 3 On page(s): 335-350 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10416751 Digital Object Identifier: 10.1109/TKDE.2008.158 Current Version Published: 2009-01-23 Abstract Clustering is a popular tool for exploratory data analysis. One of the major problems in cluster analysis is the determination of the number of clusters in unlabeled data, which is a basic input for most clustering algorithms. In this paper we investigate a new method called DBE (dark block extraction) for automatically estimating the number of clusters in unlabeled data sets, which is based on an existing algorithm for visual assessment of cluster tendency (VAT) of a data set, using several common image and signal processing techniques. Basic steps include: 1) generating a VAT image of an input dissimilarity matrix; 2) performing image segmentation on the VAT image to obtain a binary image, followed by directional morphological filtering; 3) applying a distance transform to the filtered binary image and projecting the pixel values onto the main diagonal axis of the image to form a projection signal; 4) smoothing the projection signal, computing its firstorder derivative, and then detecting major peaks and valleys in the resulting signal to decide the number of clusters. Our new DBE method is nearly "automatic", depending on just one easy-to-set parameter. Several numerical and real-world examples are presented to illustrate the effectiveness of DBE.
68 Efficient Processing of Metric Skyline Queries Lei Chen Xiang Lian Dept. of Comput. Sci. & Eng., Hong Kong Univ. of Sci. & Technol., Hong Kong; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009 Volume: 21, Issue: 3 On page(s): 351-365 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10416744 Digital Object Identifier: 10.1109/TKDE.2008.146
First Published: 2008-07-18 Current Version Published: 2009-01-23 Abstract Skyline query is of great importance in many applications, such as multi-criteria decision making and business planning. In particular, a skyline point is a data object in the database whose attribute vector is not dominated by that of any other objects. Previous methods to retrieve skyline points usually assume static data objects in the database (i.e. their attribute vectors are fixed), whereas several recent work focus on skyline queries with dynamic attributes. In this paper, we propose a novel variant of skyline queries, namely metric skyline, whose dynamic attributes are defined in the metric space (i.e. not limited to the Euclidean space). We illustrate an efficient and effective pruning mechanism to answer metric skyline queries through a metric index. Most importantly, we formalize the query performance of the metric skyline query in terms of the pruning power, by a cost model, in light of which we construct an optimized metric index aiming to maximize the pruning power of metric skyline queries. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed pruning techniques as well as the constructed index in answering metric skyline queries.
69 On the Effect of Location Uncertainty in Spatial Querying Frentzos, E. Gratsias, K. Theodoridis, Y. Univ. of Piraeus, Piraeus; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009 Volume: 21, Issue: 3 On page(s): 366-383 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10416746 Digital Object Identifier: 10.1109/TKDE.2008.164 First Published: 2008-08-08 Current Version Published: 2009-01-23 Abstract An emerging topic in the field of spatial data management is the handling of location uncertainty of spatial objects, mainly due to inaccurate measurements. The literature on location uncertainty so far has focused on modifying traditional spatial search algorithms in order to handle the impact of objects' location uncertainty in query results. In this paper, we present the first, to the best of our knowledge, theoretical analysis that estimates the average number of false hits introduced in the results of rectangular range queries in the case of data points uniformly distributed in 2D space. Then, we relax the original distribution assumptions showing how to deal with arbitrarily distributed data points and more realistic location uncertainty distributions. The accuracy of the results of
our analytical approach is demonstrated through an extensive experimental study using various synthetic and real datasets. Our proposal can be directly employed in spatial database systems in order to provide users with the accuracy of spatial query results based only on known dataset and query parameters.
70 Distributed Skyline Retrieval with Low Bandwidth Consumption Lin Zhu Yufei Tao Shuigeng Zhou Fudan Univ., Shanghai; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009 Volume: 21, Issue: 3 On page(s): 384-400 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10416743 Digital Object Identifier: 10.1109/TKDE.2008.142 First Published: 2008-07-15 Current Version Published: 2009-01-23 Abstract We consider skyline computation when the underlying data set is horizontally partitioned onto geographically distant servers that are connected to the Internet. The existing solutions are not suitable for our problem, because they have at least one of the following drawbacks: (1) applicable only to distributed systems adopting vertical partitioning or restricted horizontal partitioning, (2) effective only when each server has limited computing and communication abilities, and (3) optimized only for skyline search in subspaces but inefficient in the full space. This paper proposes an algorithm, called feedback-based distributed skyline (FDS), to support arbitrary horizontal partitioning. FDS aims at minimizing the network bandwidth, measured in the number of tuples transmitted over the network. The core of FDS is a novel feedback-driven mechanism, where the coordinator iteratively transmits certain feedback to each participant. Participants can leverage such information to prune a large amount of local data, which otherwise would need to be sent to the coordinator. Extensive experimentation confirms that FDS significantly outperforms alternative approaches in both effectiveness and progressiveness. 71 semQA: SPARQL with Idempotent Disjunction Shironoshita, E.P. Jean-Mary, Y.R. Bradley, R.M. Kabuka, M.R. INFO-TECH Soft Inc., Miami, FL;
This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009 Volume: 21, Issue: 3 On page(s): 401-414 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10416750 Digital Object Identifier: 10.1109/TKDE.2008.91 First Published: 2008-05-12 Current Version Published: 2009-01-23 Abstract The SPARQL LeftJoin abstract operator is not distributive over Union; this limits the algebraic manipulation of graph patterns, which in turn restricts the ability to create query plans for distributed processing or query optimization. In this paper, we present semQA, an algebraic extension for the SPARQL query language for RDF, which overcomes this issue by transforming graph patterns through the use of an idempotent disjunction operator Or as a substitute for Union. This permits the application of a set of equivalences that transform a query into distinct forms. We further present an algorithm to derive the solution set of the original query from the solution set of a query where Union has been substituted by Or. We also analyze the combined complexity of SPARQL, proving it to be NP-complete. It is also shown that the SPARQL query language is not, in the general case, fixed-parameter tractable. Experimental results are presented to validate the query evaluation methodology presented in this paper against the SPARQL standard, to corroborate the complexity analysis, and to illustrate the gains in processing cost reduction that can be obtained through the application of semQA.
72 Detecting, Assessing and Monitoring Relevant Topics in Virtual Information Environments Ontrup, J. Ritter, H. Scholz, S.W. Wagner, R. Fac. of Technol., Bielefeld Univ., Bielefeld; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009 Volume: 21, Issue: 3 On page(s): 415-427 ISSN: 1041-4347 INSPEC Accession Number: 10458278 Digital Object Identifier: 10.1109/TKDE.2008.149 First Published: 2008-07-18 Current Version Published: 2009-01-23 Abstract The ability to assess the relevance of topics and related sources in information-rich
environments is a key to success when scanning business environments. This paper introduces a hybrid system to support managerial information gathering. The system is made up of three components: 1) a hierarchical hyperbolic SOM for structuring the information environment and visualizing the intensity of news activity with respect to identified topics, 2) a spreading activation network for the selection of the most relevant information sources with respect to an already existing knowledge infrastructure, and 3) measures of interestingness for association rules as well as statistical testing facilitates the monitoring of already identified topics. Embedding the system by a framework describing three modes of human information seeking behavior endorses an active organization, exploration and selection of information that matches the needs of decision makers in all stages of the information gathering process. By applying our system in the domain of the hotel industry we demonstrate how typical information gathering tasks are supported. Moreover, we present an empirical study investigating the effectiveness and efficiency of the visualization framework of our system.
73 Distributional Features for Text Categorization Xiao-Bing Xue Zhi-Hua Zhou Nanjing Univ., Nanjing; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009 Volume: 21, Issue: 3 On page(s): 428-442 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10416747 Digital Object Identifier: 10.1109/TKDE.2008.166 First Published: 2008-08-08 Current Version Published: 2009-01-23 Abstract Text categorization is the task of assigning predefined categories to natural language text. With the widely used 'bag of words' representation, previous researches usually assign a word with values such that whether this word appears in the document concerned or how frequently this word appears. Although these values are useful for text categorization, they have not fully expressed the abundant information contained in the document. This paper explores the effect of other types of values, which express the distribution of a word in the document. These novel values assigned to a word are called distributional features, which include the compactness of the appearances of the word and the position of the first appearance of the word. The proposed distributional features are exploited by a tf idf style equation and different features are combined using ensemble learning techniques. Experiments show that the distributional features are useful for text
categorization. In contrast to using the traditional term frequency values solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved. Further analysis shows that the distributional features are especially useful when documents are long and the writing style is casual. 74 The Development of Fuzzy Rough Sets with the Use of Structures and Algebras of Axiomatic Fuzzy Sets Xiaodong Liu Pedrycz, W. Tianyou Chai Mingli Song Res. Center of Inf. & Control, Dalian Univ. of Technol., Dalian; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: March 2009 Volume: 21, Issue: 3 On page(s): 443-462 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10416745 Digital Object Identifier: 10.1109/TKDE.2008.147 First Published: 2008-07-18 Current Version Published: 2009-01-23 Abstract The notion of a rough set was originally proposed by Pawlak underwent a number of extensions and generalizations. Dubois and Prade (1990) introduced fuzzy rough sets which involve the use of rough sets and fuzzy sets within a single framework. Radzikowska and Kerre (2002) proposed a broad family of fuzzy rough sets, referred to as (phi, t)-fuzzy rough sets which are determined by some implication operator (implicator) phi and a certain t-norm. In order to describe the linguistically represented concepts coming from data available in some information system, the concept of fuzzy rough sets are redefined and further studied in the setting of the axiomatic fuzzy set (AFS) theory. Compared with the (phi, t)-fuzzy rough sets, the advantages of AFS fuzzy rough sets are twofold. They can be directly applied to data analysis present in any information system without resorting to the details concerning the choice of the implication phi, t-norm and a similarity relation S. Furthermore such rough approximations of fuzzy concepts come with a well-defined semantics and therefore offer a sound interpretation. Some examples are included to illustrate the effectiveness of the proposed construct. It is shown that the AFS fuzzy rough sets provide a far higher flexibility and effectiveness in comparison with rough sets and some of their generalizations.
75 Learning Image-Text Associations Tao Jiang Ah-Hwee Tan ecPresence Technol. Pte Ltd., Singapore; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009 Volume: 21, Issue: 2 On page(s): 161-177 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10370659 Digital Object Identifier: 10.1109/TKDE.2008.150 First Published: 2008-08-01 Current Version Published: 2008-12-30 Abstract Web information fusion can be defined as the problem of collating and tracking information related to specific topics on the World Wide Web. Whereas most existing work on Web information fusion has focused on text-based multidocument summarization, this paper concerns the topic of image and text association, a cornerstone of cross-media Web information fusion. Specifically, we present two learning methods for discovering the underlying associations between images and texts based on small training data sets. The first method based on vague transformation measures the information similarity between the visual features and the textual features through a set of predefined domain-specific information categories. Another method uses a neural network to learn direct mapping between the visual and textual features by automatically and incrementally summarizing the associated features into a set of information templates. Despite their distinct approaches, our experimental results on a terrorist domain document set show that both methods are capable of learning associations between images and texts from a small training data set. 76 Decompositional Rule Extraction from Support Vector Machines by Active Learning Martens, D. Baesens, B.B. Van Gestel, T. Dept. of Bus. Adm. & Public Manage., Hogeschool Gent, Ghent; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009 Volume: 21, Issue: 2 On page(s): 178-191 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10370654 Digital Object Identifier: 10.1109/TKDE.2008.131 First Published: 2008-07-15 Current Version Published: 2008-12-30
Abstract Support vector machines (SVMs) are currently state-of-the-art for the classification task and, generally speaking, exhibit good predictive performance due to their ability to model nonlinearities. However, their strength is also their main weakness, as the generated nonlinear models are typically regarded as incomprehensible black-box models. In this paper, we propose a new active learning-based approach (ALBA) to extract comprehensible rules from opaque SVM models. Through rule extraction, some insight is provided into the logics of the SVM model. ALBA extracts rules from the trained SVM model by explicitly making use of key concepts of the SVM: the support vectors, and the observation that these are typically close to the decision boundary. Active learning implies the focus on apparent problem areas, which for rule induction techniques are the regions close to the SVM decision boundary where most of the noise is found. By generating extra data close to these support vectors that are provided with a class label by the trained SVM model, rule induction techniques are better able to discover suitable discrimination rules. This performance increase, both in terms of predictive accuracy as comprehensibility, is confirmed in our experiments where we apply ALBA on several publicly available data sets. 77 Multiclass MTS for Simultaneous Feature Selection and Classification Chao-Ton Su Yu-Hsiang Hsiao Nat. Tsing Hua Univ., Hsinchu; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009 Volume: 21, Issue: 2 On page(s): 192-205 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10370662 Digital Object Identifier: 10.1109/TKDE.2008.128 First Published: 2008-07-15 Current Version Published: 2008-12-30 Abstract Multiclass Mahalanobis-Taguchi system (MMTS), the extension of MTS, is developed for simultaneous multiclass classification and feature selection. In MMTS, the multiclass measurement scale is constructed by establishing an individual Mahalanobis space for each class. To increase the validity of the measurement scale, the Gram-Schmidt process is performed to mutually orthogonalize the features and eliminate the multicollinearity. The important features are identified using the orthogonal arrays and the signal-to-noise ratio, and are then used to construct a reduced model measurement scale. The contribution of each important feature to classification is also derived according to the effect gain to develop a weighted Mahalanobis distance which is finally used as the distance metric for the classification of MMTS. Using the reduced model measurement
scale, an unknown example will be classified into the class with minimum weighted Mahalanobis distance considering only the important features. For evaluating the effectiveness of MMTS, a numerical experiment is implemented, and the results show that MMTS outperforms other well-known algorithms not only on classification accuracy but also on feature selection efficiency. Finally, a real case about gestational diabetes mellitus is studied, and the results indicate the practicality of MMTS in real-world applications. 78 k-Anonymization with Minimal Loss of Information Gionis, A. Tassa, T. Yahoo! Res., Barcelona; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009 Volume: 21, Issue: 2 On page(s): 206-219 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10399653 Digital Object Identifier: 10.1109/TKDE.2008.129 First Published: 2008-07-15 Current Version Published: 2008-12-30 Abstract The technique of k-anonymization allows the releasing of databases that contain personal information while ensuring some degree of individual privacy. Anonymization is usually performed by generalizing database entries. We formally study the concept of generalization, and propose three information-theoretic measures for capturing the amount of information that is lost during the anonymization process. The proposed measures are more general and more accurate than those that were proposed by Meyerson and Williams and Aggarwal et al. We study the problem of achieving k-anonymity with minimal loss of information. We prove that it is NP-hard and study polynomial approximations for the optimal solution. Our first algorithm gives an approximation guarantee of O(ln k) for two of our measures as well as for the previously studied measures. This improves the best-known O(k)-approximation in. While the previous approximation algorithms relied on the graph representation framework, our algorithm relies on a novel hypergraph representation that enables the improvement in the approximation ratio from O(k) to O(ln k). As the running time of the algorithm is O(n2k}), we also show how to adapt the algorithm in in order to obtain an O(k)-approximation algorithm that is polynomial in both n and k. 79 Cost-Based Predictive Spatiotemporal Join Wook-Shin Han Jaehwa Kim Byung Suk Lee Yufei Tao Rantzau, R. Markl, V. Dept. of Comput. Eng., Kyungpook Nat. Univ., Daegu;
This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009 Volume: 21, Issue: 2 On page(s): 220-233 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10370660 Digital Object Identifier: 10.1109/TKDE.2008.159 First Published: 2008-08-01 Current Version Published: 2008-12-30 Abstract A predictive spatiotemporal join finds all pairs of moving objects satisfying a join condition on future time and space. In this paper, we present CoPST, the first and foremost algorithm for such a join using two spatiotemporal indexes. In a predictive spatiotemporal join, the bounding boxes of the outer index are used to perform window searches on the inner index, and these bounding boxes enclose objects with increasing laxity over time. CoPST constructs globally tightened bounding boxes "on the fly" to perform window searches during join processing, thus significantly minimizing overlap and improving the join performance. CoPST adapts gracefully to large-scale databases, by dynamically switching between main-memory buffering and disk-based buffering, through a novel probabilistic cost model. Our extensive experiments validate the cost model and show its accuracy for realistic data sets. We also showcase the superiority of CoPST over algorithms adapted from state-of-the-art spatial join algorithms, by a speedup of up to an order of magnitude. 80 BMQ-Processor: A High-Performance Border-Crossing Event Detection Framework for Large-Scale Monitoring Applications Jinwon Lee Seungwoo Kang Youngki Lee Sang Jeong Lee Junehwa Song Div. of Comput. Sci., Korea Adv. Inst. of Sci. & Technol. (KAIST), Daejeon; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009 Volume: 21, Issue: 2 On page(s): 234-252 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10370656 Digital Object Identifier: 10.1109/TKDE.2008.140 First Published: 2008-07-15 Current Version Published: 2008-12-30 Abstract In this paper, we present BMQ-Processor, a high-performance border-crossing event (BCE) detection framework for large-scale monitoring applications. We first characterize
a new query semantics, namely, border monitoring query (BMQ), which is useful for BCE detection in many monitoring applications. It monitors the values of data streams and reports them only when data streams cross the borders of its range. We then propose BMQ-Processor to efficiently handle a large number of BMQs over a high volume of data streams. BMQ-Processor efficiently processes BMQs in a shared and incremental manner. It develops and operates over a novel stateful query index, achieving a high level of scalability over continuous data updates. Also, it utilizes the locality embedded in data streams and greatly accelerates successive BMQ evaluations. We present data structures and algorithms to support 1D as well as multidimensional BMQs. We show that the semantics of border monitoring can be extended toward more advanced ones and build region transition monitoring as a sample case. Lastly, we demonstrate excellent processing performance and low storage cost of BMQ-Processor through extensive analysis and experiments. 81 Privacy-Preserving Kth Element Score over Vertically Partitioned Data Vaidya, J. Clifton, C.W. Manage. Sci. & Inf. Syst. Dept., Rutgers Univ., Newark, NJ; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009 Volume: 21, Issue: 2 On page(s): 253-258 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10370661 Digital Object Identifier: 10.1109/TKDE.2008.167 First Published: 2008-08-15 Current Version Published: 2008-12-30 Abstract Given a large integer data set shared vertically by two parties, we consider the problem of securely computing a score separating the kth and the (k + 1) to compute such a score while revealing little additional information. The proposed protocol is implemented using the Fairplay system and experimental results are reported. We show a real application of this protocol as a component used in the secure processing of top-k queries over vertically partitioned data. 82 Semantic Access to Multichannel M-Services Xu Yang Bouguettaya, A. Spirent Commun., Rockville, MD; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009 Volume: 21, Issue: 2 On page(s): 259-272 Location: Los Angeles, CA, USA,
ISSN: 1041-4347 INSPEC Accession Number: 10370658 Digital Object Identifier: 10.1109/TKDE.2008.157 First Published: 2008-08-01 Current Version Published: 2008-12-30 Abstract M-services provide mobile users wireless access to Web services. In this paper, we present a novel infrastructure for supporting M-services in wireless broadcast systems. The proposed infrastructure provides a generic framework for mobile users to look up, access, and execute Web services over wireless broadcast channels. Access efficiency is an important issue in wireless broadcast systems. We discuss different semantics that have impact on the access efficiency for composite M-services. A multiprocess workflow is proposed for effectively accessing composite M-services from multiple broadcast channels based on these semantics. We also present and compare different broadcast channel organizations for M-services and wireless data. Analytical models are provided for these channel organizations. Practical studies are presented to demonstrate the impact of different semantics and channel organizations on the access efficiency. 83 Online Scheduling Sequential Objects with Periodicity for Dynamic Information Dissemination Chih-Lin Hu Ming-Syan Chen Dept. of Commun. Eng., Nat. Central Univ., Jhongli; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009 Volume: 21, Issue: 2 On page(s): 273-286 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10370657 Digital Object Identifier: 10.1109/TKDE.2008.148 First Published: 2008-07-18 Current Version Published: 2008-12-30 Abstract The scalability of data broadcasting has been manifested by prior studies on the base of the traditional data management systems where data objects, mapped to a pair of state and value in the database, are independent, persistent, and static against simple queries. However, many modern information applications spread dynamic data objects and process complex queries for retrieving multiple data objects. Particularly, the information servers dynamically generate data objects that are dependent and can be associated into a complete response against complex queries. Accordingly, the study in this paper considers the problem of scheduling dynamic broadcast data objects in a clientsproviders-servers system from the standpoint of data association, dependency, and dynamics. Since the data broadcast problem is NP-hard, we derive the lower and the upper bounds of the mean service access time. In light of the theoretical analyses, we
further devise a deterministic algorithm with several gain measure functions for the approximation of schedule optimization. The experimental results show that the proposed algorithm is able to generate a dynamic broadcast schedule and also minimize the mean service access time to the extent of being very close to the theoretical optimum.
84 Storing and Indexing Spatial Data in P2P Systems Kantere, V. Skiadopoulos, S. Sellis, T. Ecole Polytech. Fed. de Lausanne, Lausanne; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Feb. 2009 Volume: 21, Issue: 2 On page(s): 287-300 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10370655 Digital Object Identifier: 10.1109/TKDE.2008.139 First Published: 2008-07-15 Current Version Published: 2008-12-30 Abstract The peer-to-peer (P2P) paradigm has become very popular for storing and sharing information in a totally decentralized manner. At first, research focused on P2P systems that host 1D data. Nowadays, the need for P2P applications with multidimensional data has emerged, motivating research on P2P systems that manage such data. The majority of the proposed techniques are based either on the distribution of centralized indexes or on the reduction of multidimensional data to one dimension. Our goal is to create from scratch a technique that is inherently distributed and also maintains the multidimensionality of data. Our focus is on structured P2P systems that share spatial information. We present SpatialP2P, a totally decentralized indexing and searching framework that is suitable for spatial data. SpatialP2P supports P2P applications in which spatial information of various sizes can be dynamically inserted or deleted, and peers can join or leave. The proposed technique preserves well locality and directionality of space. 85 Unsupervised Multiway Data Analysis: A Literature Survey Acar, E. Yener, B. Dept. of Comput. Sci., Rensselaer Polytech. Inst., Troy, NY; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Jan. 2009 Volume: 21, Issue: 1 On page(s): 6-20 Location: Los Angeles, CA, USA, ISSN: 1041-4347
INSPEC Accession Number: 10324287 Digital Object Identifier: 10.1109/TKDE.2008.112 First Published: 2008-06-06 Current Version Published: 2008-11-25 Abstract Two-way arrays or matrices are often not enough to represent all the information in the data and standard two-way analysis techniques commonly applied on matrices may fail to find the underlying structures in multi-modal datasets. Multiway data analysis has recently become popular as an exploratory analysis tool in discovering the structures in higher-order datasets, where data have more than two modes. We provide a review of significant contributions in the literature on multiway models, algorithms as well as their applications in diverse disciplines including chemometrics, neuroscience, social network analysis, text mining and computer vision.
86 Comparing Scores Intended for Ranking Bhamidipati, N.L. Pal, S.K. Data Min. & Res. Group, Yahoo! Software Dev. India Pvt. Ltd., Bangalore; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Jan. 2009 Volume: 21, Issue: 1 On page(s): 21-34 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10324288 Digital Object Identifier: 10.1109/TKDE.2008.111 First Published: 2008-06-06 Current Version Published: 2008-11-25 Abstract Often, ranking is performed on the the basis of some scores available for each item. The existing practice for comparing scoring functions is to compare the induced rankings by one of the multitude of rank comparison methods available in the literature. We suggest that it may be better to compare the underlying scores themselves. To this end, a generalized Kendall distance is defined, which takes into consideration not only the final ordering that the two schemes produce, but also at the spacing between pairs of scores. This is shown to be equivalent to comparing the scores after fusing with another set of scores, making it theoretically interesting. A top k version of the score comparison methodology is also provided. Experimental results clearly show the advantages score comparison has over rank comparison.
87 Online Skyline Analysis with Dynamic Preferences on Nominal Attributes Wong, R.C.-W. Jian Pei Fu, A.W.-C. Ke Wang Dept. of Comput. Sci. & Eng., Hong Kong Univ. of Sci. & Technol., Kowloon; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Jan. 2009 Volume: 21, Issue: 1 On page(s): 35-49 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10324289 Digital Object Identifier: 10.1109/TKDE.2008.115 First Published: 2008-06-17 Current Version Published: 2008-11-25 Abstract The importance of skyline analysis has been well recognized in multi-criteria decision making applications. All of the previous studies assume a fixed order on the attributes in question. However, in some applications, users may be interested in skylines with respect to various total or partial orders on nominal attributes. In this paper, we identify and tackle the problem of online skyline analysis with dynamic preferences on nominal attributes. We investigate how changes of orders in attributes lead to changes of skylines. We address two novel types of interesting queries: a viewpoint query returns with respect to which orders a point is (or is not) in the skylines and an order-based skyline query retrieves the skyline with respect to a specific order. We develop two methods systematically and report an extensive performance study using both synthetic and real data sets to verify their effectiveness and efficiency. 88 Self-Learning Disk Scheduling Yu Zhang Bhargava, B. Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Jan. 2009 Volume: 21, Issue: 1 On page(s): 50-65 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10324290 Digital Object Identifier: 10.1109/TKDE.2008.116 First Published: 2008-06-20 Current Version Published: 2008-11-25 Abstract Performance of disk I/O schedulers is affected by many factors, such as workloads, file
systems, and disk systems. Disk scheduling performance can be improved by tuning scheduler parameters, such as the length of read timers. Scheduler performance tuning is mostly done manually. To automate this process, we propose four self-learning disk scheduling schemes: change-sensing Round-Robin, feedback learning, per-request learning, and two-layer learning. experiments show that the novel two-layer learning scheme performs best. It integrates the workload-level and request-level learning algorithms. It employs feedback learning techniques to analyze workloads, change scheduling policy, and tune scheduling parameters automatically. We discuss schemes to choose features for workload learning, divide and recognize workloads, generate training data, and integrate machine learning algorithms into the two-layer learning scheme. We conducted experiments to compare the accuracy, performance, and overhead of five machine learning algorithms: decision tree, logistic regression, naive Bayes, neural network, and support vector machine algorithms. Experiments with real-world and synthetic workloads show that self-learning disk scheduling can adapt to a wide variety of workloads, file systems, disk systems, and user preferences. It outperforms existing disk schedulers by as much as 15.8% while consuming less than 3%-5% of CPU time. 89 Discriminative Training of the Hidden Vector State Model for Semantic Parsing Deyu Zhou Yulan He Inf. Res. Centre, Univ. of Reading, Reading; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Jan. 2009 Volume: 21, Issue: 1 On page(s): 66-77 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10324291 Digital Object Identifier: 10.1109/TKDE.2008.95 First Published: 2008-05-16 Current Version Published: 2008-11-25 Abstract In this paper, we discuss how discriminative training can be applied to the hidden vector state (HVS) model in different task domains. The HVS model is a discrete hidden Markov model (HMM) in which each HMM state represents the state of a push-down automaton with a finite stack size. In previous applications, maximum-likelihood estimation (MLE) is used to derive the parameters of the HVS model. However, MLE makes a number of assumptions and unfortunately some of these assumptions do not hold. Discriminative training, without making such assumptions, can improve the performance of the HVS model by discriminating the correct hypothesis from the competing hypotheses. Experiments have been conducted in two domains: the travel
domain for the semantic parsing task using the DARPA Communicator data and the Air Travel Information Services (ATIS) data and the bioinformatics domain for the information extraction task using the GENIA corpus. The results demonstrate modest improvements of the performance of the HVS model using discriminative training. In the travel domain, discriminative training of the HVS model gives a relative error reduction rate of 31 percent in F-measure when compared with MLE on the DARPA Communicator data and 9 percent on the ATIS data. In the bioinformatics domain, a relative error reduction rate of 4 percent in F-measure is achieved on the GENIA corpus. 90 SPOT Databases: Efficient Consistency Checking and Optimistic Selection in Probabilistic Spatial Databases Parker, A. Infantes, G. Grant, J. Subrahmanian, V.S. Dept. of Comput. Sci., Univ. of Maryland, College Park, MD; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Jan. 2009 Volume: 21, Issue: 1 On page(s): 92-107 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10324293 Digital Object Identifier: 10.1109/TKDE.2008.93 First Published: 2008-05-12 Current Version Published: 2008-11-25 Abstract Spatial probabilistic temporal (SPOT) databases are a paradigm for reasoning with probabilistic statements about where a vehicle may be now or in the future. They express statements of the form "Object O is in spatial region R at some time t with some probability in the interval [L,U]." Past work on SPOT databases has developed selection operators based on selecting SPOT atoms that are entailed by the SPOT database-we call this "cautious" selection. In this paper, we study several problems. First, we note that the runtime of consistency checking and cautious selection algorithms in past work is influenced greatly by the granularity of the underlying Cartesian space. In this paper, we first introduce the notion of "optimistic" selection, where we are interested in returning all SPOT atoms in a database that are consistent with respect to a query, rather than having an entailment relationship. We then develop an approach to scaling SPOT databases that has three main contributions: 1) We develop methods to eliminate variables from the linear programs used in past work, thus greatly reducing the size of the linear programs used-the resulting advances apply to consistency checking, optimistic selection, and cautious selection. 2) We develop a host of theorems to show how we can prune the search space when we are interested in optimistic selection. 3) We use the above contributions to build an efficient index to execute optimistic selection queries over SPOT databases. Our approach is superior to past work in two major respects: First, it makes fewer assumptions than all past works on this topic except that in. Second, our
experiments, which are based on real-world data about ship movements, show that our algorithms are much more efficient than those in. 91. Efficient Evaluation of Probabilistic Advanced Spatial Queries on Existentially Uncertain Data Man Lung Yiu Mamoulis, N. Xiangyuan Dai Yufei Tao Vaitis, M. Dept. of Comput. Sci., Aalborg Univ., Aalborg; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Jan. 2009 Volume: 21, Issue: 1 On page(s): 108-122 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10324294 Digital Object Identifier: 10.1109/TKDE.2008.135 First Published: 2008-07-15 Current Version Published: 2008-11-25 Abstract We study the problem of answering spatial queries in databases where objects exist with some uncertainty and they are associated with an existential probability. The goal of a thresholding probabilistic spatial query is to retrieve the objects that qualify the spatial predicates with probability that exceeds a threshold. Accordingly, a ranking probabilistic spatial query selects the objects with the highest probabilities to qualify the spatial predicates. We propose adaptations of spatial access methods and search algorithms for probabilistic versions of range queries, nearest neighbors, spatial skylines, and reverse nearest neighbors and conduct an extensive experimental study, which evaluates the effectiveness of proposed solutions. 92 A Relation-Based Page Rank Algorithm for Semantic Web Search Engines Lamberti, F. Sanna, A. Demartini, C. Dipt. di Autom. ed Inf., Politec. di Torino, Torino; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Jan. 2009 Volume: 21, Issue: 1 On page(s): 123-136 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10324295 Digital Object Identifier: 10.1109/TKDE.2008.113
First Published: 2008-06-06 Current Version Published: 2008-11-25 Abstract With the tremendous growth of information available to end users through the Web, search engines come to play ever a more critical role. Nevertheless, because of their general-purpose approach, it is always less uncommon that obtained result sets provide a burden of useless pages. The next-generation Web architecture, represented by the Semantic Web, provides the layered architecture possibly allowing overcoming this limitation. Several search engines have been proposed, which allow increasing information retrieval accuracy by exploiting a key content of semantic Web resources, that is, relations. However, in order to rank results, most of the existing solutions need to work on the whole annotated knowledge base. In this paper, we propose a relation-based page rank algorithm to be used in conjunction with semantic Web search engines that simply relies on information that could be extracted from user queries and on annotated resources. Relevance is measured as the probability that a retrieved resource actually contains those relations whose existence was assumed by the user at the time of query definition. 93 CDNs Content Outsourcing via Generalized Communities Katsaros, D. Pallis, G. Stamos, K. Vakali, A. Sidiropoulos, A. Manolopoulos, Y. Dept. of Comput. & Commun. Eng., Thessaly Univ., Volos; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on Publication Date: Jan. 2009 Volume: 21, Issue: 1 On page(s): 137-151 Location: Los Angeles, CA, USA, ISSN: 1041-4347 INSPEC Accession Number: 10324296 Digital Object Identifier: 10.1109/TKDE.2008.92 First Published: 2008-05-12 Current Version Published: 2008-11-25 Abstract Content distribution networks (CDNs) balance costs and quality in services related to content delivery. Devising an efficient content outsourcing policy is crucial since, based on such policies, CDN providers can provide client-tailored content, improve performance, and result in significant economical gains. Earlier content outsourcing approaches may often prove ineffective since they drive prefetching decisions by assuming knowledge of content popularity statistics, which are not always available and are extremely volatile. This work addresses this issue, by proposing a novel self-adaptive technique under a CDN framework on which outsourced content is identified with no apriori knowledge of (earlier) request statistics. This is employed by using a structurebased approach identifying coherent clusters of "correlated" Web server content objects, the so-called Web page communities. These communities are the core outsourcing unit
and in this paper a detailed simulation experimentation has shown that the proposed technique is robust and effective in reducing user-perceived latency as compared with competing approaches, i.e., two communities-based approaches, Web caching, and nonCDN.