JAVA IEEE 2012 PROJECT ABSTRACTS
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
PROTECTING LOCATION PRIVACY AGAINST LOCATION-DEPENDENT ATTACKS IN MOBILE SERVICES
Privacy protection has recently received considerable attention in location-based services. A large number of location cloaking algorithms have been proposed for protecting the location privacy of mobile users. In this paper, we consider the scenario where different location-based query requests are continuously issued by mobile users while they are moving.
We show that most of the existing k-anonymity location cloaking algorithms are concerned with snapshotuser locations only and cannot effectively prevent location-dependent attacks when users’ locations are continuously updated. Therefore, adopting both the location k-anonymity and cloaking granularity as privacy metrics, we propose a new incremental clique-based cloaking algorithm, called ICliqueCloak, to defend against location-dependent attacks.
The main idea is to incrementally maintain maximal cliques needed for location cloaking in an undirected graph that takes into consideration the effect of continuous location updates. Thus, a qualified clique can be quickly identified and used to generate the cloaked region when a new request arrives.
The efficiency and effectiveness of the proposed ICliqueCloak algorithm are validated by a series of carefully designed experiments. The experimental results also show that the price paid for defending against location-dependent attacks is small
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
OPTIMIZING THE CALCULATION OF CONDITIONAL PROBABILITY TABLES IN HYBRID BAYESIAN NETWORKS USING BINARY FACTORIZATION
Reducing the computational complexity of inference in Bayesian Networks (BNs) is a key challenge. Current algorithms for inference convert a BN to a junction tree structure made up of clusters of the BN nodes and the resulting complexity is time exponential in the size of a cluster.
The need to reduce the complexity is especially acute where the BN contains continuous nodes. We propose a new method for optimizing the calculation of Conditional Probability Tables (CPTs) involving continuous nodes, approximated in Hybrid Bayesian Networks (HBNs), using an approximation algorithm called dynamic discretization.
We present an optimized solution to this problem involving binary factorization of the arithmetical expressions declared to generate the CPTs for continuous nodes for deterministic functions and statistical distributions.
The proposed algorithm is implemented and tested in a commercial Hybrid Bayesian Network software package and the results of the empirical evaluation show significant performance improvement over unfactorized models
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
MAXIMUM AMBIGUITY-BASED SAMPLE SELECTION IN FUZZY DECISION TREE INDUCTION
Sample selection is to select a number of representative samples from a large database such that a learning algorithm can have a reduced computational cost and an improved learning accuracy. This paper gives a new sample selection mechanism, i.e., the maximum ambiguity-based sample selection in fuzzy decision tree induction.
Compared with the existing sample selection methods, this mechanism selects the samples based on the principle of maximal classification ambiguity. The major advantage of this mechanism is that the adjustment of the fuzzy decision tree is minimized when adding selected samples to the training set.
This advantage is confirmed via the theoretical analysis of the leaf-nodes’ frequency in the decision trees. The decision tree generated from the selected samples usually has a better performance than that from the original database.
Furthermore, experimental results show that generalization ability of the tree based on our selection mechanism is far more superior to that based on random selection mechanism
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
MODEL-BASED METHOD FOR PROJECTIVE CLUSTERING
Clustering high-dimensional data is a major challenge due to the curse of dimensionality. To solve this problem, projective clustering has been defined as an extension to traditional clustering that attempts to find projected clusters in subsets of the dimensions of a data space.
In this paper, a probability model is first proposed to describe projected clusters in high-dimensional data space. Then, we present a model-based algorithm for fuzzy projective clustering that discovers clusters with overlapping boundaries in various projected subspaces.
The suitability of the proposal is demonstrated in an empirical study done with synthetic data set and some widely used real-world data set
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
EXTENDING RECOMMENDER SYSTEMS FOR DISJOINT USER/ITEM SETS: THE CONFERENCE RECOMMENDATION PROBLEM
In this paper, we describe the problem of recommending conference sessions to attendees and show how novel extensions to traditional model-based recommender systems, as suggested in Adomavicius and Tuzhilin [1], can address this problem.
We introduce Recommendation Engine by CONjoint Decomposition of ITems and USers ( RECONDITUS)—a technique that is an extension of preference-based recommender systems to recommend items from a new disjoint set to users from a new disjoint set. The assumption being that preferences exhibited by users with known usage behavior (e.g., past conference session attendance), which can be abstracted by projections of user and item matrices, will be similar to ones of new (different) users where the basic environment and item domain are the same (e.g., new conference). RECONDITUS requires no item ratings, but operates on observed user behavior such as past conference session attendance.
The RECONDITUS paradigm consists of projections of both user and item data, and the learning of relationships in projected space. Once established, the relationships enable predicting new relationships and provide associated recommendations.
The approach can encompass several traditional data mining problems where both clustering and prediction are necessary.
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
ENERGY-EFFICIENT REVERSE SKYLINE QUERY PROCESSING OVER WIRELESS SENSOR NETWORKS
Reverse skyline query plays an important role in many sensing applications, such as environmental monitoring, habitat monitoring, and battlefield monitoring. Due to the limited power supplies of wireless sensor nodes, the existing centralized approaches, which do not consider energy efficiency, cannot be directly applied to the distributed sensor environment.
In this paper, we investigate how to process reverse skyline queries energy efficiently in wireless sensor networks. Initially, we theoretically analyzed the properties of reverse skyline query and proposed a skyband-based approach to tackle the problem of reverse skyline query answering over wireless sensor networks.
Then, an energy-efficient approach is proposed to minimize the communication cost among sensor nodes of evaluating range reverse skyline query. Moreover, optimization mechanisms to improve the performance of multiple reverse skylines are also discussed. Extensive experiments on both real-world data and synthetic data have demonstrated the efficiency and effectiveness of our proposed approaches with various experimental settings
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
ENERGY EFFICIENT SCHEMES FOR ACCURACY-GUARANTEED SENSOR DATA AGGREGATION USING SCALABLE COUNTING
Sensor networks have received considerable attention in recent years, and are employed in many applications. In these applications, statistical aggregates such as Sum over the readings of a group of sensor nodes are often needed. One challenge for computing sensor data aggregates comes from the communication failures, which are common in sensor networks.
To enhance the robustness of the aggregate computation, multipath-based aggregation is often used. However, the multipath-based aggregation suffers from the problem of overcounting sensor readings. The approaches using the multipath-based aggregation therefore need to incorporate techniques that avoid overcounting sensor readings. In this paper, we present a novel technique named scalable counting for efficiently avoiding the overcounting problem.
We focus on having an (", ) accuracy guarantee for computing an aggregate, which ensures that the error in computing the aggregate is within a factor of " with probability (1 ). Our schemes using the scalable counting technique efficiently compute the aggregates under a given accuracy guarantee.
We provide theoretical analyses that show the advantages of the scalable counting technique over previously proposed techniques. Furthermore, extensive experiments are made to validate the theoretical results and manifest the advantages of using the scalable counting technique for sensor data aggregation
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
EFFICIENT AND PROGRESSIVE ALGORITHMS FOR DISTRIBUTED SKYLINE QUERIES OVER UNCERTAIN DATA
The skyline operator has received considerable attention from the database community, due to its importance in many applications including multicriteria decision making, preference answering, and so forth. In many applications where uncertain data are inherently exist, i.e., data collected from different sources in distributed locations are usually with imprecise measurements, and thus exhibit kind of uncertainty.
Taking into account the network delay and economic cost associated with sharing and communicating large amounts of distributed data over an internet, an important problem in this scenario is to retrieve the global skyline tuples from all the distributed local sites with minimum communication cost.
Based on the well-known notation of the probabilistic skyline query over centralized uncertain data, in this paper, we propose the notation of distributed skyline queries over uncertain data. Furthermore, two communication- and computation-efficient algorithms are proposed to retrieve the qualified skylines from distributed local sites.
Extensive experiments have been conducted to verify the efficiency, the effectiveness and the progressiveness of our algorithms with both the synthetic and real data sets
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
EDISC: A CLASS-TAILORED DISCRETIZATION TECHNIQUE FOR RULE-BASED CLASSIFICATION
Discretization is a critical component of data mining whereby continuous attributes of a data set are converted into discrete ones by creating intervals either before or during learning. There are many good reasons for preprocessing discretization, such as increased learning efficiency and classification accuracy, comprehensibility of data mining results, as well as the inherent limitation of a great majority of learning algorithms to handle only discrete data.
Many preprocessing discretization techniques have been proposed to date, of which the Entropy-MDLP discretization has been accepted as by far the most effective in the context of both decision tree learning and rule induction algorithms. This paper presents a new discretization technique EDISC which utilizes the entropy-based principle but takes a class-tailored approach to discretization.
The technique is applicable in general to any covering algorithm, including those that use the class-per-class rule induction methodology such as CN2 as well as those that use a seed example during the learning phase, such as the RULES family.
Experimental evaluation has proved the efficiency and effectiveness of the technique as a preprocessing discretization procedure for CN2 as well as RULES-7, the latest algorithm among the RULESfamily of inductive learning algorithms .
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
DISCRIMINATIVE FEATURE SELECTION BY NONPARAMETRIC BAYES ERROR MINIMIZATION
Feature selection is fundamental to knowledge discovery from massive amount of high-dimensional data. In an effort to establish theoretical justification for feature selection algorithms, this paper presents a theoretically optimal criterion, namely, the discriminative optimal criterion (DoC) for feature selection.
Compared with the existing representative optimal criterion (RoC, [21]) which retains maximum information for modeling the relationship between input and output variables, DoC is pragmatically advantageous because it attempts to directly maximize the classification accuracy and naturally reflects the Bayes error in the objective.
To make DoC computationally tractable for practical tasks, we propose an algorithmic framework, which selects a subset of features by minimizing the Bayes error rate estimated by a nonparametric estimator.
A set of existing algorithms as well as new ones can be derived naturally from this framework. As an example, we show that the Relief algorithm [20] greedily attempts to minimize the Bayes error estimated by the k -Nearest-Neighbor ( k NN) method.
This new interpretation insightfully reveals the secret behind the family of margin-based feature selection algorithms [28], [14] and also offers a principled way to establish new alternatives for performance improvement.
In particular, by exploiting the proposed framework, we establish the Parzen-Relief (P-Relief) algorithm based on Parzen window estimator, and the MAP-Relief (M-Relief) which integrates label distribution into the max-margin objective to effectively handle imbalanced and multiclass data. Experiments on various benchmark data sets demonstrate the effectiveness of the proposed algorithms.
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
DISCOVERY OF DELTA CLOSED PATTERNS AND NONINDUCED PATTERNS FROM SEQUENCES
Discovering patterns from sequence data has significant impact in many aspects of science and society, especially in genomics and proteomics. Here we consider multiple strings as input sequence data and substrings as patterns. In the real world, usually a large set of patterns could be discovered yet many of them are redundant, thus degrading the output quality.
This paper improves the output quality by removing two types of redundant patterns. First, the notion of delta tolerance closed itemset is employed to remove redundant patterns that are not delta closed. Second, the concept of statistically induced patterns is proposed to capture redundant patterns which seem to be statistically significant yet their significance is induced by their strong significant subpatterns.
It is computationally intense to mine these nonredundant patterns (delta closed patterns and noninduced patterns). To efficiently discover these patterns in very large sequence data, two efficient algorithms have been developed through innovative use of suffix tree.
Three sets of experiments were conducted to evaluate their performance. They render excellent results when applying to genomics. The experiments confirm that the proposed algorithms are efficient and that they produce a relatively small set of patterns which reveal interesting information in the sequences
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
DATA MINING FOR XML QUERY-ANSWERING SUPPORT
Extracting information from semistructured documents is a very hard task, and is going to become more and more critical as the amount of digital information available on the Internet grows. Indeed, documents are often so large that the data set returned as answer to a query may be too big to convey interpretable knowledge.
In this paper, we describe an approach based on Tree-Based Association Rules (TARs): mined rules, which provide approximate, intensional information on both the structure and the contents of Extensible Markup Language (XML) documents, and can be stored in XML format as well.
This mined knowledge is later used to provide:
1) a concise idea—the gist—of both the structure and the content of the XML document and
2) quick, approximate answers to queries. In this paper, we focus on the second feature. A prototype system and experimental results demonstrate the effectiveness of the approach
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE
All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multiviewpoint-based similarity measure and two related clustering methods.
The major difference between a traditional dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim.
Two criterion functions for document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that use other popular similarity measures on various document collections to verify the advantages of our proposal.
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
CONTINUOUS DETOUR QUERIES IN SPATIAL NETWORKS
We study the problem of finding the shortest route between two locations that includes a stopover of a given type. An example scenario of this problem is given as follows: “On the way to Bob’s place, Alice searches for a nearby take-away Italian restaurant to buy a pizza.” Assuming that Alice is interested in minimizing the total trip distance, this scenario can be modeled as a query where the current Alice’s location (start) and Bob’s place (destination) function as query points.
Based on these two query points, we find the minimum detour object (MDO), i.e., a stopover that minimizes the sum of the distances: 1) from the start to the stopover, and 2) from the stopover to the destination. In a realistic location-based application environment, a user can be indecisive about committing to a particular detour option.
The user may wish to browse multiple ( k) MDOs before making a decision. Furthermore, when a user moves, thekMDOresults at one location may become obsolete. We propose a method for continuous detour query (CDQ) processing based on incremental construction of a shortest path tree.
We conducted experimental studies to compare the performance of our proposed method against two methods derived from existing k-nearest neighbor querying techniques using real road-network data sets. Experimental results show that our proposed method significantly outperforms the two competitive techniques
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
ANOMALY DETECTION FOR DISCRETE SEQUENCES: A SURVEY
This survey attempts to provide a comprehensive and structured overview of the existing research for the problem of detecting anomalies in discrete/symbolic sequences. The objective is to provide a global understanding of the sequence anomaly detection problem and how existing techniques relate to each other.
The key contribution of this survey is the classification of the existing research into three distinct categories, based on the problem formulation that they are trying to solve. These problem formulations are: 1) identifying anomalous sequences with respect to a database of normal sequences; 2) identifying an anomalous subsequence within a long sequence; and 3) identifying a pattern in a sequence whose frequency of occurrence is anomalous.
We show how each of these problem formulations is characteristically distinct from each other and discuss their relevance in various application domains. We review techniques from many disparate and disconnected application domains that address each of these formulations.
Within each problem formulation, we group techniques into categories based on the nature of the underlying algorithm. For each category, we provide a basic anomaly detection technique, and show how the existing techniques are variants of the basic technique.
This approach shows how different techniques within a category are related or different from each other. Our categorization reveals new variants and combinations that have not been investigated before for anomaly detection.
We also provide a discussion of relative strengths and weaknesses of different techniques. We show how techniques developed for one problem formulation can be adapted to solve a different formulation, thereby providing several novel adaptations to solve the different problem formulations. We also highlight the applicability of the techniques that handle discrete sequences to other related areas such as online anomaly detection and time series anomaly detection
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
ADDING TEMPORAL CONSTRAINTS TO XML SCHEMA
If past versions of XML documents are retained, what of the various integrity constraints defined in XML Schema on those documents?
This paper describes how to interpret such constraints as sequenced constraints, applicable at each point in time. We also consider how to add new variants that apply across time, so-called nonsequenced constraints.
Our approach supports temporal documents that vary over both valid and transaction time, whose schema can vary over transaction time. We do this by replacing the schema with a (possibly time-varying) temporal schema and replacing the document with a temporal document, both of which are upward compatible with conventional XML and with conventional tools like XMLLINT , which we have extended to support the temporal constraints introduced here
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
A UNIFIED PROBABILISTIC FRAMEWORK FOR NAME DISAMBIGUATION IN DIGITAL LIBRARY
Despite years of research, the name ambiguity problem remains largely unresolved. Outstanding issues include how to capture all information for name disambiguation in a unified approach, and how to determine the number of peopleK in the disambiguation process.
In this paper, we formalize the problem in a unified probabilistic framework, which incorporates both attributes and relationships. Specifically, we define a disambiguation objective function for the problem and propose a two-step parameter estimation algorithm.
We also investigate a dynamic approach for estimating the number of people K . Experiments show that our proposed framework significantly outperforms four baseline methods of using clustering algorithms and two other previous methods. Experiments also indicate that the number K automatically found by our method is close to the actual number.
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
A PERFORMANCE ANOMALY DETECTION AND ANALYSIS FRAMEWORK FOR DBMS DEVELOPMENT
Detecting performance anomalies and finding their root causes are tedious tasks requiring much manual work. Functionality enhancements in DBMS development as in most software development often introduce performance problems in addition to bugs.
To detect the problems as soon as they are introduced, which often happens during the early phases of a development cycle, we adopt performance regression testing early in the process. In this paper, we describe a framework that we developed to manage performance anomalies after establishing a set of conditions for a problem to be considered an anomaly.
The framework uses Statistical Process Control (SPC) charts to detect performance anomalies and differential profiling to identify their root causes. By automating the tasks within the framework we were able to remove most of the manual overhead in detecting anomalies and reduce the analysis time for identifying the root causes by about 90 percent in most cases.
The tools developed and deployed based on the framework allow us continuous, automated daily monitoring of performance in addition to the usual functionality monitoring in our DBMS development
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
A LOOK-AHEAD APPROACH TO SECURE MULTIPARTY PROTOCOLS
Secure multiparty protocols have been proposed to enable noncolluding parties to cooperate without a trusted server. Even though such protocols prevent information disclosure other than the objective function, they are quite costly in computation and communication.
The high overhead motivates parties to estimate the utility that can be achieved as a result of the protocol beforehand. In this paper, we propose a look-ahead approach, specifically for secure multiparty protocols to achieve distributed k -anonymity, which helps parties to decide if the utility benefit from the protocol is within an acceptable range before initiating the protocol.
The look-ahead operation is highly localized and its accuracy depends on the amount of information the parties are willing to share. Experimental results show the effectiveness of the proposed methods
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
A KNOWLEDGE-DRIVEN APPROACH TO ACTIVITY RECOGNITION IN SMART HOMES
This paper introduces a knowledge-driven approach to real-time, continuous activity recognition based on multisensor data streams in smart homes. The approach goes beyond the traditional data-centric methods for activity recognition in three ways.
First, it makes extensive use of domain knowledge in the life cycle of activity recognition. Second, it uses ontologies for explicit context and activity modeling and representation. Third and finally, it exploits semantic reasoning and classification for activity inferencing, thus enabling both coarse-grained and fine-grained activity recognition.
In this paper, we analyze the characteristics of smart homes and Activities of Daily Living (ADL) upon which we built both context and ADL ontologies. We present a generic system architecture for the proposed knowledge-driven approach and describe the underlying ontology-based recognition process. Special emphasis is placed on semantic subsumption reasoning algorithms for activity recognition.
The proposed approach has been implemented in a function-rich software system, which was deployed in a smart home research laboratory. We evaluated the proposed approach and the developed system through extensive experiments involving a number of various ADL use scenarios. An average activity recognition rate of 94.44 percent was achieved and the average recognition runtime per recognition operation was measured as 2.5 seconds.
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
A FRAMEWORK FOR PERSONAL MOBILE COMMERCE PATTERN MINING AND PREDICTION
Due to a wide range of potential applications, research on mobile commerce has received a lot of interests from both of the industry and academia. Among them, one of the active topic areas is the mining and prediction of users’ mobile commerce behaviors such as their movements and purchase transactions.
In this paper, we propose a novel framework, calledMobile Commerce Explorer (MCE), for mining and prediction of mobile users’ movements and purchase transactions under the context of mobile commerce.
The MCEframework consists of three major components: 1) Similarity Inference Model ðSIMÞ for measuring the similarities among stores and items, which are two basic mobile commerce entities considered in this paper; 2)Personal Mobile Commerce Pattern Mine ( PMCP-Mine) algorithm for efficient discovery of mobile users’ Personal Mobile Commerce Patterns ( PMCPs); and 3) Mobile Commerce Behavior PredictorðMCBPÞ for prediction of possible mobile user behaviors.
To our best knowledge, this is the first work that facilitates mining and prediction of mobile users’ commerce behaviors in order to recommend stores and items previously unknown to a user. We perform an extensive experimental evaluation by simulation and show that our proposals produce excellent results
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
A DECISION-THEORETIC FRAMEWORK FOR NUMERICAL ATTRIBUTE VALUE RECONCILIATION
One of the major challenges of data integration is to resolve conflicting numerical attribute values caused by data heterogeneity.
In addressing this problem, existing approaches proposed in prior literature often ignore such data inconsistencies or resolve them in an ad hoc manner. In this study, we propose a decision-theoretical framework that resolves numerical value conflicts in a systematic manner.
The framework takes into consideration the consequences of incorrect numerical values and selects the value that minimizes the expected cost of errors for all data application problems under consideration. Experimental results show that significant savings can be achieved by adopting the proposed framework instead of ad hoc approaches
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
VISUAL ROLE MINING: A PICTURE IS WORTH A THOUSAND ROLES
This paper offers a new role engineering approach to Role-Based Access Control(RBAC), referred to as visual role mining . The key idea is to graphically represent user-permission assignments to enable quick analysis and elicitation of meaningful roles.
First, we formally define the problem by introducing a metric for the quality of the visualization. Then, we prove that finding the best representation according to the defined metric is a NP-hard problem. In turn, we propose two algorithms: ADVISER and EXTRACT.
The former is a heuristic used to best represent the user-permission assignments of a given set of roles. The latter is a fast probabilistic algorithm that, when used in conjunction with ADVISER, allows for a visual elicitation of roles even in absence of predefined roles.
Besides being rooted in sound theory, our proposal is supported by extensive simulations run over real data. Results confirm the quality of the proposal and demonstrate its viability in supporting role engineering decisions
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
SEGMENTATION AND SAMPLING OF MOVING OBJECT TRAJECTORIES BASED ON REPRESENTATIVENESS
Moving Object Databases (MOD), although ubiquitous, still call for methods that will be able to understand, search, analyze, and browse their spatiotemporal content. In this paper, we propose a method for trajectory segmentation and sampling based on the representativeness of the (sub)trajectories in the MOD.
In order to find the most representative subtrajectories, the following methodology is proposed. First, a novel global voting algorithm is performed, based on local density and trajectory similarity information. This method is applied for each segment of the trajectory, forming a local trajectory descriptor that represents line segment representativeness.
The sequence of this descriptor over a trajectory gives the voting signal of the trajectory, where high values correspond to the most representative parts. Then, a novel segmentation algorithm is applied on this signal that automatically estimates the number of partitions and the partition borders, identifying homogenous partitions concerning their representativeness.
Finally, a sampling method over the resulting segments yields the most representative subtrajectories in the MOD. Our experimental results in synthetic and real MOD verify the effectiveness of the proposed scheme, also in comparison with other sampling techniques.
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
SATURN: RANGE QUERIES, LOAD BALANCING AND FAULT TOLERANCE IN DHT DATA SYSTEMS
In this paper, we present Saturn, an overlay architecture for large-scale data networks maintained over Distributed Hash Tables (DHTs) that efficiently processes range queries and ensures access load balancing and fault-tolerance.
Placing consecutive data values in neighboring peers is desirable in DHTs since it accelerates range query processing; however, such a placement is highly susceptible to load imbalances.
At the same time, DHTs may be susceptible to node departures/failures and high data availability and fault tolerance are significant issues. Saturn deals effectively with these problems through the introduction of a novel multiple ring, order-preserving architecture. The use of a novel order-preserving hash function ensures fast range query processing.
Replication across and within data rings (termed vertical and horizontal replication) forms the foundation over which our mechanisms are developed, ensuring query load balancing and fault tolerance, respectively.
Our detailed experimentation study shows strong gains in range query processing efficiency, access load balancing, and fault tolerance, with low replication overheads.
The significance of Saturn is not only that it effectively tackles all three issues together—i.e., supporting range queries, ensuring load balancing, and providing fault tolerance over DHTs—but also that it can be applied on top of any order-preserving DHT enabling it to dynamically handle replication and, thus, to trade off replication costs for fair load distribution and fault tolerance.
PROTECTING LOCATION PRIVACY AGAINST LOCATION-DEPENDENT ATTACKS IN MOBILE SERVICES
Privacy protection has recently received considerable attention in location-based services. A large number of location cloaking algorithms have been proposed for protecting the location privacy of mobile users. In this paper, we consider the scenario where different location-based query requests are continuously issued by mobile users while they are moving.
We show that most of the existing k-anonymity location cloaking algorithms are concerned with snapshotuser locations only and cannot effectively prevent location-dependent attacks when users’ locations are continuously updated. Therefore, adopting both the location k-anonymity and cloaking granularity as privacy metrics, we propose a new incremental clique-based cloaking algorithm, called ICliqueCloak, to defend against location-dependent attacks.
The main idea is to incrementally maintain maximal cliques needed for location cloaking in an undirected graph that takes into consideration the effect of continuous location updates. Thus, a qualified clique can be quickly identified and used to generate the cloaked region when a new request arrives.
The efficiency and effectiveness of the proposed ICliqueCloak algorithm are validated by a series of carefully designed experiments. The experimental results also show that the price paid for defending against location-dependent attacks is small
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
OPTIMIZING THE CALCULATION OF CONDITIONAL PROBABILITY TABLES IN HYBRID BAYESIAN NETWORKS USING BINARY FACTORIZATION
Reducing the computational complexity of inference in Bayesian Networks (BNs) is a key challenge. Current algorithms for inference convert a BN to a junction tree structure made up of clusters of the BN nodes and the resulting complexity is time exponential in the size of a cluster.
The need to reduce the complexity is especially acute where the BN contains continuous nodes. We propose a new method for optimizing the calculation of Conditional Probability Tables (CPTs) involving continuous nodes, approximated in Hybrid Bayesian Networks (HBNs), using an approximation algorithm called dynamic discretization.
We present an optimized solution to this problem involving binary factorization of the arithmetical expressions declared to generate the CPTs for continuous nodes for deterministic functions and statistical distributions.
The proposed algorithm is implemented and tested in a commercial Hybrid Bayesian Network software package and the results of the empirical evaluation show significant performance improvement over unfactorized models
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
MAXIMUM AMBIGUITY-BASED SAMPLE SELECTION IN FUZZY DECISION TREE INDUCTION
Sample selection is to select a number of representative samples from a large database such that a learning algorithm can have a reduced computational cost and an improved learning accuracy. This paper gives a new sample selection mechanism, i.e., the maximum ambiguity-based sample selection in fuzzy decision tree induction.
Compared with the existing sample selection methods, this mechanism selects the samples based on the principle of maximal classification ambiguity. The major advantage of this mechanism is that the adjustment of the fuzzy decision tree is minimized when adding selected samples to the training set.
This advantage is confirmed via the theoretical analysis of the leaf-nodes’ frequency in the decision trees. The decision tree generated from the selected samples usually has a better performance than that from the original database.
Furthermore, experimental results show that generalization ability of the tree based on our selection mechanism is far more superior to that based on random selection mechanism
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
MODEL-BASED METHOD FOR PROJECTIVE CLUSTERING
Clustering high-dimensional data is a major challenge due to the curse of dimensionality. To solve this problem, projective clustering has been defined as an extension to traditional clustering that attempts to find projected clusters in subsets of the dimensions of a data space.
In this paper, a probability model is first proposed to describe projected clusters in high-dimensional data space. Then, we present a model-based algorithm for fuzzy projective clustering that discovers clusters with overlapping boundaries in various projected subspaces.
The suitability of the proposal is demonstrated in an empirical study done with synthetic data set and some widely used real-world data set
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
EXTENDING RECOMMENDER SYSTEMS FOR DISJOINT USER/ITEM SETS: THE CONFERENCE RECOMMENDATION PROBLEM
In this paper, we describe the problem of recommending conference sessions to attendees and show how novel extensions to traditional model-based recommender systems, as suggested in Adomavicius and Tuzhilin [1], can address this problem.
We introduce Recommendation Engine by CONjoint Decomposition of ITems and USers ( RECONDITUS)—a technique that is an extension of preference-based recommender systems to recommend items from a new disjoint set to users from a new disjoint set. The assumption being that preferences exhibited by users with known usage behavior (e.g., past conference session attendance), which can be abstracted by projections of user and item matrices, will be similar to ones of new (different) users where the basic environment and item domain are the same (e.g., new conference). RECONDITUS requires no item ratings, but operates on observed user behavior such as past conference session attendance.
The RECONDITUS paradigm consists of projections of both user and item data, and the learning of relationships in projected space. Once established, the relationships enable predicting new relationships and provide associated recommendations.
The approach can encompass several traditional data mining problems where both clustering and prediction are necessary.
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
ENERGY-EFFICIENT REVERSE SKYLINE QUERY PROCESSING OVER WIRELESS SENSOR NETWORKS
Reverse skyline query plays an important role in many sensing applications, such as environmental monitoring, habitat monitoring, and battlefield monitoring. Due to the limited power supplies of wireless sensor nodes, the existing centralized approaches, which do not consider energy efficiency, cannot be directly applied to the distributed sensor environment.
In this paper, we investigate how to process reverse skyline queries energy efficiently in wireless sensor networks. Initially, we theoretically analyzed the properties of reverse skyline query and proposed a skyband-based approach to tackle the problem of reverse skyline query answering over wireless sensor networks.
Then, an energy-efficient approach is proposed to minimize the communication cost among sensor nodes of evaluating range reverse skyline query. Moreover, optimization mechanisms to improve the performance of multiple reverse skylines are also discussed. Extensive experiments on both real-world data and synthetic data have demonstrated the efficiency and effectiveness of our proposed approaches with various experimental settings
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
ENERGY EFFICIENT SCHEMES FOR ACCURACY-GUARANTEED SENSOR DATA AGGREGATION USING SCALABLE COUNTING
Sensor networks have received considerable attention in recent years, and are employed in many applications. In these applications, statistical aggregates such as Sum over the readings of a group of sensor nodes are often needed. One challenge for computing sensor data aggregates comes from the communication failures, which are common in sensor networks.
To enhance the robustness of the aggregate computation, multipath-based aggregation is often used. However, the multipath-based aggregation suffers from the problem of overcounting sensor readings. The approaches using the multipath-based aggregation therefore need to incorporate techniques that avoid overcounting sensor readings. In this paper, we present a novel technique named scalable counting for efficiently avoiding the overcounting problem.
We focus on having an (", ) accuracy guarantee for computing an aggregate, which ensures that the error in computing the aggregate is within a factor of " with probability (1 ). Our schemes using the scalable counting technique efficiently compute the aggregates under a given accuracy guarantee.
We provide theoretical analyses that show the advantages of the scalable counting technique over previously proposed techniques. Furthermore, extensive experiments are made to validate the theoretical results and manifest the advantages of using the scalable counting technique for sensor data aggregation
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
EFFICIENT AND PROGRESSIVE ALGORITHMS FOR DISTRIBUTED SKYLINE QUERIES OVER UNCERTAIN DATA
The skyline operator has received considerable attention from the database community, due to its importance in many applications including multicriteria decision making, preference answering, and so forth. In many applications where uncertain data are inherently exist, i.e., data collected from different sources in distributed locations are usually with imprecise measurements, and thus exhibit kind of uncertainty.
Taking into account the network delay and economic cost associated with sharing and communicating large amounts of distributed data over an internet, an important problem in this scenario is to retrieve the global skyline tuples from all the distributed local sites with minimum communication cost.
Based on the well-known notation of the probabilistic skyline query over centralized uncertain data, in this paper, we propose the notation of distributed skyline queries over uncertain data. Furthermore, two communication- and computation-efficient algorithms are proposed to retrieve the qualified skylines from distributed local sites.
Extensive experiments have been conducted to verify the efficiency, the effectiveness and the progressiveness of our algorithms with both the synthetic and real data sets
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
EDISC: A CLASS-TAILORED DISCRETIZATION TECHNIQUE FOR RULE-BASED CLASSIFICATION
Discretization is a critical component of data mining whereby continuous attributes of a data set are converted into discrete ones by creating intervals either before or during learning. There are many good reasons for preprocessing discretization, such as increased learning efficiency and classification accuracy, comprehensibility of data mining results, as well as the inherent limitation of a great majority of learning algorithms to handle only discrete data.
Many preprocessing discretization techniques have been proposed to date, of which the Entropy-MDLP discretization has been accepted as by far the most effective in the context of both decision tree learning and rule induction algorithms. This paper presents a new discretization technique EDISC which utilizes the entropy-based principle but takes a class-tailored approach to discretization.
The technique is applicable in general to any covering algorithm, including those that use the class-per-class rule induction methodology such as CN2 as well as those that use a seed example during the learning phase, such as the RULES family.
Experimental evaluation has proved the efficiency and effectiveness of the technique as a preprocessing discretization procedure for CN2 as well as RULES-7, the latest algorithm among the RULESfamily of inductive learning algorithms .
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
DISCRIMINATIVE FEATURE SELECTION BY NONPARAMETRIC BAYES ERROR MINIMIZATION
Feature selection is fundamental to knowledge discovery from massive amount of high-dimensional data. In an effort to establish theoretical justification for feature selection algorithms, this paper presents a theoretically optimal criterion, namely, the discriminative optimal criterion (DoC) for feature selection.
Compared with the existing representative optimal criterion (RoC, [21]) which retains maximum information for modeling the relationship between input and output variables, DoC is pragmatically advantageous because it attempts to directly maximize the classification accuracy and naturally reflects the Bayes error in the objective.
To make DoC computationally tractable for practical tasks, we propose an algorithmic framework, which selects a subset of features by minimizing the Bayes error rate estimated by a nonparametric estimator.
A set of existing algorithms as well as new ones can be derived naturally from this framework. As an example, we show that the Relief algorithm [20] greedily attempts to minimize the Bayes error estimated by the k -Nearest-Neighbor ( k NN) method.
This new interpretation insightfully reveals the secret behind the family of margin-based feature selection algorithms [28], [14] and also offers a principled way to establish new alternatives for performance improvement.
In particular, by exploiting the proposed framework, we establish the Parzen-Relief (P-Relief) algorithm based on Parzen window estimator, and the MAP-Relief (M-Relief) which integrates label distribution into the max-margin objective to effectively handle imbalanced and multiclass data. Experiments on various benchmark data sets demonstrate the effectiveness of the proposed algorithms.
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
DISCOVERY OF DELTA CLOSED PATTERNS AND NONINDUCED PATTERNS FROM SEQUENCES
Discovering patterns from sequence data has significant impact in many aspects of science and society, especially in genomics and proteomics. Here we consider multiple strings as input sequence data and substrings as patterns. In the real world, usually a large set of patterns could be discovered yet many of them are redundant, thus degrading the output quality.
This paper improves the output quality by removing two types of redundant patterns. First, the notion of delta tolerance closed itemset is employed to remove redundant patterns that are not delta closed. Second, the concept of statistically induced patterns is proposed to capture redundant patterns which seem to be statistically significant yet their significance is induced by their strong significant subpatterns.
It is computationally intense to mine these nonredundant patterns (delta closed patterns and noninduced patterns). To efficiently discover these patterns in very large sequence data, two efficient algorithms have been developed through innovative use of suffix tree.
Three sets of experiments were conducted to evaluate their performance. They render excellent results when applying to genomics. The experiments confirm that the proposed algorithms are efficient and that they produce a relatively small set of patterns which reveal interesting information in the sequences
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
DATA MINING FOR XML QUERY-ANSWERING SUPPORT
Extracting information from semistructured documents is a very hard task, and is going to become more and more critical as the amount of digital information available on the Internet grows. Indeed, documents are often so large that the data set returned as answer to a query may be too big to convey interpretable knowledge.
In this paper, we describe an approach based on Tree-Based Association Rules (TARs): mined rules, which provide approximate, intensional information on both the structure and the contents of Extensible Markup Language (XML) documents, and can be stored in XML format as well.
This mined knowledge is later used to provide:
1) a concise idea—the gist—of both the structure and the content of the XML document and
2) quick, approximate answers to queries. In this paper, we focus on the second feature. A prototype system and experimental results demonstrate the effectiveness of the approach
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE
All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multiviewpoint-based similarity measure and two related clustering methods.
The major difference between a traditional dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim.
Two criterion functions for document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that use other popular similarity measures on various document collections to verify the advantages of our proposal.
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
CONTINUOUS DETOUR QUERIES IN SPATIAL NETWORKS
We study the problem of finding the shortest route between two locations that includes a stopover of a given type. An example scenario of this problem is given as follows: “On the way to Bob’s place, Alice searches for a nearby take-away Italian restaurant to buy a pizza.” Assuming that Alice is interested in minimizing the total trip distance, this scenario can be modeled as a query where the current Alice’s location (start) and Bob’s place (destination) function as query points.
Based on these two query points, we find the minimum detour object (MDO), i.e., a stopover that minimizes the sum of the distances: 1) from the start to the stopover, and 2) from the stopover to the destination. In a realistic location-based application environment, a user can be indecisive about committing to a particular detour option.
The user may wish to browse multiple ( k) MDOs before making a decision. Furthermore, when a user moves, thekMDOresults at one location may become obsolete. We propose a method for continuous detour query (CDQ) processing based on incremental construction of a shortest path tree.
We conducted experimental studies to compare the performance of our proposed method against two methods derived from existing k-nearest neighbor querying techniques using real road-network data sets. Experimental results show that our proposed method significantly outperforms the two competitive techniques
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
ANOMALY DETECTION FOR DISCRETE SEQUENCES: A SURVEY
This survey attempts to provide a comprehensive and structured overview of the existing research for the problem of detecting anomalies in discrete/symbolic sequences. The objective is to provide a global understanding of the sequence anomaly detection problem and how existing techniques relate to each other.
The key contribution of this survey is the classification of the existing research into three distinct categories, based on the problem formulation that they are trying to solve. These problem formulations are: 1) identifying anomalous sequences with respect to a database of normal sequences; 2) identifying an anomalous subsequence within a long sequence; and 3) identifying a pattern in a sequence whose frequency of occurrence is anomalous.
We show how each of these problem formulations is characteristically distinct from each other and discuss their relevance in various application domains. We review techniques from many disparate and disconnected application domains that address each of these formulations.
Within each problem formulation, we group techniques into categories based on the nature of the underlying algorithm. For each category, we provide a basic anomaly detection technique, and show how the existing techniques are variants of the basic technique.
This approach shows how different techniques within a category are related or different from each other. Our categorization reveals new variants and combinations that have not been investigated before for anomaly detection.
We also provide a discussion of relative strengths and weaknesses of different techniques. We show how techniques developed for one problem formulation can be adapted to solve a different formulation, thereby providing several novel adaptations to solve the different problem formulations. We also highlight the applicability of the techniques that handle discrete sequences to other related areas such as online anomaly detection and time series anomaly detection
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
ADDING TEMPORAL CONSTRAINTS TO XML SCHEMA
If past versions of XML documents are retained, what of the various integrity constraints defined in XML Schema on those documents?
This paper describes how to interpret such constraints as sequenced constraints, applicable at each point in time. We also consider how to add new variants that apply across time, so-called nonsequenced constraints.
Our approach supports temporal documents that vary over both valid and transaction time, whose schema can vary over transaction time. We do this by replacing the schema with a (possibly time-varying) temporal schema and replacing the document with a temporal document, both of which are upward compatible with conventional XML and with conventional tools like XMLLINT , which we have extended to support the temporal constraints introduced here
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
A UNIFIED PROBABILISTIC FRAMEWORK FOR NAME DISAMBIGUATION IN DIGITAL LIBRARY
Despite years of research, the name ambiguity problem remains largely unresolved. Outstanding issues include how to capture all information for name disambiguation in a unified approach, and how to determine the number of peopleK in the disambiguation process.
In this paper, we formalize the problem in a unified probabilistic framework, which incorporates both attributes and relationships. Specifically, we define a disambiguation objective function for the problem and propose a two-step parameter estimation algorithm.
We also investigate a dynamic approach for estimating the number of people K . Experiments show that our proposed framework significantly outperforms four baseline methods of using clustering algorithms and two other previous methods. Experiments also indicate that the number K automatically found by our method is close to the actual number.
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
A PERFORMANCE ANOMALY DETECTION AND ANALYSIS FRAMEWORK FOR DBMS DEVELOPMENT
Detecting performance anomalies and finding their root causes are tedious tasks requiring much manual work. Functionality enhancements in DBMS development as in most software development often introduce performance problems in addition to bugs.
To detect the problems as soon as they are introduced, which often happens during the early phases of a development cycle, we adopt performance regression testing early in the process. In this paper, we describe a framework that we developed to manage performance anomalies after establishing a set of conditions for a problem to be considered an anomaly.
The framework uses Statistical Process Control (SPC) charts to detect performance anomalies and differential profiling to identify their root causes. By automating the tasks within the framework we were able to remove most of the manual overhead in detecting anomalies and reduce the analysis time for identifying the root causes by about 90 percent in most cases.
The tools developed and deployed based on the framework allow us continuous, automated daily monitoring of performance in addition to the usual functionality monitoring in our DBMS development
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
A LOOK-AHEAD APPROACH TO SECURE MULTIPARTY PROTOCOLS
Secure multiparty protocols have been proposed to enable noncolluding parties to cooperate without a trusted server. Even though such protocols prevent information disclosure other than the objective function, they are quite costly in computation and communication.
The high overhead motivates parties to estimate the utility that can be achieved as a result of the protocol beforehand. In this paper, we propose a look-ahead approach, specifically for secure multiparty protocols to achieve distributed k -anonymity, which helps parties to decide if the utility benefit from the protocol is within an acceptable range before initiating the protocol.
The look-ahead operation is highly localized and its accuracy depends on the amount of information the parties are willing to share. Experimental results show the effectiveness of the proposed methods
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
A KNOWLEDGE-DRIVEN APPROACH TO ACTIVITY RECOGNITION IN SMART HOMES
This paper introduces a knowledge-driven approach to real-time, continuous activity recognition based on multisensor data streams in smart homes. The approach goes beyond the traditional data-centric methods for activity recognition in three ways.
First, it makes extensive use of domain knowledge in the life cycle of activity recognition. Second, it uses ontologies for explicit context and activity modeling and representation. Third and finally, it exploits semantic reasoning and classification for activity inferencing, thus enabling both coarse-grained and fine-grained activity recognition.
In this paper, we analyze the characteristics of smart homes and Activities of Daily Living (ADL) upon which we built both context and ADL ontologies. We present a generic system architecture for the proposed knowledge-driven approach and describe the underlying ontology-based recognition process. Special emphasis is placed on semantic subsumption reasoning algorithms for activity recognition.
The proposed approach has been implemented in a function-rich software system, which was deployed in a smart home research laboratory. We evaluated the proposed approach and the developed system through extensive experiments involving a number of various ADL use scenarios. An average activity recognition rate of 94.44 percent was achieved and the average recognition runtime per recognition operation was measured as 2.5 seconds.
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
A FRAMEWORK FOR PERSONAL MOBILE COMMERCE PATTERN MINING AND PREDICTION
Due to a wide range of potential applications, research on mobile commerce has received a lot of interests from both of the industry and academia. Among them, one of the active topic areas is the mining and prediction of users’ mobile commerce behaviors such as their movements and purchase transactions.
In this paper, we propose a novel framework, calledMobile Commerce Explorer (MCE), for mining and prediction of mobile users’ movements and purchase transactions under the context of mobile commerce.
The MCEframework consists of three major components: 1) Similarity Inference Model ðSIMÞ for measuring the similarities among stores and items, which are two basic mobile commerce entities considered in this paper; 2)Personal Mobile Commerce Pattern Mine ( PMCP-Mine) algorithm for efficient discovery of mobile users’ Personal Mobile Commerce Patterns ( PMCPs); and 3) Mobile Commerce Behavior PredictorðMCBPÞ for prediction of possible mobile user behaviors.
To our best knowledge, this is the first work that facilitates mining and prediction of mobile users’ commerce behaviors in order to recommend stores and items previously unknown to a user. We perform an extensive experimental evaluation by simulation and show that our proposals produce excellent results
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
A DECISION-THEORETIC FRAMEWORK FOR NUMERICAL ATTRIBUTE VALUE RECONCILIATION
One of the major challenges of data integration is to resolve conflicting numerical attribute values caused by data heterogeneity.
In addressing this problem, existing approaches proposed in prior literature often ignore such data inconsistencies or resolve them in an ad hoc manner. In this study, we propose a decision-theoretical framework that resolves numerical value conflicts in a systematic manner.
The framework takes into consideration the consequences of incorrect numerical values and selects the value that minimizes the expected cost of errors for all data application problems under consideration. Experimental results show that significant savings can be achieved by adopting the proposed framework instead of ad hoc approaches
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
VISUAL ROLE MINING: A PICTURE IS WORTH A THOUSAND ROLES
This paper offers a new role engineering approach to Role-Based Access Control(RBAC), referred to as visual role mining . The key idea is to graphically represent user-permission assignments to enable quick analysis and elicitation of meaningful roles.
First, we formally define the problem by introducing a metric for the quality of the visualization. Then, we prove that finding the best representation according to the defined metric is a NP-hard problem. In turn, we propose two algorithms: ADVISER and EXTRACT.
The former is a heuristic used to best represent the user-permission assignments of a given set of roles. The latter is a fast probabilistic algorithm that, when used in conjunction with ADVISER, allows for a visual elicitation of roles even in absence of predefined roles.
Besides being rooted in sound theory, our proposal is supported by extensive simulations run over real data. Results confirm the quality of the proposal and demonstrate its viability in supporting role engineering decisions
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
SEGMENTATION AND SAMPLING OF MOVING OBJECT TRAJECTORIES BASED ON REPRESENTATIVENESS
Moving Object Databases (MOD), although ubiquitous, still call for methods that will be able to understand, search, analyze, and browse their spatiotemporal content. In this paper, we propose a method for trajectory segmentation and sampling based on the representativeness of the (sub)trajectories in the MOD.
In order to find the most representative subtrajectories, the following methodology is proposed. First, a novel global voting algorithm is performed, based on local density and trajectory similarity information. This method is applied for each segment of the trajectory, forming a local trajectory descriptor that represents line segment representativeness.
The sequence of this descriptor over a trajectory gives the voting signal of the trajectory, where high values correspond to the most representative parts. Then, a novel segmentation algorithm is applied on this signal that automatically estimates the number of partitions and the partition borders, identifying homogenous partitions concerning their representativeness.
Finally, a sampling method over the resulting segments yields the most representative subtrajectories in the MOD. Our experimental results in synthetic and real MOD verify the effectiveness of the proposed scheme, also in comparison with other sampling techniques.
*------------*------------*------------*------------*------------*------------*
DOMAIN - KNOWLEDGE AND DATA ENGINEERING
SATURN: RANGE QUERIES, LOAD BALANCING AND FAULT TOLERANCE IN DHT DATA SYSTEMS
In this paper, we present Saturn, an overlay architecture for large-scale data networks maintained over Distributed Hash Tables (DHTs) that efficiently processes range queries and ensures access load balancing and fault-tolerance.
Placing consecutive data values in neighboring peers is desirable in DHTs since it accelerates range query processing; however, such a placement is highly susceptible to load imbalances.
At the same time, DHTs may be susceptible to node departures/failures and high data availability and fault tolerance are significant issues. Saturn deals effectively with these problems through the introduction of a novel multiple ring, order-preserving architecture. The use of a novel order-preserving hash function ensures fast range query processing.
Replication across and within data rings (termed vertical and horizontal replication) forms the foundation over which our mechanisms are developed, ensuring query load balancing and fault tolerance, respectively.
Our detailed experimentation study shows strong gains in range query processing efficiency, access load balancing, and fault tolerance, with low replication overheads.
The significance of Saturn is not only that it effectively tackles all three issues together—i.e., supporting range queries, ensuring load balancing, and providing fault tolerance over DHTs—but also that it can be applied on top of any order-preserving DHT enabling it to dynamically handle replication and, thus, to trade off replication costs for fair load distribution and fault tolerance.