Data Mining with Semantic Features Represented as Vectors of Semantic Clusters
July 2012
Merwyn Taylor, The MITRE Corporation
ABSTRACT
Data mining with taxonomies merged with categorical data has been
studied in the past but often limited to small taxonomies. Taxonomies are used
to aggregate categorical data such that patterns induced from the data can be
expressed at higher levels of conceptual generality. Semantic similarity and relatedness
measures can be used to aggregate categorical values for cluster-based
data mining algorithms. Many aggregation techniques rely solely on hierarchical
relationships to aggregate categorical values. While computationally attractive,
these approaches have conceptual limitations that can lead to spurious
data mining results. Alternatively, categorical data can be aggregated using hierarchical
relationships and other semantic relationships that are expressed in
ontologies and conceptual graphs thus requiring graph based similarity/
relatedness measures. Scaling these techniques to large ontologies can be
computationally expensive since there is a wider search space for expressing
patterns. An alternative representation of semantic data is presented that has attractive
computational properties when applied to data mining. Semantic data
is represented as vectors of cluster memberships. The representation supports
the use of cosine similarity measures to improve the run-time performance of
data mining with ontologies. The method is illustrated via examples of KMeans
clustering and Association Rule mining.

Additional Search Keywords
Data mining, ontologies, taxonomies, semantics, vectors, semantic similarity, semantic vectors
|