Enterprise machine learning deployments are limited by two consequences of outdated data management practices widely used today. The first is the protracted time-to-insight that stems from antiquated data replication approaches. The second is the lack of unified, contextualized data that spans the organization horizontally.
Excessive data replication and the resulting “second-order effects” are creating enormous efficiencies and waste for data scientists in most organizations. According to IDC, over 60 zettabytes of data were produced last year, and this is forecast to increase at a CAGR of 23 percent until 2025. Worse, the ratio of unique to replicated data is 1:10, which implies that most organizations’ data management methods are based on copying data.
When creating machine learning models, firms usually section off relevant data by replicating them from different sources. Models are typically trained on 20 percent of this data, while the other 80 percent remain for testing. The rigors of data cleansing, feature engineering, and model evaluation can take six months or more, making data stale during this process while delaying time-to-insight and compromising findings.
The second repercussion of traditional, outdated data management approaches is the reduced quality of insights. This effect is not only attributed to building models with stale data, but also to the inadequate relationship awareness, disconnected vertical data silos, poor contextualization, and schema limitations of relational data management techniques.
Properly implementing knowledge graphs in a modern data fabric corrects these data management issues while increasing machine learning’s value. Deploying data virtualization within a knowledge graph empowered data fabric enables data scientists to bring machine learning to their data—instead of the opposite, which wastes time and resources.
Moreover, the inherent flexibility of graph models and their ability to leverage inter-connected relationships make preparing data for machine learning much easier as they provide capabilities like improved feature engineering, root cause analysis, and graph analytics. This functionality is also key to helping knowledge graphs transition to be the dominant data management construct for the next 20 years as data management and AI converge. In short, knowledge graphs will help AI as much as AI will help knowledge graphs.
Data Scientists Need Strategic Data Management
The growing volumes and varieties of data organizations are dealing with prolonged machine learning deployments. Varying data formats, schemas, and terminologies across silos or data lakes delay machine learning initiatives requiring this training data. The lack of context and semantic annotations makes it difficult to understand data’s meaning and use for specific models. Even when data is sufficiently contextualized, this information rarely persists, so organizations must start over for subsequent projects. The months of training required when replicating this varied data is made even more difficult by fast-moving data, like information collected by IoT devices, for example. Organizations are forced to deal with this obstacle by replicating fresh data again, restarting this time-consuming process that impairs models’ functionality.
A far better approach is to train models at the data fabric layer instead of replicating data into silos. Organizations can easily create training and testing datasets without moving data. They can even specify, for example, a randomized 20 percent sample of their data with a query that extracts features and delivers a training dataset via this data virtualization approach underpinned by knowledge graphs. This methodology illustrates the connection between data management and machine learning to accelerate time-to-insight with the added benefit of training models on more current data.
Achieving Quality Machine Learning Insights
Knowledge graphs provide a richer, superior foundation for understanding enterprise data compared with relational or other approaches. They offer contextualized understanding and relationship detection between the edges of nodes, which is how graphs store data. This capability is significantly enhanced by semantic graph data models that standardize business-specific terminology as a hierarchical set of vocabularies or taxonomies. Thus, data scientists can innately understand data’s meaning and relation to any use case, such as machine learning. Semantic graph data models also align data at the schema level, provide intelligent inferences about concepts or business categories, and eschew conventional problems with terminology or synonyms while delivering a complete view of enterprise data.
These characteristics are pivotal for decreasing the time required to prepare data for machine learning while producing highly nuanced, contextualized insights from the available data. Another benefit of this approach is the relevance of graph-specific algorithms for machine learning. They allow data scientists to take advantage of specific techniques pertaining to clustering, dimensionality reduction, Principle Component Analysis (PCA), and unsupervised learning that are ideal for getting training data ready in graph settings for machine learning. These techniques and others (like graph embedding) can accelerate the feature generation process or provide impact analysis for data preparation.
Fusing Data Management and Knowledge Management
The overarching utility of knowledge graphs for machine learning is demonstrative of the mutually reinforcing nature of data management and knowledge management. To paraphrase acclaimed Google Research Professor Peter Norvig, with enough data, one doesn’t need a fancy algorithm. That’s just what merging data management and knowledge management within a uniform data fabric supported by knowledge graphs and data virtualization provides richer and more high-quality data that enables organizations to optimize machine learning without a perfect algorithm.
With sufficient data about their purchasing habits, for example, one doesn’t need fancy algorithms to predict which customers would be interested in a new product offering. The convergence of data management and knowledge management maximizes AI by giving organizations trained models, and algorithmically augmented intelligence to inform decision-making.