Back to Content Library

Assessing AI Data Readiness

September 26, 2023

September 26, 2023

Intro

Artificial intelligence (AI) project success hinges on the data used to design, train, and deploy machine learning (ML) models. Before starting any AI project, or considering further adopting ML within an organization, the readiness of the supporting data should be assessed. 

To better help organizations determine if their data is ready to use as part of machine learning, we’ve put together a data readiness assessment, which includes:

  1. The Organization: goals, AI/ML strategies, and the definition of success
  2. The Current State: uses, sources, and processing of data
  3. Other Considerations: infrastructure, tools, security and cost
  4. The Future

The Bigger Picture

Before diving into the details, it should be recognized that assessing the readiness of an organization's data, and the associated systems and people, to support current and future business AI initiatives may be only part of preparing and updating an organization's AI roadmap. Alternatively, the data assessment could be performed prior to a stand-alone project to establish, or re-establish, a baseline to work from.

An AI roadmap typically contains, besides data assessment, opportunity evaluation, proof-of-concepts, project strategy, organizational considerations, etc. This article treats the second step in our Agile for AI process -  assessing AI data readiness - as a stand-alone activity even though, in practice, it is one of several steps on the AI journey. (Download our full guide to help you evaluate if your business is ready for AI here)

Sample timeline of Agile for AI process phases and subtasks.

The Organization

The organization assessment includes conversations with all stakeholders about the success of AI in the business. Some of those conversations may entail chats with: 

  • Business leaders who help establish the organizational goals
  • Product strategists to create roadmaps and verify the data will support them 
  • Information technologists often control and manage internal and external data resources
  • Data scientists analyze the potential uses of the data in hand and identify gaps where more is needed

In chatting with each group, you’ll also want to ask about which Key Performance Indicators (KPIs) each group is most interested in, and understand the data needed to support those KPIs. For example, an insurance company's revenue growth KPI might be backtracked to the products needed to generate that growth, and from those products, what data is needed to successfully launch them. In this example, the goal of a 10% increase in revenue based on two new products, bicycle and solar panel insurance, suggests that historical data on bicycle and solar panel sales, theft and damage rates, etc. is needed.

You’ll also want to consider asking the following questions when generating the organizational perspective on data readiness:

  • Are there questions about cause and effect that are important to the business?
  • ~For example, it’s been observed that certain marketing campaigns are much more successful than others, why is that?
  • Is there an overall data strategy? What are its elements? Here are some examples -
  • ~Consolidate all data into one place, e.g., a single data platform.
  • ~Create metrics defining what is high-quality data. For example, what percentage of records can be missing certain fields.
  • ~Enable better human learning from data, i.e., traditional business intelligence
  • ~Make predictions – leveraging ML
  • ~Use data to improve applications. An example: A word processing app that is “smarter” about finding and fixing typos.
  • ~Initiate new application development. Create new apps, such as chatbots and recommendation engines, that depend on data.

For an organization, what does it mean to be successful with data? The answers to this question provide a guiding light to evaluating existing data and proposals to obtain more. Here are some possibilities to consider:

  • Success means, in part, value (increased revenue, lower costs, etc.) is realized quickly. Project cost payback in months, not years. Iterating project deliverables generates more and more value. For example, additional data for training improves prediction accuracy leading to measurable customer outcomes.
  • The ongoing change to data and related systems is managed in a planned and deterministic fashion.
  • People use the data. Personal and cultural biases can block adoption. Success is, in part, ensuring the theoretical benefits are realized.
  • Related to the preceding point, trust within the organization of the data must be established, built, and maintained. If people don’t trust the data, they won’t use it or believe it. This is one important aspect of quality.
  • Related to the earlier point of creating a single repository, one measure of success is the extent to which data is used across the organization. The opposite of success here is, for example, the sales and marketing departments each building and using their customer database.
  • A culture and practice of continuous improvement is in place.

The Current State

Once you’ve drawn an organizational roadmap by evaluating the organization, the next step is to assess the current state of data and related systems inside the organization. The second step in our overall data readiness process  is subdivided into the following topics:

  • Uses
  • Sources and processing
  • Data characterization
  • Machine Learning

Uses

Ask the following questions across all departments and functions and compile the answers categorized by data source and type: (For example, website click data might be used by both marketing and IT for different purposes.)

  • Is the data used for descriptive, predictive, or causal analysis?
  • ~Descriptive analytics is traditional analytics; describing what happened in the past.
  • ~Predictive analysis, or just predicting, is the bread and butter of machine learning but it is also done manually using spreadsheets (“what if analysis”) and business intelligence tools.
  • ~The causal analysis attempts to relate causes to effects per the example earlier of using marketing data to determine which campaigns are the most successful and why.
  • For exploration & visualization
  • ~Data scientists often have the responsibility to explore data to find correlations and causal relations and explain them to others using visualizations. Many other roles may do this same work, for example, a marketing intern using Excel to investigate click data and display statistics.
  • Where does the data come from?
  • ~Data is sourced both internally and externally. The data may be “raw”, processed, or synthesized.
  • ~An example of raw data is sensor readings taken as voltages from a thermostat. The voltages might be scaled (processed) to represent temperatures. Given temperatures at fixed times, values might be interpolated (synthesized) to estimate intermediate values.
  • How much does it vary? To what degree does it change over time?
  • ~Does the data change for environmental reasons, because of user behavior, due to integration partners? An advantage of keeping track of why data changes, and when, is it allows machine learning systems to proactively adapt, not just fail due to “model drift”. (The tendency of a model to become less and less accurate over time.)
  • ~How fast does the data change? Is the data significantly different minute to minute, hour to hour, year to year?

Sources and Processing of Data

Data comes from internal and external sources. And there may be readily available sources of data that aren’t yet in use. Here are some questions to ask:

  • Are the existing data warehouses and data lakes known and accessible?
  • Is there negative data as well as positive? May be from existing A/B testing. It’s important to record both successes and failures.
  • Are there both planned vs. unplanned experiments in the process where the data is being captured? Unplanned experiments mean instances where apps or services fail; is the data associated with the failure cases being recorded? Failures could be due to service outages, integration errors, etc.
  • Planned experiments may have control groups associated with them. Control groups could be from actual populations or synthetic ones.
  • What data is already available? Are the sources and access methods documented?
  • What are the storage requirements? Where is the data stored?
  • What is the quality of the available data? What metrics are used to measure the quality?
  • Do the data types align with requirements or is conversion needed?
  • Is data being generated or augmented for AI purposes?
  • Is there a need for additional data sources? Is a search underway?

Data characterization

Is it structured, semi-structured, or unstructured data?

  • Structured data, such as in relational databases, is usually the most common, most used, and best understood across an organization.
  • Unstructured data, as might be stored in NoSQL databases, could be entire text documents.
  • Semi-structured data spans a gamut from very loosely structured, perhaps there is some metadata such as title and author, associated with a document, to data stored as JSON. (For example, key–value pairs, lists, dictionaries, etc.)
  • Data may also be characterized by the amount, heterogeneity, and rate of processing. For example:
  • ~What are the amounts of data in bytes, records, etc.?
  • ~What are the different types of data? Numerical, text, images, etc.
  • ~What are the formats and file types used to structure and store data?
  • ~How fast is the data being processed today? How much processing is manual versus automated? For the automation, is the data moved in batches or streaming? What’s the frequency and size of the batches? What are the data rates of the streaming?
  • Distributions in the data
  • ~To prepare for ML applications of the data, it’s important to measure and record statistical information about it. For example, the population distributions of the raw data and how the data is sampled for use.
  • ~~The statistics might be described using common distributions such as normal or binomial or other methods. (Details are outside of the scope of this article.)
  • ~~If data sets are large, samples of the data might be used to create tractable sets of exploration, training, etc. How are the samples made? Random or on some other basis? What size samples? 
  • Other factors to consider when characterizing the data include:
  • ~Response times - What are the requirements around how quickly new data must be processed?
  • ~Typical - Is the existing data processing typical of all the available data or is only part of the data being processed today? For example, are all website visitor clicks being reported on or only some of them?
  • ~Rate of change - Does the nature of the data change over time? How fast? Re-using the website example, how often and to what degree does the website change necessitating changes to the data and its processing?
  • ~Legal/ethical - Have all legal and ethical issues been addressed with existing data? How? Manual policies/procedures, automation, review board, etc.
  • ~Applicability - Is all the data being gathered applicable to the business? Is there data being collected that doesn’t have a clear business purpose?
  • ~Comprehension - Does someone in the organization understand the data? What the data means. How and where is that understanding documented in writing?

Data Processing for Machine Learning

For machine learning purposes, how is the data used today?

  • Supervised vs. unsupervised vs. semi-supervised vs. self-supervised training? A very simplified definition of these terms is:
  • ~Supervised – the ML model is trained with inputs where the expected outputs are known
  • ~Unsupervised – the outputs are not known
  • ~Semi-supervised – some outputs are known, some not
  • ~Self-supervised – the model can discover the expected outputs for itself.
  • Is the use offline vs. online? (Data is used to train a model that is not in use versus one that is.)
  • Is human expertise relevant and available? Humans must often be kept in the loop to evaluate the correctness of an ML-based system and provide additional training data.
  • Have all of the attributes and their characteristics (e.g., names, types, missing values, noisiness, type of noise, utility, type of distribution) of the data been identified?
  • ~Which attributes are in use? Which attributes might be used for future projects?
  • ~Are the correlations between attributes documented?
  • Have transformations been tried or are in place? Normalizations, standardizations, etc.
  • What comparisons does the ML depend on? For example, comparisons between different types of website visitors.
  • What assumptions, e.g., about the marketplace, are built into the data? For example, if it’s assumed that all customers are local, and some are not, it will distort the results from automation. Documenting the assumptions, and the elasticity of the results if the assumptions change, is important to supporting the correct interpretation of ML outputs.
  • Stakeholder business decision-making: Does the current data support them? By interviewing the business stakeholders, the gap, if any, between what data is available to them and what is needed can be documented. Gap prioritization helps drive future data projects.

Other Considerations

Infrastructure and Tools

In addition to the data itself, the infrastructure that stores and processes the data, tools that are used to manipulate it, and business processes that define how the data is managed and treated, should be reviewed and documented:

  • What computer systems (laptops, servers, etc.) are used to extract, transform, load, clean, and store data?
  • ~Transformations include converting data between format(s). 
  • What are the network architectures the data traverses? Private or public? How is the data compressed and/or encrypted? How are the networks secured?
  • Is data stored and processed using user equipment, a private cloud, public clouds, or some combination?
  • What applications are used to process the data? Common examples include Microsoft Excel, business intelligence tools, etc.
  • What bespoke or internally developed software is in the data path? For example, are their developers on staff or outside consultants who’ve created software to process the data?

Security and Governance

  • How is the data secured? Is there an organizational information security policy or set of procedures? Is it followed?
  • Are the different sources of data secured appropriately? For example, internal sources of data for internal use may be treated differently than data brought in from the outside that is re-sold or made available to external parties.
  • Are all governing regulations and laws understood and followed? For example, PII, PCI, HIPAA, etc.
  • Is the principle of “least privilege” implemented?
  • Is data anonymized appropriately?
  • What governance policies and mechanisms are in place? For example, to eliminate explicit or implicit biases in the data and avoid liability issues.

Data Return on Investment (ROI)

  • What are the hard and soft costs of the data? Hard costs included paying for data, costs to store and process, etc. Soft costs include employee time spent manually processing the data.
  • Are those costs expected to increase or decrease in the future? Perhaps the organization expects to purchase more data or collect more internally (necessitating increased storage costs).
  • Are there mechanisms in place to calculate the ROI of different data streams?

Future State

Answering the questions laid out in the previous sections builds a good understanding of the current state of data in the organization. Going forward, there may be significant changes especially as machine learning is adopted or utilized by more groups.

If it is an organization's goal to widely adopt ML, then a “data first” stance can help. Within the business, is data and the data platform considered and treated as mission-critical?

All of the earlier questions should be answered in the context of the desired future state. In addition, it may be appropriate to consider building a cloud data platform. The rest of this section focuses on that strategy.

  • What is a Cloud Data Platform (CDP)? A cloud data platform is a single, managed repository and processing system for all business-critical data within an organization. It can be private cloud-based, completely public cloud, or a combination of the two.
  • Why a Cloud Data Platform?
  • ~Operational efficiency – efficiencies such as economy of scale, better utilization of scarce resources (people), consolidated metrics, monitoring, and management, etc.
  • ~Grow revenue – a single platform allows data to be more easily leveraged by different parts of the organization in developing and delivering products and services.
  • ~Improved customer experience – a single view of the customer enables better marketing (e.g., less over-communication, a comprehensive view of their activity) and better service (for example, interactions across all products and services in one view for the support representative).
  • ~Drive innovation – synergy between different data sets and streams may spark new ideas for products. For example, an insurance company selling life and homeowner products could see an opportunity to create a new offering at the intersection of a particular demographic and home type such as younger people owning condos.
  • ~Improve compliance – governance and compliance with the myriad state and national laws and regulations can be made less expensive and complicated if all data is in one place and subjected to a consistent set of manual and automated oversight processes.
  • Creating a Data Catalog in the CDP
  • ~A data catalog is a single, centralized repository of metadata. Metadata is data that describes other data; for example, its source, format, and uses.
  • ~Existing data repositories or platforms may need to be cataloged as well as new entries created as new sources of data are added and/or processing of existing data changes.
  • ~A data catalog should also include monitoring and alerting to notify users and IT of events of significance.
  • In the Catalog
  • ~A data catalog contains metadata, pipeline configurations, data quality checks, and pipeline activity and it is also a schema registry.
  • ~The broad definitions of the metadata in the catalog are:
  • ~Business metadata, e.g., data quality, sources, etc.
  • ~Technical metadata, e.g., where data comes from, goes to, volume, issues.
  • ~Pipeline configurations describe the various paths used to process data and the components of each path.
  • ~Automated data quality checks ensure the timely delivery of the expected volumes of information or send alerts if the service level agreements aren’t met.
  • ~Pipeline activity stores the normal, and exceptional, processing events. For example, when a batch file is retrieved from a source.
  • ~A schema registry contains the descriptions of the data, e.g., the formats, storage types, etc.
  • Managing Schema Changes
  • ~It’s important to manage schema changes, changes to how the data is structured, formatted, and stored, either manually if necessary or, by preference, via automation.
  • ~Proactive management allows the construction of resilient data processing pipelines that detect and respond to schema changes before failure.
  • ~An up-to-date schema catalog allows for data discovery and self-service. Manual data discovery allows users to find the data they use and operate on it without additional support. Automated discovery supports automated pipeline adaption to change.
  • ~In addition, having a history of schema changes and archived data to work with simplifies pipeline debugging and troubleshooting. If a pipeline starts throwing errors, one place to start is with schema changes.

Conclusion

The success of almost any AI initiative critically depends on the data and supporting systems used. A thorough data readiness assessment process carried out before starting an AI project or creating an AI roadmap significantly reduces both technical and business risk.

Identifying and documenting all significant data sources, processing, and destinations into a data catalog may be time-consuming and painstaking but it is a worthwhile project to ground AI strategies and planning in reality. 

Resources

  • For the next steps in the AI readiness process, download our free guide created by our AI experts here
  • For more information on everything AI implementation, check out our growing guide here.
  • To keep your AI project on track, check out our Agile for AI process here
  • To help evaluate if your business is ready for AI, contact us for a free consultation here

References

Back to Content Library

Contact Us

for a free discovery/consultation meeting