• Looking to build good artificiaI intelligence (AI)? Don’t let the speed and availability of open source frameworks, modules, libraries, and languages lull you into a false sense of confidence.
• Good AI needs to start with good data and good data needs to be ingested, registered, described, validated, and processed well before it reaches the ready hands of AI practitioners.
These are heady times. Enterprises have at their disposal both the raw materials and the necessary tools to achieve great things with AI, be that something grandiose as self-driving cars or unassuming as a fraud detection algorithm. The trouble with an abundance of materials (e.g., data) and tools (e.g., open source machine learning models), however, is speed. Speed kills, as they say.
For AI practitioners, this means learning to run before learning to walk by hastily automating decisions via AI models that are built on unsound data. With a few simple open source frameworks, modules, libraries, and languages, seemingly useful but ultimately erroneous predictions and conclusions can be readily drawn from any old data set in very short order. What’s the answer? More or better tools? No. As with most human problems, good old human knowhow and understanding are necessary. And that begins with data.
And right now the shortest route to that understanding appears to lie within the notion of data cataloging, specifically the ingestion, registration, description, and validation of data. For that we are starting to see a number of tools like Microsoft Azure Data Catalog and Tableau Data Catalog enter the market, promising to bring the focus back to the front end of the pipeline without enforcing (or interfering with) existing data warehousing or master data management and governance requirements.
Enterprise cloud data management heavyweight Informatica has certainly been an active proponent of data intelligence through ideas like cataloging (and data management, quality, governance and security) for some time now. But unlike many analytics- or platform-centric rivals, Informatica’s broad portfolio allows the company to market their Enterprise Data Catalog not only as standalone but also in the context of data governance, analytics, apps modernization, and other key initiatives, not as an isolated cure-all for data distrust but rather as a trust-increasing component within the enterprise data pipeline, right next to standalone data governance, data preparation, data integration, data quality, data protection, and data operationalization.
When it comes to AI-based decisions, this kind of data-first value chain is of particular importance for the simple reason that AI is an iterative, communal endeavor among data analysts, engineers, scientists, and other stakeholders. Each participant in that pipeline, whether cataloging data sets, crafting a k-means clustering data model or just searching for quarterly sales numbers, needs to have trust in the underlying data. For that to happen, users must have trust in the work done by one another. After all, it only takes one faulty assumption at any point in the pipeline to scupper the desired outcome.
That’s what makes catalogs (and catalogs as a part of a broader context) so valuable, they create a metadata system of record that allows practitioners to know the lineage of any and all data that passes through the pipeline. With the vendor’s broad winter 2019 update this month, it upped its metadata game with the addition of several new cataloging features powered by its internal AI augmentation and automation engine, Informatica CLAIRE. Informatica’s Enterprise Data Catalog is basically a catalog of catalogs, which gives it the ability to profile and scan the metadata of many types of data sources such as databases, applications (on-premises, cloud, and legacy), BI and visualization tools, file systems, ETL tools, and even other data catalogs. This creates a highly detailed and complete end-to-end view into data lineage.
Interesting new features include CLAIRE-powered intelligent structure discovery and streaming data lineage tools as well as data pipeline recommendations and categorization capabilities. Informatica also rolled out more AI-driven features such as automatic discovery of data prep errors (like null values) and recommended data prep recipes. This notion of AI-informed automation and augmentation is quite a common requirement these days, especially with data prep use cases.
What is unique with this December string of updates, however, has nothing to do with AI. Building on its heavy use of Apache Spark as a data processing engine, Informatica introduced blockchain to the data pipeline mix. That’s right, blockchain. Informatica can now use standard RESTful APIs and an OpenAPI Specification (aka Swagger file) to read, write and perform lookups on the data inside a Hyperledger-based blockchain! Why is this important? Well, if you adhere to the adage of garbage in, garbage out, it’s easy to imagine the gravity of allowing garbage into what is designed to be an immutable data store such as an implementation of Hyperledger. The same goes for getting clean and trusted information out of the blockchain.
It’s this kind of “data intelligent” thinking that’s necessary to save us from ourselves, from our relentless pursuit of speed.