The Data Analytics Platform

A 100% open source, integrated framework that accelerates application development for data analytics

The team behind CDAP

  • Ajai Narayanan
  • Albert Shau
  • Ali Anwar
  • Andreas Neumann
  • Bhooshan Mogal
  • Derek Wood
  • Edwin Elia
  • Jay Jin
  • Lea Cuniberti-Duran
  • Nitin Motgi
  • Poorna Chandra
  • Rohit Sinha
  • Sagar Kapare
  • Sreevatsan Raman
  • Terence Yim
  • Tony Hajdari
  • Vinisha Shah
  • Yaojie Feng

Get started in the Cloud

Run CDAP on any major public cloud provider including Amazon Web Services, Microsoft Azure and Google Cloud Platform

Get started on-premises

Run CDAP on-premises on your own Apache Hadoop based clusters.


CDAP lets developers, business analysts and data scientists focus on insights, analytics and business value instead of wrestling with infrastructure, and integration.

Reduced complexity

CDAP's easy to use abstractions over complex technologies shift focus to insights from infrastructure and integration.

Increased velocity

With an extensible framework and reusable templates, CDAP accelerates time to value and breaks down silos, so you can build once, run anywhere.

Increased flexibility

CDAP is 100% open source, portable and extensible. It integrates with latest Big Data and Cloud technologies.

Improved visibility

CDAP gives greater visibility on your data, by allowing to search metadata, and providing insight into data lineage.

CDAP features

Rapid development

Developer SDK and APIs with abstractions over common data processing patterns; Sandbox mode, programmatic and UI driven debugging; In-memory mode and testing framework to simplify testing; Support for cutting edge Cloud, Apache Hadoop and Apache Spark technologies.


Enterprise ready

Metadata repository with automatic technical and operational metadata capture; Business metadata annotations; Data discovery through search based on metadata; Data governance with dataset and field level lineage and auditing; Integration with enterprise security systems.


Seamless operations

REST APIs and CLI for every interaction; Time and process based scheduling; Standardized logs and metrics for all execution environments.


Portable runtime environments

Build once, run anywhere through portability across runtime environments such as Apache Hadoop YARN and Docker.


Extensible and reusable

Templates and blueprints for common use-cases; Hub for sharing pre-built plugins, applications and solutions; Extensible APIs for security, metadata, runtimes and storage.


Hybrid and multi-cloud

Interoperability across on-premises and Cloud environments; Support for all major public cloud providers such as Amazon Web Services, Microsoft Azure and Google Cloud Platform.




Pipelines provides an easy-to-use graphical data integration interface to bring together data from a myriad of different sources and define transformations visually.

Learn more


Wrangler allows you to visually and interactively cleanse and prepare raw data, with the aim of making it consumable for further processing. It provides a standardized UI driven interactive flow that takes the pain out of preprocessing tasks for data engineering, data science and data analysis. Learn More

Learn more


Analytics provides a simple, interactive, automated interface for users to easily develop, train, test, evaluate and deploy their machine learning models.

Learn more


Rules Engine provides a way for business analysts to create and manage a knowledge base of data transformation rules that need to be automatically applied to your data.

Learn more