CDAP is highly extensible, and exposes plugins, which allow users to extend its capabilities. On this page, you can see all the plugins available in CDAP. Refer to the community page to learn about contributing your own plugin.

Filter By
  • ADLSBatchSink
    Sink
    ADLSBatchSink
    Azure Data Lake Store Batch Sink writes data to Azure Data Lake Store directory in avro, orc or text format.
  • ADLSDelete
    Action
    ADLSDelete
    Deletes a file or files within ADLS file system.
  • AddField
    Transform
    AddField
    Adds a new field to each record. The field value can either be a new UUID, or it can be set directly through configuration. This transform is used when you want to add a unique id field to each record, or when you want to tag each record with some constant value. For example, you may want to add the logical start time as a field to each record.
  • AmazonS3Client
    Action
    AmazonS3Client
    The Amazon S3 Client Action is used to work with S3 buckets and objects before or after the execution of a pipeline.
  • HTTP Argument Setter
    Action
    HTTP Argument Setter
    Performs an HTTP request some endpoint to get a driver specification. Based on the spec, it will make another call to a nebula endpoint to get data about the dataset, which it will use to set 'input.path', 'input.properties', 'directives', and 'output.schema' arguments that can be used later on in the pipeline through macros.
  • Avro Dynamic Partitioned Dataset
    Sink
    Avro Dynamic Partitioned Dataset
  • AzureBlobStore
    Source
    AzureBlobStore
    Batch source to use Microsoft Azure Blob Storage as a source.
  • AzureDataLakeStore
    Source
    AzureDataLakeStore
    Azure Data Lake Store Batch Source reads data from Azure Data Lake Store files and converts it into StructuredRecord.
  • AzureDecompress
    Action
    AzureDecompress
    Azure decompress Action plugin decompress gz files from a container on Azure Storage Blob service into another container.
  • AzureDelete
    Action
    AzureDelete
    Azure Delete Action plugin deletes a container on Azure Storage Blob service.
  • AzureEventHub
    Source
    AzureEventHub
    Azure Event Hub streaming source. Emits a record with the schema specified by the user. If no schema is specified, it will emit a record with 'message'(bytes).
  • AzureFaceExtractor
    Transform
    AzureFaceExtractor
  • Google Big Query Table
    Sink
    Google Big Query Table
    This plugins exports a bigquery table as sink to be ingested into the processing pipeline. Plugin requires a service account to access the bigquery table. In order to configure the service account visit https://cloud.google.com. Make sure you provide right permissions to service account for accessing BigQuery API.
  • CDC Google Cloud Bigtable Sink
    Sink
    CDC Google Cloud Bigtable Sink
    This plugin takes input from a CDC source and writes the changes to Cloud Bigtable.
  • CDCDatabase
    Source
    CDCDatabase
    This plugin reads Change Data Capture (CDC) events from a Golden Gate Kafka topic.
  • CDCHBase
    Sink
    CDCHBase
    This plugin takes input from a CDC source and writes the changes to HBase. It will write to the HBase instance running on the cluster.
  • CDCKudu
    Sink
    CDCKudu
    This plugin takes input from a CDC source and writes the changes to Kudu.
  • CSVFormatter
    Transform
    CSVFormatter
  • CSVParser
    Transform
    CSVParser
  • Record Duplicator
    Transform
    Record Duplicator
    Makes a copy of every input record received for a configured number of times on the output. This transform does not change any record fields or types. It's an identity transform.
  • Compressor
    Transform
    Compressor
    Compresses configured fields. Multiple fields can be specified to be compressed using different compression algorithms. Plugin supports SNAPPY, ZIP, and GZIP types of compression of fields.
  • Conditional
    Condition
    Conditional
    A control flow plugin that allows conditional execution within pipelines. The conditions are specified as expressions and the variables could include values specified as runtime arguments of the pipeline, token from plugins prior to the condition and global that includes global information about pipeline like stage, pipeline, logical start time and plugin.
  • Cube
    Sink
    Cube
    Batch sink that writes data to a Cube dataset.
  • DataFactoryDriver
    Action
    DataFactoryDriver
    Performs an HTTP request some endpoint to get a driver specification. Based on the spec, it will make another call to a nebula endpoint to get data about the dataset, which it will use to set 'input.path', 'input.properties', 'directives', and 'output.schema' arguments that can be used later on in the pipeline through macros.
  • DataProfiler
    Analytics
    DataProfiler
    Calculates statistics for each input field. For every field, a total count and null count will be calculated. For numeric fields, min, max, mean, stddev, zero count, positive count, and negative count will be calculated. For string fields, min length, max length, mean length, and empty count will be calculated. For boolean fields, true and false counts will be calculated. When calculating means, only non-null values are considered.
  • Database
    Source
    Database
    Writes records to a database table. Each record will be written to a row in the table.
  • DatabaseQuery
    Action
    DatabaseQuery
    Runs a database query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
  • DateTransform
    Transform
    DateTransform
    This transform takes a date in either a unix timestamp or a string, and converts it to a formatted string. (Macro-enabled)
  • DecisionTreePredictor
    Analytics
    DecisionTreePredictor
    Loads a Decision Tree Regression model from a FileSet and uses it to label the records based on the predicted values.
  • DecisionTreeTrainer
    Sink
    DecisionTreeTrainer
    Trains a regression model based upon a particular label and features of a record. Saves this model to a FileSet.
  • Decoder
    Transform
  • Decompress
    Action
    Decompress
  • Field Decompressor
    Transform
    Field Decompressor
  • Field Decrypter
    Transform
    Field Decrypter
    Decrypts one or more fields in input records using a keystore that must be present on all nodes of the cluster.
  • Deduplicate
    Analytics
    Deduplicate
  • Distinct
    Analytics
    Distinct
    De-duplicates input records so that all output records are distinct. Can optionally take a list of fields, which will project out all other fields and perform a distinct on just those fields.
  • ADLS Batch Sink
    Sink
    ADLS Batch Sink
  • DynamicMultiFileset
    Sink
    DynamicMultiFileset
    This plugin is normally used in conjunction with the MultiTableDatabase batch source to write records from multiple databases into multiple filesets in text format. Each fileset it writes to will contain a single 'ingesttime' partition, which will contain the logical start time of the pipeline run. The plugin expects that the filsets it needs to write to will be set as pipeline arguments, where the key is 'multisink.[fileset]' and the value is the fileset schema. Normally, you rely on the MultiTableDatabase source to set those pipeline arguments, but they can also be manually set or set by an Action plugin in your pipeline. The sink will expect each record to contain a special split field that will be used to determine which records are written to each fileset. For example, suppose the the split field is 'tablename'. A record whose 'tablename' field is set to 'activity' will be written to the 'activity' fileset.
  • DynamoDB
    Source
    DynamoDB
    DynamoDB Batch Source that will read the data items from AWS DynamoDB table and convert each item into the StructuredRecord as per the schema specified by the user, that can be further processed downstream in the pipeline. User can provide the query, to read the items from DynamoDB table.
  • Elasticsearch
    Sink
    Elasticsearch
    Takes the Structured Record from the input source and converts it to a JSON string, then indexes it in Elasticsearch using the index, type, and idField specified by the user. The Elasticsearch server should be running prior to creating the application.
  • Email
    Action
    Email
    Sends an email at the end of a pipeline run.
  • Field Encoder
    Transform
    Field Encoder
  • Encryptor
    Transform
    Encryptor
    Encrypts one or more fields in input records using a java keystore that must be present on all nodes of the cluster.
  • ErrorCollector
    Error Handler
    ErrorCollector
    The ErrorCollector plugin takes errors emitted from the previous stage and flattens them by adding the error message, code, and stage to the record and outputting the result.
  • Excel
    Source
    Excel
    The Excel plugin provides user the ability to read data from one or more Excel file(s).
  • FTP
    Source
    FTP
    Batch source for an FTP or SFTP source. Prefix of the path ('ftp://...' or 'sftp://...') determines the source server type, either FTP or SFTP.
  • FTPCopy
    Action
    FTPCopy
    Copy files from FTP server to the specified destination.
  • Fail Pipeline
    Sink
    Fail Pipeline
    Batch Sink is used to fail the running pipeline when any of the record flows to this sink on receiving the first record itself.
  • FastFilter
    Transform
    FastFilter
    Filters out messages based on a specified criteria.
  • File
    Source
    File
    File streaming source. Watches a directory and streams file contents of any new files added to the directory. Files must be atomically moved or renamed.
  • File Appender
    Sink
    File Appender
    Writes to a CDAP FileSet in text format. HDFS append must be enabled for this to work. One line is written for each record sent to the sink. All record fields are joined using a configurable separator. Each time a batch is written, the sink will examine all existing files in the output directory. If there are any files that are smaller in size than the size threshold, or more recent than the age threshold, new data will be appended to those files instead of written to new files.
  • File Contents Checker
    Action
    File Contents Checker
    This action plugin can be used to check if a file is empty or if the contents of a file match a given pattern.
  • FileCopySink
    Sink
    FileCopySink
    The File Copy plugin is a sink plugin that takes file metadata records as inputs and copies the files into the local HDFS or the local filesystem.
  • FileDelete
    Action
    FileDelete
    Deletes a file or files.
  • FileMetadataSource
    Source
    FileMetadataSource
    The File Metadata plugin is a source plugin that allows users to read file metadata from a local HDFS or a local filesystem.
  • FileMove
    Action
    FileMove
    Moves a file or files.
  • GCS
    Sink
    GCS
    This plugin writes records to one or more files in a directory on Google Cloud Storage. Files can be written in various formats such as csv, avro, parquet, and json.
  • GCSAvro
    Sink
    GCSAvro
    A GCS sink to write records as AVRO records.
  • Google Cloud Storage Path Create
    Action
    Google Cloud Storage Path Create
    This plugin is used for creating directories on Google Cloud Storage (GCS).
  • Google Cloud Storage Path Delete
    Action
    Google Cloud Storage Path Delete
    This plugin is used for deleting directories on Google Cloud Storage (GCS).
  • Google Cloud Storage File
    Source
    Google Cloud Storage File
  • Google Cloud Storage File Blob
    Source
    Google Cloud Storage File Blob
  • GCSParquet
    Sink
    GCSParquet
    A GCS sink to write records as AVRO records into Parquet files.
  • GCSText
    Sink
    GCSText
    A GCS sink to write records as Comma, Tab, Pipe, CTRL+A separated or JSON text files.
  • GoldenGateNormalizer
    Transform
    GoldenGateNormalizer
  • GooglePublisher
    Sink
    GooglePublisher
  • GoogleSubscriber
    Source
    GoogleSubscriber
  • Group By
    Analytics
    Group By
  • HBase
    Source
    HBase
    Batch source that reads from a column family in an HBase table. This source differs from the Table source in that it does not use a CDAP dataset, but reads directly from HBase.
  • HDFS
    Sink
  • HDFSDelete
    Action
    HDFSDelete
    Deletes a file or files within an HDFS cluster.
  • HDFS File Merge
    Action
    HDFS File Merge
    Merges small files within an HDFS cluster.
  • HDFSMove
    Action
    HDFSMove
    Moves a file or files within an HDFS cluster.
  • HTTP
    Sink
    HTTP
    Sink plugin to send the messages from the pipeline to an external http endpoint.
  • HTTPCallback
    Action
    HTTPCallback
    Performs an HTTP request at the end of a pipeline run.
  • HTTPPoller
    Source
    HTTPPoller
  • HTTPToHDFS
    Action
    HTTPToHDFS
    Action to fetch data from an external http endpoint and create a file in HDFS.
  • Hasher
    Transform
  • Hive Bulk Export
    Action
    Hive Bulk Export
  • Hive Bulk Import
    Action
    Hive Bulk Import
  • JSONFormatter
    Transform
    JSONFormatter
  • JSONParser
    Transform
    JSONParser
    Parses an input JSON event into a record. The input JSON event could be either a map of string fields to values or it could be a complex nested JSON structure. The plugin allows you to express JSON paths for extracting fields from complex nested input JSON.
  • JavaScript
    Transform
    JavaScript
    Executes user-provided JavaScript that transforms one record into zero or more records. Input records are converted into JSON objects which can be directly accessed in JavaScript. The transform expects to receive a JSON object as input, which it can process and emit zero or more records or emit error using the provided emitter object.
  • Joiner
    Analytics
  • KVTable
    Source
    KVTable
    Reads the entire contents of a KeyValueTable, outputting records with a 'key' field and a 'value' field. Both fields are of type bytes.
  • Kafka
    alert
  • Kafka Alert Publisher
    Alert Publisher
    Kafka Alert Publisher
    Kafka Alert Publisher that allows you to publish alerts to kafka as json objects. The plugin internally uses kafka producer apis to publish alerts. The plugin allows to specify kafka topic to use for publishing and other additional kafka producer properties. This plugin uses kafka 0.10.2 java apis.
  • KinesisSink
    Sink
    KinesisSink
    Kinesis sink that outputs to a specified Amazon Kinesis Stream.
  • KinesisSource
    Source
    KinesisSource
    Spark streaming source that reads from AWS Kinesis streams.
  • Kudu
    Source
    Kudu
    CDAP Plugin for reading data from Apache Kudu table.
  • LoadToSnowflake
    Action
    LoadToSnowflake
  • LogParser
    Transform
    LogParser
    Parses logs from any input source for relevant information such as URI, IP, browser, device, HTTP status code, and timestamp.
  • LogisticRegressionClassifier
    Analytics
    LogisticRegressionClassifier
    Loads a Logistic Regression model from a file of a FileSet dataset and uses it to classify records.
  • LogisticRegressionTrainer
    Sink
    LogisticRegressionTrainer
    Trains a classification model based upon a particular label and features of a record. Saves this model to a FileSet.
  • MLPredictor
    Analytics
    MLPredictor
    Uses a model trained by the ModelTrainer plugin to add a prediction field to incoming records. The same features used to train the model must be present in each input record, but input records can also contain additional non-feature fields. If the trained model uses categorical features, and if the record being predicted contains new categories, that record will be dropped. For example, suppose categorical feature 'city' was used to train a model that predicts housing prices. If an incoming record has 'New York' as the city, but 'New York' was not in the training set, that record will be dropped.
  • MQTT
    Source
    MQTT
    The MQTT Streaming Source allows you to subscribe to an MQTT broker in a streaming context. You specify the topic to subscribe to as an MQTT client.
  • MainframeReader
    Source
    MainframeReader
    This is a source plugin that allows users to read and process mainframe files.
  • MapR-DB JSON Table
    Sink
    MapR-DB JSON Table
    MapR-DB JSON table sink is used to write the JSON documents to the MapR-DB table.
  • MapR Stream Consumer
    Source
    MapR Stream Consumer
    MapR streaming source. Reads events from MapR stream.
  • MultiFieldAdder
    Transform
    MultiFieldAdder
    Multi Field Adder Transform allows you to add one or more fields to the output. Each field specified has a name and the value. The value is currently set to be of type string.
  • Multiple Database Tables
    Source
    Multiple Database Tables
    Reads from multiple tables within a database using JDBC. Often used in conjunction with the DynamicMultiFileset sink to perform dumps from multiple tables to HDFS files in a single pipeline. The source will output a record for each row in the tables it reads, with each record containing an additional field that holds the name of the table the record came from. In addition, for each table that will be read, this plugin will set pipeline arguments where the key is 'multisink.[tablename]' and the value is the schema of the table. This is to make it work with the DynamicMultiFileset.
  • NaiveBayesClassifier
    Analytics
    NaiveBayesClassifier
    Loads a Naive Bayes model from a file of a FileSet dataset and uses it to classify records.
  • NaiveBayesTrainer
    Sink
    NaiveBayesTrainer
    Using a Naive Bayes algorithm, trains a model based upon a particular label and text field of a record. Saves this model to a file in a FileSet dataset.
  • Normalize
    Transform
    Normalize
    Normalize is a transform plugin that breaks one source row into multiple target rows. Attributes stored in the columns of a table or a file may need to be broken into multiple records: for example, one record per column attribute. In general, the plugin allows the conversion of columns to rows.
  • NullFieldSplitter
    Transform
    NullFieldSplitter
  • OracleExport
    Action
    OracleExport
    A Hydrator Action plugin to efficiently export data from Oracle to HDFS or local file system. The plugin uses Oracle's command line tools to export data. The data exported from this tool can then be used in Hydrator pipelines.
  • OrientDB
    Sink
    OrientDB
    Writes data to an OrientDB database.
  • PDFExtractor
    Transform
    PDFExtractor
  • Parquet Dynamic Partitioned Dataset
    Sink
    Parquet Dynamic Partitioned Dataset
  • Projection
    Transform
    Projection
    The Projection transform lets you drop, keep, rename, and cast fields to a different type. Fields are first dropped based on the drop or keep field, then cast, then renamed.
  • PySpark Program
    Action
    PySpark Program
    Executes user-provided Spark code in Python.
  • PythonEvaluator
    Transform
    PythonEvaluator
    Executes user-provided python code that transforms one record into zero or more records. Each input record is converted into a dictionary which can be directly accessed in python. The transform expects to receive a dictionary as input, which it can process and emit zero or more transformed dictionaries, or emit an error dictionary using the provided emitter object.
  • RecordSplitter
    Transform
    RecordSplitter
    Given a field and a delimiter, emits one record for each split of the field.
  • RedshiftToS3
    Action
    RedshiftToS3
  • Repartitioner
    Analytics
    Repartitioner
    This plugins re-partitions a Spark RDD.
  • RowDenormalizer
    Analytics
    RowDenormalizer
    Converts raw data into denormalized data based on a key column. User is able to specify the list of fields that should be used in the denormalized record, with an option to use an alias for the output field name. For example, 'ADDRESS' in the input is mapped to 'addr' in the output schema.
  • Run
    Transform
    Run
    Runs an executable binary which is installed and available on the local filesystem of the Hadoop nodes. Run transform plugin allows the user to read the structured record as input and returns the output record, to be further processed downstream in the pipeline.
  • S3
    Source
    S3
    Batch source to use Amazon S3 as a Source.
  • S3Avro
    Sink
    S3Avro
    A batch sink for writing to Amazon S3 in Avro format.
  • Amazon S3 Whole File Copier
    Sink
    Amazon S3 Whole File Copier
    The S3 File Copy plugin is a sink plugin that takes file metadata records as inputs and copies the files into an Amazon S3 filesystem.
  • S3FileMetadataSource
    Source
    S3FileMetadataSource
    The S3 File Metadata plugin is a source plugin that allows users to read file metadata from an S3 Filesystem.
  • S3Parquet
    Sink
    S3Parquet
    A batch sink to write to S3 in Parquet format.
  • S3ToRedshift
    Action
    S3ToRedshift
    S3ToRedshift Action that will load the data from AWS S3 bucket into the AWS Redshift table.
  • SFTPCopy
    Action
    SFTPCopy
  • SFTPDelete
    Action
    SFTPDelete
  • SFTPPut
    Action
  • Remote Program Executor
    Action
    Remote Program Executor
    Establishes an SSH connection with remote machine to execute command on that machine.
  • Sampling
    Analytics
    Sampling
    Sampling a large dataset flowing through this plugin to pull random records. Supports two types of sampling i.e, Systematic Sampling and Reservoir Sampling.
  • ScalaSparkCompute
    Analytics
    ScalaSparkCompute
    Executes user-provided Spark code in Scala that transforms RDD to RDD with full access to all Spark features.
  • ScalaSparkProgram
    Action
    ScalaSparkProgram
    Executes user-provided Spark code in Scala.
  • ScalaSparkSink
    Sink
    ScalaSparkSink
    Executes user-provided Spark code in Scala that operates on an input RDD or Dataframe with full access to all Spark features.
  • SnapshotAvro
    Sink
    SnapshotAvro
    A batch sink for a PartitionedFileSet that writes snapshots of data as a new partition. Data is written in Avro format. A corresponding SnapshotAvro source can be used to read only the most recently written snapshot.
  • SnapshotParquet
    Source
    SnapshotParquet
    A batch source that reads from a corresponding SnapshotParquet sink. The source will only read the most recent snapshot written to the sink.
  • SnapshotText
    Sink
    SnapshotText
    A batch sink for a PartitionedFileSet that writes snapshots of data as a new partition. Data is written in Text format.
  • Spanner
    Source
    Spanner
    This source reads from a Google Cloud Spanner table. Cloud Spanner is a fully managed, mission-critical, relational database service that offers transactional consistency at global scale, schemas, SQL (ANSI 2011 with extensions), and automatic, synchronous replication for high availability.
  • Google Cloud Speech-to-Text
    Transform
    Google Cloud Speech-to-Text
    This plugin converts audio files to text by using Google Cloud Speech-to-Text.
  • SpeechTranslator
    Transform
    SpeechTranslator
  • StateRestore
    Action
    StateRestore
    Description
  • CDAP Stream
    Source
    CDAP Stream
  • StructuredRecordToGenericRecord
    Transform
    StructuredRecordToGenericRecord
    Transforms a StructuredRecord into an Avro GenericRecord.
  • TMS
    Alert Publisher
    TMS
    Publishes alerts to the CDAP Transactional Messaging System (TMS) as json objects. The plugin allows you to specify the topic and namespace to publish to, as well as a rate limit for the maximum number of alerts to publish per second.
  • Avro Time Partitioned Dataset
    Sink
    Avro Time Partitioned Dataset
  • TPFSOrc
    Sink
  • TPFSParquet
    Sink
    TPFSParquet
  • Table
    Source
    Table
    Outputs the entire contents of a CDAP Table each batch interval. The Table contents will be refreshed at configurable intervals.
  • ToUTF8
    Action
  • TopN
    Analytics
    TopN
    Top-N returns the top "n" records from the input set, based on the criteria specified in the plugin configuration.
  • Trash
    Sink
    Trash
    Trash consumes all the records on the input and eats them all, means no output is generated or no output is stored anywhere.
  • Twitter
    Source
    Twitter
    Samples tweets in real-time through Spark streaming. Output records will have this schema:
  • UnionSplitter
    Transform
    UnionSplitter
    The union splitter is used to split data by a union schema, so that type specific logic can be done downstream.
  • Validator
    Transform
    Validator
    Validates a record, writing to an error dataset if the record is invalid. Otherwise it passes the record on to the next stage.
  • ValueMapper
    Transform
    ValueMapper
    Value Mapper is a transform plugin that maps string values of a field in the input record to a mapping value using a mapping dataset.
  • VerticaBulkExportAction
    Action
    VerticaBulkExportAction
    Bulk exports data in a vertica table into a file.
  • VerticaBulkImportAction
    Action
    VerticaBulkImportAction
    Vertica Bulk Import Action plugin gets executed after successful mapreduce or spark job. It reads all the files in a given directory and bulk imports contents of those files into vertica table.
  • Window
    Analytics
    Window
    The Window plugin is used to window a part of a streaming pipeline.
  • WindowsShareCopy
    Action
    WindowsShareCopy
    Copies a file or files on a Microsoft Windows share to an HDFS directory.
  • Wrangler
    Transform
    Wrangler
    This plugin applies data transformation directives on your data records. The directives are generated either through an interactive user interface or by manual entry into the plugin.
  • XMLMultiParser
    Transform
    XMLMultiParser
    The XML Multi Parser Transform uses XPath to extract fields from an XML document. It will generate records from the children of the element specified by the XPath. If there is some error parsing the document or building the record, the problematic input record will be dropped.
  • XMLParser
    Transform
    XMLParser
    The XML Parser Transform uses XPath to extract fields from a complex XML event. This plugin should generally be used in conjunction with the XML Reader Batch Source. The XML Reader will provide individual events to the XML Parser, which will be responsible for extracting fields from the events and mapping them to the output schema.
  • XMLReader
    Source
    XMLReader
    The XML Reader plugin is a source plugin that allows users to read XML files stored on HDFS.
  • XML to Json String
    Transform
    XML to Json String
    Accepts a field that contains a properly-formatted XML string and outputs a properly-formatted JSON string version of the data. This is meant to be used with the Javascript transform for the parsing of complex XML documents into parts. Once the XML is a JSON string, you can convert it into a Javascript object using: