« Pipelines
Plugins
CDAP is highly extensible, and exposes plugins, which allow users to extend its capabilities. On this page, you can see all the plugins available in CDAP. Refer to the community page to learn about contributing your own plugin.
Filter By
- ADLSBatchSinkSinkADLSBatchSinkAzure Data Lake Store Batch Sink writes data to Azure Data Lake Store directory in avro, orc or text format.
- ADLSDeleteAction
- AddFieldTransformAddFieldAdds a new field to each record. The field value can either be a new UUID, or it can be set directly through configuration. This transform is used when you want to add a unique id field to each record, or when you want to tag each record with some constant value. For example, you may want to add the logical start time as a field to each record.
- Amazon S3 ClientActionAmazon S3 ClientThe Amazon S3 Client Action is used to work with S3 buckets and objects before or after the execution of a pipeline.
- Argument SetterAction
- AvroDynamicPartitionedDatasetSinkAvroDynamicPartitionedDataset
- AzureBlobStoreSource
- ADLS Batch SourceSourceADLS Batch SourceAzure Data Lake Store Batch Source reads data from Azure Data Lake Store files and converts it into StructuredRecord.
- AzureDecompressActionAzureDecompressAzure decompress Action plugin decompress gz files from a container on Azure Storage Blob service into another container.
- AzureDeleteActionAzureDeleteAzure Delete Action plugin deletes a container on Azure Storage Blob service.
- AzureFaceExtractorTransformAzureFaceExtractor
- BigQuery Multi TableSinkBigQuery Multi TableThis sink writes to a multiple BigQuery tables. BigQuery is Google's serverless, highly scalable, enterprise data warehouse. Data is first written to a temporary location on Google Cloud Storage, then loaded into BigQuery from there.
- BigQuerySinkBigQueryThis sink writes to a BigQuery table. BigQuery is Google's serverless, highly scalable, enterprise data warehouse. Data is first written to a temporary location on Google Cloud Storage, then loaded into BigQuery from there.
- BigQuerySourceBigQueryThis source reads the entire contents of a BigQuery table. BigQuery is Google's serverless, highly scalable, enterprise data warehouse. Data from the BigQuery table is first exported to a temporary location on Google Cloud Storage, then read into the pipeline from there.
- BigtableSinkBigtableThis sink writes data to Google Cloud Bigtable. Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail.
- BigtableSourceBigtableThis source reads data from Google Cloud Bigtable. Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail.
- CSVFormatterTransformCSVFormatter
- CSVParserTransformCSVParser
- CassandraSink
- CassandraSource
- CloneRecordTransformCloneRecordMakes a copy of every input record received for a configured number of times on the output. This transform does not change any record fields or types. It's an identity transform.
- CompressorTransformCompressorCompresses configured fields. Multiple fields can be specified to be compressed using different compression algorithms. Plugin supports SNAPPY, ZIP, and GZIP types of compression of fields.
- ConditionalConditionConditionalA control flow plugin that allows conditional execution within pipelines. The conditions are specified as expressions and the variables could include values specified as runtime arguments of the pipeline, token from plugins prior to the condition and global that includes global information about pipeline like stage, pipeline, logical start time and plugin.
- CubeSink
- Data ProfilerAnalyticsData ProfilerCalculates statistics for each input field. For every field, a total count and null count will be calculated. For numeric fields, min, max, mean, stddev, zero count, positive count, and negative count will be calculated. For string fields, min length, max length, mean length, and empty count will be calculated. For boolean fields, true and false counts will be calculated. When calculating means, only non-null values are considered.
- DatabaseAction
- DatabaseSinkDatabaseWrites records to a database table. Each record will be written to a row in the table.
- DatabaseSourceDatabaseReads from a database using a configurable SQL query. Outputs one record for each row returned by the query.
- DatabaseQueryActionDatabaseQueryRuns a database query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
- Google Cloud DatastoreSinkGoogle Cloud DatastoreThis sink writes data to Google Cloud Datastore. Datastore is a NoSQL document database built for automatic scaling and high performance.
- DatastoreSourceDatastoreThis source reads data from Google Cloud Datastore. Datastore is a NoSQL document database built for automatic scaling and high performance.
- DateTransformTransformDateTransformThis transform takes a date in either a unix timestamp or a string, and converts it to a formatted string. (Macro-enabled)
- Db2Action
- IBM DB2SinkIBM DB2Writes records to a DB2 table. Each record will be written to a row in the table.
- Db2SourceDb2Reads from a DB2 using a configurable SQL query. Outputs one record for each row returned by the query.
- Db2ActionDb2Runs a DB2 query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
- Field DecoderTransformField Decoder
- DecompressActionDecompress
- Field DecompressorTransformField Decompressor
- DecryptorTransformDecryptorDecrypts one or more fields in input records using a keystore that must be present on all nodes of the cluster.
- DeduplicateAnalyticsDeduplicate
- DistinctAnalyticsDistinctDe-duplicates input records so that all output records are distinct. Can optionally take a list of fields, which will project out all other fields and perform a distinct on just those fields.
- DynHBaseSinkDynHBaseThis plugin supports writing dynamic schemas record to local or remote HBase Table. In addition to writing dynamic schema tables, it also support regular structured records to be written to Tables.
- CDAP Table with Dynamic SchemaSinkCDAP Table with Dynamic SchemaThis plugin supports writing dynamic schemas record to CDAP Dataset Table. In addition to writing dynamic schema tables, it also support regular structured records to be written to Tables.
- ADLS Batch SinkSinkADLS Batch Sink
- DynamicMultiFilesetSinkDynamicMultiFilesetThis plugin is normally used in conjunction with the MultiTableDatabase batch source to write records from multiple databases into multiple filesets in text format. Each fileset it writes to will contain a single 'ingesttime' partition, which will contain the logical start time of the pipeline run. The plugin expects that the filsets it needs to write to will be set as pipeline arguments, where the key is 'multisink.[fileset]' and the value is the fileset schema. Normally, you rely on the MultiTableDatabase source to set those pipeline arguments, but they can also be manually set or set by an Action plugin in your pipeline. The sink will expect each record to contain a special split field that will be used to determine which records are written to each fileset. For example, suppose the the split field is 'tablename'. A record whose 'tablename' field is set to 'activity' will be written to the 'activity' fileset.
- ElasticsearchSinkElasticsearchTakes the Structured Record from the input source and converts it to a JSON string, then indexes it in Elasticsearch using the index, type, and idField specified by the user. The Elasticsearch server should be running prior to creating the application.
- ElasticsearchSourceElasticsearchPulls documents from Elasticsearch according to the query specified by the user and converts each document to a Structured Record with the fields and schema specified by the user. The Elasticsearch server should be running prior to creating the application.
- EmailAction
- Field EncoderTransformField Encoder
- EncryptorTransformEncryptorEncrypts one or more fields in input records using a java keystore that must be present on all nodes of the cluster.
- ErrorCollectorError HandlerErrorCollectorThe ErrorCollector plugin takes errors emitted from the previous stage and flattens them by adding the error message, code, and stage to the record and outputting the result.
- ExcelSourceExcelThe Excel plugin provides user the ability to read data from one or more Excel file(s).
- FTPSourceFTPBatch source for an FTP or SFTP source. Prefix of the path ('ftp://...' or 'sftp://...') determines the source server type, either FTP or SFTP.
- FTPCopyAction
- FTPPutAction
- Fail PipelineSinkFail PipelineBatch Sink is used to fail the running pipeline when any of the record flows to this sink on receiving the first record itself.
- FastFilterTransform
- FileSink
- FileSourceFileThis source is used whenever you need to read from a distributed file system. For example, you may want to read in log files from S3 every hour and then store the logs in a TimePartitionedFileSet.
- FileSourceFileFile streaming source. Watches a directory and streams file contents of any new files added to the directory. Files must be atomically moved or renamed.
- FileAppenderSinkFileAppenderWrites to a CDAP FileSet in text format. HDFS append must be enabled for this to work. One line is written for each record sent to the sink. All record fields are joined using a configurable separator. Each time a batch is written, the sink will examine all existing files in the output directory. If there are any files that are smaller in size than the size threshold, or more recent than the age threshold, new data will be appended to those files instead of written to new files.
- FileContentsActionFileContentsThis action plugin can be used to check if a file is empty or if the contents of a file match a given pattern.
- FileDeleteAction
- FileMoveAction
- Google Cloud StorageSinkGoogle Cloud StorageThis plugin writes records to one or more files in a directory on Google Cloud Storage. Files can be written in various formats such as csv, avro, parquet, and json.
- GCSBucketCreateActionGCSBucketCreateThis plugin creates objects in a Google Cloud Storage bucket. Cloud Storage allows world-wide storage and retrieval of any amount of data at any time.
- GCSBucketDeleteActionGCSBucketDeleteThis plugin deletes objects in a Google Cloud Storage bucket. Cloud Storage allows world-wide storage and retrieval of any amount of data at any time.
- GCSCopyActionGCSCopyThis plugin copies objects from one Google Cloud Storage bucket to another. A single object can be copied, or a directory of objects can be copied.
- GCSFileSource
- GCSMoveActionGCSMoveThis plugin moves objects from one Google Cloud Storage bucket to another. A single object can be moved, or a directory of objects can be moved.
- GCSMultiFilesSinkGCSMultiFilesThis plugin is normally used in conjunction with the MultiTableDatabase batch source to write records from multiple databases into multiple directories in various formats. The plugin expects that the directories it needs to write to will be set as pipeline arguments, where the key is 'multisink.[directory]' and the value is the schema of the data.
- GooglePublisherSinkGooglePublisherThis sink writes to a Google Cloud Pub/Sub topic. Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
- GoogleSubscriberSourceGoogleSubscriberThis sources reads from a Google Cloud Pub/Sub subscription in realtime. Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
- Group ByAnalyticsGroup By
- HBaseSinkHBaseWrites records to a column family in an HBase table with one record field mapping to the rowkey, and all other record fields mapping to table column qualifiers. This sink differs from the Table sink in that it does not use CDAP datasets, but writes to HBase directly.
- HBaseSourceHBaseBatch source that reads from a column family in an HBase table. This source differs from the Table source in that it does not use a CDAP dataset, but reads directly from HBase.
- HTTPSink
- HTTPCallbackAction
- HTTPPollerSourceHTTPPoller
- HTTPToHDFSActionHTTPToHDFSAction to fetch data from an external http endpoint and create a file in HDFS.
- MD5/SHA Field DatasetTransformMD5/SHA Field Dataset
- Hive Bulk ExportActionHive Bulk Export
- Hive Bulk ImportActionHive Bulk Import
- JSONFormatterTransformJSONFormatter
- JSONParserTransformJSONParserParses an input JSON event into a record. The input JSON event could be either a map of string fields to values or it could be a complex nested JSON structure. The plugin allows you to express JSON paths for extracting fields from complex nested input JSON.
- JavaScriptTransformJavaScriptExecutes user-provided JavaScript that transforms one record into zero or more records. Input records are converted into JSON objects which can be directly accessed in JavaScript. The transform expects to receive a JSON object as input, which it can process and emit zero or more records or emit error using the provided emitter object.
- JoinerAnalyticsJoiner
- KVTableSinkKVTableWrites records to a KeyValueTable, using configurable fields from input records as the key and value.
- KVTableSourceKVTableReads the entire contents of a KeyValueTable, outputting records with a 'key' field and a 'value' field. Both fields are of type bytes.
- KafkaalertKafka
- KafkaSinkKafkaKafka sink that allows you to write events into CSV or JSON to kafka. Plugin has the capability to push the data to a Kafka topic. It can also be configured to partition events being written to kafka based on a configurable key. The sink can also be configured to operate in sync or async mode and apply different compression types to events. This plugin uses kafka 0.10.2 java apis.
- KafkaSourceKafkaKafka batch source. Emits the record from kafka. It will emit a record based on the schema and format you use, or if no schema or format is specified, the message payload will be emitted. The source will remember the offset it read last run and continue from that offset for the next run. The Kafka batch source supports providing additional kafka properties for the kafka consumer, reading from kerberos-enabled kafka and limiting the number of records read. This plugin uses kafka 0.10.2 java apis.
- KafkaSourceKafkaKafka streaming source. Emits a record with the schema specified by the user. If no schema is specified, it will emit a record with two fields: 'key' (nullable string) and 'message' (bytes). This plugin uses kafka 0.10.2 java apis.
- KafkaAlertsAlert PublisherKafkaAlertsKafka Alert Publisher that allows you to publish alerts to kafka as json objects. The plugin internally uses kafka producer apis to publish alerts. The plugin allows to specify kafka topic to use for publishing and other additional kafka producer properties. This plugin uses kafka 0.10.2 java apis.
- KinesisSinkSink
- KinesisSourceSource
- KuduSinkKuduCDAP Plugin for ingesting data into Apache Kudu. Plugin can be configured for both batch and real-time pipelines.
- KuduSource
- LoadToSnowflakeActionLoadToSnowflake
- LogParserTransformLogParserParses logs from any input source for relevant information such as URI, IP, browser, device, HTTP status code, and timestamp.
- MLPredictorAnalyticsMLPredictorUses a model trained by the ModelTrainer plugin to add a prediction field to incoming records. The same features used to train the model must be present in each input record, but input records can also contain additional non-feature fields. If the trained model uses categorical features, and if the record being predicted contains new categories, that record will be dropped. For example, suppose categorical feature 'city' was used to train a model that predicts housing prices. If an incoming record has 'New York' as the city, but 'New York' was not in the training set, that record will be dropped.
- MultiFieldAdderTransformMultiFieldAdderMulti Field Adder Transform allows you to add one or more fields to the output. Each field specified has a name and the value. The value is currently set to be of type string.
- MultiTableDatabaseSourceMultiTableDatabaseReads from multiple tables within a database using JDBC. Often used in conjunction with the DynamicMultiFileset sink to perform dumps from multiple tables to HDFS files in a single pipeline. The source will output a record for each row in the tables it reads, with each record containing an additional field that holds the name of the table the record came from. In addition, for each table that will be read, this plugin will set pipeline arguments where the key is 'multisink.[tablename]' and the value is the schema of the table. This is to make it work with the DynamicMultiFileset.
- MySQL ExecuteAction
- MySQLSinkMySQLWrites records to a MySQL table. Each record will be written to a row in the table.
- MySQLSourceMySQLReads from a MySQL instance using a configurable SQL query. Outputs one record for each row returned by the query.
- MysqlActionMysqlRuns a MySQL query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
- NGramTransformAnalyticsNGramTransformTransforms the input features into n-grams, where n-gram is a sequence of n tokens (typically words) for some integer 'n'.
- Netezza ExecuteAction
- NetezzaSinkNetezzaWrites records to a Netezza table. Each record will be written to a row in the table.
- NetezzaSourceNetezzaReads from a Netezza using a configurable SQL query. Outputs one record for each row returned by the query.
- NetezzaActionNetezzaRuns a Netezza query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
- NormalizeTransformNormalizeNormalize is a transform plugin that breaks one source row into multiple target rows. Attributes stored in the columns of a table or a file may need to be broken into multiple records: for example, one record per column attribute. In general, the plugin allows the conversion of columns to rows.
- NullFieldSplitterTransformNullFieldSplitter
- ORCDynamicPartitionedDatasetSinkORCDynamicPartitionedDataset
- OracleAction
- OracleSinkOracleWrites records to an Oracle table. Each record will be written to a row in the table.
- OracleSourceOracleReads from an Oracle table using a configurable SQL query. Outputs one record for each row returned by the query.
- OracleActionOracleRuns an Oracle query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
- OracleExportActionOracleExportA Hydrator Action plugin to efficiently export data from Oracle to HDFS or local file system. The plugin uses Oracle's command line tools to export data. The data exported from this tool can then be used in Hydrator pipelines.
- PDFExtractorTransformPDFExtractor
- ParquetDynamicPartitionedDatasetSinkParquetDynamicPartitionedDataset
- PostgresAction
- PostgresSinkPostgresWrites records to a PostgreSQL table. Each record will be written to a row in the table.
- PostgresSourcePostgresReads from a PostgreSQL using a configurable SQL query. Outputs one record for each row returned by the query.
- PostgresActionPostgresRuns a PostgreSQL query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
- ProjectionTransformProjectionThe Projection transform lets you drop, keep, rename, and cast fields to a different type. Fields are first dropped based on the drop or keep field, then cast, then renamed.
- PySparkProgramAction
- PythonTransformPythonExecutes user-provided python code that transforms one record into zero or more records. Each input record is converted into a dictionary which can be directly accessed in python. The transform expects to receive a dictionary as input, which it can process and emit zero or more transformed dictionaries, or emit an error dictionary using the provided emitter object.
- RecordSplitterTransformRecordSplitterGiven a field and a delimiter, emits one record for each split of the field.
- RedshiftToS3ActionRedshiftToS3
- RepartitionerAnalytics
- RowDenormalizerAnalyticsRowDenormalizerConverts raw data into denormalized data based on a key column. User is able to specify the list of fields that should be used in the denormalized record, with an option to use an alias for the output field name. For example, 'ADDRESS' in the input is mapped to 'addr' in the output schema.
- RunTransformRunRuns an executable binary which is installed and available on the local filesystem of the Hadoop nodes. Run transform plugin allows the user to read the structured record as input and returns the output record, to be further processed downstream in the pipeline.
- S3SinkS3This sink is used whenever you need to write to Amazon S3 in various formats. For example, you might want to create daily snapshots of a database by reading the entire contents of a table, writing to this sink, and then other programs can analyze the contents of the specified file.
- Amazon S3SourceAmazon S3This source is used whenever you need to read from Amazon S3. For example, you may want to read in log files from S3 every hour and then store the logs in a TimePartitionedFileSet.
- S3ToRedshiftActionS3ToRedshiftS3ToRedshift Action that will load the data from AWS S3 bucket into the AWS Redshift table.
- SFTPCopyActionSFTPCopy
- SFTPDeleteActionSFTPDelete
- SFTPPutActionSFTPPut
- Remote Program ExecutorActionRemote Program ExecutorEstablishes an SSH connection with remote machine to execute command on that machine.
- SalesforceSinkSalesforceA batch sink that inserts sObjects into Salesforce. Examples of sObjects are opportunities, contacts, accounts, leads, any custom objects, etc.
- SalesforceSourceSalesforceThis source reads sObjects from Salesforce. Examples of sObjects are opportunities, contacts, accounts, leads, any custom object, etc.
- SalesforceSourceSalesforceThis source tracks updates in Salesforce sObjects. Examples of sObjects are opportunities, contacts, accounts, leads, any custom object, etc.
- Salesforce MarketingSinkSalesforce MarketingThis sink inserts records into a Salesforce Marketing Cloud Data Extension. The sink requires Server-to-Server integration with the Salesforce Marketing Cloud API. See https://developer.salesforce.com/docs/atlas.en-us.mc-app-development.meta/mc-app-development/api-integration.htm for more information about creating an API integration.
- SalesforceMultiObjectsSourceSalesforceMultiObjectsThis source reads multiple sObjects from Salesforce. The data which should be read is specified using list of sObjects and incremental or range date filters. The source will output a record for each row in the SObjects it reads, with each record containing an additional field that holds the name of the SObject the record came from. In addition, for each SObject that will be read, this plugin will set pipeline arguments where the key is 'multisink.[SObjectName]' and the value is the schema of the SObject.
- SamplingAnalyticsSamplingSampling a large dataset flowing through this plugin to pull random records. Supports two types of sampling i.e, Systematic Sampling and Reservoir Sampling.
- ScalaSparkComputeAnalyticsScalaSparkComputeExecutes user-provided Spark code in Scala that transforms RDD to RDD with full access to all Spark features.
- ScalaSparkProgramAction
- SparkSinkSparkExecutes user-provided Spark code in Scala that operates on an input RDD or Dataframe with full access to all Spark features.
- Avro Snapshot DatasetSinkAvro Snapshot DatasetA batch sink for a PartitionedFileSet that writes snapshots of data as a new partition. Data is written in Avro format. A corresponding SnapshotAvro source can be used to read only the most recently written snapshot.
- Avro Snapshot DatasetSourceAvro Snapshot DatasetA batch source that reads from a corresponding SnapshotAvro sink. The source will only read the most recent snapshot written to the sink.
- Parquet Snapshot DatasetSinkParquet Snapshot DatasetA batch sink for a PartitionedFileSet that writes snapshots of data as a new partition. Data is written in Parquet format. A corresponding SnapshotParquet source can be used to read only the most recently written snapshot.
- Parquet Snapshot DatasetSourceParquet Snapshot DatasetA batch source that reads from a corresponding SnapshotParquet sink. The source will only read the most recent snapshot written to the sink.
- SnapshotTextSinkSnapshotTextA batch sink for a PartitionedFileSet that writes snapshots of data as a new partition. Data is written in Text format.
- Google Cloud SpannerSinkGoogle Cloud SpannerThis sink writes to a Google Cloud Spanner table. Cloud Spanner is a fully managed, mission-critical, relational database service that offers transactional consistency at global scale, schemas, SQL (ANSI 2011 with extensions), and automatic, synchronous replication for high availability.
- SpannerSourceSpannerThis source reads from a Google Cloud Spanner table. Cloud Spanner is a fully managed, mission-critical, relational database service that offers transactional consistency at global scale, schemas, SQL (ANSI 2011 with extensions), and automatic, synchronous replication for high availability.
- Google Cloud Speech-to-TextTransformGoogle Cloud Speech-to-TextThis plugin converts audio files to text by using Google Cloud Speech-to-Text.
- SQL Server ExecuteAction
- SQL ServerSinkSQL ServerWrites records to a SQL Server table. Each record will be written to a row in the table.
- SQL ServerSourceSQL ServerReads from a SQL Server using a configurable SQL query. Outputs one record for each row returned by the query.
- SqlServerActionSqlServerRuns a SQL Server query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
- StructuredRecordToGenericRecordTransformStructuredRecordToGenericRecordTransforms a StructuredRecord into an Avro GenericRecord.
- Transactional Alert PublisherAlert PublisherTransactional Alert PublisherPublishes alerts to the CDAP Transactional Messaging System (TMS) as json objects. The plugin allows you to specify the topic and namespace to publish to, as well as a rate limit for the maximum number of alerts to publish per second.
- Avro Time Partitioned DatasetSinkAvro Time Partitioned Dataset
- TPFSAvroSource
- ORC Time Partitioned DatasetSinkORC Time Partitioned Dataset
- TPFSParquetSinkTPFSParquet
- Parquet Time Partitioned DatasetSourceParquet Time Partitioned DatasetReads from a TimePartitionedFileSet whose data is in Parquet format.
- CDAP Table DatasetSinkCDAP Table DatasetWrites records to a CDAP Table with one record field mapping to the Table rowkey, and all other record fields mapping to Table columns.
- TableSourceTableReads the entire contents of a CDAP Table. Outputs one record for each row in the Table. The Table must conform to a given schema.
- TopNAnalyticsTopNTop-N returns the top "n" records from the input set, based on the criteria specified in the plugin configuration.
- TrashSinkTrashTrash consumes all the records on the input and eats them all, means no output is generated or no output is stored anywhere.
- TwitterSourceTwitterSamples tweets in real-time through Spark streaming. Output records will have this schema:
- UnionSplitterTransformUnionSplitterThe union splitter is used to split data by a union schema, so that type specific logic can be done downstream.
- ValidatorTransformValidatorValidates a record, writing to an error dataset if the record is invalid. Otherwise it passes the record on to the next stage.
- ValueMapperTransformValueMapperValue Mapper is a transform plugin that maps string values of a field in the input record to a mapping value using a mapping dataset.
- VerticaBulkExportActionAction
- VerticaBulkImportActionActionVerticaBulkImportActionVertica Bulk Import Action plugin gets executed after successful mapreduce or spark job. It reads all the files in a given directory and bulk imports contents of those files into vertica table.
- WindowAnalytics
- WindowsShareCopyActionWindowsShareCopyCopies a file or files on a Microsoft Windows share to an HDFS directory.
- WranglerTransformWranglerThis plugin applies data transformation directives on your data records. The directives are generated either through an interactive user interface or by manual entry into the plugin.
- XMLMultiParserTransformXMLMultiParserThe XML Multi Parser Transform uses XPath to extract fields from an XML document. It will generate records from the children of the element specified by the XPath. If there is some error parsing the document or building the record, the problematic input record will be dropped.
- XMLParserTransformXMLParserThe XML Parser Transform uses XPath to extract fields from a complex XML event. This plugin should generally be used in conjunction with the XML Reader Batch Source. The XML Reader will provide individual events to the XML Parser, which will be responsible for extracting fields from the events and mapping them to the output schema.
- XMLReaderSourceXMLReaderThe XML Reader plugin is a source plugin that allows users to read XML files stored on HDFS.
- XML to Json StringTransformXML to Json StringAccepts a field that contains a properly-formatted XML string and outputs a properly-formatted JSON string version of the data. This is meant to be used with the Javascript transform for the parsing of complex XML documents into parts. Once the XML is a JSON string, you can convert it into a Javascript object using: