Apache NiFi is an open source software to automate and manage the flow of data between different systems. It provides a web-based UI for creating monitoring and controlling data flows. Processors in Nifi are highly configurable, it can also be used to transform data at runtime.
NiFi helps in ingesting data from difference source systems to a data lake and from data lake to other target systems. Data Lake can be an Amazon S3 or a Hadoop cluster or any storage.
Some of the Key benefits of Apache NiFi:-
- Guaranteed delivery of data: NiFi offer guaranteed delivery of data with the help of its content repository and write-ahead log.
- Visualize your Data Flow: – Nifi helps in building a visual data flow, which are very easy to understand and develop.
- Integration with other Data processing tools: – It can integrate with other data processing tools like Spark and Kafka.
- Facilitates Back Pressure mechanism: Queues are the link between two processors, it buffers the data to make it available for the downstream processor. If by any reason the downstream job is not consuming the data with same speed as it is being generated in the queue, then these queues can create a backpressure on the upstream Processors to restrict the new data to come in.
- Data flow can be Prioritized: – Data in the queue can be prioritized before being fetched by the downstream. Priority can be the oldest first, newest first, largest first, or some other custom rule.
- Gives an option to decide Latency Vs Throughput:- In some scenario you may want lowest latency i.e. as soon as data is there you want it to get processed, but in some scenario you may want to achieve more throughput and willing to sacrifice the latency to some extent by allowing latency 1or 2 sec delay. we can make these Latency Vs Throughput decisions while configuring processors.
- Data Provenance: – It allows us to trace the data and its movement thought different processors. It allows us to troubleshoot and optimize Data flow.
- It gives an option to start and stop different Data Flow components separately.
Apart from these features NiFi also provides content encryption. NiFi offers secure exchange of data through the use of protocols with encryption such as 2-way SSL, shared-keys or other mechanisms.
Nifi Setup and Installation
Nifi can be typically configured on edge node. However, it is not mandatory to set it up on any particular node, it can be configures on any node. You just need to provide the location of the Hadoop configurations files in order to with with HDFS and other Hadoop based components. For high availability it can be configured on multiple nodes as well.
In order to work with HDFS related processors in NiFi we would need have a running Hadoop cluster. In the NiFi Processor config we need to pass hive-ste.xml and core-site.xml file path from the Hadoop installation.
To work on NiFi integration with Spark or Kafka, first we need to set a Hadoop cluster and then Install NiFi or we can install NiFi in an exiting Hadoop cluster and integrate it with existing tools.
Installation of NiFi on GCP DataProc or Amazon EMR cluster
GCP DataProc or Amazon EMR has preinstalled Hadoop, Spark and other tools. We can leverage these cluster install NiFi in it.
We need to first Create and Launch a DataProc Cluster with any number of data node based on your data processing requirement.
Steps to Install NiFi on GCP DataProc or Amazon EMR Cluster:-
1. Login to the master node through SSH and download Nifi tar.gz using wget command from the Apache NiFi mirror page
command to download the tar file:-
2. Untar and unzip using tar xzf command.
3.Update the bash profile and add the NiFi path by using the following command
a. vi ~/.bash_profile
b. Add the following lines as shown in the screenshot
c. source ~/.bash_profile
d. Very the new Nifi Path is set by running the “echo $Path” command
4. run the command “nifi.sh start”
5. Check the status if NiFi is running or not by running the command “nifi.sh status”
6. Once you start the Nifi, logs folder will get created with the log file.
you can check the log file here:-
NiFi provides a Web UI which runs on 8080 port. In order to access the web UI from outside the cluster or from your local machine we need to open the port in the Firewall rule for the GCP, or AWS instance where NiFi is installed.