Description
Talend's data integration platform extends its capabilities to Big Data technologies such as Hadoop (HDFS, HBase, HCatalog, Hive and Pig) and the NoSQL Cassandra and MongoDB databases. This internship will provide you with the basics to properly use Talend components created to communicate with Big Data systems.
Who is this training for ?
For whom ?
Data managers, architects, business intelligence consultants.
Prerequisites
Experience using the Talend Open Studio For Data Integration tool or skills acquired during the "Talend Open Studio, implementing data integration" training
Training objectives
Training program
- Présentation de Talend Open Studio for Big Data
- Big Data issues: the 3V model, use cases.
- The Hadoop ecosystem (HDFS, MapReduce, HBase, Hive, Pig...)
- Unstructured data and NoSQL databases.
- TOS for Big Data versus TOS for Data Integration.
- Practical work: Installation/configuration of TOS for Big Data and a Hadoop cluster (Cloudera or Hortonworks), verification of proper operation.
- Data integration in a cluster and NoSQL databases
- Definition of Hadoop cluster connection metadata.
- Connection to a MongoDB, Neo4j, Cassandra or Hbase database and data export.
- Simple data integration with a Hadoop cluster.
- Capture tweets (extension components) and direct import into HDFS.
- Practical work: Read tweets and store them as files in HDFS, analyze the frequency of the themes covered and memorize the result in HBase.
- Import / Export with SQOOP
- Use Sqoop to import, export, update data between RDBMS and HDFS systems.
- Import/export partial, incremental tables.
- Import/Export a SQL database from and to HDFS.
- Storage formats in Big Data (AVRO, Parquet, ORC, etc.).
- Practical work: Carry out a data migration relational tables on HDFS and vice versa.
- Perform manipulations on the data
- Presentation of the PIG building block and its PigLatin language.
- Talend's main Pig components, Pig flow design.
- Development of UDF routines.
- Practical work: Identify trends in the use of a website from the analysis of its logs.
- Architecture and best practices in a Hadoop cluster
- Design efficient storage in HADOOP.
- Datalake versus Datawarehouse, should we choose?
- HADOOP and the Activity Return Plan (PRA) in case major incident.
- Automate your workflows.
- Practical work: Create your data lake and automate its operation.
- Analyze and store your data with Hive
- Hive connection and schema metadata.
- The HiveQL language.
- Hive flow design, query execution.
- Implement implements Hive's ELT components.
- Practical work: Store the evolution of the price of a stock in HBase, consolidate this flow with Hive so as to materialize its evolution hour by hour for a given day.