Introduction to Hadoop

Software

mindsmapped-consulting
TSM RP05 Antivirus Steering Committee Meeting – #01 16 March 2012 Introduction to Big Data & Hadoop Big Data Hadoop Training Page ‹#› Subject Title Classification: Restricted 0 Introduction to Big Data Page ‹#› Subject Title Classification: Restricted 1 Importance Of Data “Data is the new oil,” said Andreas Weigend, social data guru and former chief scientist at Amazon.com. “Oil needs to be refined before it can be useful.” Page ‹#› Subject Title Classification: Restricted ESG Report on Analytics: Majority of organizations view data analytics as a top 5 business and IT priority. Reduced costs and process improvement are top data analytics platform benefits. No leading data analytics platform has emerged yet. Nearly one-third of the organizations surveyed are using a custom-developed solution. Big data is driving changes in analytics tools, infrastructure, and processes. Page ‹#› Subject Title Classification: Restricted Meaning of the term BigData Page ‹#› Subject Title Classification: Restricted Size of the largest dataset for processing Page ‹#› Subject Title Classification: Restricted Number of Data Sources to integrate Page ‹#› Subject Title Classification: Restricted Update frequency of the largest data set Page ‹#› Subject Title Classification: Restricted Challenges while processing data Page ‹#› Subject Title Classification: Restricted Key benefits from processing data Page ‹#› Subject Title Classification: Restricted Big Data & its hype.. Gartner: Hadoop will be in two-thirds of advanced analytics products by 2015 Livemint.com: SMAC is the new flavour of IT companies SMAC will allow the IT industry to offer more value to the clients Offshore Insights: Growth of IT companies will be dictated by cloud, mobile, analytics, big data and social media services, according to a survey of 410 global IT decision-makers by research firm Offshore Insights, released in February Page ‹#› Subject Title Classification: Restricted What is Big Data ? Lots of Data (in terms of Terabytes or Petabytes) It is a term applied to data-sets whose size is beyond the ability of commonly used software tools to capture, manage & process within a tolerable elapsed time. Systems/Enterprises generate huge amount of data from Terabytes to even Petabytes. Page ‹#› Subject Title Classification: Restricted Structured Vs Unstructured Page ‹#› Subject Title Classification: Restricted Big Data Characteristics Big Data is characterized by 3 Vs Page ‹#› Subject Title Classification: Restricted Time for Quiz For the given file formats, identify which category of data that it belongs to: Word Docs, PDFs, Tetxt files eMail body XML files Data generated by ERPs, CRMs etc Page ‹#› Subject Title Classification: Restricted Big Data Users & Scenarios Page ‹#› Subject Title Classification: Restricted Challenges Of Big Data Problem #1 : Slow Disk Reads/Writes Problem #2 : Hardware Failures Problem #3 : Data integration & Transfer Page ‹#› Subject Title Classification: Restricted Why Distributed Processing? To Read 1 TB of data: Disk seek-time: 100 Mb/sec Disk seek-time: 100 Mb/sec Page ‹#› Subject Title Classification: Restricted Why Distributed Processing? To Read 1 TB of data: Time to Process: (1TB/100MB) = 10485 sec or 175min. Time to Process: (1TB/5*100MB) = 2097 sec or 35 min. Page ‹#› Subject Title Classification: Restricted Introduction to Hadoop Page ‹#› Subject Title Classification: Restricted 19 Course Contents: History of hadoop Hadoop Ecosystem Hadoop Animal Planet What is Hadoop? Distinctions of hadoop Hadoop Components Anatomy of a File Write Anatomy of a File Read Replication & Rack awareness Page ‹#› Subject Title Classification: Restricted History of Hadoop Page ‹#› Subject Title Classification: Restricted Hadoop Ecosystem Page ‹#› Subject Title Classification: Restricted Hadoop Animal Planet Page ‹#› Subject Title Classification: Restricted The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. What is Hadoop? Page ‹#› Subject Title Classification: Restricted Key Distinctions of Hadoop HADOOP Scalable Robust Accessible Simple Page ‹#› Subject Title Classification: Restricted Hadoop Components Page ‹#› Subject Title Classification: Restricted HDFS – Hadoop Distributed File System(storage): Data is split and distributed across nodes Each split is replicated Namenode is the master & Datanodes are the slaves Mapreduce(processing): Splits a task across processors Execution is Near the data & the results are merged Self-healing Jobtracker is the master & Task trackers are slaves Hadoop Components Page ‹#› Subject Title Classification: Restricted Hadoop Components MapReduce HDFS Cluster Job Tracker Namenode Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node Page ‹#› Subject Title Classification: Restricted NameNode It is the master node & responsible for the entire cluster Manages the filesystem namespace Enterprise level software is used DataNode Slaves which run on commodity/cheap hardware Store and retrieve data when they are told to (by client or Namenode) Sends heart-beat signals to NN with the blocks that they store Secondary Node It is a backup for the Namenode Storage Components Page ‹#› Subject Title Classification: Restricted Job Tracker: Coordinates all the jobs run on the system by scheduling tasks Keeps a record of overall progress of each job If a job fails, reschedules the job on a different tasktracker Task Tracker: Slave daemon which accepts tasks to be run a block of data Sends progress reports as heart beat signals to the Job tracker at regular intervals Processing components Page ‹#› Subject Title Classification: Restricted HDFS Page ‹#› Subject Title Classification: Restricted Mapreduce Job Page ‹#› Subject Title Classification: Restricted Anatomy of a File Read Page ‹#› Subject Title Classification: Restricted Anatomy of a File Write Page ‹#› Subject Title Classification: Restricted Replication & Rack awareness Block A: Block B: Block C: Rack 1 1 2 3 4 Rack 2 5 6 7 8 Rack 3 9 10 11 12 Page ‹#› Subject Title Classification: Restricted
Please download to view
1
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Description
Text
TSM RP05 Antivirus Steering Committee Meeting – #01 16 March 2012 Introduction to Big Data & Hadoop Big Data Hadoop Training Page ‹#› Subject Title Classification: Restricted 0 Introduction to Big Data Page ‹#› Subject Title Classification: Restricted 1 Importance Of Data “Data is the new oil,” said Andreas Weigend, social data guru and former chief scientist at Amazon.com. “Oil needs to be refined before it can be useful.” Page ‹#› Subject Title Classification: Restricted ESG Report on Analytics: Majority of organizations view data analytics as a top 5 business and IT priority. Reduced costs and process improvement are top data analytics platform benefits. No leading data analytics platform has emerged yet. Nearly one-third of the organizations surveyed are using a custom-developed solution. Big data is driving changes in analytics tools, infrastructure, and processes. Page ‹#› Subject Title Classification: Restricted Meaning of the term BigData Page ‹#› Subject Title Classification: Restricted Size of the largest dataset for processing Page ‹#› Subject Title Classification: Restricted Number of Data Sources to integrate Page ‹#› Subject Title Classification: Restricted Update frequency of the largest data set Page ‹#› Subject Title Classification: Restricted Challenges while processing data Page ‹#› Subject Title Classification: Restricted Key benefits from processing data Page ‹#› Subject Title Classification: Restricted Big Data & its hype.. Gartner: Hadoop will be in two-thirds of advanced analytics products by 2015 Livemint.com: SMAC is the new flavour of IT companies SMAC will allow the IT industry to offer more value to the clients Offshore Insights: Growth of IT companies will be dictated by cloud, mobile, analytics, big data and social media services, according to a survey of 410 global IT decision-makers by research firm Offshore Insights, released in February Page ‹#› Subject Title Classification: Restricted What is Big Data ? Lots of Data (in terms of Terabytes or Petabytes) It is a term applied to data-sets whose size is beyond the ability of commonly used software tools to capture, manage & process within a tolerable elapsed time. Systems/Enterprises generate huge amount of data from Terabytes to even Petabytes. Page ‹#› Subject Title Classification: Restricted Structured Vs Unstructured Page ‹#› Subject Title Classification: Restricted Big Data Characteristics Big Data is characterized by 3 Vs Page ‹#› Subject Title Classification: Restricted Time for Quiz For the given file formats, identify which category of data that it belongs to: Word Docs, PDFs, Tetxt files eMail body XML files Data generated by ERPs, CRMs etc Page ‹#› Subject Title Classification: Restricted Big Data Users & Scenarios Page ‹#› Subject Title Classification: Restricted Challenges Of Big Data Problem #1 : Slow Disk Reads/Writes Problem #2 : Hardware Failures Problem #3 : Data integration & Transfer Page ‹#› Subject Title Classification: Restricted Why Distributed Processing? To Read 1 TB of data: Disk seek-time: 100 Mb/sec Disk seek-time: 100 Mb/sec Page ‹#› Subject Title Classification: Restricted Why Distributed Processing? To Read 1 TB of data: Time to Process: (1TB/100MB) = 10485 sec or 175min. Time to Process: (1TB/5*100MB) = 2097 sec or 35 min. Page ‹#› Subject Title Classification: Restricted Introduction to Hadoop Page ‹#› Subject Title Classification: Restricted 19 Course Contents: History of hadoop Hadoop Ecosystem Hadoop Animal Planet What is Hadoop? Distinctions of hadoop Hadoop Components Anatomy of a File Write Anatomy of a File Read Replication & Rack awareness Page ‹#› Subject Title Classification: Restricted History of Hadoop Page ‹#› Subject Title Classification: Restricted Hadoop Ecosystem Page ‹#› Subject Title Classification: Restricted Hadoop Animal Planet Page ‹#› Subject Title Classification: Restricted The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. What is Hadoop? Page ‹#› Subject Title Classification: Restricted Key Distinctions of Hadoop HADOOP Scalable Robust Accessible Simple Page ‹#› Subject Title Classification: Restricted Hadoop Components Page ‹#› Subject Title Classification: Restricted HDFS – Hadoop Distributed File System(storage): Data is split and distributed across nodes Each split is replicated Namenode is the master & Datanodes are the slaves Mapreduce(processing): Splits a task across processors Execution is Near the data & the results are merged Self-healing Jobtracker is the master & Task trackers are slaves Hadoop Components Page ‹#› Subject Title Classification: Restricted Hadoop Components MapReduce HDFS Cluster Job Tracker Namenode Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node Page ‹#› Subject Title Classification: Restricted NameNode It is the master node & responsible for the entire cluster Manages the filesystem namespace Enterprise level software is used DataNode Slaves which run on commodity/cheap hardware Store and retrieve data when they are told to (by client or Namenode) Sends heart-beat signals to NN with the blocks that they store Secondary Node It is a backup for the Namenode Storage Components Page ‹#› Subject Title Classification: Restricted Job Tracker: Coordinates all the jobs run on the system by scheduling tasks Keeps a record of overall progress of each job If a job fails, reschedules the job on a different tasktracker Task Tracker: Slave daemon which accepts tasks to be run a block of data Sends progress reports as heart beat signals to the Job tracker at regular intervals Processing components Page ‹#› Subject Title Classification: Restricted HDFS Page ‹#› Subject Title Classification: Restricted Mapreduce Job Page ‹#› Subject Title Classification: Restricted Anatomy of a File Read Page ‹#› Subject Title Classification: Restricted Anatomy of a File Write Page ‹#› Subject Title Classification: Restricted Replication & Rack awareness Block A: Block B: Block C: Rack 1 1 2 3 4 Rack 2 5 6 7 8 Rack 3 9 10 11 12 Page ‹#› Subject Title Classification: Restricted
Comments
Top