Recently my work necessitated me look into the new features added in informatica 9.1, but I never thought the journey will take me to explore further on this and write a blogLet’s see how i traversed through different new aspects that are getting very much related to data management and Business Intelligence. First we will look what is Bigdata and its position now.

People would always think how the organizations like Yahoo, Google, Facebook store large amounts of data of the users. We should take a note that Facebook stores more photos than Google’s Picassa. Any guesses??

What is Hadoop

The answer is Hadoop and it is a way to store large amounts of data in petabytes and zettabytes. This storage system is called as Hadoop Distributed File System. Hadoop was developed by Doug Cutting based on ideas suggested by Google’s papers. Mostly we get large amounts of machine generated data. For example, the Large Hadron Collider to study the origins of universe produces 15 petabytes of data every year for each experiment carried out.

MapReduce

The next thing which comes to our mind is how quick we can access these large amounts of data. Hadoop uses MapReduce, which first appeared in research papers of Google. It follows ‘Divide and Conquer’. The data is organized as key value pairs. It processes the entire data that is spread across countless number of systems in parallel chunks from a single node. Then it will sort and process the collected data.

With a standard PC server, Hadoop will connect to all the servers and distributes the data files across these nodes. It used all these nodes as one large file system to store and process the data , making it a 100% unadulterated distributed file system. Extra nodes can be added if data reaches the maximum installed capacity, making the setup highly scalable. It is very cheap as it is open source and doesn’t require special processors like used in traditional servers. Hadoop is also one of the NoSQL implementations.

Hadoop in Real time

The Tennessee Valley Authority(TVA) uses smart-grid field devices to collect data on its power-transmission lines and facilities across the country. These sensors send in data at a rate of 30 times per second – at that rate, the TVA estimates it will have half a petabyte of data archived within a few years. TVA uses Hadoop to store and analyse data. In India Power Grid Corporation of India intends to install these smart devices in their grids for collecting data to reduce transmission losses. It is better they also emulate TVA. Recently Facebook moved to 30 Petabyte Hadoop, which sounds incredible and hard to digest the fact we are using such a myriad volume of data.

Data Warehouse and Business Intelligence Products supporting Hadoop and MapReduce

  • Greenplum
  • Informatica
  • Teradata
  • Pentaho
  • Talend

If Hadoop and other NoSQL implementations are widely used, the limitations of traditional SQL systems can be resolved like storing unstructured data. With the volume of data increasing exponentially, commercialization of Hadoop will happen in a large scale and data integrator tools will play a key role in mining data for business.

Readers share your experiences if any of you have worked  with Hadoop on other ETL and BI Tools  tools that are available in the market.

Posted by Hari harasudhan S
Comments (0)
August 18th, 2011

Comments (0)