John Naisbitt, author of bestselling book Megatrends, stated in 1995 that “we are drowning in information but starved for knowledge.” Around the same time a study at Berkeley University has estimated that by the end of 1999, the sum of human-produced information (including all audio, video recordings and text) to be about 12 Exabyte of data, where one Exabyte is 1 million terra bytes. By 2009, only a decade later, 494 Exabyte of information were transferred seamlessly across the globe every day, according to The Digital Britain Final Report.

In a recent publication from McKinsey Global Institute titled “Big data: The next frontier for innovation, competition, and productivity” it was shown that there is a 40% growth in the globally generated data, and there is a shortage of at least 140,000 brains who need to analyzed these data.

“Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. While what community has already come to terms is that big data must have four properties –
Volume – that cannot be handled by typical database software tools to capture, store, manage, and analyze.
Velocity – that incoming rate of data is fast.
Variety – e.g. structured and unstructured data combined (from say social media or insurance claims) or geospatial
data mixed (contextual data), and
Voracity – The data and all of it need to be consumed very fast. Fast response time required to analyze and respond
e.g. Scrutinize million transaction events online to identify potential fraud.

This is some in sense going against statistical notion of sampling. But some business cases are surely in place and more of them to come – e.g. analyzing click stream data for better targeted advertisement, or social media analytics, or even in power grid monitoring. Sophisticated analytics can substantially improve decision making, minimize risks, and unearth valuable insights that would otherwise remain hidden. Data collected from various sources, when combined together can generate new insights. Products and services could be offered to customers any time over digital media. As large infrastructures and cities are get automated – they too will generate tremendous amount of information and require round the clock real-time monitoring and state of the art analytics built using machine learning and statistical techniques.

According to experts, Big data can generate significant financial value across sectors: US health care alone can see $300 billion value per year with 0.7% productivity growth. Global personal location data can see $100 billion+ revenue for service providers, and up to $700 billion value to end user.

We have built in-depth capabilities in following areas to handle and analyze Big Data:

1. Development with Hadoop, HDFS, MapReduce, Hadoop Streaming, Hive, Pig, Mahout and HBase
2. Development with MongoDB
3. Development with memcached
4. Experience in handling:

  • ETL with Hadoop – Pig, Hive, HBase and Data on HDFS
  • ETL with MongoDB
  • Data profiling and cleansing in the distributed Hadoop like environment
  • Handling image data on Hadoop using HIPI
  • Serialization on Hadoop with Google protocol-buffer
  • Unstructured text data management on Hadoop
  • Time-series data analysis with Hadoop (e.g. from smart grid)
  • Combined solution using memcached, MongoDB, Hadoop MapReduce, and webserver
  • Extensive log collection and analytics (log4j, Chukwa)
  • Benchmarking: YCSB, GridMix3
  • Theta-joins using MapReduce
  • Self-similarity join / self-join with map reduce
  • Projection and joins with map reduce (SQL-to-MapReduce translators)
  • Distributed scheduling and load balancing including fair scheduler

Currently we offer services in following areas in Big Data:

Social CRM:
We are working on analyzing real-time data streams from Twitter and Facebook using various machine learning algorithms in conjunction with natural language processing techniques on Big Data platforms. We use language modelling (N-gram), named entity recognition and parsing techniques (PCFG), with various classifiers, clustering algorithms, and regression techniques.

One of our important focus areas is to provide solution for enterprise business analytics requirement by leveraging Hadoop and related big data technologies like MongoDB and memcached. We are currently working on customer profile analysis, fraud detection and analysis, workflow modelling in Banking and Insurance domains.

We are gathering large volumes of real-time operational measurement data from multi-vendor and multi-technology networks. Using map-reduce framework we calculate various KPI and KQI and generate meaningful information and actionable insights much faster in near real-time.