Friday 4 March 2016

What is Big Data

What is Big Data?

We are dealing with data for so many years. But in today's landscape the emphasis has shifted to analytics and Big Data.

Best result can be expected from analytics only when it is provided with high quantity and high quality of data. The more data we have, the better decision we get. Currently data of size of data we deal with is in petabytes which will in future will scale to zeta bytes. With the evolution of technology over the year we are proficient in dealing with massive database, data marts and data warehouses. But now things have changed. We are getting data from different sources which are largely unstructured. So it is a new challenge for the organization how to handle that vast amount of data both structured and unstructured. This situation is dealt with Big Data.

We have reached a point of Data Explosion. From where we are getting all these data. The below diagram explain this.
What is Big Data



The data comes from multiple source sensors that gather climate information, contents posted on social media, online transactions record, call details records, cell phone GPS signals, CCTV cameras.

Characteristics of Big Data

Big Data is characterized by four V's.

i) Volume : As our data volume increase the traditional infrastructure is unable to handle it. Managing such humongous data with current budget is not feasible. Organisation is flooded with growing data sometimes in the range of petabytes.

ii)Velocity : Now we have multiple point of data source. Some of them like sensors generates data at such a large pace with equally large volume, retaining them has become a challenge. We have to improve our response time. Some real time data like fraud detection must be processed immediately.

iii)Variety : Now we have both type of data Structured as well as unstructured. Like texts, sensor data, audio and video clips. If we have to analyse both together then new approach is required. And the irony is 80% of data is unstructured.

iv) Veracity : Establishing trust on the data is also a challenge. As bad input will result in bad output. We are devoting so much of time in analysing the data the data must be trustworthy.



What is Big Data


Big Data Strategy

All source of data must be fully exploited by organization. While making decisions executive should consider not only operational data and customer demographics, but also customer feedback,details in contracts and agreements and other type of unstructured data and content.

Factors for Big Data Strategy

i) Integrate and manage full variety, velocity and volume of data
ii)Apply advanced analytics to information in its native form
iii)Visualize all available data for ad-hocs analysis
iv)Development environment for building new analytic applications
v) Workload optimization and scheduling
v) Security and governance



People get confused with Big Data as a technology. It is not just technology, it is a business strategy for utilizing information resource. Success at each entry point
is accelerated by products within Big Data platform which helps in building the foundation for future requirements by expanding further into the big data platform

Big Data Tool
i) Hadoop
ii)Cloudera
iii)MongoDB
iv)Talend

Hadoop - "Hadoop is big data and big data is Hadoop". This is what most of the people think. But it is not like that. Hadoop is just one of the flavour of Big Data. It is an open source software framework for storage of very large dataset. It has enormous storage of any kind of data coupled with efficient processing system. It can handle concurrent task.

Cloudera - Cloudera has some additional features which allow people working in an organisation better access to the data.It is an enterprise solution in which hadoop
can be implemented. It is more secure. As we are storing sensitive data, data security is more important.

MongoDB - It is a modern approach which helps in storing unstructured data in a proper way.

Talend - It is also open source company with a number of products. 

Thursday 3 March 2016

Real Time Analytics of Big Data

Big Data is used for storing enormous data which is both structured and unstructured and coming from different sources like sensors. In this post I am going to explain Real Time Analytics of Big Data.

The data that we deal with can be analyzed by two ways.
  1. When the data is in motion. That mean when data is still running and it has not been inserted into database.
  2. After data has been inserted into database.

Now the world has become so fast that if we wait for the data to be inserted into database and then analyze it, sometimes it becomes useless.

Let me give some example. We have CCTV camera at every traffic signal. It generates millions of data every second. Now traditionally we follow the technique where when some crime happens then we analyze the database and try to figure out the criminal. This is the bottom up approach. The better option is to analyze every things at source in real time. We will put face scanner at every source and the moment it find some suspects it will alert the nearest crime control system. In this case we don't have to wait for data to get inserted into database. Therefore we can nail the suspect and caught them
before they can commit crime.

There are other areas also where we can use real time analytics.

Now a days every where we have so much data that it is practically impossible to store all of them. So we analyze the data before storing in data base and remove the unwanted data. In this way we will store only the important data.

Real time analytics tool

i) IBM Infosphere Stream
ii) Apache Spark
iii)Apache Storm

IBM infosphere Stream is a core product of IBM which focuses on real time analysis of  big data. The aim is to analyses the data in real time and come out with meaningful conclusions. It works on the principal of Graph. As graph is set of vertex and edges. It also is based on that principal. Here vertex will be called as operator and edges will be called as stream. In operator we will write the code and in stream tuple will flow. Tuple is nothing but a row of data. We have different types of operators each with specific function.

Operators

Source Type : Any outside data first comes into this operator. This is the entry point of data. It is capable of interacting from external devices. So it is the intersection point between software and hardware. This operator is capable of parsing and creating external tuples.

Sink Type : The main work of this operator is to load the data into database.

Filter : It do the tuple filtering. The tuple which does not meet the criteria is omitted.

Punctor : A punctor operator can insert punctuations into output stream based on user supplied condition.

Aggregate : An aggregate operator is used for grouping and summarizing  incoming tuples.

Join : Join operator is used for correlating two streams.

Sort : Sort operator is used for imposing an order on incoming tuple.

Real Time Analytics of Big Data


So we have source operator and we have sink operator. The source will interact with outside world. Get the required data from any hardware or file.
The sink will load the final data into database.
In between we have different operator which will be linked with each other via edges known as stream in our case. All the data flow through this stream.

Some cases where real time analytics of data is useful
i) Crime detection and prevention
ii) Stock Market - In stock market trading happens so fast that a fraction of second change
    everything. Here if we analyse the pattern in real time then we can generate  meaningful
    conclusion.
iii)Telecommunication - Now a days world is so densely connection that it becomes a headache for
     the companies to manage the CDR. One can imagine the vast quantity of data present in a CDR.
     All of the data is not relevant. So in order to store them efficiently Infosphere Stream can be
     used. It will parse all the details and remove the irrelevant one.
iv)Health monitoring - The system can also be used for proper monitoring of health. Data from
    devices can be monitored and studies in order to find out if  the patient is suffering from some
    diseases.
v) Transportation - Real time data can be available about movement of buses or anything and
    customer can benefit from it.

Infosphere Stream and IOT(Internet of Things)

One of the future technology is IOT. Every company is investing heavily in this field. Streaming technology can be used in implementation of IOT.

For successful implementation of IOT two things are required. The system is capable in handling large amount of data and it is capable of communicating with hardware. Infosphere Stream qualified in both. So it can be one of the technology by which IOT can be implemented.

Let me give an example of IOT-

With the onset of IOT everything will become smart. So we will have smart chair. I can find out from anywhere in the world whether someone has occupied my chair. For this we will give an unique ID to my chair. My chair will be in a network. We will use some sensor like pressure sensor in order to determine whether someone has occupied my chair. The pressure sensor will continuously generate the data after fixed interval of time. Our Source operator will communicate with the sensor and generate the required tuple. Which will be then parsed by the parser to find out if someone is occupying it. So from anywhere in the world we can tell if someone has occupied my chair.