Friday, 4 March 2016

What is Big Data

What is Big Data?

We are dealing with data for so many years. But in today's landscape the emphasis has shifted to analytics and Big Data.

Best result can be expected from analytics only when it is provided with high quantity and high quality of data. The more data we have, the better decision we get. Currently data of size of data we deal with is in petabytes which will in future will scale to zeta bytes. With the evolution of technology over the year we are proficient in dealing with massive database, data marts and data warehouses. But now things have changed. We are getting data from different sources which are largely unstructured. So it is a new challenge for the organization how to handle that vast amount of data both structured and unstructured. This situation is dealt with Big Data.

We have reached a point of Data Explosion. From where we are getting all these data. The below diagram explain this.
What is Big Data



The data comes from multiple source sensors that gather climate information, contents posted on social media, online transactions record, call details records, cell phone GPS signals, CCTV cameras.

Characteristics of Big Data

Big Data is characterized by four V's.
i) Volume : As our data volume increase the traditional infrastructure is unable to handle it. Managing such humongous data with current budget is not feasible. Organisation is flooded with growing data sometimes in the range of petabytes.
ii)Velocity : Now we have multiple point of data source. Some of them like sensors generates data at such a large pace with equally large volume, retaining them has become a challenge. We have to improve our response time. Some real time data like fraud detection must be processed immediately.
iii)Variety : Now we have both type of data Structured as well as unstructured. Like texts, sensor data, audio and video clips. If we have to analyse both together then new apporach is required. And the irony is 80% of data is unstructured.
iv) Veracity : Establishing trust on the data is also a challenge. As bad input will result in bad putput. We are devoting so much of time in analysing the data the data must be trustworthy.



What is Big Data


Big Data Strategy

All source of data must be fully exploited by organization. While making decisions executive should consider not only operational data and customer demographics, but also customer feedback,details in contracts and agreements and other type of unstructured data and content.

Factors for Big Data Strategy

i) Integrate and manage full variety, velocity and volume of data
ii)Apply advanced analytics to information in its native form
iii)Visualize all available data for ad-hocs analysis
iv)Development environment for building new analytic applications
v) Workload optimization and scheduling
v) Security and governance



People get confused with Big Data as a technology. It is not just technology, it is a business strategy for utilising information resource. Success at each entry point
is accelerated by products within Big Data platform which helps in building the foundation for future requirements by expanding further into the big data platform

Big Data Tool
i) Hadoop
ii)Cloudera
iii)MongoDB
iv)Talend

Hadoop - "Hadoop is big data and big data is Hadoop". This is what most of the people think. But it is not like that. Hadoop is just one of the flavour of Big Data. It is an open source software framework for storage of very large dataset. It has enormous storage of any kind of data coupled with efficient processing system. It can handle concurrent task.

Cloudera - Cloudera has some additional features which allow people working in an organisation better access to the data.It is an enterprise solution in which hadoop
can be implemented. It is more secure. As we are storing sensitive data, data security is more important.

MongoDB - It is a modern approach which helps in storing unstructured data in a proper way.

Talend - It is also open source company with a number of products. 

Thursday, 3 March 2016

Real Time Analytics of Big Data

Big Data is used for storing enormous data which is both structured and unstructured and coming from different sources like sensors. In this post I am going to explain Real Time Analytics of Big Data.

The data that we deal with can be analyzed by two ways.
  1. When the data is in motion. That mean when data is still running and it has not been inserted into database.
  2. After data has been inserted into database.

Now the world has become so fast that if we wait for the data to be inserted into database and then analyze it, sometimes it becomes useless.

Let me give some example. We have CCTV camera at every traffic signal. It generates millions of data every second. Now traditionally we follow the technique where when some crime happens then we analyze the database and try to figure out the criminal. This is the bottom up approach. The better option is to analyze every things at source in real time. We will put face scanner at every source and the moment it find some suspects it will alert the nearest crime control system. In this case we don't have to wait for data to get inserted into database. Therefore we can nail the suspect and caught them
before they can commit crime.

There are other areas also where we can use real time analytics.

Now a days every where we have so much data that it is practically impossible to store all of them. So we analyze the data before storing in data base and remove the unwanted data. In this way we will store only the important data.

Real time analytics tool

i) IBM Infosphere Stream
ii) Apache Spark
iii)Apache Storm

IBM infosphere Stream is a core product of IBM which focuses on real time analysis of  big data. The aim is to analyses the data in real time and come out with meaningful conclusions. It works on the principal of Graph. As graph is set of vertex and edges. It also is based on that principal. Here vertex will be called as operator and edges will be called as stream. In operator we will write the code and in stream tuple will flow. Tuple is nothing but a row of data. We have different types of operators each with specific function.

Operators

Source Type : Any outside data first comes into this operator. This is the entry point of data. It is capable of interacting from external devices. So it is the intersection point between software and hardware. This operator is capable of parsing and creating external tuples.

Sink Type : The main work of this operator is to load the data into database.

Filter : It do the tuple filtering. The tuple which does not meet the criteria is omitted.

Punctor : A punctor operator can insert punctuations into output stream based on user supplied condition.

Aggregate : An aggregate operator is used for grouping and summarizing  incoming tuples.

Join : Join operator is used for correlating two streams.

Sort : Sort operator is used for imposing an order on incoming tuple.

Real Time Analytics of Big Data


So we have source operator and we have sink operator. The source will interact with outside world. Get the required data from any hardware or file.
The sink will load the final data into database.
In between we have different operator which will be linked with each other via edges known as stream in our case. All the data flow through this stream.

Some cases where real time analytics of data is useful
i) Crime detection and prevention
ii) Stock Market - In stock market trading happens so fast that a fraction of second change
    everything. Here if we analyse the pattern in real time then we can generate  meaningful
    conclusion.
iii)Telecommunication - Now a days world is so densely connection that it becomes a headache for
     the companies to manage the CDR. One can imagine the vast quantity of data present in a CDR.
     All of the data is not relevant. So in order to store them efficiently Infosphere Stream can be
     used. It will parse all the details and remove the irrelevant one.
iv)Health monitoring - The system can also be used for proper monitoring of health. Data from
    devices can be monitored and studies in order to find out if  the patient is suffering from some
    diseases.
v) Transportation - Real time data can be available about movement of buses or anything and
    customer can benefit from it.

Infosphere Stream and IOT(Internet of Things)

One of the future technology is IOT. Every company is investing heavily in this field. Streaming technology can be used in implementation of IOT.

For successful implementation of IOT two things are required. The system is capable in handling large amount of data and it is capable of communicating with hardware. Infosphere Stream qualified in both. So it can be one of the technology by which IOT can be implemented.

Let me give an example of IOT-

With the onset of IOT everything will become smart. So we will have smart chair. I can find out from anywhere in the world whether someone has occupied my chair. For this we will give an unique ID to my chair. My chair will be in a network. We will use some sensor like pressure sensor in order to determine whether someone has occupied my chair. The pressure sensor will continuously generate the data after fixed interval of time. Our Source operator will communicate with the sensor and generate the required tuple. Which will be then parsed by the parser to find out if someone is occupying it. So from anywhere in the world we can tell if someone has occupied my chair.


   
   


 




Saturday, 9 January 2016

IBM InfoSphere Streams

In April of 2009, IBM made available a revolutionary product named IBM InfoSphere Streams (Streams). Streams is a product architected specifically to help clients continuously analyze massive volumes of streaming data at extreme speeds to improve business insight and decision making. Based on ground-breaking work from an IBM Research team working with the U.S. Government, Streams is one of the first products designed specifically for the new business, informational, and analytical needs of the Smarter Planet Era.


Overview of Streams

As the amount of data available to enterprises and other organizations dramatically increases, more and more companies are looking to turn this data into actionable information and intelligence in real time. Addressing these requirements requires applications that are able to analyze potentially enormous volumes and varieties of continuous data streams to provide decision makers with critical information almost instantaneously. Streams provides a development platform and runtime environment where you can develop applications that ingest, filter, analyze, and correlate potentially massive volumes of continuous data streams based on defined, proven, and analytical rules that alert you to take appropriate action, all within an appropriate time frame for your organization. The Streams product goes further by allowing the applications to be modified dynamically. Although there are other systems that embrace the stream computing paradigm, Streams takes a fundamentally different approach to how it performs continuous processing and therefore differentiates itself from the rest with its distributed runtime platform, programming model, and tools for developing continuously processing applications. The data streams that are consumable by Streams can originate from sensors, cameras, news feeds, stock tickers, or a variety of other sources, including traditional databases. The streams of input sources are defined and can be numeric, text, or non-relational types of information, such as video, audio, sonar, or radar inputs. Analytic operators are specified to perform their actions on the streams. The applications, once created, are deployed to the Streams Runtime.

Thursday, 2 July 2015

Hadoop Distributed File System

Hadoop Distributed File System

(HDFS) is the file system that is used to store the data in Hadoop. How it stores data is special. When a file is saved in HDFS, it is first broken down into blocks with any remainder data that is occupying the final block. The size of the block depends on the way that HDFS is configured. At the time of writing, the default block size for Hadoop is 64 megabytes (MB). To improve performance for larger files, Hadoop changes this setting at the time of installation to 128 MB per block. Then, each block is sent to a different data node and written to the hard disk drive (HDD). When the data node writes the file to disk, it then sends the data to a second data node where the file is written. When this process completes, the second data node sends the data to a third data node. The third node confirms the completion of the writeback to the second, then back to the first. The NameNode is then notified and the block write is complete. After all blocks are written successfully, the result is a file that is broken down into blocks with a copy of each block on three data nodes. The location of all of this data is stored in memory by the NameNode.

http://techniquetechnology.blogspot.in/2015/07/hadoop-distributed-file-system.html

Scalability

Hadoop is designed to run on many commodity servers. The Hadoop software architecture also lends itself to be scalable within each server. HDFS can deal with individual files that are terabytes in size and Hadoop clusters can be petabytes in size if required. Individual nodes can be added to Hadoop at any time. The only cost to the system is the input/output (I/O) of redistributing the data across all of the available nodes, which ultimately might speed up access. The upper limit of how large you can make your cluster is likely to depend on the hardware that you have assembled. For example, the NameNode stores metadata in random access memory (RAM) that is roughly equivalent to a GB for every TB of data in the cluster.

Tuesday, 30 June 2015

MapReduce Technique : Hadoop Big Data

As a batch processing architecture, the major value of Hadoop is that it enables ad hoc queries to run against an entire data set and return results within a reasonable time frame. Distributed computing across a multi-node cluster is what allows this level of data processing to take place.
MapReduce applications can process vast amounts (multiple terabytes) of data in parallel on large clusters in a reliable, fault-tolerant manner. MapReduce is a computational paradigm in which an application is divided into self-contained units of work. Each of these units of work can be issued on any node in the cluster.

http://bigdataconcept.blogspot.in/2015/06/mapreduce-hadoop-big-data.html


A MapReduce job splits the input data set into independent chunks that are processed by map tasks in parallel. The framework sorts the map outputs, which are then input to reduce tasks. Job inputs and outputs are stored in the file system. The MapReduce framework and the HDFS (Hadoop Distributed File System) are typically on the same set of nodes, which enables the framework to schedule tasks on nodes that contain data.
The MapReduce framework consists of a single primary JobTracker and one secondary TaskTracker per node. The primary node schedules job component tasks, monitors tasks, and re-executes failed tasks. The secondary node runs tasks as directed by the primary node.

MapReduce is composed of the following phases:

i)Map
ii)Reduce

The map phase

The map phase is the first part of the data processing sequence within the MapReduce framework. Map functions serve as worker nodes that can process several smaller snippets of the entire data set. The MapReduce framework is responsible for dividing the data set input into smaller chunks, and feeding them to a corresponding map function. When you write a map function, there is no need to incorporate logic to enable the function to create multiple maps that can use the distributed computing architecture of Hadoop. These functions are oblivious to both data volume and the cluster in which they are operating. As such, they can be used unchanged for both small and large data sets (which is most common for those using Hadoop).

Important: Hadoop is a great engine for batch processing. However, if the data volume is small, the processor usage that is incurred by using the MapReduce framework might negate the benefits of using this approach.
Based on the data set that one is working with, a programmer must construct a map function to use a series of key/value pairs. After processing the chunk of data that is assigned to it, each map function also generates zero or more output key/value pairs to be passed forward to the next phase of the data processing sequence in Hadoop. The input and output types of
the map can be (and often are) different from each other.

The reduce phase

As with the map function, developers also must create a reduce function. The key/value pairs from map outputs must correspond to the appropriate reducer partition such that the final results are aggregates of the appropriately corresponding data. This process of moving map
outputs to the reducers is known as shuffling. When the shuffle process is completed and the reducer copies all of the map task outputs, the
reducers can go into what is known as a merge process. During this part of the reduce phase, all map outputs can be merged together to maintain their sort ordering that is established during the map phase. When the final merge is complete (because this process is done in rounds for performance optimization purposes), the final reduce task of consolidating results
for every key within the merged output (and the final result set), is written to the disk on the HDFS.

Development languages: Java is a common language that is used to develop these functions. However, there is support for a host of other development languages and frameworks, which include Ruby, Python, and C++.

Sunday, 28 June 2015

Operational Vs Analytical : Big Data Technology

There are two technologies used in Big Data Operational and Analytical. Operational capabilities include capturing and storing data in real time where as analytical capabilities include complex analysis of all the data. They both are complementary to each other hence deployed together.




Operational and analytical technologies of Big Data have different requirement and in order to address those requirement different architecture has evolved. Operational systems include NoSql database which deals with responding to concurrent requests. Analytical Systems focuses on complex queries which touch almost all the data.Both system work in tandem and manages hundreds of terabytes of data spanning over billion of records.

Operational Big Data

For Operational Big Data NoSql is generally used. It was developed to address the shortcoming of traditional database and it is faster and can deal with large quantity of data spread over multiple servers. We are also using cloud computing architectures to allow massive computation to run effectively as well as it is cost efficient. This has made Big Data workload easier to manage, faster to implement as well as cheaper.
Here in addition to interaction with user it also provide artificial intelligence about the active data. For example in games the moves of user are studies and next course of actions are suggested. NoSql can analyse real-time data and can generate conclusion based on that.

Analytical Big Data

Analytical Big Data is addressed by MPP database systems and MapReduce. These technologies has evolved as a result of shortcoming in traditional database which deals which one servers only. On the other hand MapReduce provides new method of analyzing data which is beyond the scope of SQL.

As volumes of data generated by users is increasing the analytical workload in realtime has also increased. So MapReduce has emerged as the first choice for Big Data analytics as its algorithm is superior. No Sql also provide limited capabilities in MapReduce technique but generally we prefer copying data from NoSql system to Analytical Systems such as Hadoop for MapReduce.

SpaceX's Falcon 9 rocket explodes.

On 28th June 2015 SpaceX's Falcon 9 rocket which was an unmanned rocket for international space station had exploded just min after lit-off. Nasa official is not sure what had caused the explosion and they are investigating the matter.

The rocket was launched from Cape Canaveral, Fla at 10:21 a.m. . Things was going smoothly when suddenly after 2 min it exploded. It was carrying more than 4000 pounds of food and supplies to the space station. It was unmanned so it was carrying no astronauts. American astronauts Scott Kelly is in space station and the supply was for him.

Two of the earlier launches of Space X had also failed. They were Orbital Antares rocket and Russian Progress 59.