Big Data

Introduction

Big Data refers to the inability of traditional data architectures to efficiently handle the new datasets. Characteristics of Big Data that force new architectures are:

  • Volume (size of the dataset)
  • Variety (date from different sources)
  • Velocity (rate of flow)
  • Variability (the change in other characteristics)

Descriptive analytics

  • Data aggregation – such as grouping, sum, averaging
  • Document search and retrieval – indexing, ranking, clustering
  • Detecting patterns in data – time series analysis and anomaly detection

Predictive analytics

Scalability – the reflection of the systems ability to efficiently utilise the increased processing resources.

  • Scaling out – adding more nodes to a distributed system
  • Scaling up – by adding more resources to a single node in a system

Speed – speed should increase in proportion to the increase in the number of processors

MapReduce

Is a programming model and an associated implementation for processing and generating big data sets with a parallel distributed algorithm on a cluster. The mapping part performs the filtering and sorting etc. and the reduce method performs the summary result. Types include Google MR and Apache Hadoop.

The framework of MapReduce is to partition data and assign portions of it to workers, that operate in parallel. The master then controls the many operations that are needed to be carried out.

Problems with MapReduce

  • distributed programming is difficult
  • Network transfer is a limiting factor
  • Need to be resistant to single node failure

Cloud computing

  • Saas – applications delivered over the internet as services
  • The cloud – data centre hardware and systems software
  • Public clouds
    • Pay as you go
    • AWS (Amazon web service), Microsoft Azure, Google AppEngine
  • Private clouds
    • Internal data centres
    • Normally not included under cloud computing

Lower management systems include EC2 by Amazon, middle management is Azure and you would need more management on Google AppsEngine

PySpark

A stream is a sequence of discretized streams called Dstreams. These can be created from various input sources such as Flume, Kafka or HDFS