Big Data

Story Horse?

I was talking with some friends in IT and found that most did not know what “SNA” means, but at the same time I have seen several articles about “Big Data” is the future. “Big Data” is a hot topic but, in many ways, we are still trying to define what the phrase “Big Data” means. Would be “Big Data” and related the future? (this is a topic for a new post.) I’ll just give a brief explanation of this new world.
Would be just another Buzzword?

Ok SNA means “Social network analysis”

A Brief Introduction

Social network analysis: A methodological introduction

Big Data

What it is and why it matters

To complete the alphabet soup we will see two more acronyms. “NoSQL” and “HADOOP”

Martinfowler NoSQL


Facebook claims to have over 30 petabytes information. If you were to store all of this data on 1TB hard disks and stack them on top of one another, you would have a tower twice as high as the Empire State building in New York. The height of this tower illustrates that processing and analyzing such data need to take place in a distributed process on multiple machines rather than on a single system. However, this kind of processing has always been very complex, and much time is spent solving recurring problems, like processing in parallel, distributing data to the compute nodes, and, in particular, handling errors during processing. To free developers from these repetitive tasks, Google introduced the “MapReduce” framework.

An example of MapReduce

Google developed an abstraction layer that splits the data flow into two main phases: the map phase and the reduce phase. In a style similar to functional programming, computations can take place in parallel on multiple computers in the map phase. The same thing also applies to the reduce phase, so that MapReduce applications can be massively parallelized on a computer cluster.
The popularity of MapReduce become wich Apache Hadoop open source implementation. Hadoop can be installed on standard hardware and possesses excellent scaling characteristics, which means you can run it on a cluster and then extend the cluster dynamically by purchasing more computers.
Apache Hadoop Project is an open source implementation of Google’s distributed filesystem (Google File System, GFS) and the MapReduce Framework. One of the most important supporters promoting this project over the years is Yahoo. Today, many other well-known enterprises, such as Facebook and IBM, as well as an active community, contribute to development.

The hadoop has one Ecosystem (but again this is a topic for a new post.)

More alphabet soup;
HDFS, Hive, Hbase, Cassandra, Pig, ZooKeeper, Phoenix, Oozie, Presto and more … a world without end!

The era of Polyglot Persistence has begun
(again this is a topic for a new post.)

Models and Methods in Social Network Analysis – John Scott;
Social Network Analysis: Methods and Applications – Stanley Wasserman;
Social Network Analysis (Quantitative Applications in the Social Sciences) – David Knoke


One thought on “Big Data

  1. Maurício Reis

    Nice article to have an overview about terms and technologies from the Big Data universe. Lots of technologies to read about.

    I just think that is missing in the article the term Data Scientist, the guy that transforms all this data in some valuable information for companies. This is not a exclusive Big Data related term, but I think that worth to mention because it’s importance grew exponentially together with the Big Data field:

    But I think that, together with the Data Science, another alphabet soup comes along, so maybe this is subject to another post as well.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s