Author Archives: igfasouza

About igfasouza

My name is Igor Souza … I am Software Engineer with over eight years’ experience in java development. I’ve worked in various sectors with the Java platform and i have experience with several framewoks. Already worked as Systems Analyst, Developer, Java J2EE designer, technical lead and Scrum Master. The last two years I have worked with SNA (Social Network Analysis) and Big-Data (Hadoop ecosystem). Brazilian Geek who likes to play with Android and Arduino in my free time. Sepultura Fan and Hockey Player. Based in Dublin

Hack.guides() 2016 Tutorial Contest – I support

Story Horse?

Today I want to share the Hack.guides() 2016 Tutorial Contest, a tutorial competition.

Is a good idea to share the knowledge and help the community. I really liked the idea and I decide to participate and support this. I already public my tutorial  there and I thinking and do another one. I wrote about Red Sqirl:

You can see more about Red Sqirl Here:


Red Sqirl – first step tutorial with Pokemon

Story Horse?

Today I’ll show you how to take your first steps with Red Sqirl.

This tutorial is based in this Docker image:

I’m going to show you how start a basic ETL using Red Sqirl with a major trending topic in the moment. Pokemon Go

I just Googled and found a list of Pokemon:

and a list of all Pokemons inside the Pokemon Go game:

I just copied the tables in these two cvs files. pokemon.cvs and pkemonGo.cvs – I’ve also removed the special characters and images.

To start we are going to copy one file to docker and then copy this file to Hadoop.

sudo docker ps //this is going to show the imageID example (3d7ac2fccb23)

sudo docker inspect 3d7ac2fccb23

sudo docker cp pokemon.csv 3d7ac2fccb23:/tmp

now you have the file inside the docker image.

Now on Red Sqirl click on Remote File System.

  • Click the plus symbol to create a new ssh connection to a remote server.
  • Host : localhost
  • Port : (do not select)
  • Password : Give your password (in this case is redsqirl)
  • Save (check this if you want to save)
  • now we are able to see the file.
  • Click the plus symbol to create a new directory on hadoop file system and give it the name pokemon.mrtxt
  • now drag and drop the file from Remote to HDFS
  • now you can do the same to the other file.
  • Create a folder called pokemonGo.mrtxt

Create a Workflow
Setup a Source Action

This Task will show you how the source “action” can be configured to select flat files and change the properties such as the delimiter of the file and also the headings and types of the file.

In the Pig footer drag a new Pig source icon onto the canvas.

  • Double click to open source.
  • Name the action “pokemon“.
  • Comment the action “this is a tutorial using Pokemon data“.
  • Click OK.
  • On the data set screen, click on the path field or on the button.
  • Click on the radio button beside “pokemon.mrtxt”- if you cannot find it refresh the view by clicking on the search button, or you need navigate on the file system.
  • At this stage, you will see the data correctly display on the screen, the name of the fields are “Field1 string, Field2 string…”
  • On the feature title line, click on the edit button.
  • Once it appears you can choose “Change Header”
  • Copy and paste “Name STRING,Type STRING,Total INT,HP INT,Attack INT,Defence INT,Spatk INT,Spdef INT,Speed INT” into the value field.
  • Click OK. You will have the confirmation that the Header is correct.
  • Click OK to exit from the Configuration window.
  • If you leave the mouse cursor on the source action you will be able to see some configuration details

Now we are going to do the same steps but you can call this action as pokemonGO and select the pokemonGo.mrxt as a source. The header is “NAME STRING, TYPE1 STRING, TYPE2 STRING”

with tow source on the canvas we can do the Pig join

Perform a Pig Join Action

Drop a pig join onto the canvas.

  • Create a link from “pokemon” to the new pig join action.
  • Create a link from “pokemonGo” to the new pig join action.
  • Double click the pig join and call it “pokemonData”.
  • The first page list the table aliases, click next.
  • On the following page, make sure that “copy” is selected as the generator and click OK.
  • Here we remove name and type from pokemon table.
  • Click next.
  • This page has two interactions that specify the join type and the fields to join on, we use the default join type which is “Join” so this does not need to be changed.
  • In “Join Field” column, type “” and “”. This condition will join the two tables together.
  • Click next.
  • Click next on the sorting page.
  • Click OK on the final page.

Now you have a simple Pokemon Go dataset to start your data analyses …


Red Sqirl – Overview

Story Horse?

Today I want to show a little bit about the project that I have been working on the last 3 years called “Red Sqirl”.

Red Sqirl is a web based big data application that simplifies the analysis of large data sets.

I’m going to talk a little bit about the Architecture but you can have a look here: to see all other details.

Red Sqirl is a web application that you can install directly on top of a Hadoop Cluster. Current available only for Tomcat.
Uses Tomcat as a web service, but when you are logging in, it will create another process owned by the logged in user and make key components available on RMI. Every action on the application is run through the users’ process to avoid permission conflicts.

Architecture – java web application based in JSF framework and HTML5. You can see all the source on github.

Red Sqirl contains a drag & drop view. The user drag objects to a canvas to build a workflow. The technology here is kinetic.js – you can see a basic intro:

The canvas is where a workflow is contained which is used in Red Sqirl to manage a jobs processes or flow. A workflow is a build up of processes that chain together and perform corrective actions that produce a desired output. It is a way of managing a job so that each aspect of the job can be modified to use desired parameters.

Basic the user double click one object and configure fill parameters to performance a task. This task are submitted to Oozie and Oozie manages workflows so they can be run in parallel to other jobs.
More about Oozie –

Red Sqirl runs in parallel with the Hadoop platform and other Hadoop Technologies. The main technologies for storage are Hive and HDFS. These jobs can be saved and be used again in the future. Saved jobs can be open and modified to be run with different parameters. The output of these jobs will be saved to appropriate storage facility (Hive or HDFS)

Hadoop is a distributed system that allows for MapReduce processes to be run over the data that is stored in these technologies.
More about Hadoop –

Once you finish you can share your canvas. We call this a Model. A model can be shared in a marketplace to others users.
Just need fill a form with some information about and upload the zip file.

Red Sqirl is extensible, so you can create a new plug-in using a new technology. We call this a Package. Packages are groups of actions that are used to perform specific processes on Red Sqirl. You can see here how to install or  upload a Package: and here how to create :

You can see all Model an Packages here:

Red Sqirl support the all trending Hadoop Ecosystem APIs. The idea is a online that you can do Data Analysis in a simple way.  You don’t need know the PIG syntax and the Spark syntax to use than.

In a future I’ll do a first step Red Sqirl tutorial.

Software Craftsmanship

Story Horse?

Software Craftsmanship is a metaphor that can radically transform the way that we create and deliver software systems, with implications for the way we develop software, manage teams and deliver value to the users. Is an approach to software development that emphasizes the coding skills of the software developers themselves.

The software craftsmanship movement talks about practising as a way to to develop programming skills to become software craftsmen. Technical practices are considered to be important, it takes time to learn them and become better programmers.

The manifesto for software craftsmanship

The book Clean Code: A Handbook of Agile Software Craftsmanship (Robert C. Martin) is an excellent place to start if you haven’t. I’m not the first and definitely not the last to compare coding to craftsmen in today’s world and in previous times in history.

I would suggest reading Software Craftsmanship: The New Imperative

Doing a Google search I found Building Software Craftsmen
“To become craftsmen programmers need to gain “real-world experience and practical applications of knowledge”

How can programmers develop their skills to become software craftsmen?
Googled again and “Why I Don’t Do Code Katas
“if you want to get better at something, repeating practice alone is not enough. You must practice with increased difficulty and challenge.”

“craftsperson is someone who not only creates something from nothing from materials of their choice, but usually puts a part of themselves into what they make.”

The Codesmith

Anyone Can Be A Codesmith


Top10 Books for Developer

Story Horse?

Recently I saw the web site “41 Websites Every Java Developer Should Bookmark” and had the idea to make my list. One list of top books. I gave some research and found this.

I look at my top ten list which includes many of the same books as on his list, but my list has a few that are different.

A list of books that every programmer should read.

Domain-driven Design: Tackling Complexity in the Heart of Software – Eric Evans

Patterns of Enterprise Application Architecture – Martin Fowler

Refactoring: Improving the Design of Existing Code – Martin Fowler

Clean Code – Robert C. Martin

The Clean Coder – Robert C. Martin

Design patterns : elements of reusable object-oriented software – Erich Gamma

The Pragmatic Programmer – Andrew Hunt

Refactoring to Patterns – Joshua Kerievsky

Head First Design Patterns – Kathy Sierra

Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions – Gregor Hohpe

Startup the basics

Story Horse?

I was talking with some friends about Startup and discovered that I have some good references materials for people who are interested in the area.

First of all, What is a Startup?

Top 10 books

Business Model Generation – Alexander Osterwalder –

The Lean Startup – Eric Ries –

Running Lean – Ash Maurya –

The Four Steps to the Epiphany – Steve Blank

The Entrepreneur’s Guide to Customer Development – Brant Cooper –

The Startup Owner’s Manual – Steve Blank and Bob Dorf

Art of the Start – Guy Kawasaki –

The Other Side of Innovation – Vijay Govindarajan –

Sua ideia ainda não vale nada –

O Livro Negro do Empreendedor – Fernando Trías

Links I recommend

Blog –

Blog steveblank –

Blog Venture Hacks –

Blog Startup Marketing –

Blog OnStartups –

Blog For Entrepreneurs –

Blog instigator –

Blog A Smart Bear –

Blog Startup Lessons Learned –

Motivational Video –

Big Data

Story Horse?

I was talking with some friends in IT and found that most did not know what “SNA” means, but at the same time I have seen several articles about “Big Data” is the future. “Big Data” is a hot topic but, in many ways, we are still trying to define what the phrase “Big Data” means. Would be “Big Data” and related the future? (this is a topic for a new post.) I’ll just give a brief explanation of this new world.
Would be just another Buzzword?

Ok SNA means “Social network analysis”

A Brief Introduction

Social network analysis: A methodological introduction

Big Data

What it is and why it matters

To complete the alphabet soup we will see two more acronyms. “NoSQL” and “HADOOP”

Martinfowler NoSQL


Facebook claims to have over 30 petabytes information. If you were to store all of this data on 1TB hard disks and stack them on top of one another, you would have a tower twice as high as the Empire State building in New York. The height of this tower illustrates that processing and analyzing such data need to take place in a distributed process on multiple machines rather than on a single system. However, this kind of processing has always been very complex, and much time is spent solving recurring problems, like processing in parallel, distributing data to the compute nodes, and, in particular, handling errors during processing. To free developers from these repetitive tasks, Google introduced the “MapReduce” framework.

An example of MapReduce

Google developed an abstraction layer that splits the data flow into two main phases: the map phase and the reduce phase. In a style similar to functional programming, computations can take place in parallel on multiple computers in the map phase. The same thing also applies to the reduce phase, so that MapReduce applications can be massively parallelized on a computer cluster.
The popularity of MapReduce become wich Apache Hadoop open source implementation. Hadoop can be installed on standard hardware and possesses excellent scaling characteristics, which means you can run it on a cluster and then extend the cluster dynamically by purchasing more computers.
Apache Hadoop Project is an open source implementation of Google’s distributed filesystem (Google File System, GFS) and the MapReduce Framework. One of the most important supporters promoting this project over the years is Yahoo. Today, many other well-known enterprises, such as Facebook and IBM, as well as an active community, contribute to development.

The hadoop has one Ecosystem (but again this is a topic for a new post.)

More alphabet soup;
HDFS, Hive, Hbase, Cassandra, Pig, ZooKeeper, Phoenix, Oozie, Presto and more … a world without end!

The era of Polyglot Persistence has begun
(again this is a topic for a new post.)

Models and Methods in Social Network Analysis – John Scott;
Social Network Analysis: Methods and Applications – Stanley Wasserman;
Social Network Analysis (Quantitative Applications in the Social Sciences) – David Knoke