Today I’ll show you how to take your first steps with Red Sqirl.
This tutorial is based in this Docker image: https://hub.docker.com/r/redsqirl/cloudera/
I’m going to show you how start a basic ETL using Red Sqirl with a major trending topic in the moment. Pokemon Go
I just Googled and found a list of Pokemon: http://pokemondb.net/pokedex/all
and a list of all Pokemons inside the Pokemon Go game: http://www.pokemongodb.net/2016/05/pokemon-go-pokedex.html
I just copied the tables in these two cvs files. pokemon.cvs and pkemonGo.cvs – I’ve also removed the special characters and images.
To start we are going to copy one file to docker and then copy this file to Hadoop.
sudo docker ps //this is going to show the imageID example (3d7ac2fccb23) sudo docker inspect 3d7ac2fccb23 sudo docker cp pokemon.csv 3d7ac2fccb23:/tmp
now you have the file inside the docker image.
Now on Red Sqirl click on Remote File System.
- Click the plus symbol to create a new ssh connection to a remote server.
- Host : localhost
- Port : (do not select)
- Password : Give your password (in this case is redsqirl)
- Save (check this if you want to save)
- now we are able to see the file.
- Click the plus symbol to create a new directory on hadoop file system and give it the name pokemon.mrtxt
- now drag and drop the file from Remote to HDFS
- now you can do the same to the other file.
- Create a folder called pokemonGo.mrtxt
Create a Workflow
Setup a Source Action
This Task will show you how the source “action” can be configured to select flat files and change the properties such as the delimiter of the file and also the headings and types of the file.
In the Pig footer drag a new Pig source icon onto the canvas.
- Double click to open source.
- Name the action “pokemon“.
- Comment the action “this is a tutorial using Pokemon data“.
- Click OK.
- On the data set screen, click on the path field or on the button.
- Click on the radio button beside “pokemon.mrtxt”- if you cannot find it refresh the view by clicking on the search button, or you need navigate on the file system.
- At this stage, you will see the data correctly display on the screen, the name of the fields are “Field1 string, Field2 string…”
- On the feature title line, click on the edit button.
- Once it appears you can choose “Change Header”
- Copy and paste “Name STRING,Type STRING,Total INT,HP INT,Attack INT,Defence INT,Spatk INT,Spdef INT,Speed INT” into the value field.
- Click OK. You will have the confirmation that the Header is correct.
- Click OK to exit from the Configuration window.
- If you leave the mouse cursor on the source action you will be able to see some configuration details
Now we are going to do the same steps but you can call this action as pokemonGO and select the pokemonGo.mrxt as a source. The header is “NAME STRING, TYPE1 STRING, TYPE2 STRING”
with tow source on the canvas we can do the Pig join
Perform a Pig Join Action
Drop a pig join onto the canvas.
- Create a link from “pokemon” to the new pig join action.
- Create a link from “pokemonGo” to the new pig join action.
- Double click the pig join and call it “pokemonData”.
- The first page list the table aliases, click next.
- On the following page, make sure that “copy” is selected as the generator and click OK.
- Here we remove name and type from pokemon table.
- Click next.
- This page has two interactions that specify the join type and the fields to join on, we use the default join type which is “Join” so this does not need to be changed.
- In “Join Field” column, type “pokemon.name” and “pokemonGo.name”. This condition will join the two tables together.
- Click next.
- Click next on the sorting page.
- Click OK on the final page.
Now you have a simple Pokemon Go dataset to start your data analyses …