28/06/2013

[Python] Map/Reduce examples

These are simple algorithms written following the Map/Reduce model in Python. To successfully run them, you must download this archive containing:
  • MapReduce.py and MapReduce.pyc: a simple implementation of the Map/Reduce programming model in Python
  • books.json, dna.json, friends.json, matrix.json and records.json: simple datasets

[SQL] Matrix multiplication

The second assignment in my Coursera Big Data course regarded data analysis in SQL using SQLite3. We used a test dataset, which you can download from here, composed of:
  •  matrix.db: a simple database representing two square sparse matrices stored as tables A and B each with row_num, col_num and value as columns
  • reuters.db: a database containing a single table frequency(docid, term, count), where docid is an identifier corresponding to a particular file, term is an English word, and count is the number of the times the term appears within the document indicated by docid

I will not post all scripts produced here since many were just simple SELECTs so we'll just focus on the most interesting stuff: matrix multiplication in SQL.

It sounds more complicated than it really is but, given the two sparse matrices, it is possible to compute their multiplication AxB by doing a simple JOIN between A columns and B rows, then GROUPing BY A rows and B columns and finally SELECTing, for each matrix cell, the SUM of the multiplication between the two cell values in both matrices:

SELECT a.row_num, b.col_num, SUM(a.value*b.value)
FROM A a JOIN B b ON a.col_num=b.row_num
GROUP BY a.row_num, b.col_num;

Which is exactly the implementation of the matrix multiplication formula.


[Python] Twitter tweets analysis

I recently took a Big Data course on Coursera, where I had the chance to employ different analysis techniques to huge datasets. One of the assignments required us to perform some sort of tweet sentiment analysis using code written in Python (version 2.7.3).

To successfully run the examples, you must download this archive which contains:
  • README.html: instructions on how to register on Twitter and get your own token and consumer keys
  • AFINN-README.txt: a description of the AFINN-111.txt file which will be used to evaluate the various tweets
  • AFINN-111.txt: a list of english terms and their sentiment score, manually set by Finn Årup Nielsen
  • twitterstream.py: a Python script used to query Twitter and collect a stream of tweets. It runs indefinitely so you will have to manually stop it after some time
Firstly, you should run twitterstream.py as:

python twitterstream.py > output.txt

stopping it when you feel you've collected enough data. It will extract some tweets in JSON format and store them in a new file called output.txt which will be our test dataset. Note that all scripts were automatically graded so many fine tuning techniques had to be avoided to prevent it failing to correctly evaluate each solution.

Lastly, to understand the scripts, you should take a look at Twitter's official tweets JSON format documentation.

19/06/2013

[PL/SQL] Read, parse and store data from an XML file

Suppose you have a fairly complex XML file with this structure:

 <myRoot>  
      <myElement>  
           <aNode>  
                <aData>123</aData>  
                <bData><![CDATA[some lengthy text with special <characters>in</it>]]></bData>  
           </aNode>   
           <otherNode>  
                <otherData>31/12/1900</otherData>  
                .  
                .  
                .  
                <otherData>01/01/2000</otherData>  
           </otherNode>  
           <lastNode>  
                <simpleTag>  
                     <simpleValue1>some text</simpleValue1>  
                     <simpleValue2>other text</simpleValue2>  
                     <simpleFilter>0</simpleFilter>  
                </simpleTag>  
                <simpleTag>  
                     <simpleValue1>blah blah</simpleValue1>  
                     <simpleValue2>other blah</simpleValue2>  
                     <simpleFilter>1</simpleFilter>  
                </simpleTag>  
                <simpleTag>  
                     <simpleValue1>foo</simpleValue1>  
                     <simpleValue2>bar</simpleValue2>  
                     <simpleFilter>1</simpleFilter>  
                </simpleTag>  
                .  
                .  
                .  
           </lastNode>  
      </myElement>  
      .  
      .  
      .  
 </myRoot>  


Also, we won't have a proper schema file but we know that it'll have one or more myElement nodes, for each of which it'll have:
  • a single aNode node with an integer aData element and a CDATA bData text element
  • a single otherNode node with multiple otherData elements representing dates
  • a single lastNode node with multiple simpleTag child nodes, each composed of three elements: a string in both simpleValue1 and simpleValue2 and an integer in simpleFilter. Additionally, we also know that for every lastNode node, just one and only one simpleTag node will have its simpleFilter element set to 0.

Now you want to store it in a table, then read it, parse its contents and save them in another table.

15/06/2013

[PL/SQL] Oracle send mail through SMTP server with authentication

Sending mail through a PL/SQL script in Oracle is pretty easy thanks to the UTL_SMTP package.

Much like TELNET, sending a mail via our script will require us to use the HELO, MAIL FROM, RCPT TO and DATA commands. Thus, when sending a mail to multiple recipients - directly or in CC - we must issue the RCPT TO command once for each one and add the CC mail header to our message in order for them to appear as such.