I'm not really sure how I ended up at Innovation Weekend 2010 in Vegas.
Background and history
For me it goes back to Hackers night 2009 in Vienna - when Craig Cmehil (our very own Simon Cowell) and I were talking late at night, not long before Craig, Tom Jung and Duane Chaos and I got kicked out the MesseCentrum building. We were discussing the RIA hackers night, which I felt had turned into a late night learning session based on our own laptops. And whilst it was still a cool thing to do, I for one felt that it was missing something. Mostly, real hacking and a deliverable. And real purpose. And what I didn't realise at that time was that Marilyn Pratt had already taken a load of hackers into the BPX slam and those guys were really hacking.
Craig and I kept in touch and whilst I don't know how much I shaped the end product - he probably already had a plan, he usually does - the end result was the Innovation Weekend. An all night hackers night running from Sunday to Monday. I wasn't able to attend in Berlin and as luck had it, the cheap tickets to Vegas were on Saturay, meant that I was just about recovered from jetlag by Sunday at 1pm.
Now I have to take you back again 6 months to the SAP Inside Track in London that Darren Hague kindly organised. I met a PhD student called Sarah Otner, who was doing a PhD on the recognition system in the SAP Community Network. I loved her passion and interest in the system and she was really frustrated, because she needed data in order to do the mining she needed to do to write her thesis. SAP were blocking her desire to get the data out, either for technical or legal reasons. I don't think that it was an orchestrated attack - but rather that it was the typical problems that you see in a large corporation.
I saw her in Berlin last week and she looked slightly downtrodden - no progress on data in the preceding 6 months since I saw her in SIT London. I felt that for SAP there was no downside - free research and exposure for one of the most exciting community networks in the word.
Fast forward to Vegas
... and I found myself in the amazing Innovation Weekend masterminded by Marilyn Pratt and Craig Cmehil. Without those guys it would be nothing.
They had prepared 8 BPX focussed business cases and one of these was as follows:
8. "Physician: Heal Thyself": Improving the SCN from within!
Posted by: Sarah Otner GOAL: Improve the recognition systems of SCN by examining the historical data
- Does the SCN recognition system reward the right kinds of behaviors and contributions?
- What's the <ins style="line-height: 1.22em; border-style: initial; border-color: initial; outline-width: 0px; outline-style: initial; outline-color: initial; font-size: 11px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; text-decoration: none; border-width: 0px; padding: 0px; margin: 0px">real</ins> value of being a Top Contributor?
PROBLEM: Initial attempts to pull the source data already available on SCN into Excel failed as they only returned 10 lines and the same 10 lines upon each request (a problem when one Top Contributor table has 17,000 individuals).
- A database of community members and their activity year-on-year for as many years as is available.
- Scrape the Contributor Recognition Program, the Top Contributors' lists, the Topic Leaders' lists, and the Mentors' rosters into a format easily manipulable (by me! J) for analysis.
Fellow SAP Mentor Thorsten Franz turned up at the table along with a number of other great individuals. And it became clear that this was a pretty easy technology challenge, provided we could get the data. So I set about getting the data whilst Thorsten, Arun, Laurant and others worked on analytics and presentation.
Mounting a DOS on SCN (aka making friends and influencing people)
So it turns out that the only way to get points data out of SCN is to read the RSS feeds on the contributor pages. Only the contributor page version is broken. The company version does however work, and it is possible to see points - by Company by Person by Year by Development Area. Can you see where I am headed?
So if you want to find out the contributors for Bluefinsolutions.com - for 2010 and for Mobile, you can go here.
So all I needed to do was to write a script to get this for all companies, all years and all points areas. Simple, right. Here's the bash script to do it:
for year in `cat ../year`; do for devel in `cat ../devel`; do for comp in `cat ../companynames`; do wget -O $year,$devel,$comp 'http://www.sdn.sap.com/irj/sdn/topcontributorsrss?periodId='$year'&minimumPointsCount=20&areaIds='$devel'&organization='$comp; done; done; done
Note that I downloaded the years, company names and development areas using the same techniques and put them in files - and note that the filename is cued to be part of the CSV name. But... I forgot to escape the & by surrounding it in inverted commas. So in doing so, I opened up 2500 threads (I used the top 2500 companies). And SDN died for 3 hours.
After SDN came back up I fixed my script and parallelised it by year - so just 8 threads running. It took 10 hours to download all the data into some 180,000 XML files. Thankfully, we have lots of CPU power these days. So I wrote some scripts around that too.
First, files that are 409 bytes long don't actually have any data in them. So we strip them out the list of files to process as follows:
for a in `find -not -size 409c -print | sed '1d'| cut -c3-100`; do echo $a; done > ../filled
And then we strip the XML out, turn it into a flat file and append the filename that relates to it, to each line.
for a in `cat ../filled`; do cat $a | sed '1,9d' | more | sed ':a;N;$!ba;s/<\/title>\n/,/g' | sed 's/<title>//' | sed ':a;N;$!ba;s/<\/link>\n/,/g' | sed 's/<link>//' | sed ':a;N;$!ba;s/<\/description>\n/,/g' | sed 's/<description>//' | sed ':a;N;$!ba;s/<\/pubDate>\n/,/g' | sed 's/<pubDate>//'| sed ':a;N;$!ba;s/<\/scn:rank>\n/,COMPANY/g' | sed 's/<scn:rank>//'| sed ':a;N;$!ba;s/<\/item>\n//g' | sed 's/<item>//'| sed ':a;N;$!ba;s/<\/rss>//g'| sed ':a;N;$!ba;s/<\/channel>//g' | sed '1d' | sed '$d' | sed s/COMPANY/$a/; done >> ../fillout.csv
This gives us a bunch of data that looks like this:
Jon Reed,https://www.sdn.sap.com/irj/servlet/prt/portal/prtroot/com.sap.sdn.businesscard.SDNBusinessCard?u=gLyawsX5bMI%3D,80,Tue, 10 Feb 2009 2:43:19,1,y08,P,jonerp.com
All we do then is convert the data and replace some years and development areas, and we've got a nice big CSV file with people by year, development area and company.
The rest is easy
The rest of our demo was easy - we uploaded the big CSV file into SAP's cloud BI Service - http://bi.ondemand.com and used SAP BusinessObjects Explorer to look at the data. We also used the new beta BUPA dashboarding service which worked pretty well.
Well, we have done what we set out to achieve. We have 7 years of SCN data explorable by most of the metrics that Sarah was looking for. There are some things that were hard to do - especially scraping the master data from SCN business cards and that is a work in progress.
But what we're hoping, and there's a number of us that share this vision, is that as the SCN team start to realise the value of analysis by students of the data, we are able to break down the walls of getting more detailed information available to people like Sarah who want to run PhD theses into the community.
Huge thanks to Marilyn and Craig for making it possible. To Kai and Mark and Chip and everyone from the SCN team who I inconvenienced. Sorry about that.