Friday, July 15, 2011

I scrape as a timepass, and its fun

'Scraping' the name itself sounds negative even though in my view Scraping is a real boon to developers when there's no API(REST, XML based, etc) already published.. REST APIs are very widely adopted. Lets discuss something about REST first:
REST is basically like, there are some resources(objects) on a server. Now these resources can be accessed via the same old HTTP GET, POST, DELETE... (maybe after authentication which these days is mostly via OAUTH). In response to the HTTP request the server sends back the JSON string. The gud thing is JSON is quite readable and can be parsed very easily. So the resources/services on a server can be communicated with the clients via the HTTP requests and the JSON responses.
Lets compare that with scraping..
  • When we want to scrape there's again a HTTP request to a server.
  • Server sends some HTML, CSS and JS.
Now there are issues with the HTML and the JS content sent.
HTML issues:
  • Browser seem to handle any hopeless HTML  sent.
  • So the content becomes tough to parse coz people i.e. the HTML content writers seem to be sloppy.
If proper HTML standard is followed by every content writer then its very easy to parse it and use it.. but when the HTML code is gone case we rely on libraries which heuristically parse and give you good searchable objects to enjoy with programmatically. The logic that the libraries would work is fairly simple ..

If browser's can parse hopeless HTML to a degree that the browser still renders bad HTML good on it, implies there is parsing code available with the browser. Mozilla Firefox source code is open. There's lots to learn there..
But to get a quick scrape done we really don't need to look into the browser's HTML parsing code. Instead we've got people who've made great libraries probably looking at the browser's HTML parsing code.
For my scraping needs I've enjoyed with BeautifulSoup when I code in Python and Jsoup when I code in Java. Playing with these libraries is sheer fun.. To get a feel for how good beautifulsoup is check this code that I wrote long time back. This just gets all Indian railway train information data like schedule of different trains and train stoppages from and makes SQL queries to persist the same data.

Ok so now that we've discussed that HTML parsing is more of an issue solved, lets talk about the JavaScript part of the HTTP response..
  • Scraping Static pages is easy as explained above. 
  • Today pages are dynamic. And ajax calls really make the user experience better.
So now if we are scraping a dynamic page we need to understand and render the Javascript code too.. Whoa.. But just think of it in terms of the browser..
"The browser understands JS and Ajax and renders the content received from AJAX calls."
But this case is very different and is far more different in comparison to the HTML issue.. The JS code interacts with the HTML and CSS so now we need to emulate the whole browser itself. That is when COM(Component Object Model) objects enter into the picture. We basically have to emulate the full browser and COM helps in doing that. Check out how to programmatically use Internet Explorer COM objects with some of the python code:
from win32com.client import Dispatch
ie = Dispatch("InternetExplorer.Application") #Do stuff with the ie object
 Ok so scraping the dynamic pages is tougher in comparison to static pages.. But it can be done. 
Also many a times one can get the right data by finding out the URL to which to ajax call goes and then use the response received. 
There's still lot of functionality which can be achieved from many of the static pages still available.. 
For example:
Visualizing realtime stock data is crucial for traders. Now in many of the Stock Tracking websites available the stock data being viewed is some 2 minutes behind the actual time. Also there are websites where a HTTP request to a server for a Stock returns the realtime price of it.. But the problem there is that pages have to be refreshed again and again to check the current price.. Also in one web page in the browser you can check only one Stock at a time.. 
So me and a friend of mine(he is a big market lover :) ) tried to solve the above two issues and these were the minimum functionality that had to be solved:
  1. Choose the companies from NSE that the user wants to track. 
  2. Aggregate all the stock prices of the companies in one page.
  3. Auto refresh the prices via AJAX every 10 seconds or so.
So we made a Simple and basic Realtime stock monitoring application which achieves the above, this uses GWT and the Google App Engine to host which are some interesting things to talk about.. Probably for some other post. Even though the User interface isn't good but still it serves the purpose.

Ok so scraping Static pages can be visualized as an API itself, albeit the fact that the content will change with time.. And the chances that the scraping code breaks when the content changes is always there.. One small tip to avoid breakages to some extent is to use CSS classnames if they exist while scraping, that way hopefully if the page author wants to restyle he/she would change the CSS properties of that class instead of the class name..

Finally Scraping is fun for developers like me coz it helps solve many interesting problems.. But think in terms of  how the search engine giants Google etc. work . They rely big time on scraping and crawling.. Their bots have to literally crawl the whole web and scrape data based on their algos to help their servers index and search.. So scraping itself is a huge revenue generator too but for now I will enjoy the fun part of scraping .. Hopefully some time in the future I scrape to create revenue.. :)

P.S. I scrape just for timepass.. Please dont sue me.. ;)

Saturday, May 28, 2011


This post is about an interesting event that happened last weekend..
Its 21/5, ~10pm.. I am returning back from Airport to Gurgaon.. The following are the sequence of events that take place..
  • Me in a shared cab a pretty gal sits next to me..
  • The cab goes from airport to Gurgaon
  • Lots of silence initially...Finally when gurgaon seems to be nearby she asks the driver about some hotel.. Driver doesn't know..
  • I step into the conversation.. She tells me about the place.. Even I dont know.. but I tell her to ask Justdial and give her the no.
  • Fortunately or unfortunately justdial doesn't give out any concrete information..
  • So I call up a friend of mine to know where exactly this hotel is.. And he knows..
  • I tell her about it.. 
  • She agrees to take an auto with me to the metro where.. I had kept the bike, so i tell her about it.. I tell her that I'll drop her to that hotel on the bike .. she agrees.. 
  • So after the auto ride and some talk we head to the bike..
  • Even on the bike I get to talk to her and all.. and even she's like talking rigorously and all, like lot of the gals do..
  • She tells she is an airhostess and is new in gurgaon..
  • She's searching for a PG..
  • The hotel arrives..
  • She says bye..
Eventually I dont get to know her name.. Also I could've given her my cell no. coz I could've search for PGs on the internet and help airhostesses you know.. ;)

But I'm just happy I helped a gal who was lucky enough to find a decent guy like me.. at the night.. or else things can get pretty dirty and bad here in gurgaon/NCR when a gal who's alone meets gone nut case people at night..

Such interesting things should keep on happening.. But I shouldn't be screwing up as I did this time..

Saturday, May 7, 2011

Learnings from punctures and Lassi's - Note to myself.. :)

I've always loved learning new stuff..
I've enjoyed learning stuff which I find interesting, mostly things which are logical which have some kind of pattern..
Be it programming the computer to do useful stuff or learn playing the guitar..
The quest for learning should never die..
Coz when learning stops I guess life stops.. 

This post is about an interesting learning I had about learning today..
The series of events start off day before yesterday..

  1. I found that the bike which I use for commute from office to work and vice versa (Thanks Hims :) ) was punctured when I was leaving the office to go out and swim... 
  2. I go to Mishraji who's an ever helpful and differently awesome colleague in office and I borrow his huge Royal Enfield Bike.. 
  3. I swim and return back to office.. 
  4. And then Mishraji drops me to my home too.. 
  5. Next day I'm back to office with Hims my flatmate.. 
  6. Now starts off the struggle to get the puncture fixed..
  7. I borrow the activa from the only female colleague in office at 5 pm with a 45min timeline.. 
  8. There was this puncture shop which I go to, and I ask the approx 35yr aged guy, who runs the shop, to get someone to come with me to office and remove the tyre from the bike which was parked at my office instead of me getting the bike to his shop..
  9. Now here comes this kid quite young but he's got a smile out there.. 
  10. Me and this kid we head to the office.. 
  11. This guy is into his work of removing the tyre etc. He leaves the tools etc there in the office parking in the basement itself..
  12. Now we take the tyre back to the puncture shop..
  13. He fixes the puncture and we headback to office..
  14. Now comes the challenging task for him to put the tyre to the right spot.. He seems to try out all the ways he knows..  But he doesn't succeed.. 
  15. There's a combination of anger and pity in my mind.. Pity is coz of the fact that this chap is so young.. He shouldn't be doing this.. So even I go in try to set the tyre at the write place.. But even I cant do anything.. I was wearing a new shirt that day and I was very conscious of the shirt not spoiling..  The guard uncle in the basement who's been observing the whole fiasco also steps in finally, even he angrily questions the kid.. The kid unfortunately has no answers.. He mumbles probably he has to call his 35 yr old uncle to fix this.. But I summarily reject that option coz the activa and the lady who gave me the activa has left for the day and I dont have any vehicle to take the kid from office to the shop and vice versa.. I communicate the same to him.. The guard who had come also started helping the kid out and boom.. Finally the tyre is set into the right place.. I praise the guard.. He scolds the kid.. 
  16. Next, when all the final touches are done I take the kid back to his shop.. This is when I start being curious, I ask him all kinds of questions.. 
      1. What's his name? Ashok
      2. What's his age? 13
      3. Where's he from? Near itarsi, MP
      4. Where do his parents stay and what do they do? No mom, Dad works as a daily wager.. 
      5. How did he get into repairing punctures? Parents told him to do so.. 
      6. Who's the 35 yr old uncle.. ? His answer - He's ustaad.. He teaches me.. I love learning.. 
      7. How much are you paid? Nothing .. The ustaad uncle gives me food twice daily and I am learning something every day.. :) .. This is something which made me happy and sad all at once.. 
      8. Tum school jate ho? Silence.. 
  17. His silence creates silence in my mind too .. I hate it when I cant do anything.. 
  18. I am scared that coz this kid took extra time for getting the bikes' tyre fixed his ustaad uncle might punish him.. But all I do is go pay the ustaad money.. and return back home.. 
  19. Next day is again normal.. I go to office in the morning.. Its some 6.30 in the evening,  I head to the bike to swim again.. After about 200 metres I again find to my despair that the same tyre is again punctured.. 
  20. I again drag the bike to the Ashok's shop.. When I reach there.. This kid again smiles .. He's again into fixing the bike.. He reveals that the walnut itself has torn away from the tube so the tube itself has to replaced.. Now I'm reminded of the fact that the kid had not tightened the nut of the tube to the tyre properly.. But I now think that its coz of him and his inexperience and also the society which makes many kids do work they dont deserve at their age.. These and many other thoughts like these keep my mind preoccupied.. In the meantime this kid has again set the tube into the tyre.. Its again his task to set the tyre into the right place in the bike.. I again ask him .. Can you do it? He smiles I'll try if I cant I'll ask ustaad uncle to fix it.. The ustaad finally fixes the tyre I pay him and he leaves for some time.. Now again there's a conversation which starts off with this kid Ashok.. 
    1. I find that there's another kid who's sitting in the shop.. This kid is dressed better.. He's combed neatly.. He's managing the cash.. So I ask Ashok about who the other kid is .. He tells that the other kid is the ustaad's son.. 
    2. Does ustaad's son go to school? Yes he does.. - Hmm.. That saddens me.. Double standards..  
    3. Where do you stay? He points out a house nearby.. 
    4. What did you have for lunch? - Roti.. 
    5. Take this extra money.. Use it when you want.. - He says no I cant take that..
  21. I see a juice corner nearby .. I tell Ashok to come with me.. Luckily he comes with me.. I ask him what would be want to take? Juice or lassi.. - He says lassi.. 
    1. I again find that the Juice corner guy is also some 16 yrs something.. 
    2. I tell him to make two glasses of lassi.. 
    3. He starts off preparing .. I again converse with Ashok.. 
    4. Now even this juice corner guy springs up into the conversation... 
      1. He starts off saying how much he likes Maths.. 
      2. Juice guy - "If some1 teaches me Maths 1 hour daily I can do well.. I can learn all the other things and finish my tenth.. "
        1. I dont know what comes in my mind I ask him.. How much is 13*7? - He promptly says 91.
        2. 16*9 - 144
        3. Ok.. 19* 19? -- He starts thinking and gives up.. He says.. he can do it.. All along Mr. Ashok is smiling .. When he hears 19*19 even he asks the juice guy ... Bolo Bolo.. 
        4. Finally the juice guy says .. 19* 19 bhi ho jayega.. Take this lassi its done.. 
    5. I give the lassi to Ashok.. 
    6. Think what I can do .. And I just leave after paying for the lassi.. The juice guy is like.. Apka bhi ban gaya hai.. I tell him you have the lassi.. or give it to Ashok.. 
  22. I just quit the shop and all .. I drive to the pool and just with lots of thoughts in my mind.. There's helplessness.. Lot of it..
  23. I had just learnt free style properly some 2-3 days ago.. I decide that its time to learn some other stroke.. I ask the trainers there about the toughest stroke out there.. And now I'm learning butterfly stroke.. Ashok is still learning how to fix punctures, tyres, the juice guy is learning Maths.. Life moves on.. The wheel of learning is also moving steadily :) , (:

Friday, April 22, 2011

Unix + Marketing

I had a friend who had posted this:
The Microsoft vs. Linux war didn’t help with the disdain toward marketing among software geeks, hackers and slashdotters. It feels good to believe that Windows succeeded mostly because of marketing and money spent by the mega-corporation. If Linux had the marketing muscle of Windows, it would rule the world.

But I thought otherwise. Coz the reality is as contradictory as it looks to the normal eyes..

Mac OS X is a unix based OS, and I guess even Apple has spent huge cash on marketing Mac. And in my view Apple has a marketing overload kind of strategy i.e. they sell everything and anything they can, They add value to some open source product and pitch it fervently to the market to earn cash. Remember the only way which not so geeky chaps could download music on a Mac or any Apple device was iTunes, which proves that they are marketing pros. Also Android is based on a modified Linux Kernel. Same goes with Google Chrome OS. So even google has taken up unix + marketing seriously to counter two huge requirements i.e. the mobile space and the cloud.. iOS which powers all these apple devices i.e. the iphone's and the ipad's is again unix based.. So Unix + Marketing has been tried since a long time with some success.. iOS /Android / Chrome OS might give a definitive answer to the question in the coming future.. Ultimately all these upcoming OSes have their roots in unix so Unix is already ruling the world with whatever marketing has been done upon.