Friday, June 11, 2010

The Empire Avenue Crawl

The other night, I downloaded the profile pages of the first 3235 EA users. Tonight, I'm finishing up the code to pull user data out of the pages & load it into a database. As soon as that's done, I'll start working on my next post: a complete member list of every community! So be prepared!

10 comments:

  1. Sounds great! The cool stuff folks are doing with the API makes me want to do some data mining myself.

    ReplyDelete
  2. This is actually NOT using the API & I have to tread VERY lightly to keep from getting banned. I HAVE talked to (e)DUPS, the CEO & (I think) the lead developer about it, though. Granted, it was AFTER the crawl that he became aware that that's how I was getting my data, and there IS that old adage "it's easier to beg forgiveness than to ask for permission"...but I still think the only reason I haven't been banned already is that I discovered a bug as a result of the crawl ;)

    ReplyDelete
  3. Hi Niv!
    Yes, I downloaded the HTML of all the profiles. Specifically, I wrote a small windows batch file that called iterated over profiles 1-3235 & grabbed them using curl
    (yeah...I'm doing all this development in Windows 7 64-bit right now because I've been too busy to redo my ubuntu install on the other partition).

    I definitely plan on hosting the raw data, but before I can, I have to figure where/how I'm going to accomplish that.
    I'm flexible & open to suggestions, not to mention welcoming any hosting offers!

    ReplyDelete
  4. If you send me the files, I'll be more than happy to host them with full attribution on my blog at http://www.innerlogics.com/blog (it's about time I write a new post... :)

    ReplyDelete
  5. To be honest, I'd rather space that I can access, simply because this isn't a one-time operation & will be constantly kept updated.
    How about this: I'll create a torrent for the raw data & host it from home (10Mbps download & upload), pending approval from DUPS to distribute the data.
    Once the API overhaul that's planned goes into effect, the data will be much more readily available to everybody & I won't have to deal with parsing through hundreds of megabytes worth of files to generate 3 tables in a database :P

    ReplyDelete
  6. Torrent sounds good, naturally, pending approval from @dups. If you want, I can web-seed it for you, and/or provide you FTP access.

    http://en.wikipedia.org/wiki/BitTorrent_(protocol)#Web_seeding

    ReplyDelete
  7. Torrent sounds good, naturally, pending approval from @dups. If you want, I can web-seed it for you, and/or provide you FTP access.

    http://en.wikipedia.org/wiki/BitTorrent_(protocol)#Web_seeding

    ReplyDelete
  8. Hello Crawford!
    Really looking forward to reading your results.

    Did you just download the HTML of the pages or were you scraping specific information?
    Would you consider sharing the raw data so that myself (and probably others) won't have to hit the servers too much? :)

    Thanks,
    (e)NIVS

    ReplyDelete