Empire Avenue Eavesdroppings: The Empire Avenue Crawl

Friday, June 11, 2010

The Empire Avenue Crawl

The other night, I downloaded the profile pages of the first 3235 EA users. Tonight, I'm finishing up the code to pull user data out of the pages & load it into a database. As soon as that's done, I'll start working on my next post: a complete member list of every community! So be prepared!

10 comments:

eacrawJune 11, 2010 at 8:22 PM
EAVB_IHXNOMJMDD
ReplyDelete
Replies
eacrawJune 11, 2010 at 8:22 PM
testing
ReplyDelete
Replies
adrielhamptonJune 11, 2010 at 8:39 PM
Sounds great! The cool stuff folks are doing with the API makes me want to do some data mining myself.
ReplyDelete
Replies
eacrawJune 11, 2010 at 8:54 PM
This is actually NOT using the API & I have to tread VERY lightly to keep from getting banned. I HAVE talked to (e)DUPS, the CEO & (I think) the lead developer about it, though. Granted, it was AFTER the crawl that he became aware that that's how I was getting my data, and there IS that old adage "it's easier to beg forgiveness than to ask for permission"...but I still think the only reason I haven't been banned already is that I discovered a bug as a result of the crawl ;)
ReplyDelete
Replies
eacrawJune 13, 2010 at 9:05 PM
Hi Niv!
Yes, I downloaded the HTML of all the profiles. Specifically, I wrote a small windows batch file that called iterated over profiles 1-3235 & grabbed them using curl
(yeah...I'm doing all this development in Windows 7 64-bit right now because I've been too busy to redo my ubuntu install on the other partition).

I definitely plan on hosting the raw data, but before I can, I have to figure where/how I'm going to accomplish that.
I'm flexible & open to suggestions, not to mention welcoming any hosting offers!
ReplyDelete
Replies
Niv SingerJune 13, 2010 at 9:13 PM
If you send me the files, I'll be more than happy to host them with full attribution on my blog at http://www.innerlogics.com/blog (it's about time I write a new post... :)
ReplyDelete
Replies
eacrawJune 13, 2010 at 9:26 PM
To be honest, I'd rather space that I can access, simply because this isn't a one-time operation & will be constantly kept updated.
How about this: I'll create a torrent for the raw data & host it from home (10Mbps download & upload), pending approval from DUPS to distribute the data.
Once the API overhaul that's planned goes into effect, the data will be much more readily available to everybody & I won't have to deal with parsing through hundreds of megabytes worth of files to generate 3 tables in a database :P
ReplyDelete
Replies
Niv SingerJune 13, 2010 at 9:44 PM
Torrent sounds good, naturally, pending approval from @dups. If you want, I can web-seed it for you, and/or provide you FTP access.

http://en.wikipedia.org/wiki/BitTorrent_(protocol)#Web_seeding
ReplyDelete
Replies
Niv SingerJune 13, 2010 at 9:44 PM
Torrent sounds good, naturally, pending approval from @dups. If you want, I can web-seed it for you, and/or provide you FTP access.

http://en.wikipedia.org/wiki/BitTorrent_(protocol)#Web_seeding
ReplyDelete
Replies
Niv SingerJune 14, 2010 at 1:51 AM
Hello Crawford!
Really looking forward to reading your results.

Did you just download the HTML of the pages or were you scraping specific information?
Would you consider sharing the raw data so that myself (and probably others) won't have to hit the servers too much? :)

Thanks,
(e)NIVS
ReplyDelete
Replies

Add comment