What Does RT Look Like?
An interesting question. I'm not talking about in terms of the layout or coding or anything visual. I'm talking in terms of relationships.
Many of you may have seen the Internet Map
It's an interesting application of network theory and web scraping. Basically, it takes a huge percentage of the major websites on the planet and groups them by a number of metrics.
Well, I don't have access to an internet trunk line and I'm a bit impatient. But it's still possible to do something similar on a small scale. So when I started a job that required me to learn Python and saw some of @Desayjin
's Ask an Economist journals, I decided to write my own site scraper for RT.
How does a scraper work, you might ask. Well, hold on. I was just getting to that. A scraper is a term for a program that visits a number of webpages based on certain criteria and "scrapes" information off of them. I built mine of of the urlLib2 module in python. It starts with one or more seeds, in this case users with lots of friends, and gathers their data. Then it records a list of their friends and adds all of those friends to a queue. After each user is processed, their pages are scraped. If the scrape finds they have more friends than a certain threshold (in my case, 500), it adds their data to a list and then puts all their friends on the queue. If not, they're ignored. You have to ignore a bunch of people, otherwise you'll be running the program for a week. As it is, I processed about 75,000 users in 18 hours for a list of 319 users of over 500 friends. It's not necessarily complete, but it would have been difficult to be missed. As it is, the data takes up about 7 MB, which doesn't sound like much until you realize that an average ebook is less than half a megabyte.
But plaintext is difficult to to visualize, and they say a picture is worth a thousand words. So I used the module NetworkX to start manipulating the data. The results look a ...