ArXiv Fever

ArXiv is a public repository where researchers make their new work available. It's a great resource when you are trying to find cutting-edge research undertaken by physicists, computer scientists, or biologists, among others. Since these are people who do a lot to advance fields like social network analysis and datamining, sociologists should not shy away from this work, despite the many equations.

In this post, I'll develop a quick and diry way to get an impression of what is happening on arXiv using the service's API.

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [2]:
matplotlib.rcParams['figure.figsize'] = (24.0, 16.0)

The above two commands prepare our IPython environment for graphical output, which awaits you further down.

Next we'll load a few libraries and modules. Requests and wordcloud are not in the standard library, but they are unproblematic to install using a system package manager or pip.

In [3]:
import requests
from urllib.parse import urlencode
from xml.dom.minidom import parseString as xml_parse
from collections import Counter
from wordcloud import WordCloud, STOPWORDS

Next, I'll define a helper function, arxiv_query, to get the data from arXiv and parse it. Thankfully arXiv does not require registration or any kind of access key, so this function is quite simple.

Note: arxiv_query tries to retrieve 1,000 items, which could take a while. If you are planning on doing research using queries that could return hundreds of items, it is probably better to page through results.

In [4]:
def arxiv_query(q, method='query'):
    ARXIV_API = 'http://export.arxiv.org/api/{}?{}'
    d = {'max_results': 1000,
         'search_query': q}
    query_url = ARXIV_API.format(method, urlencode(d))
    r = requests.get(query_url)
    if r.status_code == 200:
        return xml_parse(r.text)
    else:
        raise ConnectionError

My preoccupation these days is with Instagram research, so I'm going to see what people on arXiv have uploaded under that query.

In [5]:
t = arxiv_query('all:instagram')

Let's see how many entries we got back.

In [6]:
len(t.getElementsByTagName('entry'))
Out[6]:
15

As you might expect, research using Instagram data is not very voluminous yet. Only fifteen papers on arXiv so far.

How long has research been going on?

In [7]:
dates = [d.firstChild.nodeValue for d in t.getElementsByTagName('published')]
In [8]:
sorted(dates)
Out[8]:
['2014-06-30T14:22:39Z',
 '2014-08-02T03:45:53Z',
 '2014-08-07T10:39:52Z',
 '2014-08-21T05:13:04Z',
 '2014-10-29T19:02:21Z',
 '2014-11-16T20:10:30Z',
 '2015-03-05T06:03:53Z',
 '2015-03-12T22:35:06Z',
 '2015-05-07T09:26:31Z',
 '2015-05-26T08:39:07Z',
 '2015-07-10T02:58:04Z',
 '2015-07-13T16:20:43Z',
 '2015-08-18T00:54:01Z',
 '2015-08-25T19:27:23Z',
 '2015-09-07T13:21:48Z']

The first entry dates from June 2014; the latest is just a few days old. Clearly this is an area that's only beginning to capture researchers' attention.

What is it that they are researching? Since there's only fifteen papers, we could easily read through them or at least skim their abstracts. But since we have the data right here, we might as well do a bit of aggregation to get an overview. Besides, isn't it always nice to have an excuse to create a word cloud?

In [9]:
summaries = [s.firstChild.nodeValue for s in t.getElementsByTagName('summary')]
s = ''.join(summaries).lower()

We can count how often certain words appear, sans predefined stop words.

In [10]:
Counter([w for w in s.split() if w not in STOPWORDS]).most_common(15)
Out[10]:
[('social', 42),
 ('users', 19),
 ('online', 16),
 ('media', 14),
 ('analysis', 14),
 ('instagram', 13),
 ('also', 12),
 ('study', 10),
 ('popular', 9),
 ('network', 9),
 ('find', 8),
 ('twitter', 8),
 ('data', 8),
 ('using', 8),
 ('cyberbullying', 8)]

Some of these words look a bit too obvious -- they may just detract from more interesting word occurences -- so I'm going to append them to the list of stop words that will be removed from the visualization.

In [11]:
stops = list(STOPWORDS) + ['social', 'online', 'media', 'analysis', 'instagram', 'also', 'study', 'using']

Okay, now we're all set for word cloud prettiness.

In [12]:
wordcloud = WordCloud(stopwords=stops, 
                      background_color='white', 
                      width=2400, height=1600, 
                      min_font_size=6).generate(s)
In [13]:
plt.imshow(wordcloud)
Out[13]:
<matplotlib.image.AxesImage at 0x780b0ff2c400>

Here we have it: a visualization of the most cutting-edge Instagram research. It's not the most surprising result, but hey, this is more of a proof of concept anyway.

But still, we can say a few things about Instagram research on the basis of this word cloud.

  • The high incidence of the words "user", "follower" and "network" suggests that researchers are looking at characteristics of social networks formed on the platform.
  • Researchers are relating what they observe on Instagram to research on Twitter (and, to a far lesser extent, Flickr), but interestingly Facebook, Instagram's parent company, does not appear to be referenced very much.
  • The actual content of what is pictured on Instagram does not seem to play a huge role. Words like "health", "fashion", "restaurants" and "food" do appear in the word cloud, but they are pretty small.
  • Business uses of Instagram seem to be in the focus of some researchers, hence the occurence of words like "market" and "merchant".
  • Researchers are studying incidents of cyberbullying on Instagram.

That's all for now. Stay tuned for more fun with APIs.