American Sociology and Zombies
- Authored by John D. Boy
- Permalink
A frightening tale of imperfect data, a contrived metric, and undead scholarship.
Read on, if you dare, and you'll find out what zombies haunt American sociology! 🧟♀️🧟♂️🎓
I couldn't resist the Halloween theme for a quick writeup of my recent foray into bibliometric research. I wanted to figure out whether I could identify "undead" sociological scholarship—work that, after being dead for a while, found an unexpected second lease on life. I understand "dead" in the scientometric sense of not being cited.
How to find such undead scholarship? With a bibliometric approach, it's possible to cast a wide net. I decided to first focus on work that appeared in two major journals, the American Journal of Sociology and the American Sociological Review. Both journals have a long publication history (starting in 1895 and 1936, respectively), so they've had plenty of time to breed a few zombies.
Here's how I created my data set:
- I collected the digital object identifier (e.g., 10.1007/s12108-015-9254-0) of every article to appear in the two journals from the beginning till 1990, figuring that work published after then hasn't been around long enough to attain zombie status. This is a simple matter of making a few API calls to Crossref. Crossref doesn't only respond with DOIs but also a range of other useful bits of information, including a citation count.
- I then queried Lens for additional data on all AJS and ASR articles. What I was most interested in was the list of citing articles for each DOI I queried. Without knowing which articles are or aren't being cited, how will I find the zombies?
- Because Lens does not supply full bibliographic information for each citing article but only a Lens ID (e.g., 016-637-274-198-210), I needed to submit another batch of queries. This time I sent the Lens IDs of the citing articles and asked to get back their publication dates. That way I can't just determine whether articles are being cited, but also when.
- I can then merge the list of AJS and ASR articles with the list of citing articles and their publications years. For each article, I can then find out not just how often they were cited (a popular vanity metric), but also how long the gaps between citations are.
How do I got from citation gaps to finding zombies? This is where it gets extra scary. I concoted a metric that I call the zombie score. It is a function of the maximum gap between citations, the standard deviation of all gaps, the total "lifetime" of the article (from publication date to date of last citation), and the total number of citations. Why, you ask? Do you really want to know? Really? I told you the answer is scary. Alright, I'll tell you, but don't say I didn't warn you. I simply developed the score through trial and error, trying slightly different calculations until the score seemed to be surfacing the kinds of articles I was hoping to identify.
The horror.
Alright, you asked for it...
(Want to hunt zombies as well? You can find the data in this repository.)
%pylab inline
plt.rcParams['figure.figsize'] = [18, 12]
import pandas as pd
import seaborn as sb
amsoc_lens = pd.read_msgpack('amsoc_lens.msgpack')
amsoc_citations = pd.read_msgpack('amsoc_citations_lens.msgpack')
After loading the two data sets (AJS and ASR articles, citing articles), I merged them so that I could calculate citation gaps for each article. This requires a little intermediary work to bridge the two, which is what the following inelegant code does.
citation_pairs = []
for i, row in amsoc_lens[['scholarly_citations']].dropna(how='all').iterrows():
citation_pairs.extend([(i, cit) for cit in row['scholarly_citations']])
citations_long = pd.DataFrame(citation_pairs, columns=['doi', 'citing_article_lens_id'])
citations_long = citations_long.merge(amsoc_lens[['year',
'date_published',
'citations_crossref',
'citations_lens']],
left_on='doi',
right_index=True)
citations_long = citations_long.merge(amsoc_citations,
left_on='citing_article_lens_id',
right_on='lens_id', how='left').drop('lens_id', 1).set_index('doi')
Now that the data sets are merged, the following calculates gaps between citations.
citations_long = citations_long.reset_index().sort_values(['doi', 'citing_article_year']).set_index('doi')
citations_long['prev_cit_year'] = citations_long.groupby('doi')['citing_article_year'].shift()
citations_long['prev_cit_year'] = citations_long['prev_cit_year'].fillna(citations_long['year'])
citations_long['cit_gap'] = citations_long['citing_article_year'] - citations_long['prev_cit_year']
max_cit_gaps = citations_long.groupby('doi')[['cit_gap']].max()\
.rename(columns={'cit_gap': 'max_cit_gap'})
std_cit_gaps = citations_long.groupby('doi')[['cit_gap']].std(ddof=0)\
.rename(columns={'cit_gap': 'std_cit_gap'})
last_cit = citations_long.groupby('doi')[['citing_article_year']].max()\
.rename(columns={'citing_article_year': 'last_cit'})
Finally, I merge the citation gap metrics back into the article-level data.
amsoc_lens = amsoc_lens.merge(max_cit_gaps, on='doi')\
.merge(std_cit_gaps, on='doi')\
.merge(last_cit, on='doi')
I calculate the difference between the number of citations reported by Lens and Crossref. This is a crude indicator of data quality. I'm relying on the Lens data, but if Lens reports fewer citations than Crossref, it's likely the Lens data have substantial problems. Then citation gaps would then appear too big, resulting in spurious zombie spottings. Can't have that.
amsoc_lens['lens_crossref_diff'] = amsoc_lens['citations_lens'] - amsoc_lens['citations_crossref']
I will restrict the analysis to articles with at least five citations and no obvious data problems.
amsoc_lens = amsoc_lens[(amsoc_lens['citations_lens'] > 4) &
(amsoc_lens['lens_crossref_diff'] >= 0)]
Now we're ready to calculate the zombie score. 😱
amsoc_lens['lifetime'] = amsoc_lens['last_cit'] - amsoc_lens['year']
amsoc_lens['zombie_score'] = ((amsoc_lens['max_cit_gap'] * amsoc_lens['std_cit_gap']) / \
(amsoc_lens['lifetime'] * pd.np.log10(amsoc_lens['citations_lens'])))
sb.lineplot(amsoc_lens['year'],
amsoc_lens['zombie_score'])
Not surprisingly, zombies were mostly born a long time ago. In fact, the most zombie-ish articles are from a time before ASR even existed, meaning that ASR zombies are probably a rare breed.
The jaggedness of the line in the early decades indicates that there are a lot of data problems with older articles. But let's just ignore that for now, shall we?
What does it mean for an article to have a low zombie score? It means it's an evergreen article, the kind that's been continually cited. Unsurprisingly, that characteristic is highly correlated with a high absolute citation count. Before we look at the zombies, let's look at their sprightly counterparts.
amsoc_lens.sort_values('zombie_score')\
.head(10)[['family_names', 'title', 'journal', 'year', 'last_cit', 'citations_lens']]
Most of these highly-cited "evergreen" articles are from the 1970s and 1980s. (Remember that I didn't collect any data past 1990.) Interestingly, rival sociologists Parsons and Gouldner are two exceptions, breaking into these high ranks with considerably older articles.
But enough of these pedestrian pieces of scholarship. Let's finally look at the zombies.
Doomsday drumroll please...
amsoc_lens.sort_values('zombie_score', ascending=False)\
.head(20)[['family_names', 'title', 'journal', 'year', 'citations_lens', 'last_cit']]
There we have them, the 25 most undead articles of American sociology. They were, with one exception, all published before World War I, and they all appeared in AJS. They have only been cited a handful of times, but they've all been cited in the past decade.
I plan to look more into who necrobumped these articles and why, but a glance over the list already suggests a few possible reasons. G.H. Mead and Albion Small were major figures in the (first) Chicago School, and it seems like their work in AJS has been rediscovered by scholars taking an interest in the Chicago School's intellectual history. Annie Marion MacLean is on the list (twice) for a similar reason, except that she wasn't a major figure in the Chicago School, but one of the neglected women toiling away at ethnographic studies. Nellie Mason Auten is another such example.
The note by Durkheim published as a "short editorial" is the only thing the eminent French sociologist ever published in AJS, as far as I can see.
But many of the other names are more puzzling. Can we expect an Ellwood revival, or a Hayes revival, or a Chapin revival? Unlikely. Ratzel coined the phrase Lebensraum beloved by the Nazis; that's a zombie to slay. I'm glad I learned about Vladimir Karapetoff and his mathematical ideas about human satisfaction though, and I'll be sure to inspect more zombies up close.
What about zombies born after ASR came into being?
amsoc_lens[amsoc_lens['year'] >= 1936].sort_values('zombie_score', ascending=False)\
.head(20)[['family_names', 'title', 'journal', 'year', 'citations_lens', 'last_cit']]
We see a fairly even split between the two journals. Some of the authors here are clearly superstars that were being rediscovered, as before with the Chicago School scholars. Karen Horney! Robert Merton! Everett Hughes! But others are again more puzzling...
It's also noticeable that all zombies have 5 or 6 citations—right at the cutoff. That suggests that my metric needs adjusting—but the thought of that is just too scary for me right now...
Happy Halloween! 🎃