Summarizing Tweets in a Disaster

On April 25th 2015, just before noon, Nepal experienced an earthquake of magnitude 7.8 on the moment magnitude scale. The earthquake ripped through Kathmandu valley, and a series of aftershocks leveled entire villages.

Immediately after the earthquake, volunteers from around the world were instrumental in guiding emergency operations, using satellite imagery to identify infrastructure destruction throughout the region.

However, people on the ground in Nepal were also generating tremendous amounts of information which could be of use to rescue operations, albeit less directly: on twitter. Between April 25th and May 28th, 33,610 tweets were tweeted by people in Nepal. These tweets were full of useful information, but 33,610 tweets is simply too many for a rescue operation to comb through.

This is the motivation for this project:

Can I take a large body of tweets, and from them extract a useful, short summary which could have been useful to rescue operations on the ground?

Getting the tweets
Finding all the useful tweets, using content words and tf-idf scores
Picking the best tweets to make a short summary
Conclusion

1. Getting the tweets

Link to code

I obtained my tweets from ‘Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages’; this project took thousands of tweets from crises, and labelled them into 8 different categories (eg. ‘displaced people and evacuations’, or ‘sympathy and emotional support’).

However, since a rescue team wouldn’t be able to label the tweets, I only used the dataset for the tweets themselves, not for the labels.

Twitter’s policy says that only tweet ids, not the tweets themselves, can be saved — this way, if a user removes their tweets, then they are not stored elsewhere. I therefore had to use twitter’s API to obtain the tweet contents from this corpus. This required me to register an app:

tweet_1

I then used Twython to download the tweets, and was on my way!

2. Extracting the situational tweets

Link to code

It was immediately clear that not all tweets would be equally useful to rescue teams. For instance,

@Gurmeetramrahim: #MSGHelpEarthquakeVictims Shocked!!!hearing
earthquake in Nepal n some parts of India. I pray to GOD to
save His child

contains no information which is useful to rescue teams, especially compared to:

MEA opens 24 hour Control Room for queries regarding the Nepal #Earthquake.
Numbers:
+91 11 2301 2113
+91 11 2301 4104
+91 11 2301 7905

Useful tweets are categorized as situational tweets, and may contain status updates, or immediately useful information (such as numbers of nearby hospitals).

Non situational tweets contain (for example) sentiments, opinions or event analyses. These do not immediately help rescue efforts.

Before I could begin summarizing tweets, I needed to separate the situational tweets from non-situational ones. I did this in two ways: I first manually isolated characteristics (2.1 — Content words) of tweets which contribute to their usefulness. I then used a document analysis tool called tf-idf (2.2) to find words which are significant to this particular event and group of tweets.

2.1. Content words

In their 2015 paper, Rudra et. al identified three classes of terms which provide important information during disasters: numerals, nouns and main verbs. words which fell into these classes are called content words. I found this too general to differentiate tweets, and defined two classes of my own:

Numerals (eg. number of casualties, important phone numbers)
Entities (eg. places, dates, events, organisations, etc.)

SpaCy (a natural language processing library, which automatically analyzes and extracts information from text) is a very useful tool to identify content words; when SpaCy tokenizes text, it adds a lot of additional information to the tokens, such as whether it is an entity (and if it is, what type of entity it is), what its part of speech is (i.e. is it a noun? a verb? a numeral?), or even the token’s sentiment.

tweet_2

I used SpaCy to tokenize the tweets. This meant breaking the tweets down into their components (primarily words, but also punctuation and numbers), and turning these components into tokens. For instance,

: Over 110 killed in earthquake: Nepal Home Ministry (PTI)

becomes

[:, Over, 110, killed, in, earthquake, :, Nepal, Home, Ministry,
(, PTI, )]

The power of SpaCy is that these tokens are full of additional information; for example, I can find out the part of speech of all of these tokens:

[u'PUNCT', u'ADP', u'NUM', u'VERB', u'ADP', u'NOUN', u'PUNCT',
u'PROPN', u'PROPN', u'PROPN', u'PUNCT', u'PROPN', u'PUNCT']

Another very useful attribute of a token is its entity type; SpaCy can tell me that Kathmandu is a city, or that 25 April is a date. I included a token as a content word if its entity type (token.ent_type_) was:

NORP: Nationality or religious or political group
FACILITY: Buildings, airports, highways, bridges, etc.
ORG: Companies, agencies, institutions, etc.
GPE: Countries, cities, states.
LOC: Non-GPE locations, mountain ranges, bodies of water.
EVENT: Named hurricanes, battles, wars, sports events, etc.
DATE: Absolute or relative dates or periods.
TIME: Times smaller than a day.

I also included tokens if their part of speech (token.pos_) marked them as a number, or if they were one of a list of key words (eg. ‘killed’, ‘injured’, ‘hurt’).

This classification broadly allowed me to start sorting my tweets; situational tweets contain more content words, while non-situational tweets contain less. For instance,

@timesofindia: #Earthquake | Helpline numbers of the Indian Embassy
in Nepal:\r+9779581107021\r\r+9779851135141'

is a highly useful tweet, and as expected, many content words are extracted:

[the, Indian, Embassy, Nepal, 977, 9581107021, 977, 9851135141]

On the other hand,

Pray for #Nepal where a powerful earthquake has struck. May the
Almighty grant them ease to face this calamity with utmost sÛ_

contains very little situational information, and the only content word extracted from it is [Nepal].

2.2. tf-idf scores

A drawback of content words is that they fail to capture any information about the words themselves. For instance, for this disaster, the word Nepal is going to be a strong indicator of whether or not a tweet is situational, but its not weighted any differently than any other content words right now.

It is possible to introduce such a weighting, using term frequency — inverse document frequency (tf-idf) scores.

Despite its long name, the logic behind a tf-idf score is straightforward:

“Words which occur fairly frequently in a body of documents are probably more important, but if they occur too often, then they are too general to be helpful.”

Basically, I wanted to score the word ‘Nepal’ highly (and it should occur once, in many tweets), but not the word ‘the’ (which should occur many times, in many tweets).

Mathematically, the tf-idf score for some word t can be described as

\[ \textrm{tf-idf score} = \overline{c_{t}} \times log(\frac{N}{n_{t}})\]

where c is the average number of times the word t appears in a document, N is the total number of documents, and n is the number of documents in which the word t appears.

Textacy, a library built on top of SpaCy, made it super easy for me to assign tf-idf scores to the words in the tweets:

WORD:morning -- tf-idf SCORE:6.45446604904
WORD:bishnu -- tf-idf SCORE:8.06390396147
WORD:nagarkot -- tf-idf SCORE:12.2359876248
WORD:search -- tf-idf SCORE:6.35915586923
WORD:kathmandu -- tf-idf SCORE:5.27350744616

Now, if I picked only the tweets which had lots of content words, or even only the tweets with lots of content words with high tf-idf scores, I would still end up with far too many tweets for rescue teams to realistically find useful.

What I want to do is find a summary of tweets which is short, but which also encompasses the most words with the highest tf-idf scores possible.

3. Content Word-based Tweet Summarization

To generate something which would be useful to rescue teams, I need to generate something which is short (and therefore quick to read). It also needs to contain information which would be of use to rescue teams — so the summary needs to be full of content words with high tf-idf scores.

I can easily define this as an equation to solve for, with constraints:

Equation: Maximize the total score of the content words in my summary.

Constraint 1: The summary must be shorter than 150 words.

Constraint 2: If I pick a content word to be in my summary, then I must pick some tweet which contains that content word to be in my summary.

Constraint 3: If I pick some tweet to be in my summary, then all the content words in that tweet must be included.

I need to solve the equation, subject to the constraints. The variables I am solving for (whether a content word is in the summary or not) are integers. Specifically, the choice is binary — 1 if the word is included, 0 if it is not.

This approach of solving some equation (maximizing the score of content words) subject to some constraints, with integer variables, is known as Integer Linear Programming (ILP).

Using ILP, I can define mathematically what I wrote above as maximizing

\[ \sum_{i=1}^{n}x_{i} + \sum_{j=1}^{m} Score(j) \cdot y_{j}\]

where x and y are arrays of 1s and 0s, depending on whether or not tweet i is selected and content word j is selected, and Score(j) is content word j’s tf-idf score. The constraints are defined as

\[ \sum_{i=1}^{n} x_{i}\cdot Length(i) \leq L\] \[ \sum_{j \in C_{i}} y_{j} \leq | C_{i} | \times x_{i}, i = [1, … , n]\] \[ \sum_{i \in T_{j}} x \geq y_{j}, j = [1, …, m] \]

(These are described in more detail in my jupyter notebook)

Using pymathproj to optimize this ILP problem yielded the following summary:

1. TV: 2 dead, 100 injured in Bangladesh from Nepal quake: DHAKA, Bangladesh (AP) ÛÓ A TV r...
-------------
2. : Earthquake helpline at the Indian Embassy in Kathmandu-+977 98511 07021, +977 98511 35141
-------------
3. +91 11 2301 7905om no for Nepal #earthquake +91 11 2301 2113
-------------
4. Tremors felt in West Bengal after 7.9 Magnitude Earthquake in Nepal
-------------
5. This mobile App may help these people to predict this powerfull M7.9 earthquake
-------------
6. 5.00 earthquake occured 31km NNW of Nagarkot, Nepal at 09:30 UTC! #earthquake #Nagarkot
-------------
7. Earthquake in bihar up punjab haryana delhi ncr and nepal
-------------
8. : Whole Himalayan region is becoming non stable. Two yrs back Uttrakhand, then Kashmir now Nepal n north east. Even Tibet isÛ_
-------------
9. WellingtonHere Nepal's Home Ministry Says at Least 71 People Killed in the Earthquake: Nepal'...  WellingtonHere
-------------
10. 934s of major earthquake-s in Nepal:
-------------
11. Historic Dharahara tower collapses in Kathmandu after quake | Latest News &amp; Updates at Daily...
-------------
12. Powerful quake near Nepal capital causes widespread damage, avalanche near Everest base camp

Comparing this to random tweets:

tweet_3

There definitely is some noise, but its not bad! The summary is especially good at providing locational information, such as describing where the epicenter is and which areas are impacted by the earthquake.

4. Conclusions

If I was to do this project again, I would use a twitter specific tokenizer. SpaCy’s tokenizer was actually pretty poor at tokenizing the data, because of all the abbreviations and twitter specific lingo. I would also have cleaned the data more, since misspellings and abbreviations may also have hurt the performance of the tokenizer.

Overall, it was awesome to experiment with methods other than word embeddings to compare and quantify textual data. Also, the speed of this method makes me confident it could be implemented by rescue teams; in particular, if smaller subsets of tweets were used (eg. daily tweets, instead of tweets covering the whole event), it could be very useful!

4.1. Sources, and further reading

This project involved implementing a variety of different papers; here are some cool papers to further explore twitter summaries in disasters:

Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages

Natural Language Processing to the Rescue? Extracting “Situational Awareness” Tweets During Mass Emergency

Summarizing Situational Tweets in a Crisis Scenario (this post essentially implemented this paper)

Update — Using NLTK’s twitter tokenizer

Link to code

I repeated the exercise using NLTK instead of spaCy. This allowed me to tokenize the tweets using NLTK’s twitter specific tokenizer, yielding the following summary (compared to the spaCy output):

tweet_4

(NDRF is India’s Natural Disaster Response Force)

Because the content words are defined differently, its difficult to quantitatively compare the two outputs, but it is worth noting that NLTK’s entity recognition system (I used Stanford’s NER) was significantly slower than spaCy’s.

Summarizing Tweets in a Disaster

Contents

1. Getting the tweets

2. Extracting the situational tweets

2.1. Content words

2.2. tf-idf scores

3. Content Word-based Tweet Summarization

4. Conclusions

4.1. Sources, and further reading

Update — Using NLTK’s twitter tokenizer