AcademictwitteR Tutorial

academicTwitter is a package to query the Twitter Academic Research Product Track, providing access to full-archive search and other v2 API endpoints. Functions are written with academic research in mind, focusing on data storage.

**Authorization for Academic API access: https://cran.r-project.org/web/packages/academictwitteR/vignettes/academictwitteR-auth.html

**After you got access to the API, you can now use the academicTwitter package to pull tweets.

Now we’re ready to collect tweets!

The main workhorse function of academictwitteR is get_all_tweets(), which will get tweets from the Full archive of public tweets for the specified query and specified time period.

 get_all_tweets(
  query = "halloween",
  start_tweets = "2020-10-01T00:00:00Z",
  end_tweets = "2021-10-25T00:00:00Z",
  n = 1000, #set an upper limit. The default is 100. 
  #bearer_token = ""  will have to use this if you haven't set_bearer() from the start. Not recommended. 
  data_path = "halloween",   #recommended for large-scale collection. Data stored in as JSON files. 
  bind_tweets = FALSE,  #recommended for large-scale collection.
)

## query:  halloween

## Warning: Directory already exists. Existing JSON files may be parsed and
## returned, choose a new path if this is not intended.

## Total pages queried: 1 (tweets captured this page: 498).
## Total pages queried: 2 (tweets captured this page: 496).
## Total pages queried: 3 (tweets captured this page: 494).
## Total tweets captured now reach 1000 : finishing collection.
## Data stored as JSONs: use bind_tweets function to bundle into data.frame

Now we use r bind_tweets() to bundle JSONs into a data.frame object. There are two formats for the output that you can use: “raw” and “tidy”.

tweets <- bind_tweets(data_path = "halloween", output_format = "raw")
class(tweets)

## [1] "list"

bind_tweets(data_path = "halloween", output_format = "raw") %>% names

##  [1] "tweet.entities.annotations"          "tweet.entities.mentions"            
##  [3] "tweet.entities.urls"                 "tweet.entities.hashtags"            
##  [5] "tweet.entities.cashtags"             "tweet.public_metrics.retweet_count" 
##  [7] "tweet.public_metrics.reply_count"    "tweet.public_metrics.like_count"    
##  [9] "tweet.public_metrics.quote_count"    "tweet.attachments.media_keys"       
## [11] "tweet.attachments.poll_ids"          "tweet.geo.place_id"                 
## [13] "tweet.geo.coordinates"               "tweet.referenced_tweets"            
## [15] "tweet.main"                          "user.public_metrics.followers_count"
## [17] "user.public_metrics.following_count" "user.public_metrics.tweet_count"    
## [19] "user.public_metrics.listed_count"    "user.entities.url"                  
## [21] "user.entities.description"           "user.main"                          
## [23] "sourcetweet.main"

So we can see that the output with the “raw” format is a list of data frames, and some columns are still nested in the list column. One way to circumvent this is to use the “tidy” format

tweets_tidy <- bind_tweets(data_path = "halloween", output_format = "tidy") 
class(tweets_tidy)

## [1] "tbl_df"     "tbl"        "data.frame"

colnames(tweets_tidy)

##  [1] "tweet_id"               "user_username"          "text"                  
##  [4] "source"                 "conversation_id"        "lang"                  
##  [7] "created_at"             "possibly_sensitive"     "author_id"             
## [10] "in_reply_to_user_id"    "user_protected"         "user_location"         
## [13] "user_description"       "user_pinned_tweet_id"   "user_name"             
## [16] "user_url"               "user_profile_image_url" "user_verified"         
## [19] "user_created_at"        "retweet_count"          "like_count"            
## [22] "quote_count"            "user_tweet_count"       "user_list_count"       
## [25] "user_followers_count"   "user_following_count"   "sourcetweet_type"      
## [28] "sourcetweet_id"         "sourcetweet_text"       "sourcetweet_lang"      
## [31] "sourcetweet_author_id"

Now with this, data is in the tibble class. There are some caveats about this format:

Has both data about tweets, authors and “source tweets”, i.e. in three columns tweet_id, author_id and sourcetweet_id respectively.
For replied tweets of a reply, have to use the conversation_id.
Also lose some data fields such as list of hashtags, cashtags, urls, entities, context annotations etc.

Use count_all_tweets to get the aggregated counts of all tweets with specified query and time range

count_all_tweets(query = "halloween",
  start_tweets = "2021-10-01T00:00:00Z",
  end_tweets = "2021-10-25T00:00:00Z",
  granularity="day")

## query:  halloween 
## Total pages queried: 1 (tweets captured this page: 24).
## This is the last page for halloween : finishing collection.

##                         end                    start tweet_count
## 1  2021-10-02T00:00:00.000Z 2021-10-01T00:00:00.000Z      543489
## 2  2021-10-03T00:00:00.000Z 2021-10-02T00:00:00.000Z      495990
## 3  2021-10-04T00:00:00.000Z 2021-10-03T00:00:00.000Z      412397
## 4  2021-10-05T00:00:00.000Z 2021-10-04T00:00:00.000Z      415268
## 5  2021-10-06T00:00:00.000Z 2021-10-05T00:00:00.000Z      357890
## 6  2021-10-07T00:00:00.000Z 2021-10-06T00:00:00.000Z      354965
## 7  2021-10-08T00:00:00.000Z 2021-10-07T00:00:00.000Z      352597
## 8  2021-10-09T00:00:00.000Z 2021-10-08T00:00:00.000Z      333127
## 9  2021-10-10T00:00:00.000Z 2021-10-09T00:00:00.000Z      300663
## 10 2021-10-11T00:00:00.000Z 2021-10-10T00:00:00.000Z      285580
## 11 2021-10-12T00:00:00.000Z 2021-10-11T00:00:00.000Z      346884
## 12 2021-10-13T00:00:00.000Z 2021-10-12T00:00:00.000Z      359464
## 13 2021-10-14T00:00:00.000Z 2021-10-13T00:00:00.000Z      363654
## 14 2021-10-15T00:00:00.000Z 2021-10-14T00:00:00.000Z      430863
## 15 2021-10-16T00:00:00.000Z 2021-10-15T00:00:00.000Z      475678
## 16 2021-10-17T00:00:00.000Z 2021-10-16T00:00:00.000Z      458902
## 17 2021-10-18T00:00:00.000Z 2021-10-17T00:00:00.000Z      431112
## 18 2021-10-19T00:00:00.000Z 2021-10-18T00:00:00.000Z      467959
## 19 2021-10-20T00:00:00.000Z 2021-10-19T00:00:00.000Z      499991
## 20 2021-10-21T00:00:00.000Z 2021-10-20T00:00:00.000Z      502286
## 21 2021-10-22T00:00:00.000Z 2021-10-21T00:00:00.000Z      531884
## 22 2021-10-23T00:00:00.000Z 2021-10-22T00:00:00.000Z      543159
## 23 2021-10-24T00:00:00.000Z 2021-10-23T00:00:00.000Z      563103
## 24 2021-10-25T00:00:00.000Z 2021-10-24T00:00:00.000Z      615789

Get user tweets

user_twt <-
  get_all_tweets(
    #query = "",
    users = c("CornellUCOMM", "Cornell"),
    start_tweets = "2020-01-01T00:00:00Z",
    end_tweets = "2021-06-01T00:00:00Z",
    n = 1000,
    #n = Inf #to get all tweets from each user. Susceptible to RAM size and rate limit. 
  )

## Warning: Recommended to specify a data path in order to mitigate data loss when
## ingesting large amounts of data.

## Warning: Tweets will not be stored as JSONs or as a .rds file and will only be
## available in local memory if assigned to an object.

## query:   (from:CornellUCOMM OR from:Cornell) 
## Total pages queried: 1 (tweets captured this page: 500).
## Total pages queried: 2 (tweets captured this page: 500).
## Total tweets captured now reach 1000 : finishing collection.

head(user_twt$text)

## [1] "RT @CornellAlumni: Today, and every day, we remember the Cornell alumni who gave the ultimate sacrifice in service of our country. #Memoria…"                                                                  
## [2] "With tickets for two in hand, families of undergraduate seniors and graduate school candidates poured into Ithaca for the first in-person graduation since December 2019. https://t.co/zN1FlZq9r5 #Cornell2021"
## [3] "A new New York Times article quotes @nataliebazarova and cites her recent research on social media disclosure during the pandemic. https://t.co/N2GzTKUUaL"                                                    
## [4] "We honor the lives of the brave women and men we’ve lost while serving our country. #MemorialDay https://t.co/6ek6yTD2j6"                                                                                      
## [5] "RT @CREarle: Commencement is always bittersweet as we say goodbye to amazing @Cornell students who now go forth to change the world. Thank…"                                                                   
## [6] "RT @EzraCornell: Congratulations to the #Cornell2021 graduating class! You persevered through uncertainty and challenges, and you and your…"

How to work about error: vector memory exhausted (limit reached?) with n = Inf: Can increase the system memory allocated to RStudio with R_MAX_VSIZE ; or get more RAM.

Build the entire conversation thread for a Tweet ID

conversation <-
  get_all_tweets(
    # Replace with Tweet ID of your choice to get replies
   conversation_id = "1403738886275096605",
  start_tweets = "2020-10-01T00:00:00Z",
  end_tweets = "2021-10-25T00:00:00Z",
  bind_tweets = TRUE,
  )

## Warning: Recommended to specify a data path in order to mitigate data loss when
## ingesting large amounts of data.

## Warning: Tweets will not be stored as JSONs or as a .rds file and will only be
## available in local memory if assigned to an object.

## query:   conversation_id:1403738886275096605 
## Total pages queried: 1 (tweets captured this page: 14).
## This is the last page for  conversation_id:1403738886275096605 : finishing collection.

colnames(conversation)

##  [1] "id"                  "public_metrics"      "referenced_tweets"  
##  [4] "text"                "lang"                "author_id"          
##  [7] "entities"            "created_at"          "in_reply_to_user_id"
## [10] "possibly_sensitive"  "source"              "conversation_id"

head(conversation$text)

## [1] "@suhemparack @AcademicChatter This is cool. Good work!"                                                                                                                                                   
## [2] "@suhemparack Been checking every two days or so for some news on home timeline v2 endpoint :D"                                                                                                            
## [3] "@thelifeofrishi I actually don’t know about that one but you can check if it is on our roadmap here: https://t.co/exGQTQAm0Q"                                                                             
## [4] "@suhemparack Nice! I was wondering if v2 version of home_timeline is in the works, can you please confirm? Mentions and User v2 is there now we just need home_timeline to make things work with v2"      
## [5] "@suhemparack this looks like an excellent way to stalk someone 🙄"                                                                                                                                        
## [6] "@suhemparack The ability to search by geographic regions would be helpful (if you haven't incorporated it already). Sorting by information on engagement and impression would be helpful as well. THANKS!"

Get geo-tagged tweets (with has:geo operator)

#covid <-
 # get_all_tweets(
 #   "covid-19 has:geo",
 #   "2021-01-01T01:00:00Z",
  #  "2021-01-01T02:00:00Z") not run

#View(tweets)

Look up Users using User IDs with get_user_profile()

# Replace with user IDs of your choice
user_ids <- c("2244994945", "6253282")

users <-
  get_user_profile(
    user_ids)

## Processing from 1 to 2

colnames(users)

##  [1] "id"                "url"               "protected"        
##  [4] "created_at"        "verified"          "pinned_tweet_id"  
##  [7] "name"              "entities"          "public_metrics"   
## [10] "location"          "profile_image_url" "description"      
## [13] "username"

Adding parameters to narrow your search

halloween_us <- get_all_tweets(
  query = "halloween",
  start_tweets = "2021-10-01T00:00:00Z",
  end_tweets = "2021-10-25T00:00:00Z",
  n = 500, 
  country = "US", is_verified = FALSE,  
  lang = "en",  is_retweet = FALSE 
)

## Warning: Recommended to specify a data path in order to mitigate data loss when
## ingesting large amounts of data.

## Warning: Tweets will not be stored as JSONs or as a .rds file and will only be
## available in local memory if assigned to an object.

## query:  halloween -is:retweet -is:verified place_country:US lang:en 
## Rate limit reached. Rate limit will reset at 2021-10-27 16:10:09 
## Sleeping for 509 seconds. 
## ================================================================================Total pages queried: 1 (tweets captured this page: 500).
## Total tweets captured now reach 500 : finishing collection.

The list of arguments you can add to r get_all_tweets() can be found here: https://github.com/cjbarrie/academictwitteR

What if you have more than one query

halloween_us <- get_all_tweets(
 query = c("halloween", "trick or treat", "candies"), # for OR argument or c("halloween trick treat") for AND argument
 start_tweets = "2021-10-01T00:00:00Z",
 end_tweets = "2021-10-25T00:00:00Z",
 n = 100
)

## Warning: Recommended to specify a data path in order to mitigate data loss when
## ingesting large amounts of data.

## Warning: Tweets will not be stored as JSONs or as a .rds file and will only be
## available in local memory if assigned to an object.

## query:  (halloween OR trick or treat OR candies) 
## Total pages queried: 1 (tweets captured this page: 497).
## Total tweets captured now reach 100 : finishing collection.

halloween_us %>% as_tibble()

## # A tibble: 100 × 14
##    conversation_id  author_id  id     created_at  public_metrics$r… $reply_count
##    <chr>            <chr>      <chr>  <chr>                   <int>        <int>
##  1 145242466608193… 993817397… 14524… 2021-10-24…                 9            0
##  2 145242466607368… 140361496… 14524… 2021-10-24…               440            0
##  3 145242466590165… 789593646… 14524… 2021-10-24…               915            0
##  4 145242466530617… 122436164… 14524… 2021-10-24…               532            0
##  5 145242466333899… 244707145  14524… 2021-10-24…               137            0
##  6 145242466179558… 1278330990 14524… 2021-10-24…                 0            0
##  7 145242466148511… 702164473  14524… 2021-10-24…                 7            0
##  8 145242466069656… 141496693… 14524… 2021-10-24…                 0            0
##  9 145242465947613… 140560806… 14524… 2021-10-24…              7993            0
## 10 145241964097877… 136908712… 14524… 2021-10-24…                 0            0
## # … with 90 more rows, and 9 more variables: lang <chr>, text <chr>,
## #   referenced_tweets <list>, entities <df[,5]>, source <chr>,
## #   possibly_sensitive <lgl>, attachments <df[,1]>, in_reply_to_user_id <chr>,
## #   geo <df[,2]>

Build your own query

You can separately call a “build query” function, which will be input to the get_all_tweets later. Let’s see what the available arguments are.

#?build_query

Resume tweet collection in case of interruption / Update

#resume_collection(data_path = "data")
#update_collection(data_path = "data", end_tweets = "2020-05-10T00:00:00Z")

Advanced tutorials for storage (Python)

Introduction to Twitter data processing and storage on AWS: https://dev.to/twitterdev/introduction-to-twitter-data-processing-and-storage-on-aws-1og

Other Resources

academictwitteR package:

https://cran.r-project.org/web/packages/academictwitteR/ https://github.com/cjbarrie/academictwitteR

Twitter developer doc:

https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet Details about tweet_object https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research Tutorial for beginners

Support:

Contact point on Twitter to ask about academic Twitter: @suhemparack or https://twittercommunity.com/c/academic-research/62 Forums for q&A

For Python:

https://gitlab.com/christoph.fink/twitterhistory
https://github.com/DigitalGeographyLab/tweetsearcher