academicTwitter is a package to query the Twitter Academic Research Product Track, providing access to full-archive search and other v2 API endpoints. Functions are written with academic research in mind, focusing on data storage.
**Authorization for Academic API access: https://cran.r-project.org/web/packages/academictwitteR/vignettes/academictwitteR-auth.html
**After you got access to the API, you can now use the academicTwitter package to pull tweets.
The main workhorse function of academictwitteR is get_all_tweets(), which will get tweets from the Full archive of public tweets for the specified query and specified time period.
get_all_tweets(
query = "halloween",
start_tweets = "2020-10-01T00:00:00Z",
end_tweets = "2021-10-25T00:00:00Z",
n = 1000, #set an upper limit. The default is 100.
#bearer_token = "" will have to use this if you haven't set_bearer() from the start. Not recommended.
data_path = "halloween", #recommended for large-scale collection. Data stored in as JSON files.
bind_tweets = FALSE, #recommended for large-scale collection.
)
## query: halloween
## Warning: Directory already exists. Existing JSON files may be parsed and
## returned, choose a new path if this is not intended.
## Total pages queried: 1 (tweets captured this page: 498).
## Total pages queried: 2 (tweets captured this page: 496).
## Total pages queried: 3 (tweets captured this page: 494).
## Total tweets captured now reach 1000 : finishing collection.
## Data stored as JSONs: use bind_tweets function to bundle into data.frame
Now we use r bind_tweets() to bundle JSONs into a data.frame object. There are two formats for the output that you can use: “raw” and “tidy”.
tweets <- bind_tweets(data_path = "halloween", output_format = "raw")
class(tweets)
## [1] "list"
bind_tweets(data_path = "halloween", output_format = "raw") %>% names
## [1] "tweet.entities.annotations" "tweet.entities.mentions"
## [3] "tweet.entities.urls" "tweet.entities.hashtags"
## [5] "tweet.entities.cashtags" "tweet.public_metrics.retweet_count"
## [7] "tweet.public_metrics.reply_count" "tweet.public_metrics.like_count"
## [9] "tweet.public_metrics.quote_count" "tweet.attachments.media_keys"
## [11] "tweet.attachments.poll_ids" "tweet.geo.place_id"
## [13] "tweet.geo.coordinates" "tweet.referenced_tweets"
## [15] "tweet.main" "user.public_metrics.followers_count"
## [17] "user.public_metrics.following_count" "user.public_metrics.tweet_count"
## [19] "user.public_metrics.listed_count" "user.entities.url"
## [21] "user.entities.description" "user.main"
## [23] "sourcetweet.main"
So we can see that the output with the “raw” format is a list of data frames, and some columns are still nested in the list column. One way to circumvent this is to use the “tidy” format
tweets_tidy <- bind_tweets(data_path = "halloween", output_format = "tidy")
class(tweets_tidy)
## [1] "tbl_df" "tbl" "data.frame"
colnames(tweets_tidy)
## [1] "tweet_id" "user_username" "text"
## [4] "source" "conversation_id" "lang"
## [7] "created_at" "possibly_sensitive" "author_id"
## [10] "in_reply_to_user_id" "user_protected" "user_location"
## [13] "user_description" "user_pinned_tweet_id" "user_name"
## [16] "user_url" "user_profile_image_url" "user_verified"
## [19] "user_created_at" "retweet_count" "like_count"
## [22] "quote_count" "user_tweet_count" "user_list_count"
## [25] "user_followers_count" "user_following_count" "sourcetweet_type"
## [28] "sourcetweet_id" "sourcetweet_text" "sourcetweet_lang"
## [31] "sourcetweet_author_id"
Now with this, data is in the tibble class. There are some caveats about this format:
Has both data about tweets, authors and “source tweets”, i.e. in three columns tweet_id, author_id and sourcetweet_id respectively.
For replied tweets of a reply, have to use the conversation_id.
Also lose some data fields such as list of hashtags, cashtags, urls, entities, context annotations etc.
count_all_tweets(query = "halloween",
start_tweets = "2021-10-01T00:00:00Z",
end_tweets = "2021-10-25T00:00:00Z",
granularity="day")
## query: halloween
## Total pages queried: 1 (tweets captured this page: 24).
## This is the last page for halloween : finishing collection.
## end start tweet_count
## 1 2021-10-02T00:00:00.000Z 2021-10-01T00:00:00.000Z 543489
## 2 2021-10-03T00:00:00.000Z 2021-10-02T00:00:00.000Z 495990
## 3 2021-10-04T00:00:00.000Z 2021-10-03T00:00:00.000Z 412397
## 4 2021-10-05T00:00:00.000Z 2021-10-04T00:00:00.000Z 415268
## 5 2021-10-06T00:00:00.000Z 2021-10-05T00:00:00.000Z 357890
## 6 2021-10-07T00:00:00.000Z 2021-10-06T00:00:00.000Z 354965
## 7 2021-10-08T00:00:00.000Z 2021-10-07T00:00:00.000Z 352597
## 8 2021-10-09T00:00:00.000Z 2021-10-08T00:00:00.000Z 333127
## 9 2021-10-10T00:00:00.000Z 2021-10-09T00:00:00.000Z 300663
## 10 2021-10-11T00:00:00.000Z 2021-10-10T00:00:00.000Z 285580
## 11 2021-10-12T00:00:00.000Z 2021-10-11T00:00:00.000Z 346884
## 12 2021-10-13T00:00:00.000Z 2021-10-12T00:00:00.000Z 359464
## 13 2021-10-14T00:00:00.000Z 2021-10-13T00:00:00.000Z 363654
## 14 2021-10-15T00:00:00.000Z 2021-10-14T00:00:00.000Z 430863
## 15 2021-10-16T00:00:00.000Z 2021-10-15T00:00:00.000Z 475678
## 16 2021-10-17T00:00:00.000Z 2021-10-16T00:00:00.000Z 458902
## 17 2021-10-18T00:00:00.000Z 2021-10-17T00:00:00.000Z 431112
## 18 2021-10-19T00:00:00.000Z 2021-10-18T00:00:00.000Z 467959
## 19 2021-10-20T00:00:00.000Z 2021-10-19T00:00:00.000Z 499991
## 20 2021-10-21T00:00:00.000Z 2021-10-20T00:00:00.000Z 502286
## 21 2021-10-22T00:00:00.000Z 2021-10-21T00:00:00.000Z 531884
## 22 2021-10-23T00:00:00.000Z 2021-10-22T00:00:00.000Z 543159
## 23 2021-10-24T00:00:00.000Z 2021-10-23T00:00:00.000Z 563103
## 24 2021-10-25T00:00:00.000Z 2021-10-24T00:00:00.000Z 615789
user_twt <-
get_all_tweets(
#query = "",
users = c("CornellUCOMM", "Cornell"),
start_tweets = "2020-01-01T00:00:00Z",
end_tweets = "2021-06-01T00:00:00Z",
n = 1000,
#n = Inf #to get all tweets from each user. Susceptible to RAM size and rate limit.
)
## Warning: Recommended to specify a data path in order to mitigate data loss when
## ingesting large amounts of data.
## Warning: Tweets will not be stored as JSONs or as a .rds file and will only be
## available in local memory if assigned to an object.
## query: (from:CornellUCOMM OR from:Cornell)
## Total pages queried: 1 (tweets captured this page: 500).
## Total pages queried: 2 (tweets captured this page: 500).
## Total tweets captured now reach 1000 : finishing collection.
head(user_twt$text)
## [1] "RT @CornellAlumni: Today, and every day, we remember the Cornell alumni who gave the ultimate sacrifice in service of our country. #Memoria…"
## [2] "With tickets for two in hand, families of undergraduate seniors and graduate school candidates poured into Ithaca for the first in-person graduation since December 2019. https://t.co/zN1FlZq9r5 #Cornell2021"
## [3] "A new New York Times article quotes @nataliebazarova and cites her recent research on social media disclosure during the pandemic. https://t.co/N2GzTKUUaL"
## [4] "We honor the lives of the brave women and men we’ve lost while serving our country. #MemorialDay https://t.co/6ek6yTD2j6"
## [5] "RT @CREarle: Commencement is always bittersweet as we say goodbye to amazing @Cornell students who now go forth to change the world. Thank…"
## [6] "RT @EzraCornell: Congratulations to the #Cornell2021 graduating class! You persevered through uncertainty and challenges, and you and your…"
How to work about error: vector memory exhausted (limit reached?) with n = Inf: Can increase the system memory allocated to RStudio with R_MAX_VSIZE ; or get more RAM.
conversation <-
get_all_tweets(
# Replace with Tweet ID of your choice to get replies
conversation_id = "1403738886275096605",
start_tweets = "2020-10-01T00:00:00Z",
end_tweets = "2021-10-25T00:00:00Z",
bind_tweets = TRUE,
)
## Warning: Recommended to specify a data path in order to mitigate data loss when
## ingesting large amounts of data.
## Warning: Tweets will not be stored as JSONs or as a .rds file and will only be
## available in local memory if assigned to an object.
## query: conversation_id:1403738886275096605
## Total pages queried: 1 (tweets captured this page: 14).
## This is the last page for conversation_id:1403738886275096605 : finishing collection.
colnames(conversation)
## [1] "id" "public_metrics" "referenced_tweets"
## [4] "text" "lang" "author_id"
## [7] "entities" "created_at" "in_reply_to_user_id"
## [10] "possibly_sensitive" "source" "conversation_id"
head(conversation$text)
## [1] "@suhemparack @AcademicChatter This is cool. Good work!"
## [2] "@suhemparack Been checking every two days or so for some news on home timeline v2 endpoint :D"
## [3] "@thelifeofrishi I actually don’t know about that one but you can check if it is on our roadmap here: https://t.co/exGQTQAm0Q"
## [4] "@suhemparack Nice! I was wondering if v2 version of home_timeline is in the works, can you please confirm? Mentions and User v2 is there now we just need home_timeline to make things work with v2"
## [5] "@suhemparack this looks like an excellent way to stalk someone 🙄"
## [6] "@suhemparack The ability to search by geographic regions would be helpful (if you haven't incorporated it already). Sorting by information on engagement and impression would be helpful as well. THANKS!"
#covid <-
# get_all_tweets(
# "covid-19 has:geo",
# "2021-01-01T01:00:00Z",
# "2021-01-01T02:00:00Z") not run
#View(tweets)
# Replace with user IDs of your choice
user_ids <- c("2244994945", "6253282")
users <-
get_user_profile(
user_ids)
## Processing from 1 to 2
colnames(users)
## [1] "id" "url" "protected"
## [4] "created_at" "verified" "pinned_tweet_id"
## [7] "name" "entities" "public_metrics"
## [10] "location" "profile_image_url" "description"
## [13] "username"
halloween_us <- get_all_tweets(
query = "halloween",
start_tweets = "2021-10-01T00:00:00Z",
end_tweets = "2021-10-25T00:00:00Z",
n = 500,
country = "US", is_verified = FALSE,
lang = "en", is_retweet = FALSE
)
## Warning: Recommended to specify a data path in order to mitigate data loss when
## ingesting large amounts of data.
## Warning: Tweets will not be stored as JSONs or as a .rds file and will only be
## available in local memory if assigned to an object.
## query: halloween -is:retweet -is:verified place_country:US lang:en
## Rate limit reached. Rate limit will reset at 2021-10-27 16:10:09
## Sleeping for 509 seconds.
## ================================================================================Total pages queried: 1 (tweets captured this page: 500).
## Total tweets captured now reach 500 : finishing collection.
The list of arguments you can add to r get_all_tweets() can be found here: https://github.com/cjbarrie/academictwitteR
halloween_us <- get_all_tweets(
query = c("halloween", "trick or treat", "candies"), # for OR argument or c("halloween trick treat") for AND argument
start_tweets = "2021-10-01T00:00:00Z",
end_tweets = "2021-10-25T00:00:00Z",
n = 100
)
## Warning: Recommended to specify a data path in order to mitigate data loss when
## ingesting large amounts of data.
## Warning: Tweets will not be stored as JSONs or as a .rds file and will only be
## available in local memory if assigned to an object.
## query: (halloween OR trick or treat OR candies)
## Total pages queried: 1 (tweets captured this page: 497).
## Total tweets captured now reach 100 : finishing collection.
halloween_us %>% as_tibble()
## # A tibble: 100 × 14
## conversation_id author_id id created_at public_metrics$r… $reply_count
## <chr> <chr> <chr> <chr> <int> <int>
## 1 145242466608193… 993817397… 14524… 2021-10-24… 9 0
## 2 145242466607368… 140361496… 14524… 2021-10-24… 440 0
## 3 145242466590165… 789593646… 14524… 2021-10-24… 915 0
## 4 145242466530617… 122436164… 14524… 2021-10-24… 532 0
## 5 145242466333899… 244707145 14524… 2021-10-24… 137 0
## 6 145242466179558… 1278330990 14524… 2021-10-24… 0 0
## 7 145242466148511… 702164473 14524… 2021-10-24… 7 0
## 8 145242466069656… 141496693… 14524… 2021-10-24… 0 0
## 9 145242465947613… 140560806… 14524… 2021-10-24… 7993 0
## 10 145241964097877… 136908712… 14524… 2021-10-24… 0 0
## # … with 90 more rows, and 9 more variables: lang <chr>, text <chr>,
## # referenced_tweets <list>, entities <df[,5]>, source <chr>,
## # possibly_sensitive <lgl>, attachments <df[,1]>, in_reply_to_user_id <chr>,
## # geo <df[,2]>
You can separately call a “build query” function, which will be input to the get_all_tweets later. Let’s see what the available arguments are.
#?build_query
#resume_collection(data_path = "data")
#update_collection(data_path = "data", end_tweets = "2020-05-10T00:00:00Z")
Introduction to Twitter data processing and storage on AWS: https://dev.to/twitterdev/introduction-to-twitter-data-processing-and-storage-on-aws-1og
https://cran.r-project.org/web/packages/academictwitteR/ https://github.com/cjbarrie/academictwitteR
https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet Details about tweet_object https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research Tutorial for beginners
Contact point on Twitter to ask about academic Twitter: @suhemparack or https://twittercommunity.com/c/academic-research/62 Forums for q&A