The newscatcheR package provides three simple functions for reading RSS feeds from news outlets and have them conveniently returned as a tibble.
The first function get_news()
returns a tibble of the rss feed of a given site.
# adding a small time delay to avoid simultaneous posts to the API
Sys.sleep(3)
get_news(website = "news.ycombinator.com")
#> GET request successful. Parsing...
#> Warning: Predicate functions must be wrapped in `where()`.
#>
#> # Bad
#> data %>% select(is.character)
#>
#> # Good
#> data %>% select(where(is.character))
#>
#> ℹ Please update your code.
#> This message is displayed once per session.
#> # A tibble: 30 x 10
#> feed_title feed_link feed_description feed_pub_date item_title
#> <chr> <chr> <chr> <dttm> <chr>
#> 1 Hacker Ne… https://… Links for the i… 2020-06-05 13:42:29 There Are…
#> 2 Hacker Ne… https://… Links for the i… 2020-06-05 13:42:29 SimRefine…
#> 3 Hacker Ne… https://… Links for the i… 2020-06-05 13:42:29 Why Is th…
#> 4 Hacker Ne… https://… Links for the i… 2020-06-05 13:42:29 Mental We…
#> 5 Hacker Ne… https://… Links for the i… 2020-06-05 13:42:29 Google2Cs…
#> 6 Hacker Ne… https://… Links for the i… 2020-06-05 13:42:29 People tr…
#> 7 Hacker Ne… https://… Links for the i… 2020-06-05 13:42:29 Ask HN: H…
#> 8 Hacker Ne… https://… Links for the i… 2020-06-05 13:42:29 Julia as …
#> 9 Hacker Ne… https://… Links for the i… 2020-06-05 13:42:29 First pho…
#> 10 Hacker Ne… https://… Links for the i… 2020-06-05 13:42:29 High-Spee…
#> # … with 20 more rows, and 5 more variables: item_link <chr>,
#> # item_description <chr>, item_pub_date <dttm>, item_category <list>,
#> # item_comments <chr>
The second function get_headlines
is a helper function that returns a tibble of just the headlines, instead of the full rss feed.
# adding a small time delay to avoid simultaneous posts to the API
Sys.sleep(3)
get_headlines(website = "news.ycombinator.com")
#> GET request successful. Parsing...
#> feed_entries$item_title
#> 1 There Are No Bugs, Just TODOs
#> 2 SimRefinery Recovered
#> 3 Why Is the Human Brain So Efficient? (2018)
#> 4 Mental Wealth
#> 5 Google2Csv is a simple Google scraper that saves the results on a CSV file
#> 6 People try to do right by each other, no matter the motivation, study finds
#> 7 Ask HN: How do I reach making $1-1.5k/mo in 13 months?
#> 8 Julia as a CLI Calculator
#> 9 First photo of HS2 tunnel boring machines
#> 10 High-Speed Pool and Billiards Video Clips
#> 11 Germany, France launch Gaia-X platform in bid for ‘tech sovereignty’
#> 12 A History of Clojure [pdf]
#> 13 Why So Many Police Are Handling the Protests Wrong
#> 14 Ask HN: How to Disagree with the Rest of Management?
#> 15 WeChat permanently closes account after user sets CCP-offensive password
#> 16 Synthetic red blood cells mimic natural ones, and have new abilities
#> 17 Julialang Antipatterns
#> 18 Signal app downloads spike as US protesters seek message encryption
#> 19 Ask HN: Which Coursera courses/specializations you recommend?
#> 20 The Beauty of Unix Pipelines
#> 21 Kids and Time
#> 22 Ask HN: Have you ever gone without a computer or phone for an extended period?
#> 23 Ask HN: Are my expectations on code quality and professionalism too high?
#> 24 Containers from first principles
#> 25 Words that don't translate into English
#> 26 The Story Behind The Unmarked Federal Agents Occupying Washington, D.C
#> 27 Homoiconicity Revisited
#> 28 The Go Compiler Needs to Be Smarter
#> 29 Open source 5G core network
#> 30 macOS in QEMU in Docker
The function tld_sources
is a helper function for browsing news sites by top level domains. It’s useful to see which news sites from a country are present in the database.
tld_sources("de")
#> # A tibble: 40 x 2
#> url rss_endpoint
#> <chr> <chr>
#> 1 spiegel.de https://www.spiegel.de/international/index.rss
#> 2 zeit.de http://newsfeed.zeit.de/index
#> 3 thelocal.de https://www.thelocal.de/feeds/rss.php
#> 4 deutschland.de https://www.deutschland.de/en/feed-news/rss.xml
#> 5 raccoon.onyxbits.de https://raccoon.onyxbits.de/blog/index.xml
#> 6 abendblatt.de http://www.abendblatt.de/?service=Rss
#> 7 berliner-zeitung.de https://www.berliner-zeitung.de/feed/index.rss
#> 8 bild.de http://www.bild.de/rss-feeds/rss-16725492,feed=home.bild…
#> 9 bz-berlin.de http://www.bz-berlin.de/rss
#> 10 capital.de http://www.capital.de/rss
#> # … with 30 more rows
This package can be convenient if you need to fetch news from various websites for further analysis and you don’t want to search manually for the URL of their RSS feeds.
Assuming we have the news sites we want to follow:
c("bbc.com", "spiegel.de", "washingtonpost.com") sites =
We can get a list of data frames with:
lapply(sites, get_news)
#> GET request successful. Parsing...
#>
#> GET request successful. Parsing...
#>
#> GET request successful. Parsing...
#> [[1]]
#> # A tibble: 35 x 14
#> feed_title feed_link feed_description feed_language feed_pub_date
#> <chr> <chr> <chr> <chr> <dttm>
#> 1 BBC News … https://… BBC News - World en-gb 2020-06-05 12:01:05
#> 2 BBC News … https://… BBC News - World en-gb 2020-06-05 12:01:05
#> 3 BBC News … https://… BBC News - World en-gb 2020-06-05 12:01:05
#> 4 BBC News … https://… BBC News - World en-gb 2020-06-05 12:01:05
#> 5 BBC News … https://… BBC News - World en-gb 2020-06-05 12:01:05
#> 6 BBC News … https://… BBC News - World en-gb 2020-06-05 12:01:05
#> 7 BBC News … https://… BBC News - World en-gb 2020-06-05 12:01:05
#> 8 BBC News … https://… BBC News - World en-gb 2020-06-05 12:01:05
#> 9 BBC News … https://… BBC News - World en-gb 2020-06-05 12:01:05
#> 10 BBC News … https://… BBC News - World en-gb 2020-06-05 12:01:05
#> # … with 25 more rows, and 9 more variables: feed_last_build_date <dttm>,
#> # feed_generator <chr>, feed_ttl <chr>, item_title <chr>, item_link <chr>,
#> # item_description <chr>, item_pub_date <dttm>, item_guid <chr>,
#> # item_category <list>
#>
#> [[2]]
#> # A tibble: 20 x 12
#> feed_title feed_link feed_description feed_language feed_pub_date
#> <chr> <chr> <chr> <chr> <dttm>
#> 1 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 2 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 3 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 4 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 5 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 6 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 7 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 8 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 9 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 10 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 11 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 12 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 13 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 14 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 15 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 16 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 17 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 18 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 19 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> 20 DER SPIEG… https://… Deutschlands fü… de 2020-06-05 16:44:07
#> # … with 7 more variables: feed_last_build_date <dttm>, item_title <chr>,
#> # item_link <chr>, item_description <chr>, item_pub_date <dttm>,
#> # item_guid <chr>, item_category <list>
#>
#> [[3]]
#> # A tibble: 26 x 11
#> feed_title feed_link feed_description feed_language feed_pub_date
#> <chr> <chr> <chr> <chr> <dttm>
#> 1 World http://w… The Washington … en-US 2020-06-05 08:01:43
#> 2 World http://w… The Washington … en-US 2020-06-05 08:01:43
#> 3 World http://w… The Washington … en-US 2020-06-05 08:01:43
#> 4 World http://w… The Washington … en-US 2020-06-05 08:01:43
#> 5 World http://w… The Washington … en-US 2020-06-05 08:01:43
#> 6 World http://w… The Washington … en-US 2020-06-05 08:01:43
#> 7 World http://w… The Washington … en-US 2020-06-05 08:01:43
#> 8 World http://w… The Washington … en-US 2020-06-05 08:01:43
#> 9 World http://w… The Washington … en-US 2020-06-05 08:01:43
#> 10 World http://w… The Washington … en-US 2020-06-05 08:01:43
#> # … with 16 more rows, and 6 more variables: item_title <chr>, item_link <chr>,
#> # item_description <chr>, item_pub_date <dttm>, item_guid <chr>,
#> # item_category <list>