Skip to contents

Displays a table of all corpora available in the text2map.corpora package, along with their metadata.

Usage

list_corpora(type = NULL, category = NULL, downloaded_only = FALSE)

Arguments

type

Optional filter: "bundled" for corpora included with the package, "download" for corpora that must be downloaded, or NULL for all.

category

Optional filter: "corpus" for text corpora, "tweetids" for tweet ID lists, or NULL for all.

downloaded_only

If TRUE, only show corpora that have been downloaded (always includes bundled corpora).

Value

Invisible data frame of corpus metadata.

Examples

list_corpora()
#> Available corpora:
#>                   corpus     type category  n_rows n_cols     downloaded
#>     corpus_annual_review  bundled   corpus      70      7        Bundled
#>       corpus_tng_season5  bundled   corpus   10834      5        Bundled
#>             corpus_usnss  bundled   corpus      18      2        Bundled
#>      corpus_envsociology  bundled   corpus     817      8        Bundled
#>      corpus_cmu_blogs100  bundled   corpus     100      6        Bundled
#>   corpus_europarl_subset  bundled   corpus   10000      4        Bundled
#>           corpus_beyonce  bundled   corpus      83     10        Bundled
#>      corpus_taylor_swift  bundled   corpus     110     10        Bundled
#>      corpus_presidential  bundled   corpus    2475     13        Bundled
#>     corpus_senti_bench4k  bundled   corpus    4044      6        Bundled
#>  corpus_isot_fake_news2k  bundled   corpus    2000      5        Bundled
#>      corpus_finefoods10k  bundled   corpus    9999      9        Bundled
#>        corpus_atn_immigr  bundled   corpus    3230      8        Bundled
#>             corpus_ittpr  bundled   corpus     976      7        Bundled
#>    corpus_reddit_aita10k  bundled   corpus   10157     18        Bundled
#>             corpus_enron download   corpus   30965      7 Not downloaded
#>     corpus_nytimes_covid download   corpus     982     28 Not downloaded
#>          corpus_disaster download   corpus   10860      3 Not downloaded
#>        corpus_web_dubois download   corpus   12757      5 Not downloaded
#>    corpus_isot_fake_news download   corpus   44244      5 Not downloaded
#>         corpus_pitchfork download   corpus   20783     13 Not downloaded
#>           corpus_dsj_vox download   corpus   22789      8 Not downloaded
#>               corpus_atn download   corpus  204135     13 Not downloaded
#>              corpus_atn2 download   corpus 2688879     11 Not downloaded
#>         corpus_finefoods download   corpus   50000      9 Not downloaded
#>       corpus_reddit_aita download   corpus   32766     18 Not downloaded
#>       corpus_senti_bench download   corpus   11557      6 Not downloaded
#>      corpus_black_mirror download   corpus   18972      5 Not downloaded
#>        corpus_scifi_pulp download   corpus    2110     11 Not downloaded
#>     corpus_moral_stories download   corpus   24000     10 Not downloaded
#>           tweetids_covid download tweetids    1922      1 Not downloaded
#>       tweetids_covid_geo download tweetids    1999      1 Not downloaded
#>        tweetids_stayhome download tweetids   23737      1 Not downloaded
#>             tweetids_gme download tweetids   15594      1 Not downloaded
list_corpora(type = "bundled")
#> Corpus results:
#>                   corpus    type category n_rows n_cols downloaded
#>     corpus_annual_review bundled   corpus     70      7    Bundled
#>       corpus_tng_season5 bundled   corpus  10834      5    Bundled
#>             corpus_usnss bundled   corpus     18      2    Bundled
#>      corpus_envsociology bundled   corpus    817      8    Bundled
#>      corpus_cmu_blogs100 bundled   corpus    100      6    Bundled
#>   corpus_europarl_subset bundled   corpus  10000      4    Bundled
#>           corpus_beyonce bundled   corpus     83     10    Bundled
#>      corpus_taylor_swift bundled   corpus    110     10    Bundled
#>      corpus_presidential bundled   corpus   2475     13    Bundled
#>     corpus_senti_bench4k bundled   corpus   4044      6    Bundled
#>  corpus_isot_fake_news2k bundled   corpus   2000      5    Bundled
#>      corpus_finefoods10k bundled   corpus   9999      9    Bundled
#>        corpus_atn_immigr bundled   corpus   3230      8    Bundled
#>             corpus_ittpr bundled   corpus    976      7    Bundled
#>    corpus_reddit_aita10k bundled   corpus  10157     18    Bundled
list_corpora(category = "tweetids")
#> Corpus results:
#>              corpus     type category n_rows n_cols     downloaded
#>      tweetids_covid download tweetids   1922      1 Not downloaded
#>  tweetids_covid_geo download tweetids   1999      1 Not downloaded
#>   tweetids_stayhome download tweetids  23737      1 Not downloaded
#>        tweetids_gme download tweetids  15594      1 Not downloaded
list_corpora(downloaded_only = TRUE)
#> Corpus results:
#>                   corpus    type category n_rows n_cols downloaded
#>     corpus_annual_review bundled   corpus     70      7    Bundled
#>       corpus_tng_season5 bundled   corpus  10834      5    Bundled
#>             corpus_usnss bundled   corpus     18      2    Bundled
#>      corpus_envsociology bundled   corpus    817      8    Bundled
#>      corpus_cmu_blogs100 bundled   corpus    100      6    Bundled
#>   corpus_europarl_subset bundled   corpus  10000      4    Bundled
#>           corpus_beyonce bundled   corpus     83     10    Bundled
#>      corpus_taylor_swift bundled   corpus    110     10    Bundled
#>      corpus_presidential bundled   corpus   2475     13    Bundled
#>     corpus_senti_bench4k bundled   corpus   4044      6    Bundled
#>  corpus_isot_fake_news2k bundled   corpus   2000      5    Bundled
#>      corpus_finefoods10k bundled   corpus   9999      9    Bundled
#>        corpus_atn_immigr bundled   corpus   3230      8    Bundled
#>             corpus_ittpr bundled   corpus    976      7    Bundled
#>    corpus_reddit_aita10k bundled   corpus  10157     18    Bundled