In order to reduce load on Wikidata’s server and to speed up the
processing of data, tidywikidatar
makes extensive use of
local caching.
What data are cached locally
There are a few types of data that are cached locally:
- searches run with
tw_search()
- data about an item, typically retrieved with
tw_get()
ortw_get_property()
- labels or description of properties, typically retrieved with
tw_get_property_label()
andtw_get_property_description()
- qualifiers of properties, typically retrieved with
tw_get_qualifiers()
- data retrieved from (or about) Wikipedia pages, with
tw_get_wikipedia()
, andtw_get_wikipedia_page_links()
To reduce space used for local caching and speed up processing time, it is possible to store only labels and information available in a given language when relevant.
Caching with SQLite
In tidywikidatar
, it is possible to enable caching
with:
If you do not include further parameters, by default
tidywikidatar
will use a local SQLite database for
caching.
You can choose in which folder the SQLite database will be stored
with tw_set_cache_folder()
; if not already existing, you
can create that folder with tw_create_cache_folder()
.
tw_set_cache_folder(path = fs::path(
fs::path_home_r(),
"R",
"tw_data"
))
tw_create_cache_folder()
Caching with other database backends
Support for other database backends is now available. They can be
accessed most easily using the following approach, having ensured that
the relevant driver (and odbc
package) have previously been
installed:
tw_enable_cache(SQLite = FALSE)
tw_set_cache_db(
driver = "MySQL",
host = "localhost",
port = 3306,
database = "tidywikidatar",
user = "secret_username",
pwd = "secret_password"
)
# for testing, consider running a local database e.g. with:
# docker run --name tidywikidatar_db -p 3306:3306 -e MYSQL_ROOT_PASSWORD=secret_root_password -e MYSQL_USER=secret_username -e MYSQL_PASSWORD=secret_password -e MYSQL_DATABASE=tidywikidatar mysql:latest
It is also technically possible to pass directly a connection
generated with DBI::dbConnect()
to each function.
Name of tables in cached databases
Each database has a table for each language and type of content. For
example, item information retrieved with
tw_get(id = "Q180099", language = "en")
will be stored in a
table called tw_item_en
.
The name of the table is unique and is generated by
tw_get_cache_table_name()
. For example:
tw_get_cache_table_name(type = "item", language = "en")
#> [1] "tw_item_en"
Column types and indexing
Due to limited familiarity with different database backends and
limited time for testing, the creation of database tables is left to the
default values of DBI::dbWriteTable()
. For occasional use,
this should not be an issue. However, when the local cache reaches
millions rather than only thousands of rows, response time from a MySql
database can take a few seconds, rather than a fraction of a second as
would be expected. To deal with this, new functions for adding indexing
to cache tables have been introduced,
tw_check_cache_index()
, tw_index_cache_item()
,
and tw_index_cache_search()
. It is possible to apply this
to all existing functions of a given type as outlined below. This speeds
up retrieval time dramatically on MySql databases; impact on other types
of databases has not been thoroughly tested.
db <- tw_connect_to_cache()
tables_v <- DBI::dbListTables(conn = db)
# for search cache tables
purrr::walk(
.x = tables_v[stringr::str_starts(string = tables_v, "tw_search_item")],
.f = function(x) {
tw_index_cache_search(table_name = x)
}
)
# for item cache tables
purrr::walk(
.x = tables_v[stringr::str_starts(string = tables_v, "tw_item")],
.f = function(x) {
tw_index_cache_item(table_name = x)
}
)