bibnets reads bibliographic data from two kinds of
source. The first is the standard database exports — Scopus, Web of
Science, OpenAlex, Lens.org, Dimensions, Crossref, BibTeX, and RIS —
which it recognises and parses automatically; give it a single file,
several files, or a whole folder and it works out each format on its
own. The second is any custom table of your own: a CSV
or data frame that is not a known export, where you simply name the
columns that hold the authors, references, or keywords and
bibnets reads it into the same structure. Either way you
get the same structure — the bibnets format — that
every network builder (author_network(),
keyword_network(), reference_network(),
document_network(), source_network(),
country_network(), institution_network(),
conetwork()) works from. The bibnets format is a data frame
with one row per paper: most columns hold a single value (title, year,
journal), while the fields that can have many values per paper —
authors, references, keywords — hold a list in each row.
In full, the bibnets format has these columns:
| Column | Type | Meaning |
|---|---|---|
id |
chr | Document identifier (EID, OpenAlex W-ID, DOI, etc.) |
title |
chr | Document title |
year |
int | Publication year |
journal |
chr | Source / journal / venue name |
doi |
chr | DOI without the https://doi.org/ prefix |
cited_by_count |
int | Citations received (as reported by source) |
abstract |
chr | Abstract text; NA for sources that do not expose
it |
type |
chr | Document type (article, review, book-chapter, …) |
authors |
list | Character vector of author names per row |
references |
list | Character vector of cited references per row |
keywords |
list | Character vector of keywords per row |
Some sources add extra columns (such as index_keywords,
keywords_plus, affiliations, or
countries); these are kept after the standard ones.
This vignette documents the read_biblio() entry point
and each reader, the generic-CSV path, network construction directly
from custom columns and separators, the split_field()
helper, and the manual construction of a compatible data frame.
For CSV files that do not match any of the recognised signatures
(in-house exports, custom dumps, public datasets), map each source
column onto a standard field by name. The identifier
column is named via id; each multi-valued field is named
via its own argument — authors, keywords,
references, countries,
affiliations — and journal for the scalar
source/venue. sep is the delimiter applied inside those
cells. Naming any of these columns implies
format = "generic", so you do not need to pass
format yourself.
Hypothetical call:
data <- read_biblio(
"my_data.csv",
id = "doc_id",
authors = "Authors",
keywords = "Keywords",
sep = ";"
)Demonstrated on the bundled OpenAlex CSV (which uses |
as the delimiter). The source columns have long dotted names; mapping
them by argument yields the standard authors and
keywords list-columns:
f <- system.file("extdata", "openalex_works.csv", package = "bibnets")
generic <- read_biblio(
f,
id = "id",
authors = "authorships.author.display_name",
keywords = "primary_topic.display_name",
sep = "|"
)
generic$authors[[1]]
#> [1] "Jakub Kužílek" "Martin Hlosta" "Zdeněk Zdráhal"
generic$keywords[[1]]
#> [1] "Online Learning and Analytics"Each mapped column is split on sep and stored under its
standard name as a list-column; the original source column is left in
place. For any further columns that have no dedicated argument,
list_cols splits them in place (keeping their original
names).
Often a dataset is already a plain data frame or CSV with its own
column names and its own delimiter — you do not need to coerce it into
the standard schema first. Every network builder accepts a column
argument named after the entity it builds (authors,
keywords, references, journal,
countries, affiliations) plus a
sep for splitting a delimited character column. The builder
splits, normalises, and builds in one call.
papers <- data.frame(
id = 1:4,
`Author Names`= c("Smith J, Doe A, Lee K", "Smith J, Lee K",
"Doe A, Lee K", "Smith J, Doe A"),
Tags = c("ml, ai", "ml, nlp", "ai, nlp", "ml, ai"),
check.names = FALSE,
stringsAsFactors = FALSE
)
# Point the builder at the column and give it the delimiter — no renaming.
author_network(papers, authors = "Author Names", sep = ",")
#> # bibnets network: author_collaboration | 3 nodes · 3 edges | counting: full
#> from to weight count
#> 1 DOE A LEE K 2 2
#> 2 DOE A SMITH J 2 2
#> 3 LEE K SMITH J 2 2
keyword_network(papers, keywords = "Tags", sep = ",")
#> # bibnets network: keyword_co_occurrence | 3 nodes · 3 edges | counting: full
#> from to weight count
#> 1 AI ML 2 2
#> 2 AI NLP 1 1
#> 3 ML NLP 1 1The works dimension (the rows of the works x entities
matrix) is the id column. You do not have to supply one:
id = NULL (the default) uses an existing id
column when present and otherwise numbers the rows, treating each row as
one document. The example above has no id column and still
works for that reason. To use a differently-named identifier column,
name it with the id argument:
papers2 <- data.frame(
paper_id = c("P1", "P2", "P3"),
authors = c("Alice, Bob", "Alice, Carol", "Bob, Carol"),
stringsAsFactors = FALSE
)
author_network(papers2, authors = "authors", sep = ",", id = "paper_id")
#> # bibnets network: author_collaboration | 3 nodes · 3 edges | counting: full
#> from to weight count
#> 1 ALICE BOB 1 1
#> 2 ALICE CAROL 1 1
#> 3 BOB CAROL 1 1Two entities are linked when they share the same id, so
the identifier controls what counts as “the same document” during
projection.
sep is any literal delimiter, so BibTeX-style
" and " or pipe-delimited exports work too:
bib <- data.frame(
id = 1:3,
creators = c("Alice and Bob", "Alice and Carol", "Bob and Carol"),
stringsAsFactors = FALSE
)
author_network(bib, authors = "creators", sep = " and ")
#> # bibnets network: author_collaboration | 3 nodes · 3 edges | counting: full
#> from to weight count
#> 1 ALICE BOB 1 1
#> 2 ALICE CAROL 1 1
#> 3 BOB CAROL 1 1In a coupling network the entity column and the
references column can use different delimiters. Reference
strings frequently contain internal commas
("Smith J, 2020, Journal"), so references is
split on ";" by default, independent of sep.
Override it with references_sep when your references use
another delimiter:
d <- data.frame(
id = c("P1", "P2", "P3"),
auth = c("Alice, Bob", "Alice, Carol", "Bob, Carol"),
references = c("R1, R2", "R1, R3", "R2, R3"),
stringsAsFactors = FALSE
)
author_network(d, "coupling", authors = "auth", sep = ",",
references_sep = ",")
#> # bibnets network: author_coupling | 3 nodes · 3 edges | counting: full
#> from to weight count
#> 1 ALICE BOB 3 3
#> 2 ALICE CAROL 3 3
#> 3 BOB CAROL 3 3Values exported with surrounding quotes ("Alice", or the
CSV doubled form ""Alice"") are cleaned automatically —
strip_quotes = TRUE is the default, so a quoted label and
its bare form collapse to the same node. Internal apostrophes
(e.g. O'Brien) are left untouched. Set
strip_quotes = FALSE to keep the quotes as part of the
label.
q <- data.frame(
id = 1:3,
authors = c('"Alice"; "Bob"', '"Alice"; "Carol"', '"Bob"; "Carol"'),
stringsAsFactors = FALSE
)
author_network(q) # quotes stripped -> ALICE, BOB, CAROL
#> # bibnets network: author_collaboration | 3 nodes · 3 edges | counting: full
#> from to weight count
#> 1 ALICE BOB 1 1
#> 2 ALICE CAROL 1 1
#> 3 BOB CAROL 1 1If you pass a sep that does not actually split the
column — for example the data is pipe-delimited but you left
sep = ";" — and the values contain a structural delimiter
(";", "|", or a tab), the builder warns you
instead of silently treating each whole cell as one entity:
bad <- data.frame(
id = 1:3,
authors = c("Smith J| Doe A", "Smith J| Lee K", "Doe A| Lee K"),
stringsAsFactors = FALSE
)
invisible(author_network(bad)) # warns: values contain "|"
#> Warning: Splitting column 'authors' on sep = ";" produced no multi-entry rows,
#> but most values contain "|". If entries are separated by "|", pass that as sep.The check is deliberately quiet for commas and " and ",
which appear inside perfectly valid single labels
("Last, First" names, one-reference-per-row citation
strings, organisations like "Smith and Sons").
read_biblio()read_biblio() accepts a single file, a vector of file
paths, or a directory. When format = "auto" (the default)
it detects the format from the contents of the file:
data <- read_biblio("export.csv") # auto-detect format
data <- read_biblio("scopus_dir/") # entire directory, rbind'd
data <- read_biblio(c("a.csv", "b.csv")) # multiple files, rbind'd
data <- read_biblio("file.csv", format = "scopus") # force a formatWhen given a directory, read_biblio() collects every
.csv, .txt, .bib,
.ris, .xls, and .xlsx file in it,
reads each one, and combines the results with rbind(). For
more than one file a summary message is emitted:
Read 3 files: 1247 rows total
Format detection is performed on the first non-empty line of the file:
@TY -FN or
PT"About the data: ..."),
line 2 is used instead. Header tokens determine the format:
eid for Scopus, lens id for Lens.org,
publication id or dimensions url for
Dimensions, authorships.author.display_name for the
OpenAlex flat CSV.If detection fails, read_biblio() raises an error that
lists the supported formats and indicates how to pass
format explicitly or name the entity columns
(authors, keywords, …), which reads the file
as a generic CSV.
Two readers are not dispatched by read_biblio():
read_openalex() accepts an in-memory tibble from
openalexR::oa_fetch(), not a file path.read_crossref() accepts the data element
of rcrossref::cr_works().Both take R objects rather than files and are called directly.
read_scopus() ingests the standard Scopus CSV export
(File -> Export -> CSV from the Scopus search UI).
Mappings from Scopus columns to the bibnets schema:
| Scopus column | Standard column |
|---|---|
EID (or Article No.) |
id |
Title |
title |
Year |
year |
Source title |
journal |
DOI |
doi (prefix stripped) |
Cited by |
cited_by_count |
Abstract |
abstract |
Document Type |
type |
Authors (;-delimited) |
authors (list) |
References (;-delimited) |
references (list) |
Author Keywords (;-delimited) |
keywords (list) |
Index Keywords (;-delimited) |
index_keywords (list, extra) |
Affiliations (;-delimited) |
affiliations (list, extra) |
Language of Original Document |
language (extra) |
Scopus stores each cited reference as one semicolon-delimited string
in a single cell. read_scopus() splits on ;
and applies standardize_refs() to each entry: uppercasing,
whitespace normalisation, and removal of a trailing DOI where present.
References differing only in case or trailing DOI then resolve to the
same node in co-citation and reference networks.
WoS exports come in two shapes:
wos1 <- read_wos("savedrecs.txt") # plaintext (default)
wos2 <- read_wos("savedrecs.tsv", format = "tab") # tab-delimitedThe plaintext format is a tagged record syntax. Each record begins
with a PT (publication type) tag and ends with
ER (end record). Within the record, every field is
introduced by a 2-letter tag at the start of a line, with continuation
lines indented:
| Tag | Field |
|---|---|
AU |
Authors (one per line) |
TI |
Title |
SO |
Source / journal |
PY |
Year |
DI |
DOI |
TC |
Times cited |
AB |
Abstract |
DT |
Document type |
DE |
Author keywords |
ID |
Keywords plus (extra: keywords_plus) |
CR |
Cited references (one per line) |
read_wos() walks the file, splitting on ER
boundaries, and emits one row per record. The tab-delimited variant
carries the same fields in a flat CSV-like grid. Either way the output
schema is identical.
The Dimensions CSV begins with a metadata row of the form
"About the data: This export was generated on YYYY-MM-DD ..."
before the column header. read_dimensions() detects this
preamble and skips it. If the line has been removed (for example, by
manual editing of the file), the reader continues to function because it
identifies the column row by the Dimensions header tokens
Publication ID and Dimensions URL.
Extras returned: affiliations and countries
as list-columns, analogous to the OpenAlex schema.
Key Lens columns and how they map:
| Lens column | Standard column |
|---|---|
Lens ID |
id |
Title |
title |
Publication Year |
year |
Source Title |
journal |
DOI |
doi |
Cited by Count |
cited_by_count |
Abstract |
abstract |
Publication Type |
type |
Author/s |
authors (list) |
Reference Identifiers |
references (list) |
Keywords |
keywords (list) |
read_bibtex() parses
@type{key, field = {value}, ...} blocks.
read_ris() parses tagged TY - ... ER -
blocks; the structure is equivalent to WoS plaintext, but with a
different tag dictionary.
Standard BibTeX and RIS do not contain cited-reference data, so the
references column in the resulting data frame is empty on
every row. These formats are sufficient for co-authorship and keyword
co-occurrence networks. For co-citation, coupling, or direct citation
networks, the appropriate sources are Scopus, Web of Science, OpenAlex
(via oa_fetch()), Dimensions, Lens, or Crossref.
library(rcrossref)
raw <- cr_works(query = "graph neural networks", limit = 100)
data <- read_crossref(raw$data)read_crossref() accepts the data element of
the cr_works() result (a data frame, not the wrapping
list). The function handles the two field-naming variants Crossref
returns (container.title vs container-title;
is.referenced.by.count vs
is-referenced-by-count) and maps both to the standard
schema.
OpenAlex ships data through two routes that bibnets supports separately.
The package includes a 30-row OpenAlex flat CSV at
inst/extdata/openalex_works.csv, corresponding to the
export produced by downloading “Works” results from the OpenAlex web
interface. Multi-valued fields use | as the delimiter.
f <- system.file("extdata", "openalex_works.csv", package = "bibnets")
oa <- read_openalex_csv(f)
str(oa, max.level = 1)
#> 'data.frame': 30 obs. of 13 variables:
#> $ id : chr "W2769342982" "W2264893711" "W2612059685" "W3118164373" ...
#> $ title : chr "Open University Learning Analytics dataset" "Educational Data Mining and Learning Analytics in Programming" "Predicting Student Performance using Advanced Learning Analytics" "Predicting Student Performance Using Data Mining and Learning Analytics Techniques: A Systematic Literature Review" ...
#> $ year : int 2017 2015 2017 2020 2022 2016 2020 2024 2016 2020 ...
#> $ journal : chr "Scientific Data" "" "" "Applied Sciences" ...
#> $ doi : chr "10.1038/sdata.2017.171" "10.1145/2858796.2858798" "10.1145/3041021.3054164" "10.3390/app11010237" ...
#> $ cited_by_count: int 432 312 235 417 247 163 122 133 131 177 ...
#> $ abstract : chr NA NA NA NA ...
#> $ type : chr "article" "article" "article" "article" ...
#> $ authors :List of 30
#> $ references :List of 30
#> $ keywords :List of 30
#> $ affiliations :List of 30
#> $ countries :List of 30
head(oa[, c("id", "title", "year", "journal", "type")], 5)
#> id
#> 1 W2769342982
#> 2 W2264893711
#> 3 W2612059685
#> 4 W3118164373
#> 5 W4300484403
#> title
#> 1 Open University Learning Analytics dataset
#> 2 Educational Data Mining and Learning Analytics in Programming
#> 3 Predicting Student Performance using Advanced Learning Analytics
#> 4 Predicting Student Performance Using Data Mining and Learning Analytics Techniques: A Systematic Literature Review
#> 5 Artificial Intelligence and Learning Analytics in Teacher Education: A Systematic Review
#> year journal type
#> 1 2017 Scientific Data article
#> 2 2015 article
#> 3 2017 article
#> 4 2020 Applied Sciences article
#> 5 2022 Education Sciences reviewThe list-columns:
oa$authors[[1]]
#> [1] "Jakub Kužílek" "Martin Hlosta" "Zdeněk Zdráhal"
oa$affiliations[[1]]
#> [1] "The Open University"
#> [2] "Czech Technical University in Prague"
#> [3] "The Open University"
#> [4] "The Open University"
#> [5] "Czech Technical University in Prague"
oa$countries[[1]]
#> [1] "CZ" "GB" "GB" "CZ" "GB"References and abstracts are absent from the OpenAlex flat export:
references is empty and abstract is
NA because the web download does not include those fields.
Use OpenAlex via openalexR::oa_fetch() and
read_openalex() when you need cited references or
abstracts.
The remaining fields support several network constructions that do not require references — co-authorship, country, institution, keyword, source, and document networks:
openalexRThis path is used when references and abstracts are required.
openalexR::oa_fetch() returns a nested tibble with
author, referenced_works,
concepts, and keywords list-columns;
read_openalex() converts it to the standard schema:
library(openalexR)
raw <- oa_fetch(entity = "works", search = "learning analytics", per_page = 200)
data <- read_openalex(raw)References are returned as OpenAlex Work IDs
(e.g. W2769342982) rather than formatted citation strings.
The IDs are stable identifiers suitable for co-citation and
direct-citation networks; visualisations that need human-readable labels
can join the IDs back to titles in a separate step.
When data does not come from any of the supported sources, a bibnets-compatible data frame can be constructed directly. The requirement is: standard scalar columns are character or integer; multi-valued fields are list-columns whose elements are character vectors.
df <- data.frame(
id = c("p1", "p2", "p3"),
title = c("Paper A", "Paper B", "Paper C"),
year = c(2020L, 2021L, 2022L),
stringsAsFactors = FALSE
)
df$authors <- list(
c("ALICE", "BOB"),
c("BOB", "CAROL"),
c("ALICE", "CAROL", "DAVE")
)
df$references <- list(
c("R1", "R2"),
c("R1", "R3"),
c("R2", "R3", "R4")
)
df$keywords <- list(
c("graph", "network"),
c("network", "embedding"),
c("graph", "embedding", "neural")
)
author_network(df, "collaboration")
#> # bibnets network: author_collaboration | 4 nodes · 5 edges | counting: full
#> from to weight count
#> 1 ALICE BOB 1 1
#> 2 ALICE CAROL 1 1
#> 3 BOB CAROL 1 1
#> 4 ALICE DAVE 1 1
#> 5 CAROL DAVE 1 1
keyword_network(df)
#> # bibnets network: keyword_co_occurrence | 4 nodes · 5 edges | counting: full
#> from to weight count
#> 1 EMBEDDING GRAPH 1 1
#> 2 EMBEDDING NETWORK 1 1
#> 3 GRAPH NETWORK 1 1
#> 4 EMBEDDING NEURAL 1 1
#> 5 GRAPH NEURAL 1 1
reference_network(df)
#> # bibnets network: reference_co_citation | 4 nodes · 5 edges | counting: full
#> from to weight count
#> 1 R1 R2 1 1
#> 2 R1 R3 1 1
#> 3 R2 R3 1 1
#> 4 R2 R4 1 1
#> 5 R3 R4 1 1build_bipartite() applies
toupper(trimws(...)) to every entity label before
constructing the sparse matrix, so "graph",
"Graph", and "GRAPH" are mapped to the same
node "GRAPH". Tests or comparisons that reference node
names should use uppercase strings.
split_field() helpersplit_field() converts a character column with
semicolon-delimited (or otherwise delimited) values into a list-column
without going through read_biblio(format = "generic"):
split_field(c("Alice; Bob; Carol", "Dave; Eve"))
#> [[1]]
#> [1] "Alice" "Bob" "Carol"
#>
#> [[2]]
#> [1] "Dave" "Eve"
split_field(c("a|b|c", "d|e"), sep = "|")
#> [[1]]
#> [1] "a" "b" "c"
#>
#> [[2]]
#> [1] "d" "e"This is the same operation that read_scopus() and the
other readers apply internally to multi-valued columns; it is exported
for use in custom pipelines.
Different readers expose different extras: WoS provides
keywords_plus, Scopus provides index_keywords,
OpenAlex provides countries. To combine sources, restrict
each frame to the standard columns and bind:
common <- c("id", "title", "year", "journal", "doi", "cited_by_count",
"abstract", "type", "authors", "references", "keywords")
data(biblio_data)
b1 <- biblio_data
b2 <- biblio_data
b2$id <- paste0(b2$id, "_dup")
cols <- intersect(common, names(b1))
combined <- rbind(b1[, cols], b2[, cols])
nrow(combined)
#> [1] 20Two practical notes:
keywords_plus) should
be retained on the per-source frame and merged selectively rather than
coerced into the combined frame.After reading, basic checks on the list-column sizes and the scalar columns help detect silent corruption. Empty list-columns and out-of-range years are common indicators that an export is incomplete.
data(scopus_quantum_cloud)
sc <- scopus_quantum_cloud
range(lengths(sc$authors))
#> [1] 0 40
range(lengths(sc$references))
#> [1] 0 245
range(lengths(sc$keywords))
#> [1] 0 20
head(sort(table(sc$journal), decreasing = TRUE), 5)
#>
#> IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
#> 24
#> IEEE Transactions on Circuits and Systems I: Regular Papers
#> 20
#> IEEE Access
#> 18
#> IEEE Transactions on Very Large Scale Integration (VLSI) Systems
#> 14
#> IEEE Internet of Things Journal
#> 12
range(sc$year, na.rm = TRUE)
#> [1] 2020 2025
table(sc$type)
#>
#> Article Book Book chapter Conference paper
#> 279 1 15 191
#> Conference review Review
#> 3 10Indicators to check:
lengths() of 0 on every row of
references for a Scopus or WoS file indicates that the
export did not include the references column. Re-export from the source
with the references field selected.0 or NA indicates an empty
source field."article")
is expected for filtered searches; broader mixes are expected for
thematic searches.| Symptom | Cause | Fix |
|---|---|---|
Could not detect file format |
First line doesn’t match any signature | Pass format = "scopus" (etc.) explicitly, or name the
entity columns (authors, keywords, …) to read
it as a generic CSV |
Empty references list on every row |
BibTeX/RIS or OpenAlex flat CSV — these don’t carry citations | Use Scopus/WoS, OpenAlex via oa_fetch(), Dimensions,
Lens, or Crossref |
Invalid multibyte string on read |
Wrong encoding | Most readers accept encoding = "latin1"; pass it
through read_biblio(..., encoding = "latin1") |
Author names look like LASTNAME, F.J. not
FJ LASTNAME |
Default is flip_names = FALSE |
The reader returns names as-is from the source. Cluster them by
string match downstream, or pass flip_names = TRUE if all
names follow Last, First |
| Dimensions file silently fails | “About the data” preamble removed and column header edited | read_dimensions() detects the standard preamble and
falls back to header-token detection; the failure mode requires the
column header itself to have been edited |
Co-authorship network contains duplicate nodes
(e.g. "Alice" and "ALICE") |
Mixed casing in the source | The standard readers and build_bipartite() apply
toupper(trimws(...)) to entity labels. Manually constructed
frames should apply the same normalisation |
vignette("bibnets"), covers
network construction on the in-package datasets.vignette("parsing-author-names") covers
parse_names() for normalising author labels before a
network is built.author_network(),
keyword_network(), reference_network(),
document_network(), source_network(),
country_network(), institution_network(),
conetwork()) accept the same core arguments
(type, counting, similarity,
threshold, top_n, format) plus
the custom-column arguments shown above (id, the entity
column, sep, references_sep,
strip_quotes), so switching between network types on data
already in the standard schema requires only a function-name
change.