textpress

A lightweight toolkit for text retrieval and NLP with a consistent API: Fetch, Read, Process, and Search. Functions cover the full pipeline from web data to text processing and indexing. Multiple search strategies – regex, BM25, cosine similarity, dictionary matching. Verb_noun naming; pipe-friendly; no heavy dependencies; outputs are plain data frames.


Installation

From CRAN:

install.packages("textpress")

Development version:

remotes::install_github("jaytimm/textpress")

The textpress API map

1. Data acquisition (fetch_*)

These functions talk to the outside world to find locations of information. They return URLs or metadata, not full text.

2. Ingestion (read_*)

Once you have locations, bring the data into R.

3. Processing (nlp_*)

Prepare raw text for analysis or indexing. Designed to be used with the pipe |>.

4. Retrieval (search_*)

Four ways to query your data. Subject-first: first argument is the data (corpus, index, or embeddings); the second is the query/needle. Pipe-friendly.

Function Primary input (needle) Use case
search_regex(corpus, query, …) Character (pattern) Specific strings/patterns, KWIC.
search_dict(corpus, terms, …) Character (vector of terms) Exact phrases/MWEs; no partial-match risk.
search_index(index, query, …) Character (keywords) BM25 ranked retrieval.
search_vector(embeddings, query, …) Numeric (vector/matrix) Semantic neighbors.

search_dict is the exact n-gram matcher: pass a vector of terms (e.g. ); get a table of where they appeared. Optimized for high-speed extraction of thousands of specific terms (MWEs) across large corpora. Add categories later with a left_join on or .

Quick start (all four stages):

library(textpress)
links  <- fetch_urls("R high performance computing")
corpus <- read_urls(links$url)
corpus$doc_id <- seq_len(nrow(corpus))
toks   <- nlp_tokenize_text(corpus, by = "doc_id", include_spans = FALSE)
index  <- nlp_index_tokens(toks)
search_regex(corpus, "parallel|future", by = "doc_id")
search_dict(corpus, terms = c("OpenMP", "Socket"), by = "doc_id")
search_index(index, "distributed computing")
# search_vector(embeddings, query)  # use util_fetch_embeddings() for embeddings

Extension: Using textpress with LLMs & agents

While textpress is a general-purpose text toolkit, its design fits LLM-based workflows (e.g. RAG) and autonomous agents.

Lightweight RAG (retrieval-augmented generation)
You can build a local-first RAG pipeline without a heavy vector DB:

Tool-use for autonomous agents
If you are building an agent (e.g. via or another R framework), textpress functions work well as tools: flat naming and predictable data-frame outputs make them easy for a model to call.


License

MIT © Jason Timm, MA, PhD

Citation

If you use this package in your research, please cite:

citation("textpress")

Issues

Report bugs or request features at https://github.com/jaytimm/textpress/issues

Contributing

Contributions welcome! Please open an issue or submit a pull request.