
This is an updated and better extracted version of the wikimedia/Wikipedia dataset originally released in 2023. We carefully parsed Wikipedia HTML dumps from August of 2025 covering 325 languages.
This dataset:
wikimedia/Wikipedia removes all tables and math contenthas_math metadata attributeYou can explore the dataset, compare it to wikimedia/Wikipedia and preview the live Wikipedia pages on our space.
| Subset | Name | Size | Pages |
|---|---|---|---|
en | English | 35.1 GB | 6,614,655 |
de | German | 13.1 GB | 2,713,646 |
fr | French | 12.1 GB | 2,566,183 |
ru | Russian | 10.7 GB | 1,817,813 |
ja | Japanese | 9.9 GB | 1,354,269 |
es | Spanish | 8.5 GB | 1,948,965 |
it | Italian | 7.4 GB | 1,799,759 |
uk | Ukrainian | 5.4 GB | 1,239,253 |
zh | Chinese (writtenvernacular Chinese) | 5.1 GB | 1,295,955 |
pl | Polish | 4.4 GB | 1,543,918 |
ceb | Cebuano | 4.4 GB | 5,647,436 |
pt | Portuguese | 4.3 GB | 1,135,383 |
nl | Dutch | 3.5 GB | 2,072,865 |
ca | Catalan | 3.5 GB | 962,290 |
ar | Arabic | 3.4 GB | 1,230,456 |
sv | Swedish | 2.9 GB | 2,470,063 |
cs | Czech | 2.2 GB | 534,563 |
fa | Persian | 2.2 GB | 1,021,336 |
vi | Vietnamese | 2.1 GB | 1,279,087 |
hu | Hungarian | 2.1 GB | 515,004 |
ko | Korean | 2.0 GB | 582,035 |
he | Hebrew | 2.0 GB | 372,053 |
sr | Serbian | 2.0 GB | 664,345 |
id | Indonesian | 1.8 GB | 723,099 |
tr | Turkish | 1.6 GB | 629,762 |
fi | Finnish | 1.5 GB | 572,900 |
no | Norwegian (Bokmål) | 1.3 GB | 620,802 |
el | Greek | 1.2 GB | 242,517 |
hy | Armenian | 1.2 GB | 309,820 |
ro | Romanian | 1.2 GB | 493,462 |
| ... | |||
| Total | 184.7 GB | 61,550,610 |
A detailed list is available here.
See the tables above for the subset of the language you want to download.
We currently do not provide smaller sample versions, but by setting limit or using streaming=True you can easily fetch a sample of the data. If there is interest from the community we might upload smaller sampled versions later on.
datatrovefrom datatrove.pipeline.readers import ParquetReader
# limit determines how many documents will be streamed (remove for all)
# this will fetch the Portuguese data
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000)
for document in data_reader():
# do something with document
print(document)
###############################
# OR for a processing pipeline:
###############################
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import ParquetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.pipeline.writers import JsonlWriter
pipeline_exec = LocalPipelineExecutor(
pipeline=[
ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000),
LambdaFilter(lambda doc: "hugging" in doc.text),
JsonlWriter("some-output-path")
],
tasks=10
)
pipeline_exec.run()
from huggingface_hub import snapshot_download
folder = snapshot_download(
"HuggingFaceFW/finewiki",
repo_type="dataset",
local_dir="./finewiki/",
# download the English subset
allow_patterns=["data/enwiki/*"])
from datasets import load_dataset
# get Spanish data
fw = load_dataset("HuggingFaceFW/finewiki", name="eswiki", split="train", streaming=True)
Example from the English subset (values truncated for readability):
{
"text": "# 10th Tank Corps\nThe 10th Tank Corps was a tank corps of the Red Army, formed twice.\n\n## First Formation\nIn May–June 1938, ...",
"id": "enwiki/32552979",
"wikiname": "enwiki",
"page_id": 32552979,
"title": "10th Tank Corps",
"url": "https://en.wikipedia.org/wiki/10th_Tank_Corps",
"date_modified": "2023-07-26T12:32:03Z",
"in_language": "en",
"wikidata_id": "Q12061605",
"bytes_html": 115017,
"wikitext": "{{short description|Tank corps of the Soviet military}}\n\n{{Infobox military unit...",
"version": 1167219203,
"infoboxes": "[{\"title\": \"10th Tank Corps\", \"data\": {\"Active\": \"...\"}}]",
"has_math": false
}
text (string): cleaned, structured article text preserving headings, lists, code/pre blocks, tables and math. Has some markdown formatting (headings, tables, lists)id (string): dataset‑unique identifier; typically <wikiname>/<page_id>wikiname (string): wiki project name, e.g., enwiki, ptwikipage_id (int): MediaWiki page identifiertitle (string): article titleurl (string): canonical article URLdate_modified (string): ISO‑8601 timestamp of the last page revisionin_language (string): article language code (e.g., en, pt)wikidata_id (string|null): Wikidata QID associated with the pagebytes_html (int): size in bytes of the original HTML bodywikitext (string): original wikitext when availableversion (int|string): revision/version identifier of the pageinfoboxes (string): JSON‑encoded array of extracted infobox objects with title and key‑value datahas_math (bool): whether math content was detected on the pageThe full pipeline processing code is available here. It runs on datatrove. While we tried to offer robust support for most language variants of Wikipedia, the lack standardization on the HTML level means that for some subsets the extraction might be sub-optimal. If this is the case for the languages you are interested in, we recommend adapting our code to address your specific concerns.
We used the Wikimedia Enterprise HTML dump API (https://api.enterprise.wikimedia.com/v2/snapshots) and downloaded main-namespace (NS0) snapshots for the different language versions of Wikipedia. We intentionally relied on pre-rendered HTML over the more commonly used wikitex/markdown dumps:
wikitext often encodes templates and formatting as parser functions/macros, which makes large sections of wikipages harder to reconstruct faithfully, whereas the Enterprise HTML already expands those structures. Snapshots from August of 2025 were used. We record rich per‑page attributes (IDs, titles, URLs, language, version, timestamps, Wikidata IDs) as part of the metadata.
We heavily adapted mwparserfromhtml to parse the HTML content into a clean, structured text representation that preserves meaningful formatting. Redirect and disambiguation pages are removed reliably (via redirect markers in wikitext/HTML and disambiguation signals, including Wikidata IDs and page‑props). Reference‑like sections filled with non-article unnatural content (e.g., “References”, “Notes”, “External links”, localized per language) are excluded using a curated heading list and structural cues (reference list containers), so citations/notes are dropped without harming the main body. Visual/navigation boilerplate (ToC, navboxes, messageboxes, authority control, categories) is filtered out, while infoboxes are carefully extracted into the metadata into key-value structured data that can be useful for knowledge search applications. We additionally strive to keep math content (and mark pages containing it with a has_math flag) as well as tables, where much of the Wikipedia knowledge is contained.
One common issue with low-resource language Wikipedias is the large prevelance of content from other languages, particularly English (often from articles or boilerplate pages copied over from the English Wikipedia). To ensure language quality and consistency, we apply language‑ and script‑aware checks tailored to each wiki. Pages are kept only if their predicted writing system matches the expected scripts for that language. For non‑English wikis, pages that are predominantly English above a confidence threshold are removed to reduce cross‑language leakage. We also drop ultra‑short pages without infoboxes to avoid low‑signal content.
This dataset contains text from Wikipedia, licensed under Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0) and also available under GFDL. See Wikipedia’s licensing and Terms of Use: https://dumps.wikimedia.org/legal.html
Our processed release is an adaptation of that text and is licensed under CC BY-SA 4.0.
@dataset{penedo2025finewiki, author = {Guilherme Penedo}, title = {FineWiki}, year = {2025}, publisher = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/HuggingFaceFW/finewiki}, urldate = {2025-10-20}, note = {Source: Wikimedia Enterprise Snapshot API (https://api.enterprise.wikimedia.com/v2/snapshots). Text licensed under CC BY-SA 4.0 with attribution to Wikipedia contributors.} }