logo
0
0
Login
Guilherme Penedo<guipenedo@users.noreply.huggingface.co>
docs: fix typo (#2)

FineWiki

This is an updated and better extracted version of the wikimedia/Wikipedia dataset originally released in 2023. We carefully parsed Wikipedia HTML dumps from August of 2025 covering 325 languages.

This dataset:

  • fully renders templates as it was extracted from HTML and not markdown dumps
  • removes redirects, disambiguation, and other non main article pages
  • includes detailed metadata such as page ID, title, last modified date, wikidate ID, version and markdown version of the text
  • preserves elements and formatting such as headings, lists, code/pre blocks, tables and math content
  • notably, wikimedia/Wikipedia removes all tables and math content
  • excludes most of the "References", "See also", "Notes", "External links", and similar citations/notes sections across all languages
  • besides keeping all math content, pages containing math are flagged with a has_math metadata attribute
  • extracts infoboxes (the summary high-level information boxes on the right of some wikipedia pages) in a structured format into the metadata, for RAG and other uses
  • only keeps pages whose script (writing alphabet) matches the expected list for that language
  • for non-English wikis, any page fully or mostly in English is removed (common issue for Language Identifiers/classifiers training)

Visualize and Compare

You can explore the dataset, compare it to wikimedia/Wikipedia and preview the live Wikipedia pages on our space.

Available subsets

SubsetNameSizePages
enEnglish35.1 GB6,614,655
deGerman13.1 GB2,713,646
frFrench12.1 GB2,566,183
ruRussian10.7 GB1,817,813
jaJapanese9.9 GB1,354,269
esSpanish8.5 GB1,948,965
itItalian7.4 GB1,799,759
ukUkrainian5.4 GB1,239,253
zhChinese (writtenvernacular Chinese)5.1 GB1,295,955
plPolish4.4 GB1,543,918
cebCebuano4.4 GB5,647,436
ptPortuguese4.3 GB1,135,383
nlDutch3.5 GB2,072,865
caCatalan3.5 GB962,290
arArabic3.4 GB1,230,456
svSwedish2.9 GB2,470,063
csCzech2.2 GB534,563
faPersian2.2 GB1,021,336
viVietnamese2.1 GB1,279,087
huHungarian2.1 GB515,004
koKorean2.0 GB582,035
heHebrew2.0 GB372,053
srSerbian2.0 GB664,345
idIndonesian1.8 GB723,099
trTurkish1.6 GB629,762
fiFinnish1.5 GB572,900
noNorwegian (Bokmål)1.3 GB620,802
elGreek1.2 GB242,517
hyArmenian1.2 GB309,820
roRomanian1.2 GB493,462
...
Total184.7 GB61,550,610

A detailed list is available here.

How to download and use 🌐 FineWiki

See the tables above for the subset of the language you want to download.

We currently do not provide smaller sample versions, but by setting limit or using streaming=True you can easily fetch a sample of the data. If there is interest from the community we might upload smaller sampled versions later on.

Using 🏭 datatrove

from datatrove.pipeline.readers import ParquetReader # limit determines how many documents will be streamed (remove for all) # this will fetch the Portuguese data data_reader = ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000) for document in data_reader(): # do something with document print(document) ############################### # OR for a processing pipeline: ############################### from datatrove.executor import LocalPipelineExecutor from datatrove.pipeline.readers import ParquetReader from datatrove.pipeline.filters import LambdaFilter from datatrove.pipeline.writers import JsonlWriter pipeline_exec = LocalPipelineExecutor( pipeline=[ ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000), LambdaFilter(lambda doc: "hugging" in doc.text), JsonlWriter("some-output-path") ], tasks=10 ) pipeline_exec.run()

Using huggingface_hub

from huggingface_hub import snapshot_download folder = snapshot_download( "HuggingFaceFW/finewiki", repo_type="dataset", local_dir="./finewiki/", # download the English subset allow_patterns=["data/enwiki/*"])

Using datasets

from datasets import load_dataset # get Spanish data fw = load_dataset("HuggingFaceFW/finewiki", name="eswiki", split="train", streaming=True)

Dataset Structure

Data Instances

Example from the English subset (values truncated for readability):

{ "text": "# 10th Tank Corps\nThe 10th Tank Corps was a tank corps of the Red Army, formed twice.\n\n## First Formation\nIn May–June 1938, ...", "id": "enwiki/32552979", "wikiname": "enwiki", "page_id": 32552979, "title": "10th Tank Corps", "url": "https://en.wikipedia.org/wiki/10th_Tank_Corps", "date_modified": "2023-07-26T12:32:03Z", "in_language": "en", "wikidata_id": "Q12061605", "bytes_html": 115017, "wikitext": "{{short description|Tank corps of the Soviet military}}\n\n{{Infobox military unit...", "version": 1167219203, "infoboxes": "[{\"title\": \"10th Tank Corps\", \"data\": {\"Active\": \"...\"}}]", "has_math": false }

Data Fields

  • text (string): cleaned, structured article text preserving headings, lists, code/pre blocks, tables and math. Has some markdown formatting (headings, tables, lists)
  • id (string): dataset‑unique identifier; typically <wikiname>/<page_id>
  • wikiname (string): wiki project name, e.g., enwiki, ptwiki
  • page_id (int): MediaWiki page identifier
  • title (string): article title
  • url (string): canonical article URL
  • date_modified (string): ISO‑8601 timestamp of the last page revision
  • in_language (string): article language code (e.g., en, pt)
  • wikidata_id (string|null): Wikidata QID associated with the page
  • bytes_html (int): size in bytes of the original HTML body
  • wikitext (string): original wikitext when available
  • version (int|string): revision/version identifier of the page
  • infoboxes (string): JSON‑encoded array of extracted infobox objects with title and key‑value data
  • has_math (bool): whether math content was detected on the page

Data Processing

The full pipeline processing code is available here. It runs on datatrove. While we tried to offer robust support for most language variants of Wikipedia, the lack standardization on the HTML level means that for some subsets the extraction might be sub-optimal. If this is the case for the languages you are interested in, we recommend adapting our code to address your specific concerns.

Downloading

We used the Wikimedia Enterprise HTML dump API (https://api.enterprise.wikimedia.com/v2/snapshots) and downloaded main-namespace (NS0) snapshots for the different language versions of Wikipedia. We intentionally relied on pre-rendered HTML over the more commonly used wikitex/markdown dumps: wikitext often encodes templates and formatting as parser functions/macros, which makes large sections of wikipages harder to reconstruct faithfully, whereas the Enterprise HTML already expands those structures. Snapshots from August of 2025 were used. We record rich per‑page attributes (IDs, titles, URLs, language, version, timestamps, Wikidata IDs) as part of the metadata.

Extraction

We heavily adapted mwparserfromhtml to parse the HTML content into a clean, structured text representation that preserves meaningful formatting. Redirect and disambiguation pages are removed reliably (via redirect markers in wikitext/HTML and disambiguation signals, including Wikidata IDs and page‑props). Reference‑like sections filled with non-article unnatural content (e.g., “References”, “Notes”, “External links”, localized per language) are excluded using a curated heading list and structural cues (reference list containers), so citations/notes are dropped without harming the main body. Visual/navigation boilerplate (ToC, navboxes, messageboxes, authority control, categories) is filtered out, while infoboxes are carefully extracted into the metadata into key-value structured data that can be useful for knowledge search applications. We additionally strive to keep math content (and mark pages containing it with a has_math flag) as well as tables, where much of the Wikipedia knowledge is contained.

Filtering

One common issue with low-resource language Wikipedias is the large prevelance of content from other languages, particularly English (often from articles or boilerplate pages copied over from the English Wikipedia). To ensure language quality and consistency, we apply language‑ and script‑aware checks tailored to each wiki. Pages are kept only if their predicted writing system matches the expected scripts for that language. For non‑English wikis, pages that are predominantly English above a confidence threshold are removed to reduce cross‑language leakage. We also drop ultra‑short pages without infoboxes to avoid low‑signal content.

Licensing Information

This dataset contains text from Wikipedia, licensed under Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0) and also available under GFDL. See Wikipedia’s licensing and Terms of Use: https://dumps.wikimedia.org/legal.html

Our processed release is an adaptation of that text and is licensed under CC BY-SA 4.0.

Citation Information

@dataset{penedo2025finewiki, author = {Guilherme Penedo}, title = {FineWiki}, year = {2025}, publisher = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/HuggingFaceFW/finewiki}, urldate = {2025-10-20}, note = {Source: Wikimedia Enterprise Snapshot API (https://api.enterprise.wikimedia.com/v2/snapshots). Text licensed under CC BY-SA 4.0 with attribution to Wikipedia contributors.} }