README.md · main · ai-datasets/HuggingFaceFW/finewiki

ai-datasets/HuggingFaceFW/finewiki

Public

Code Issues Pull requests Events Packages Insights

main

finewiki/README.md

Guilherme Penedo<guipenedo@users.noreply.huggingface.co>

docs: fix typo (#2)

8bd13e72

PreviewCode viewBlame

Raw

FineWiki

This is an updated and better extracted version of the wikimedia/Wikipedia dataset originally released in 2023. We carefully parsed Wikipedia HTML dumps from August of 2025 covering 325 languages.

This dataset:

fully renders templates as it was extracted from HTML and not markdown dumps
removes redirects, disambiguation, and other non main article pages
includes detailed metadata such as page ID, title, last modified date, wikidate ID, version and markdown version of the text
preserves elements and formatting such as headings, lists, code/pre blocks, tables and math content
notably, wikimedia/Wikipedia removes all tables and math content
excludes most of the "References", "See also", "Notes", "External links", and similar citations/notes sections across all languages
besides keeping all math content, pages containing math are flagged with a has_math metadata attribute
extracts infoboxes (the summary high-level information boxes on the right of some wikipedia pages) in a structured format into the metadata, for RAG and other uses
only keeps pages whose script (writing alphabet) matches the expected list for that language
for non-English wikis, any page fully or mostly in English is removed (common issue for Language Identifiers/classifiers training)

Visualize and Compare

You can explore the dataset, compare it to wikimedia/Wikipedia and preview the live Wikipedia pages on our space.

Available subsets

Subset	Name	Size	Pages
`en`	English	35.1 GB	6,614,655
`de`	German	13.1 GB	2,713,646
`fr`	French	12.1 GB	2,566,183
`ru`	Russian	10.7 GB	1,817,813
`ja`	Japanese	9.9 GB	1,354,269
`es`	Spanish	8.5 GB	1,948,965
`it`	Italian	7.4 GB	1,799,759
`uk`	Ukrainian	5.4 GB	1,239,253
`zh`	Chinese (writtenvernacular Chinese)	5.1 GB	1,295,955
`pl`	Polish	4.4 GB	1,543,918
`ceb`	Cebuano	4.4 GB	5,647,436
`pt`	Portuguese	4.3 GB	1,135,383
`nl`	Dutch	3.5 GB	2,072,865
`ca`	Catalan	3.5 GB	962,290
`ar`	Arabic	3.4 GB	1,230,456
`sv`	Swedish	2.9 GB	2,470,063
`cs`	Czech	2.2 GB	534,563
`fa`	Persian	2.2 GB	1,021,336
`vi`	Vietnamese	2.1 GB	1,279,087
`hu`	Hungarian	2.1 GB	515,004
`ko`	Korean	2.0 GB	582,035
`he`	Hebrew	2.0 GB	372,053
`sr`	Serbian	2.0 GB	664,345
`id`	Indonesian	1.8 GB	723,099
`tr`	Turkish	1.6 GB	629,762
`fi`	Finnish	1.5 GB	572,900
`no`	Norwegian (Bokmål)	1.3 GB	620,802
`el`	Greek	1.2 GB	242,517
`hy`	Armenian	1.2 GB	309,820
`ro`	Romanian	1.2 GB	493,462
...
Total		184.7 GB	61,550,610

A detailed list is available here.

How to download and use 🌐 FineWiki

See the tables above for the subset of the language you want to download.

We currently do not provide smaller sample versions, but by setting limit or using streaming=True you can easily fetch a sample of the data. If there is interest from the community we might upload smaller sampled versions later on.

Using 🏭 `datatrove`


from datatrove.pipeline.readers import ParquetReader

# limit determines how many documents will be streamed (remove for all)
# this will fetch the Portuguese data
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000) 
for document in data_reader():
    # do something with document
    print(document)

###############################    
# OR for a processing pipeline:
###############################

from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import ParquetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.pipeline.writers import JsonlWriter

pipeline_exec = LocalPipelineExecutor(
    pipeline=[
        ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000),
        LambdaFilter(lambda doc: "hugging" in doc.text),
        JsonlWriter("some-output-path")
    ],
    tasks=10
)
pipeline_exec.run()

Using `huggingface_hub`


from huggingface_hub import snapshot_download
folder = snapshot_download(
                "HuggingFaceFW/finewiki", 
                repo_type="dataset",
                local_dir="./finewiki/",
                # download the English subset
                allow_patterns=["data/enwiki/*"])

Using `datasets`


from datasets import load_dataset
# get Spanish data
fw = load_dataset("HuggingFaceFW/finewiki", name="eswiki", split="train", streaming=True)

Dataset Structure

Data Instances

Example from the English subset (values truncated for readability):


{
  "text": "# 10th Tank Corps\nThe 10th Tank Corps was a tank corps of the Red Army, formed twice.\n\n## First Formation\nIn May–June 1938, ...",
  "id": "enwiki/32552979",
  "wikiname": "enwiki",
  "page_id": 32552979,
  "title": "10th Tank Corps",
  "url": "https://en.wikipedia.org/wiki/10th_Tank_Corps",
  "date_modified": "2023-07-26T12:32:03Z",
  "in_language": "en",
  "wikidata_id": "Q12061605",
  "bytes_html": 115017,
  "wikitext": "{{short description|Tank corps of the Soviet military}}\n\n{{Infobox military unit...",
  "version": 1167219203,
  "infoboxes": "[{\"title\": \"10th Tank Corps\", \"data\": {\"Active\": \"...\"}}]",
  "has_math": false
}

Data Fields

text (string): cleaned, structured article text preserving headings, lists, code/pre blocks, tables and math. Has some markdown formatting (headings, tables, lists)
id (string): dataset‑unique identifier; typically <wikiname>/<page_id>
wikiname (string): wiki project name, e.g., enwiki, ptwiki
page_id (int): MediaWiki page identifier
title (string): article title
url (string): canonical article URL
date_modified (string): ISO‑8601 timestamp of the last page revision
in_language (string): article language code (e.g., en, pt)
wikidata_id (string|null): Wikidata QID associated with the page
bytes_html (int): size in bytes of the original HTML body
wikitext (string): original wikitext when available
version (int|string): revision/version identifier of the page
infoboxes (string): JSON‑encoded array of extracted infobox objects with title and key‑value data
has_math (bool): whether math content was detected on the page

Data Processing

The full pipeline processing code is available here. It runs on datatrove. While we tried to offer robust support for most language variants of Wikipedia, the lack standardization on the HTML level means that for some subsets the extraction might be sub-optimal. If this is the case for the languages you are interested in, we recommend adapting our code to address your specific concerns.

Downloading

We used the Wikimedia Enterprise HTML dump API (https://api.enterprise.wikimedia.com/v2/snapshots) and downloaded main-namespace (NS0) snapshots for the different language versions of Wikipedia. We intentionally relied on pre-rendered HTML over the more commonly used wikitex/markdown dumps: wikitext often encodes templates and formatting as parser functions/macros, which makes large sections of wikipages harder to reconstruct faithfully, whereas the Enterprise HTML already expands those structures. Snapshots from August of 2025 were used. We record rich per‑page attributes (IDs, titles, URLs, language, version, timestamps, Wikidata IDs) as part of the metadata.

Extraction

We heavily adapted mwparserfromhtml to parse the HTML content into a clean, structured text representation that preserves meaningful formatting. Redirect and disambiguation pages are removed reliably (via redirect markers in wikitext/HTML and disambiguation signals, including Wikidata IDs and page‑props). Reference‑like sections filled with non-article unnatural content (e.g., “References”, “Notes”, “External links”, localized per language) are excluded using a curated heading list and structural cues (reference list containers), so citations/notes are dropped without harming the main body. Visual/navigation boilerplate (ToC, navboxes, messageboxes, authority control, categories) is filtered out, while infoboxes are carefully extracted into the metadata into key-value structured data that can be useful for knowledge search applications. We additionally strive to keep math content (and mark pages containing it with a has_math flag) as well as tables, where much of the Wikipedia knowledge is contained.

Filtering

One common issue with low-resource language Wikipedias is the large prevelance of content from other languages, particularly English (often from articles or boilerplate pages copied over from the English Wikipedia). To ensure language quality and consistency, we apply language‑ and script‑aware checks tailored to each wiki. Pages are kept only if their predicted writing system matches the expected scripts for that language. For non‑English wikis, pages that are predominantly English above a confidence threshold are removed to reduce cross‑language leakage. We also drop ultra‑short pages without infoboxes to avoid low‑signal content.

Licensing Information

This dataset contains text from Wikipedia, licensed under Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0) and also available under GFDL. See Wikipedia’s licensing and Terms of Use: https://dumps.wikimedia.org/legal.html

Our processed release is an adaptation of that text and is licensed under CC BY-SA 4.0.

Citation Information


@dataset{penedo2025finewiki,
  author    = {Guilherme Penedo},
  title     = {FineWiki},
  year      = {2025},
  publisher = {Hugging Face Datasets},
  url       = {https://huggingface.co/datasets/HuggingFaceFW/finewiki},
  urldate   = {2025-10-20},
  note      = {Source: Wikimedia Enterprise Snapshot API (https://api.enterprise.wikimedia.com/v2/snapshots). Text licensed under CC BY-SA 4.0 with attribution to Wikipedia contributors.}
}

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111