Dataset information
Available languages
French
Keywords
texte, ocr, text-mining, pdf
Dataset description
# Text extracted from pdfs found on data.gouv.fr
## Description
This dataset contains text extracted from 6602 files that have the ‘pdf’ extension in the resource catalog of data.gouv.fr.
The dataset contains only the pdfs of 20 Mb or less and which are always available on the URL indicated.
The extraction was done with [PDFBox](https://pdfbox.apache.org/) via its Python wrapper [python-PDFBox](https://pypi.org/project/python-pdfbox/). PDFs that are images (scans, maps, etc.)
are detected with a simple heuristic: if after converting to text with ‘PDFBox’, the file size is less than 20 bytes, it is considered to be an image.
In this case, OCRisation is carried out. This one is made with [Tesseract](https://github.com/tesseract-ocr/tesseract) via its Python wrapper [pyocr](https://github.com/openpaperwork/pyocr).
The result is ‘txt’ files from ‘pdfs’ sorted by organisation (the organisation that published the resource). There are 175 organisations in this dataset, so 175 files.
The name of each file corresponds to the string ‘{id-du-dataset}--{id-de-la-resource}.txt’.
#### Input
Catalogue of [data.gouv.fr resources](https://www.data.gouv.fr/en/datasets/catalogue-des-donnees-de-data-gouv-fr/).
#### Output
Text files of each ‘pdf’ resource found in the catalogue that was successfully converted and satisfied the above constraints.
The tree is as follows:
Bash
.
ACTION_Nogent-sur-Marne
53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt
53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt
53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt
53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt
53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt
|...
Aeroport_La_Rochelle-Ile_de_Re
Agency_de_services_and_payment_ASP
Agency_du_Numerique
...
“'”
## Distribution of texts [as of 20 May 2020]
The top 10 organisations with the largest number of documents is:
Python
[(‘Les_Lilas’, 1294),
(‘Ville_de_Pirae’, 1099),
(‘Region_Hauts-de-France’, 592),
(‘Ressourcerie_datalocale’, 297),
(‘NA’, 268),
(‘CORBION’, 244),
(‘Education_Nationale’, 189),
(‘Incubator_of_Services_Numeriques’, 157),
(‘Ministere_des_Solidarites_and_de_la_Sante’, 148),
(‘Communaute_dAgglomeration_Plaine_Vallee’, 142)]
“'”
And their preview in 2D is ([HashFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)+[TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)+[t-SNE]):
[Plot t-SNE of DGF texts](https://raw.githubusercontent.com/psorianom/data_gouv_text/master/img/samplefigure.png)
## Code
The Python scripts used to do this extraction are [here](https://github.com/psorianom/data_gouv_text).
## Remarks
Due to the quality of the original pdfs (low resolution scans, non-aligned pdfs,...) and the performance of the pdf->txt transformation methods, the results can be very loud.
Build on reliable and scalable technology