Text from pdfs found on data.gouv.fr

Open data API in a single place

Provided by etalab

Get early access to Text from pdfs found on data.gouv.fr API!

Let us know and we will figure it out for you.

Dataset information

Country of origin
2020.05.20 16:14
Available languages
texte, ocr, text-mining, pdf
Quality scoring

Dataset description

# Text extracted from pdfs found on data.gouv.fr ## Description This dataset contains text extracted from 6602 files that have the ‘pdf’ extension in the resource catalog of data.gouv.fr. The dataset contains only the pdfs of 20 Mb or less and which are always available on the URL indicated. The extraction was done with [PDFBox](https://pdfbox.apache.org/) via its Python wrapper [python-PDFBox](https://pypi.org/project/python-pdfbox/). PDFs that are images (scans, maps, etc.) are detected with a simple heuristic: if after converting to text with ‘PDFBox’, the file size is less than 20 bytes, it is considered to be an image. In this case, OCRisation is carried out. This one is made with [Tesseract](https://github.com/tesseract-ocr/tesseract) via its Python wrapper [pyocr](https://github.com/openpaperwork/pyocr). The result is ‘txt’ files from ‘pdfs’ sorted by organisation (the organisation that published the resource). There are 175 organisations in this dataset, so 175 files. The name of each file corresponds to the string ‘{id-du-dataset}--{id-de-la-resource}.txt’. #### Input Catalogue of [data.gouv.fr resources](https://www.data.gouv.fr/en/datasets/catalogue-des-donnees-de-data-gouv-fr/). #### Output Text files of each ‘pdf’ resource found in the catalogue that was successfully converted and satisfied the above constraints. The tree is as follows: Bash . ACTION_Nogent-sur-Marne 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt |... Aeroport_La_Rochelle-Ile_de_Re Agency_de_services_and_payment_ASP Agency_du_Numerique ... “'” ## Distribution of texts [as of 20 May 2020] The top 10 organisations with the largest number of documents is: Python [(‘Les_Lilas’, 1294), (‘Ville_de_Pirae’, 1099), (‘Region_Hauts-de-France’, 592), (‘Ressourcerie_datalocale’, 297), (‘NA’, 268), (‘CORBION’, 244), (‘Education_Nationale’, 189), (‘Incubator_of_Services_Numeriques’, 157), (‘Ministere_des_Solidarites_and_de_la_Sante’, 148), (‘Communaute_dAgglomeration_Plaine_Vallee’, 142)] “'” And their preview in 2D is ([HashFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)+[TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)+[t-SNE]): [Plot t-SNE of DGF texts](https://raw.githubusercontent.com/psorianom/data_gouv_text/master/img/samplefigure.png) ## Code The Python scripts used to do this extraction are [here](https://github.com/psorianom/data_gouv_text). ## Remarks Due to the quality of the original pdfs (low resolution scans, non-aligned pdfs,...) and the performance of the pdf->txt transformation methods, the results can be very loud.
Build on reliable and scalable technology
Revolgy LogoAmazon Web Services LogoGoogle Cloud Logo

Frequently Asked Questions

Some basic informations about API Store ®.

Operation and development of APIs are currently fully funded by company Apitalks and its usage is for free.
Yes, you can.
All important information such as time of last update, license and other information are in response of each API call.
In case of major update that would not be compatible with previous version of API, we keep for 30 days both versions so you will have enough time to transfer to new version. We will inform you about the changes in advance by e-mail.

Didn't find the API you need?

Let us know and we will figure it out for you.

API Store provides access to European Open Data via scalable and reliable REST API interface.
Copyright © 2025. Made with ♥ by Apitalks