Dataset highlights: - 223,739 documents from doc4web.ru, a document hosting platform for students and teachers - Primarily in Russian, with some English and potentially other languages - Each entry includes: URL, title, download link, file path, and content (where available) - Contains original document files in addition to metadata - Data reflects a wide range of educational topics and materials - Licensed under Creative Commons Zero (CC0) for unrestricted use
The dataset can be used for analyzing educational content in Russian, text classification tasks, and information retrieval systems. It's also valuable for examining trends in educational materials and document sharing practices in the Russian-speaking academic community. The inclusion of original files allows for in-depth analysis of various document formats and structures.