Enrico Shippole

conceptofmind

AI & ML interests

None yet

Recent Activity

updated a dataset 11 days ago
DataProvenanceInitiative/Megawika_corrected
updated a dataset about 1 month ago
conceptofmind/MegaWika
updated a dataset about 1 month ago
DataProvenanceInitiative/Megawika_subset
View all activity

Organizations

MedARC's profile picture LawInformedAI's profile picture Dolphin Brothers Unite's profile picture CarperAI's profile picture Nomos AI's profile picture Experimental Models's profile picture Data Provenance Initiative's profile picture Citation Max's profile picture FreeLaw's profile picture ZeroGPU Explorers's profile picture lul2's profile picture Social Post Explorers's profile picture genlaw's profile picture aidos-lab's profile picture Distillation Hugs's profile picture Teraflop AI's profile picture

Posts 2

view post
Post
2546
Teraflop AI is excited to help support the Caselaw Access Project and Harvard Library Innovation Lab, in the release of over 6.6 million state and federal court decisions published throughout U.S. history. It is important to democratize fair access to data to the public, legal community, and researchers. This is a processed and cleaned version of the original CAP data.

During the digitization of these texts, there were erroneous OCR errors that occurred. We worked to post-process each of the texts for model training to fix encoding, normalization, repetition, redundancy, parsing, and formatting.

Teraflop AI’s data engine allows for the massively parallel processing of web-scale datasets into cleaned text form.

Link to the processed dataset: https://huggingface.co./datasets/TeraflopAI/Caselaw_Access_Project

The Caselaw Access Project dataset is licensed under the CC0 License.

We plan to release trillions of commercially licensed text tokens, images, audio, videos, and other datasets spanning numerous domains and modalities over the next months. If you are interested in contributing commercially licensed data be sure to reach out: https://twitter.com/EnricoShippole

Follow us for the next collaborative dataset releases: https://twitter.com/TeraflopAI
view post
Post
A 1b dense causal language model begins to "saturate" in terms of accuracy around 5 epochs on 1.2T tokens.

models

None public yet