Spaces:

LLM360
/

TxT360

Running

App Files Files Community

hunterhector commited on Oct 7, 2024

Commit

4cc0103

1 Parent(s): a4dc57a

a lot of text fixes.

Browse files

Files changed (5) hide show

bibliography.bib +9 -0
curated.py +44 -16
main.py +22 -22
results.py +40 -36
web.py +4 -4

bibliography.bib CHANGED Viewed

@@ -529,3 +529,12 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
   pages={994-998},
   keywords={Couplings;Databases;Data mining;Algorithm design and analysis;Social network services;Feature extraction;Cleaning;Transitive Closure;Connected Components;Large Scale Graphs;Hadoop;MapReduce},
   doi={10.1109/ICCNC.2014.6785473}}

   pages={994-998},
   keywords={Couplings;Databases;Data mining;Algorithm design and analysis;Social network services;Feature extraction;Cleaning;Transitive Closure;Connected Components;Large Scale Graphs;Hadoop;MapReduce},
   doi={10.1109/ICCNC.2014.6785473}}
+@misc{lozhkov2024starcoder2stackv2,
+      title={StarCoder 2 and The Stack v2: The Next Generation},
+      author={Anton Lozhkov and Raymond Li and Loubna Ben Allal and Federico Cassano and Joel Lamy-Poirier and Nouamane Tazi and Ao Tang and Dmytro Pykhtar and Jiawei Liu and Yuxiang Wei and Tianyang Liu and Max Tian and Denis Kocetkov and Arthur Zucker and Younes Belkada and Zijian Wang and Qian Liu and Dmitry Abulkhanov and Indraneil Paul and Zhuang Li and Wen-Ding Li and Megan Risdal and Jia Li and Jian Zhu and Terry Yue Zhuo and Evgenii Zheltonozhskii and Nii Osae Osae Dade and Wenhao Yu and Lucas Krauß and Naman Jain and Yixuan Su and Xuanli He and Manan Dey and Edoardo Abati and Yekun Chai and Niklas Muennighoff and Xiangru Tang and Muhtasham Oblokulov and Christopher Akiki and Marc Marone and Chenghao Mou and Mayank Mishra and Alex Gu and Binyuan Hui and Tri Dao and Armel Zebaze and Olivier Dehaene and Nicolas Patry and Canwen Xu and Julian McAuley and Han Hu and Torsten Scholak and Sebastien Paquet and Jennifer Robinson and Carolyn Jane Anderson and Nicolas Chapados and Mostofa Patwary and Nima Tajbakhsh and Yacine Jernite and Carlos Muñoz Ferrandis and Lingming Zhang and Sean Hughes and Thomas Wolf and Arjun Guha and Leandro von Werra and Harm de Vries},
+      year={2024},
+      eprint={2402.19173},
+      archivePrefix={arXiv},
+      primaryClass={cs.SE},
+      url={https://arxiv.org/abs/2402.19173},
+}

curated.py CHANGED Viewed

@@ -16,7 +16,7 @@ overview = (
         H2("Curated Sources Processing"),
         H3("What This Section Contains"),
         P(
-            "This section provides a complete discussion on the filtering applied to the 14 curated sources that comprise the non-web data section of TxT360. The section is split into the following topic areas: "
         ),
         Ul(
             Li("Curated Sources Data Processing Summary", style="margin-bottom: 5px"),
@@ -30,18 +30,16 @@ overview = (
 )
 curated_sources_intro = Div(
-    H2("Curated Sources in TxT360"),
     P(
-        "Curated sources comprise high-quality datasets that contain domain-specificity.",
-        B(
-            " TxT360 was strongly influenced by The Pile",
-            D_cite(bibtex_key="thepile"),
-            " regarding both inclusion of the dataset and filtering techniques.",
-        ),
-        " These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide valuable data that is excluded from the web dataset mentioned above. Analyzing and processing non-web data can yield insights and opportunities for various applications. Details about each of the sources are provided below. ",
     ),
     P(
-        "TxT360 respects the copyright of the data sources and have not included the controversial data that was used in The Pile like YouTube and Opensubtitles, Reddit threads, and books."
     ),
 )
@@ -1198,10 +1196,22 @@ filtering_process = Div(
                 ". The dataset was parsed using the Story ID. In this dataset each post is a story, and each reply is considered subsequent story. Story IDs were considered between ID 1 to 37500000.  The URL for all Story IDs was pinged. If that ID returned an error, the ID was removed. Each request was given a 2 second wait to account for network time.",
             ),
             P(
-                "The HackerNews dataset contains a vast amount of stories and is known for lively discussions. Due to the number of replies a story may contain, only longest comment thread for each story was sampled past level 3. All stories included the title (1st level) and all direct replies (2nd level)."
             ),
             P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
                     "As discussed above, the comment heirarchies required a thoughful approach to extracting meaningful data. ",
                     style="margin-bottom: -3px",
@@ -1279,6 +1289,7 @@ filtering_process = Div(
                 "All content was downloaded leading to high number of documents filtered during local deduplication. Following The Pile, priority was given to plain_text first, followed by the columns in the table in reverse order."
             ),
             P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
                     "Consecutive whitespaces and tabs were found. Consecutive Whitespaces and tabes were reduce to one, single whitespace.",
@@ -1296,7 +1307,11 @@ filtering_process = Div(
                     "Converted all single new lines to whitespace. If whitespace was found after a new line with no text, the whitespace was removed. All leading and trailing whitespace was removed.",
                     style="margin-bottom: -3px",
                 ),
-                Li("All \f characters were removed.", style="margin-bottom: -3px"),
             ),
             P(B("Filters Applied: ")),
             Ul(
@@ -1422,6 +1437,9 @@ filtering_process = Div(
                     "Similar to the HackerNews challenges, we had to map comments and sub-comments to the original question.",
                     style="margin-bottom: -3px",
                 ),
             ),
             P(B("Filters Applied: ")),
             Ul(
@@ -1457,7 +1475,7 @@ filtering_process = Div(
             P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
-                    "A byte string was included at the beginning of new lines",
                     style="margin-bottom: -3px",
                 ),
                 Li('No space before keyword "Answer:"', style="margin-bottom: -3px"),
@@ -1500,15 +1518,25 @@ filtering_process = Div(
             P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
-                    "Consecutive whitespaces were found spanning 10+ whitespace entries. These whitespaces were reduce to one, single whitespace.",
                     style="margin-bottom: -3px",
                 ),
                 Li(
-                    "Consecutive new lines were found in some documents. All consecutive news over two were were reduce to two new lines.",
                     style="margin-bottom: -3px",
                 ),
                 Li(
-                    "Delimiters such as * * * * * * * * ? were found. They were removed and replaced with whitespace.",
                     style="margin-bottom: -3px",
                 ),
             ),

         H2("Curated Sources Processing"),
         H3("What This Section Contains"),
         P(
+            "This section provides a complete discussion on the filtering applied to the 14 curated sources that comprise the non-Common Crawl data section of TxT360. The section is split into the following topic areas: "
         ),
         Ul(
             Li("Curated Sources Data Processing Summary", style="margin-bottom: 5px"),
 )
 curated_sources_intro = Div(
+    H2("Domain Specific Curated Sources"),
     P(
+        "While massive amount of data can be crawled and obtained from the Internet, there are certain sources contain data in additional formats (e.g. PDF documents), or organized and published as official dumps (e.g. Wikipedia). We refer to these sources as curated sources. These dataset often comprises high-quality data that contain domain-specificity, such as academic publications or domain specific discussions. TxT360 was strongly influenced by The Pile",
+        D_cite(bibtex_key="thepile"),
+        " regarding both inclusion of the dataset and filtering techniques.",
+    ),
+    P("These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide high quality data. And as mentioned above, they are excluded from the web dataset via URL matching. Details about each of the sources are provided below. ",
     ),
     P(
+        "TxT360 respects the copyright of the data sources and have not included the controversial data that was used in The Pile like YouTube and Opensubtitles, Reddit threads, and book3."
     ),
 )
                 ". The dataset was parsed using the Story ID. In this dataset each post is a story, and each reply is considered subsequent story. Story IDs were considered between ID 1 to 37500000.  The URL for all Story IDs was pinged. If that ID returned an error, the ID was removed. Each request was given a 2 second wait to account for network time.",
             ),
             P(
+                "The HackerNews dataset contains a vast amount of stories and is known for lively discussions. Due to the number of replies a story may contain, only longest comment thread for each story was sampled past level 3. All stories included the title (1st level) and all direct replies (2nd level). We may consider relax this constrain and extract more data."
             ),
             P(B("Unique Data Preparation Challenges: ")),
             Ul(
+                Li(
+                    "The converesation and forum style structure can be a very helpful signal for language model training. During processing the dataset, we try to encode such structure but without introducing too much noise. We choose to use an",
+                    D_code("<AUTHOR>", language="html"),
+                    " tag to encode the main thread text by the original poster, and use a ",
+                    D_code("<COMMENT>", language="html"),
+                    " tag to encode the replies. We initially choose ",
+                    D_code("<P>", language="html"),
+                    " as a tag since it is used by some instruction tuning dataset, but realize the ",
+                    D_code("<P>", language="html"),
+                    " tag can easily conflict with the original text.",
+                    style="margin-bottom: -3px",
+                ),
                 Li(
                     "As discussed above, the comment heirarchies required a thoughful approach to extracting meaningful data. ",
                     style="margin-bottom: -3px",
                 "All content was downloaded leading to high number of documents filtered during local deduplication. Following The Pile, priority was given to plain_text first, followed by the columns in the table in reverse order."
             ),
             P(B("Unique Data Preparation Challenges: ")),
+            P("The Freelaw text uses a lot of whitespaces and newlines to format the document visually. These lines are not necessary for language model learning and sometimes have confusing semantic meanings. We attempt to unify how whitespaces appear in this dataset with the following heuristics."),
             Ul(
                 Li(
                     "Consecutive whitespaces and tabs were found. Consecutive Whitespaces and tabes were reduce to one, single whitespace.",
                     "Converted all single new lines to whitespace. If whitespace was found after a new line with no text, the whitespace was removed. All leading and trailing whitespace was removed.",
                     style="margin-bottom: -3px",
                 ),
+                Li(
+                    "All form feed (",
+                    D_code("\\f", language="bash"),
+                    ")characters were removed.", style="margin-bottom: -3px"
+                ),
             ),
             P(B("Filters Applied: ")),
             Ul(
                     "Similar to the HackerNews challenges, we had to map comments and sub-comments to the original question.",
                     style="margin-bottom: -3px",
                 ),
+                Li(
+                    "The dataset comes with the usernames of post authors. We attempt to replace them with strings such as <USER1> to remove the PII. This step might also reduce the language model's effort to memorizing the user names."
+                ),
             ),
             P(B("Filters Applied: ")),
             Ul(
             P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
+                    "In one of our versions, we save the string as a byte string instead of raw text, introducing addition byte indicators at the string level",
                     style="margin-bottom: -3px",
                 ),
                 Li('No space before keyword "Answer:"', style="margin-bottom: -3px"),
             P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
+                    "The original books uses a lot of witespaces to format the text, similar to the case of FreeLaw. Sometimes, 10+ consecutive whitespaces were found. These whitespaces were reduce to one, single whitespace.",
+                    style="margin-bottom: -3px",
+                ),
+                Li(
+                    "For similar reasons, consecutive new lines were found in some documents. All consecutive news over two were were reduce to two new lines.",
                     style="margin-bottom: -3px",
                 ),
                 Li(
+                    "The books are formmated with end-of-line hyphenation and break a single words into two lines. Hence a regular word such as ",
+                    D_code("text", language="bash"),
+                    " could become ",
+                    D_code("te-\\nxt", language="bash"),
+                    ". We detect the combination of ",
+                    D_code("-\\n", language="bash"),
+                    " and remove them to the origin word heuristically.",
                     style="margin-bottom: -3px",
                 ),
                 Li(
+                    "Text delimiters such as * * * * * * * * were used to indicate structures like sections. We removed such known delimiters and replaced them with proper whitespaces and new lines. For others, we make sure there are no additional leading or trailing whitepsaces.",
                     style="margin-bottom: -3px",
                 ),
             ),

main.py CHANGED Viewed

@@ -77,7 +77,7 @@ front_matter = {
             "affiliationURL": "",
         },
         {
-            "author": "An Li",
             "authorURL": "https://huggingface.co/an1118",
             "affiliation": "UCSD",
             "affiliationURL": "",
@@ -216,7 +216,7 @@ def main():
                     ),
                     Div(
                         A(
-                            "Web Data Processing",
                             href="#section21",
                         )
                     ),
@@ -256,7 +256,7 @@ def main():
                     ),
                     Div(
                         A(
-                            "Curated Sources Processing",
                             href="#section31",
                         )
                     ),
@@ -853,9 +853,12 @@ def intro():
     return Div(
         Section(
             H2("About TxT360"),
-            P(  "TL;DR ",
-                B("We introduce TxT360 (Trillion eXtracted Text), the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution. We demonstrate a simple but effective upsampling recipe that creates a 15+ trillion-token corpus, outperforming FineWeb 15T on several key metrics. With the information, TxT360 empowers pre-trainers to explore more advanced weighting techniques, a feature not commonly available in previous pre-training datasets. In line with our 360° open source spirit, we document all detailed steps, reasons of our decisions, detailed statistics and more, in additional to the dataset itself. We hope this can serve as a useful resource for future developers."
-                )
             ),
             plotly2fasthtml(all_eval_res_figs["MMLU"]),
             P(
@@ -865,52 +868,49 @@ def intro():
                 D_cite(bibtex_key="c4"),
                 D_cite(bibtex_key="muennighoff2023scaling"),
                 D_cite(bibtex_key="dolma"),
-                ", TxT360 carefully implements data processing steps including extraction, filtering, deduplication, personally identifiable information removal, and other steps.",
-            ),
-            P(
-                "Metadata is stored along the processing stpes, enabling fine-grained control to create data distributions and corpus of desired size. As an example, we present one simple upsampling scheme that takes into account the duplication counts, resulting in a 15~16 trillion token corpus, outperforming FineWeb and our non-upsampling baselines, on diverse evaluations. Unlike DCLM",
                 D_cite(bibtex_key="dclm"),
                 "and RedPajama V2,",
                 D_cite(bibtex_key="redpajama-v2"),
-                "we present the final deduplicated dataset that is ready to go.",
-            ),
-            P(
-                "In line with our 360° open-source initiative, we’ve documented all implementation details in this blog post and will be open-sourcing the code soon (stay tuned!). We also provide examples of each filter along with the rationale behind every decision, with the goal of informing and inspiring future work."
             ),
             id="section11",
         ),
         Section(
             H2("Why TxT360"),
-            H3(
-                "TxT360 is the first dataset to combine both web and curated data sources commonly used in pretraining."
             ),
             new_table_div_1,
             # table_div_1,
             # table_div_2,
             P(
-                "In pretraining, it is common to combine web data and curated sources (cite). Web data is included to provide a vast quantity of long tail and diverse data, while curated datasets are often information rich and provide the 'deep-dive' domain information. Combining both datasets plays a critical role for effective LLM pre-training. By integrating the reach of web data with the quality of curated sources, TxT360 meets and surpasses the rigorous standards required for state-of-the-art LLM pre-training. See Results section below."
             ),
             P(
-                "** TxT360 does not include code. This decision was made due to the perceived low duplication code with other sources."
             ),
             # P("Table 2: Basic TxT360 Statistics."),
             # table_div_data,
             id="section12",
         ),
         Section(
-            H2("Our Generalizable Approach to Data Processing"),
             P(
-                "To produce TxT360, a comprehensive and transparent data processing pipeline was designed to account for the nuances of both web and curated datasets. The pipeline presents a unified framework for processing both data types, making it convenient and easily adaptive for users to revise and fine-tune the pipeline for their own use cases."
             ),
             P(
                 "Web datasets are inherently noisy and varied. The TxT360 pipeline implements sophisticated filtering and deduplication techniques to clean and remove redundancies while preserving data integrity."
             ),
             P(
-                "Curated datasets are typically structured and consistently formatted. TxT360 filters these sources with selective steps to maintain their integrity while providing seamless integration into the larger dataset. Both data source types are globally deduplicated together resulting in 5.7T tokens of high-quality data. The table below shows the source distribution of TxT360 tokens."
             ),
             table_div_data,
             P(
-                "We provide details and context for the choices behind TxT360 in the respective Web Data Processing and Curated Source Processing section. A deep dive describing the deduplication process can be found in the Shared Processing Steps section."
             ),
             # Img(src="images/pipeline.png", height="300", width="600"),
             # P(

             "affiliationURL": "",
         },
         {
+            "author": "Li An",
             "authorURL": "https://huggingface.co/an1118",
             "affiliation": "UCSD",
             "affiliationURL": "",
                     ),
                     Div(
                         A(
+                            "Common Crawl Data",
                             href="#section21",
                         )
                     ),
                     ),
                     Div(
                         A(
+                            "Curated Sources",
                             href="#section31",
                         )
                     ),
     return Div(
         Section(
             H2("About TxT360"),
+            P(  B("TL;DR "),
+                "We introduce ",
+                A(B("TxT360 (Trillion eXtracted Text),"), href="https://huggingface.co/datasets/LLM360/TxT360"),
+                " the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution. We demonstrate a simple but effective upsampling recipe that creates a 15+ trillion-token corpus, outperforming FineWeb 15T on several key metrics. With the information, TxT360 empowers pre-trainers to explore more advanced weighting techniques, a feature not commonly available in previous pre-training datasets. Our findings highlight the importance of both high-quality data sources and appropriate weighting for optimal blending in LLM training."
+            ),
+            P("In line with our 360° open source spirit, we document all detailed steps, reasons of our decisions, detailed statistics, our code (stay tuned!), analysis results and more, in additional to the dataset itself. We hope this can serve as a useful resource for future developers."
             ),
             plotly2fasthtml(all_eval_res_figs["MMLU"]),
             P(
                 D_cite(bibtex_key="c4"),
                 D_cite(bibtex_key="muennighoff2023scaling"),
                 D_cite(bibtex_key="dolma"),
+                ", TxT360 carefully implements data processing steps including extraction, filtering, deduplication, personally identifiable information removal, and other steps. Unlike DCLM",
                 D_cite(bibtex_key="dclm"),
                 "and RedPajama V2,",
                 D_cite(bibtex_key="redpajama-v2"),
+                "we also hope to provide a dataset at this scale that is ready to go, without requiring futher filtering."
             ),
             id="section11",
         ),
         Section(
             H2("Why TxT360"),
+            P(
+                "In this year we have seen excellent datasets released by the community. Among those, most datasets focus on one source (e.g., crawled websites, code bases, papers). However, it is not trivial to combine these sources together due to the potential duplicaiton across them. TxT360 is the first dataset to combine most of sources commonly used in pretraining."
             ),
             new_table_div_1,
             # table_div_1,
             # table_div_2,
             P(
+                "In LLM pretraining, it is common to combine all possible text sources due to the Scaling Law. Crawled web pages are included to provide a vast quantity of data which can cover long tail and diverse information, while curated datasets such as Wikipedia are also used, which often provide the 'deep-dive' domain information. By integrating the reach of web data with the quality of curated sources, TxT360 meets and surpasses the rigorous standards required for state-of-the-art LLM pre-training."
             ),
             P(
+                "** TxT360 does not include very specific domains such as code and math. This decision was made due to the perceived low duplication code with other sources, and the different logic requiring to build those datasets. We leave those work to future work and recommend users refer to existing projects such as Stack V2",
+                D_cite(bibtex_key="lozhkov2024starcoder2stackv2"),
+                ".",
             ),
             # P("Table 2: Basic TxT360 Statistics."),
             # table_div_data,
             id="section12",
         ),
         Section(
+            H2("Our Approach"),
             P(
+                "To produce TxT360, a comprehensive data processing pipeline was designed to account for the nuances of both web and curated datasets. The pipeline presents a unified framework for processing both data types, making it convenient and easily adaptive for users to revise and fine-tune the pipeline for their own use cases."
             ),
             P(
                 "Web datasets are inherently noisy and varied. The TxT360 pipeline implements sophisticated filtering and deduplication techniques to clean and remove redundancies while preserving data integrity."
             ),
             P(
+                "Curated datasets are typically structured and consistently formatted, but also can cause troubles with their own special formatting preferences. TxT360 filters these sources with selective steps to maintain their integrity while providing seamless integration into the larger dataset. Both data source types are globally deduplicated together resulting in ~5T tokens of high-quality data. The table below shows the source distribution of TxT360 tokens. ",
+                B("Note that we do not recommend to use the raw distribution of the deduplicated dataset, a simple recipe is provided in the studies section."),
             ),
             table_div_data,
             P(
+                "We provide details and context for the choices behind TxT360 in the respective Common Crawl Data Processing and Curated Source Processing section. A deep dive describing the deduplication process can be found in the Shared Processing Steps section."
             ),
             # Img(src="images/pipeline.png", height="300", width="600"),
             # P(

results.py CHANGED Viewed

@@ -144,7 +144,7 @@ for bucket, perplexities in data.items():
 # Update layout
 fig22.update_layout(
-    title="Perplexity Across Different Years (Global)",
     xaxis_title="Year",
     yaxis_title="Average Perplexity",
     legend_title="Bucket (duplicate count range)"
@@ -254,7 +254,7 @@ for year, values in data.items():
 # Update layout
 fig.update_layout(
-    title="Perplexity Across Different Dump Duplication Counts (global)",
     xaxis_title="Number of Dumps Duplication",
     yaxis_title="Average Perplexity",
     legend_title="Year"
@@ -296,7 +296,7 @@ for year, values in data.items():
 # Update layout
 fig.update_layout(
-    title="Perplexity Across Different Buckets (local)",
     xaxis_title="Bucket (Duplicate Count Range)",
     yaxis_title="Average Perplexity",
     legend_title="Year"
@@ -403,7 +403,7 @@ for year, values in data.items():
 # Update layout
 fig.update_layout(
-    title="Perplexity Across Different Dump Duplication Counts (local)",
     xaxis_title="Number of Dumps Duplication",
     yaxis_title="Average Perplexity",
     legend_title="Year"
@@ -442,7 +442,7 @@ for year, perplexities in data.items():
 # Update layout
 fig.update_layout(
-    title="Perplexity Across Different Buckets (Global)",
     xaxis_title="Bucket (duplicate count range)",
     yaxis_title="Average Perplexity",
     legend_title="Year"
@@ -477,7 +477,7 @@ for bucket, perplexities in data.items():
 # Update layout
 fig.update_layout(
-    title="Perplexity Across Different Years (Global)",
     xaxis_title="Year",
     yaxis_title="Average Perplexity",
     legend_title="Bucket (duplicate count range)"
@@ -543,7 +543,7 @@ for year, year_data in data.items():
 # Update layout
 fig.update_layout(
-    title="Perplexity Across Different Dump Duplication Counts (Global)",
     xaxis_title="Number of Dumps Duplication",
     yaxis_title="Average Perplexity",
     legend_title="Year"
@@ -611,7 +611,7 @@ for year, year_data in data.items():
 # Update layout
 fig.update_layout(
-    title="Perplexity Across Different Buckets (Local)",
     xaxis_title="Bucket (Duplicate Count Range)",
     yaxis_title="Average Perplexity",
     legend_title="Year"
@@ -675,7 +675,7 @@ for year, year_data in data.items():
 # Update layout
 fig.update_layout(
-    title="Perplexity Across Different Dump Duplication Counts (Local)",
     xaxis_title="Number of Dumps Duplication",
     yaxis_title="Average Perplexity",
     legend_title="Year"
@@ -821,47 +821,48 @@ upsampling_exp = Div(
 preplexity_intro_div = Div(
         H2("Perplexity Evaluation on Duplicate Data"),
         H3("Model based Quality Estimation"),
-        P("We took one of the model-based data quality evaluation strategies adopted by", A("DataComp-LM",href="https://arxiv.org/abs/2406.11794"),  "which used perplexity filtering as a candidate for quality filtering. DataComp-LM followed", A("CCNet’s",href="https://arxiv.org/abs/1911.00359"), "practice to use a 5-gram Kneser-Ney model as implemented in the",A("KenLM",href="https://github.com/kpu/kenlm"), "library for efficient perplexity calculation. Following this practice, we estimated data quality by taking a", A("KenLM model",href="https://huggingface.co/edugp/kenlm"), "trained on English Wikipedia data to compute perplexity on data with different duplication patterns. Lower perplexity is regarded as a signal of higher quality."),
         H3("Sampling Strategy"),
-        P("We started from a processed Common Crawl (CC) ablation dataset divided by the number of duplicates of each document. For each CC dump, we have different buckets each holding chunks of document with different duplicate count ranges (1-1, 2-5, 6-10, 11-100, 101-1000, 1001-30000000). We sampled the first 10k documents from each chunk with their meta data."),
 )
 perp1_div = Div(
         Section(
-            H3("Perplexity vs Buckets"),
-            P("For each bucket, we aggregated all the chunks that belong to a single year and calculated the average perplexity for each (bucket, year) data point."),
-            #Img(src="images/prep-diff-buckets-global.png", height = "300", width = "600" ),
-            plotly2fasthtml(Perplexity_Across_Different_Buckets_global_graph),
-        ),
-        Section(
-            H3("Perplexity vs Years"),
-            P("Taking the same data, we can convert it into a graph indicating the yearly trend. For most buckets, the average perplexity of dumps from more recent years seem to be lower than that of former years."),
             #Img(src="images/prep-across-diff-year-global-dup-buckets.png", height = "300", width = "600" ),
             plotly2fasthtml(graph2222),
         ),
     Section(
-            H3("Perplexity vs Document Duplication"),
-            P("We can also break each bucket into distinct document counts. The graph becomes a bit noisy at the end because of insufficient samples with larger duplication counts."),
             #Img(src="images/prep-across-diff-docs-dup-count-global.png", height = "300", width = "600" ),
             plotly2fasthtml(graph3),
         ),
     Section(
-            H3("Perplexity vs Dump Duplication"),
-            P("We are also interested in how the number of dumps a document is in affect data quality. From the graph below we can see that documents that are duplicated across around 40 - 60 dumps usually have lower perplexity."),
             #Img(src="images/prep-across-diff-dump-dup-counts-global.png", height = "300", width = "600" ),
             plotly2fasthtml(graph4),
         ),
     Section(
-            H3("Perplexity vs Local Buckets"),
-            P("Previously we have seen that documents in recent dumps tend to have lower perplexity. This might be related to the way how global deduplication was implemented. During global deduplication, we only keep copy in the latest dump. Hence documents that are duplicated across multiple dumps only appear in the latest one. To avoid bias brought by this strategy, we tried to recover the states before the global deduplication by reading the metadata attached with each document."),
             #Img(src="images/prep-across-diff-buckets-local.png", height = "300", width = "600" ),
             plotly2fasthtml(graph5),
         ),
     Section(
-            H3("Perplexity vs Local Dump Duplication"),
-            P("Following the same practice, we can plot the local version of the graph of average perplexity with respect to dump duplication."),
             #Img(src="images/prep-diff-dump-dump-counts-local.png", height = "300", width = "600" ),
             plotly2fasthtml(graph6),
         ),
@@ -874,27 +875,27 @@ llama_div = Div(
             P("For comparison purpose, we run the same perplexity evaluation with llama 3.1 8B model."),
         ),
         Section(
-            H3("Perplexity vs Buckets"),
             #Img(src="images/perp-across-diff-buckets-global.png", height = "300", width = "600" ),
             plotly2fasthtml(llama_graph1),
         ),
         Section(
-            H3("Perplexity vs Years"),
             #Img(src="images/prep-across-diff-years-global.png", height = "300", width = "600" ),
             plotly2fasthtml(llama_graph2),
         ),
     Section(
-            H3("Perplexity vs Dump Duplication"),
             #Img(src="images/prep-vs-dump-dup-global.png", height = "300", width = "600" ),
             plotly2fasthtml(llama_graph4),
         ),
     Section(
-            H3("Perplexity vs Local Buckets"),
             #Img(src="images/prep-diff-buckets-local.png", height = "300", width = "600" ),
             plotly2fasthtml(llama_graph5),
         ),
     Section(
-            H3("Perplexity vs Local Dump Duplication"),
             #Img(src="images/prep-vs-dump-dup-global.png", height = "300", width = "600" ),
             plotly2fasthtml(llama_graph6),
         ),
@@ -928,16 +929,19 @@ for title, data in topic_charts:
 cluster_div = Div(
     Section(
         H2("Topic Analysis"),
-        P("We tried to classify data into topic groups and looked for correlations between topics and statistics of data. Data from different topic groups should manifest different characteristics of distribution, which can give us some insight into the composition of dataset."),
         H3("Methodology"),
-        P("We took the ", A("common crawl", href="https://commoncrawl.org/"), " data and clustered them into 17 topic groups using ", A("BERTopic", href="https://maartengr.github.io/BERTopic/index.html"), ". We collected and aggregated a series of metrics which include quality signals and other useful metadata. For each topic group, we calculated average scores and generated the corresponding bar charts over different metrics for comparison and analysis."),
         H3("Cluster Groups"),
-        P("We grouped data into the following 17 clusters"),
         Ul(*(
             Li(topic_name, style = "margin-bottom: 5px")
             for topic_name in ("Arts", "Business & Economics & Finance", "Culture & Cultural geography", "Daily Life & Home & Lifestyle", "Education", "Entertainment & Travel & Hobby", "Environment", "Food & Drink & Cooking", "Health & Wellness & Medicine", "Law & Justice", "Natural Science & Formal Science & Technology", "Personal Development & Human Resources & Career", "Politics & Government", "Religion & Spirituality", "Shopping & Commodity", "Society & Social Issues & Human Rights", "Sports")
         )),
-        H3("Results Analysis"),
         *(
             Section(H4(title), plotly2fasthtml(topic_graphs[i]), P(data.get("comment", '')))
             for i, (title, data) in enumerate(topic_charts)

 # Update layout
 fig22.update_layout(
+    title="Perplexity Across Different Years",
     xaxis_title="Year",
     yaxis_title="Average Perplexity",
     legend_title="Bucket (duplicate count range)"
 # Update layout
 fig.update_layout(
+    title="Perplexity Across Different Dump Duplication Counts",
     xaxis_title="Number of Dumps Duplication",
     yaxis_title="Average Perplexity",
     legend_title="Year"
 # Update layout
 fig.update_layout(
+    title="Perplexity Across Different Buckets",
     xaxis_title="Bucket (Duplicate Count Range)",
     yaxis_title="Average Perplexity",
     legend_title="Year"
 # Update layout
 fig.update_layout(
+    title="Perplexity Across Different Dump Duplication Counts",
     xaxis_title="Number of Dumps Duplication",
     yaxis_title="Average Perplexity",
     legend_title="Year"
 # Update layout
 fig.update_layout(
+    title="Perplexity Across Different Buckets",
     xaxis_title="Bucket (duplicate count range)",
     yaxis_title="Average Perplexity",
     legend_title="Year"
 # Update layout
 fig.update_layout(
+    title="Perplexity Across Different Years",
     xaxis_title="Year",
     yaxis_title="Average Perplexity",
     legend_title="Bucket (duplicate count range)"
 # Update layout
 fig.update_layout(
+    title="Perplexity Across Different Dump Duplication Counts",
     xaxis_title="Number of Dumps Duplication",
     yaxis_title="Average Perplexity",
     legend_title="Year"
 # Update layout
 fig.update_layout(
+    title="Perplexity Across Different Buckets",
     xaxis_title="Bucket (Duplicate Count Range)",
     yaxis_title="Average Perplexity",
     legend_title="Year"
 # Update layout
 fig.update_layout(
+    title="Perplexity Across Different Dump Duplication Counts",
     xaxis_title="Number of Dumps Duplication",
     yaxis_title="Average Perplexity",
     legend_title="Year"
 preplexity_intro_div = Div(
         H2("Perplexity Evaluation on Duplicate Data"),
         H3("Model based Quality Estimation"),
+        P("We took one of the model-based data quality evaluation strategies adopted by ", A("DataComp-LM",href="https://arxiv.org/abs/2406.11794"),  " which used perplexity filtering as a candidate for quality filtering. The DCLM results show that a simple perplexity filter is still quite strong. DCLM followed ", A("CCNet’s",href="https://arxiv.org/abs/1911.00359"), " practice to use a 5-gram Kneser-Ney model as implemented in the ",A("KenLM",href="https://github.com/kpu/kenlm"), " library for efficient perplexity calculation. In order to gain more insights of our dataset, we also took a ", A("KenLM model",href="https://huggingface.co/edugp/kenlm"), " trained on English Wikipedia data to compute perplexity on data with different duplication patterns, and try to observe how such signals coorelate with the duplication patterns."),
         H3("Sampling Strategy"),
+        P("We took a early version of the TxT360 Common Crawl (CC) portion, and bucket the documents by the number of duplicates each has. For each CC snapshot, we bucket the documents by their duplicate counts in the following buckets (1, 2-5, 6-10, 11-100, 101-1000, 1001-infinite). We sampled the first 10k documents from each bucket."),
 )
 perp1_div = Div(
+        # this looks basically the same as the figure below, comment it out for now.
+        # Section(
+        #     H3("Perplexity vs Buckets"),
+        #     P("For each bucket, we aggregated all the chunks that belong to a single year and calculated the average perplexity for each (bucket, year) data point. We observe the perplexity is generally dropping. This could be biased since we always keep the newest document if we find a duplicate."),
+        #     #Img(src="images/prep-diff-buckets-global.png", height = "300", width = "600" ),
+        #     plotly2fasthtml(Perplexity_Across_Different_Buckets_global_graph),
+        # ),
         Section(
+            H3("Perplexity vs. Years"),
+            P("Taking the same data, we can convert it into a graph indicating the yearly trend. For most buckets, the average perplexity of dumps from more recent years seem to be lower than that of former years. This could be biased since we always keep the newest document if we find a duplicate."),
             #Img(src="images/prep-across-diff-year-global-dup-buckets.png", height = "300", width = "600" ),
             plotly2fasthtml(graph2222),
         ),
     Section(
+            H3("Perplexity vs. Document Duplication"),
+            P("Instead of bucketing, we also plot the relationship between perplexity versus the number of duplicates directly. The graph becomes a bit noisy at the end because of insufficient samples with larger duplication counts. However, we can observe that there seems to be a lower point at around 10-20 duplicates. To see the results more clearly, we recommend you turn of other years and only look at one year, and zoom in to 0-100 region on the X axis."),
             #Img(src="images/prep-across-diff-docs-dup-count-global.png", height = "300", width = "600" ),
             plotly2fasthtml(graph3),
         ),
     Section(
+            H3("Perplexity vs. Dump Duplication"),
+            P("Fineweb hypothesize that documents appear across multiple snapshots (CC dumps) might be an indicator of quality. Hence, we also plot the perplexity versus the number of times a document appear in different snapshots. From the graph below we can see that documents that are duplicated across around 40 - 60 snapshots usually have lower perplexity."),
             #Img(src="images/prep-across-diff-dump-dup-counts-global.png", height = "300", width = "600" ),
             plotly2fasthtml(graph4),
         ),
     Section(
+            H3("Perplexity Plots before Global Deduplication"),
+            P("Previously we have seen that documents in recent snapshots tend to have lower perplexity. This might be related to the way how global deduplication was implemented. During global deduplication, we only keep copy in the latest dump. Hence documents that are duplicated across multiple dumps only appear in the latest one. To avoid bias brought by this strategy, we tried to recover the states before the global deduplication using the stored metadata (i.e., the locally deduplicted dataset state). This trends are a bit different. In the figure below, we do not observe a clear trend of which year has a higher quality, especially in the 2-10 bucket region."),
             #Img(src="images/prep-across-diff-buckets-local.png", height = "300", width = "600" ),
             plotly2fasthtml(graph5),
         ),
     Section(
+            H3("Perplexity vs. Dump Duplication before Global Deduplication"),
+            P("Following the same practice, we can plot the graph of average perplexity with respect to dump duplication count, before global deduplication. The conclusion is similar, that documents with a dump duplication count around 40-60 have the lower perplexity."),
             #Img(src="images/prep-diff-dump-dump-counts-local.png", height = "300", width = "600" ),
             plotly2fasthtml(graph6),
         ),
             P("For comparison purpose, we run the same perplexity evaluation with llama 3.1 8B model."),
         ),
         Section(
+            H3("Perplexity vs. Buckets"),
             #Img(src="images/perp-across-diff-buckets-global.png", height = "300", width = "600" ),
             plotly2fasthtml(llama_graph1),
         ),
         Section(
+            H3("Perplexity vs. Years"),
             #Img(src="images/prep-across-diff-years-global.png", height = "300", width = "600" ),
             plotly2fasthtml(llama_graph2),
         ),
     Section(
+            H3("Perplexity vs. Dump Duplication"),
             #Img(src="images/prep-vs-dump-dup-global.png", height = "300", width = "600" ),
             plotly2fasthtml(llama_graph4),
         ),
     Section(
+            H3("Perplexity vs. Buckets before Global Deduplication"),
             #Img(src="images/prep-diff-buckets-local.png", height = "300", width = "600" ),
             plotly2fasthtml(llama_graph5),
         ),
     Section(
+            H3("Perplexity vs. Dump Duplication Count before Global Deduplication"),
             #Img(src="images/prep-vs-dump-dup-global.png", height = "300", width = "600" ),
             plotly2fasthtml(llama_graph6),
         ),
 cluster_div = Div(
     Section(
         H2("Topic Analysis"),
+        P("In order to understand our dataset better, we tried to cluster our data into topic groups and examined for correlations between topics and other attributes of the documents. We suspect documents from different topic groups should manifest different characteristics of distribution, which can give us some insight into the composition of dataset."),
         H3("Methodology"),
+        P("We took an early version of the LLM360 Common Crawl portion and clustered them into 17 topic groups using ", A("BERTopic", href="https://maartengr.github.io/BERTopic/index.html"), ". We collected and aggregated a series of metrics from the stored metadata. For each topic group, we calculated average scores and generated the corresponding bar charts over different metrics for comparison and analysis."),
         H3("Cluster Groups"),
+        P("We grouped data into the following 17 clusters. These clusters are obtained by first clustered a seed portion of the dataset into 128 dumps, and then we manually inspect the clusters to combine 17 semantically meaningful ones."),
         Ul(*(
             Li(topic_name, style = "margin-bottom: 5px")
             for topic_name in ("Arts", "Business & Economics & Finance", "Culture & Cultural geography", "Daily Life & Home & Lifestyle", "Education", "Entertainment & Travel & Hobby", "Environment", "Food & Drink & Cooking", "Health & Wellness & Medicine", "Law & Justice", "Natural Science & Formal Science & Technology", "Personal Development & Human Resources & Career", "Politics & Government", "Religion & Spirituality", "Shopping & Commodity", "Society & Social Issues & Human Rights", "Sports")
         )),
+        H3("Topic vs. Various Metrics"),
+        P(
+            "In the following section, we plot the cluster against their average score of a particular metric stored in the metadta. We recommend the readers to jump to the ones you are most interested in."
+        ),
         *(
             Section(H4(title), plotly2fasthtml(topic_graphs[i]), P(data.get("comment", '')))
             for i, (title, data) in enumerate(topic_charts)

web.py CHANGED Viewed

@@ -376,10 +376,10 @@ def web_data():
     return Div(
         Section(
         Div(
-        H1("Web Data Processing"),
-        H2("Common Crawl Snapshot Processing"),
-        H3("What This Section Contains"),
-        P("This section provides a complete discussion on the filtering applied to the 99 Common Crawl snapshots that comprise the web data section of TxT360. The section is split into the following topic areas: "),
         Ul(
             Li("Web Data Processing Summary", style = "margin-bottom: 5px"),
             Li("Document Preparation", style = "margin-bottom: 5px"),

     return Div(
         Section(
         Div(
+            # H1("Web Data Processing"),
+            H2("Common Crawl Snapshot Processing"),
+            H3("What This Section Contains"),
+            P("This section provides a complete discussion on the filtering applied to the 99 Common Crawl snapshots that comprise the web data section of TxT360. The section is split into the following topic areas: "),
         Ul(
             Li("Web Data Processing Summary", style = "margin-bottom: 5px"),
             Li("Document Preparation", style = "margin-bottom: 5px"),