Default Chunk Strategy
The Default Chunk Strategy extracts content from PDF and splits it into Knowledge Chunks based on a fixed token length and recognizes tables. Based on our research, the Default Chunk Strategy returns better structured results and works best with the Top K value set to 5 in the Search Extract Output Node.Alternative Chunk Strategy
The Alternative Chunk Strategy is effective when your original PDF is divided into paragraphs in a logical order separated by double line breaks. This strategy attempts to distinguish the structure, such as sections, and then splits each of them into one or more Knowledge Chunks. To apply the Alternative Chunk Strategy, append.preset_uiolc_ls.pdf
to the file name when you upload a file. For example, if you have a PDF file named cognigy.pdf
, rename it to cognigy.preset_uiolc_ls.pdf
before the upload. The .preset_uiolc_ls
appendix triggers the Alternative Chunk Strategy after you upload the PDF file.
While processing the PDF file, Knowledge AI may omit complex elements such as visually intricate headers or lists and not include these elements in the Knowledge Chunks.
Examples
Assume you have the following text from the Cognigy blog in a PDF file:PDF sample
PDF sample
Example 1: Default Chunk Strategy
If you use the Default Chunk Strategy, Knowledge AI splits the PDF file into 3 equal Knowledge Chunks.Default Chunk Splitting
Default Chunk Splitting
Example 2: Alternative Chunk Strategy
If you use the Alternative Chunk Strategy, Knowledge AI splits this text into 5 Knowledge Chunks. Note that in the first Knowledge Chunk, a title is missing because it is formatted as a complex element. In comparison, the Default Chunk Strategy recognizes and includes the title in the Knowledge Chunks.Alternative Chunk Splitting
Alternative Chunk Splitting