Knowledge AI supports using a web page as a type when creating a new Knowledge Source.
Using a Web Page as a knowledge source has a current set of restrictions:
- All visible text on the page, including items such as cookie notices, will be included in the result.
- The content must be hosted on a publicly accessible website, reachable from the Cognigy environment.
- The content hosted on websites with anti-crawling measures cannot be accessed.
- No images or OCR (Optical Character Recognition) capabilities are supported.
How to Ingest a Web Page¶
Before you can use a web page as a source, you must first follow the steps to Create a Knowledge Store.
To ingest a web page as a new Knowledge Store, perform the following steps:
- In the left-side menu, navigate to Build and then select Knowledge.
- Click + New Knowledge Store and enter a name and optional description.
- Click Save and then click New Knowledge Sources.
- Click the drop-down menu under Type and select Web Page.
- Copy and paste the URL of the web page you want to ingest into the URL field.
- (Optional) Add a description and Source Tags.
- Click Create. A new entry will appear in the Knowledge Store, and a task will be initiated to parse website content.
- Wait until the status column shows a green checkmark.
You can now click the name of your Knowledge Source and inspect the results in the Chunk Editor.
When ingesting a web page, the Knowledge AI chunking process will perform the following:
- Visit the URL as a page in a browser session.
- Scroll to the bottom of the web page.
- Access lazy-loaded1 content by checking for any text changes until the page is stable and no longer loading additional text.
- Generate Knowledge Source content based on the visible text result.
The web page content will be imported into a knowledge source once. The source is not automatically updated to reflect future content changes on the web page.
Lazy loading is a technique in web development that defers the loading of non-critical or non-visible content until it is needed, improving page load times and user experience. ↩