Searching through various PDF files

Course Queries Syllabus Queries 2 years ago

0 2 0 0 0 tuteeHUB earn credit +10 pts

5 Star Rating 1 Rating

Posted on 16 Aug 2022, this text provides information on Syllabus Queries related to Course Queries. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Take Quiz To Earn Credits!

Turn Your Knowledge into Earnings.

tuteehub_quiz

Answers (2)

Post Answer
profilepic.png
manpreet Tuteehub forum best answer Best Answer 2 years ago

 

I'm just looking for advice on how I can get my code to operate faster. It's pretty quick right now with searching through 30 3-page PDFs, but I imagine once there gets to be thousands of files to search that it will take longer than I'd like. I can change SearchOption.AllDirectories to TopDirectoryOnly. I've done some testing though and it seems like what takes the longest is the searching in the files not actually enumerating the directory.

 public string ReadPdfFile(string fileName, String searchText)
        {
            List<int> pages = new List<int>();
            if (File.Exists(fileName))
            {
                PdfReader pdfReader = new PdfReader(fileName);
                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

                string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
                if (currentPageText.Contains(searchText))
                {
                    pages.Add(page);
                }
            }
            pdfReader.Close();
        }
        if (pages.Count == 0)
            return null;
        else
            return fileName;
    }      

    protected void txtBoxSearchPDF_Click(object sender, EventArgs e)
    {
        if (txtBoxSearchString.Text == "")
        {
            lblNoSearchString.Visible = true;               
        }
        else
        {
            lblNoSearchString.Visible = false;
       var files = from file in Directory.EnumerateFiles(@"C:\schools\syllabus", "*.pdf", SearchOption.AllDirectories)

                        select new
                        {
                            File = file,
                        };

            StringBuilder sb = new StringBuilder();

            foreach (var f in files)
            {
                string fileNameOnly = string.Empty;
                string pdfSearchMatch = ReadPdfFile(f.File, txtBoxSearchString.Text);
                if (pdfSearchMatch != null)
                {
                    string domainURL = Regex.Replace(pdfSearchMatch, @"C:\\schools\\syllabus", @"https://mywebsite.com/search/syllabus/");                                
                    string finalSyllabusURL 
                                                
                                                
0 views
0 shares
profilepic.png
manpreet 2 years ago

The major bottleneck is most likely in the ReadPdfFile method as we are dealing with a PDF file.

In your ReadPdfFilemethod, a PdfReader is created to read through every page of the document to find the searchText and the page numbers on which the searchText is found is stored inside a List named pages.
Once the reader ran through every page, the method returns null or the filename based on whether numbers of pages is 0.

What you could do is to return as soon as you have found the text, so that you don't have to look through the entire document for nothing.


The method has been renamed to reflect more what it actually performs, and 
the return type has been changed to bool, since we only need to know if the file contains the search text.

public bool SearchPdfFile(string fileName, String searchText)
{
    /* technically speaking this should not happen, since "you" are calling it
       therefore this should be handled critically
        if (!File.Exists(fileName)) return false; //original workflow
    */
    if (!File.Exists(fileName))
        throw new FileNotFoundException("File not found", fileName);

    using (PdfReader reader = new PdfReader(fileName))
    {
        var strategy = new SimpleTextExtractionStrategy();

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            var currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            if (currentPageText.Contains(searchText))
                return true;
        }
    }

    return false;
}

0 views   0 shares

No matter what stage you're at in your education or career, TuteeHub will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.