>15,000 pdfs indexed but zero c:/ drive matches

Overview Forums Discussions >15,000 pdfs indexed but zero c:/ drive matches

Viewing 10 posts - 1 through 10 (of 10 total)
  • Author
    Posts
  • #16005 Reply
    A Guy
    Guest

    The great majority of files on my PC are pdfs.

    Most are on the C:/ drive, a small fraction on (an old, second internal) E:/ drive.

    Anytxt returns pdf file matches from the E:/ drive, but ZERO from the C:/ drive, irrespective of the search text entered.

     

    The Anytxt ‘File Index Manager’ Count shows 15,760 pdf files present, or recognised (C:/ drive plus E:/ drive).

    The Everything search program lists slightly more (15,858) as it includes pdf files in System (not just User) directories.

    The Everything program lists User pdf files present as 15,781 total (15,107):(674)  (C:/ drive):(E:/ drive).

    Yet, as stated, Anytext displays ZERO pdf file matches from the main drive, main pdf repository, the C:/ drive.

     

    Why would this otherwise promising search tool return ZERO results for/from the main body of pdf files, on the main drive, even despite the File Index Manager evidently recognising them, and is there a workaround?

     

    #16025 Reply
    Abbie
    Moderator

    Hello.

    Please check your index by Options->Index Manager->Index Rules.

    Then selecting the search type of the file and clicking “Edit”

    Please check the settings include or exclude the C:/ drive.

    Thank you.

    #16026 Reply
    A Guy
    Guest

    Abbie, much thanks for replying.

    The Index Rules was one of the first things I checked (before submitting the initial query).

    The settings, by default, and for all file types, were (already), and are (still) as you indicated in your last step.

    However, as another check, I also ran searches, again with various search-term entries, after changing the Index Rule to:

    All ‘.pdf’ Files Will Be Included“.

    Same result. ZERO file matches from the 15,000+ pdf files on the c:/ drive.

    But it’s actually worse, and stranger, than that.

    My initial focus was pdf files because that file format is by far what I have most of.

    I have determined subsequently that the same problem occurs with virtually* all file formats, except EPUB documents, as follows:

    With bare basic search terms such as “the” or “and“, Anytxt returns results for various (presumably ALL match-relevant) file formats on the E:/ drive (pdf, doc, docx, txt, ppt, pptx, xls, xlsx, xlsm, one – etc.).

    However, it returns ZERO results from the C:/ drive for file formats – pdf, doc, docx, txt, ppt, pptx, xls, xlsx, xlsm, one – etc., WITH one notable exception. From the C:/ drive it returns results (hundreds, so again, presumably ALL match-relevant files) for EPUB documents.

    (*There were also a trivial number of matches from txt files matched from the C:/ drive, those txt files all being AppData, not User-generated files.)

     

    #16060 Reply
    Abbie
    Moderator

    Hello.

    Have you activated the anti-virus software?

    #16095 Reply
    A Guy
    Guest

    Hi Abbie, without specific details, I cannot guess what specifically, or generically, you are referring to as “the anti-virus software”. But I believe that avenue may be not relevant in this case. . .

    The following is intended as feedback to the programme’s developer(s), and for the benefit of other users of this programme who may encounter unexpected outcomes from it):

    For the past few days, I have been ‘characterising’ the Anytxt programme (checking and testing what I can, toward trying to understand and summarise its behaviour, to be able to – hopefully – use it effectively), and found the following:

    (1) It now (finally) is returning matches from the C:/ drive, and with (large) counts of file numbers as might be expected, and for all file formats.

    (2) However, its status indicators in terms of files indexed/processing is multi-layered, and confusing:

    (2a) On start up – pop-up messages are displayed, including:

    Loading drive C: index database.”

    Load drive E: index database successful.”

    Load all index database successful.”

    However, these do not mean it has finished processing all files.

    (2b) After those messages another pop-up is displayed:

    The Full Text Search Engine is starting, please wait a moment…”

    Accompanied by a ‘doing something’ segmented spinning indicator in the top right corner of the programme window.

    (2c) While waiting for the ‘moment‘ to pass, I checked Option(s) > ‘Index Rules’:

    About an hour after programme launch, PDF files status was ‘Pending’, all other file formats ‘Finish’ [sic].

    About 4 hours after programme launch, PDF files status was ‘Finish’ [sic; should be ‘Finished’].

    It was some time between 1 and 12 hours after launch that text-search matches from the C:/ drive were (finally) being returned.

    (2d) While STILL waiting for the promised ‘moment‘ to pass, I checked the Index Store Path, via:

    Windows Explorer  > PC > C:/ drive > Program Data > AnyTxt > data

    The following time series shows respective drives’ data file sizes:

    Hours:      ~1                   ~4                ~8                   ~11                ~12             ~30 to ~80*

    C.ati      65,064 KB > 668,260 > 3,974,372 > 6,229,748 > 7,626,620     > > same >>*

    E.ati      71,332 KB  > > same >>

    F.ati             80 KB  > > same >>

    (2e) *BUT, as at now more than three days of continuous PC and Anytxt-programme running, and after the C.ati data file size seemed to plateau, and certain text-search entries likewise (e.g. 3597 matches for one search-term became 3599 by 24 hours later, 7568 matches for another search-term became 7569 by 24 hours later, and 9332 matches for a third search-term remained at 9332 by 24 hours later),the ‘moment‘ has still not finished. In other words, I am still waiting for the programme to finish doing what it’s doing, assuming it ever will.

    Three days after the third launch, the programme still states:

    Load all index database successful. But some files are still being processed and your search results may be incomplete at this time.”

    “… search results may be incomplete at this time as some files are still being processed … when the status in the upper right corner changes to green [tick], all scans are completed.”

     

    SUGGESTION 1: It would seem that, instead of a “wait a moment” message, where the ‘moment‘ may equate to days, a user-meaningful processing progress status indicator would be useful, such as a 0 to 100% progress bar or other graphic, or an estimated time-to-completion counting-down clock.

     

    (3) Further ‘characterisation’ leads to further improvement suggestions. . .

     

    There are currently the following discrete search method options for users:

    (1) Whole Match

    (2) Advanced Search (with various search syntax variable inputs possible, per the Help(H) menu.)

    (3) Regular Match

    SUGGESTION 2: However, none of these appear to equate necessarily to an “Exact Match“.

    For example, if one wishes to search the term ‘MS’ the programme returns matches including:

    ms (for milliseconds), Ms. (short for Miss), msg, msgbox, msn, sums, mums, films, hamster, etc.

    But the same (number of) matches are returned if one instead enters the search term ‘ MS ‘, (space em ess space).

    Or if one tries to ‘outsmart’ by entering, for example the search term ‘ MS  !ms.’

    This might be expected to return instances of ‘MS’ but not of ‘MS.’.

    But instead, no matches for either string are returned.

    It seems (unless I am incorrect) that spaces and full-stops, if not also other non alpha-numeric characters, are not enabled with the programme’s current search capabilities. It therefore seems not possible for users to cut down to a shortlist of desired as against unuseful matches when such a term as ‘MS’ may be critical and urgent, as in the context of it standing for ‘Multiple Sclerosis’.

     

    SUGGESTION 3: Also, none of the current search options appear to facilitate a “Proximity Match” (a feature in some PDF reader software).

    ‘Proximity’ with such software equates to, for example:

    The presence of two or more search terms, within the same paragraph, or page, or document.

    It appears from the results fields that whole-document-matches are returned by default. But Anytext’s indexing already incorporates line or row numbers, so it should be easy to facilitate finer ‘Proximity Matches’ for users, by enabling, for example, users to choose, if not set: ‘Within plus-or-minus X lines or rows of each other’. (If X were 30 to 50 rows, that would equate to approximately one page for many texts, depending on their respective font and page size etc.)

    This search functionality would enable users to drill down to a subset of file matches ‘most likely‘ to contain what is being looked for, while ignoring potentially large numbers of less likely if not outright time-wasting files.

     

    SUGGESTION 4: There are four column/field headers for search-term matches: Name, Modified (date), Path, (file) Type.

    However, results displayed initially do not appear to be in any A-Z or Z-A order based upon any of those columns/fields.

    If one clicks on one of the headers, that column becomes sorted. However, it may be useful for some users to be able to have, if not set as a default, one of those headers in sorted order, e.g. Date Z-A (descending order, newest at top, oldest at bottom).

    It may be even more useful for some users to be able to sort by, if not set as default, one or more columns, e.g. by file 1Type then 2Date, or vice-versa, or 1Date, 2Type, 3Name, etc.

     

    Lastly, to recap on indexing status, now being 10 days after my initial download and launch of the programme (version 1.3.1380), and more than three days after leaving it and my PC running continuously but STILL NO GREEN TICK, I continue to wonder when, or even if, I may be able to expect the programme to ever finish processing? (And, consequently, how many, if any, matching files for certain search terms may remain missing from the programme’s results.) Is it actually still processing, or stuck in a loop, or on one (or a few) particular file(s)? There seems to be no (easy) way for users to know.

    #16102 Reply
    Abbie
    Moderator

    Hello.

    Thank you very much for your valuable suggestions.

    Are there many images in those PDF documents?

    #16106 Reply
    A Guy
    Guest

    Not in most. But certainly in a significant proportion of them , i.e. one may guess that among 15,000+ texts, several thousand of them will include images.

    (And some files have in the order of 500+ pages, or even (though only a few) 1000+ pages.)

    The Anytxt ‘File Index Manager’ Count has been ‘crawling’ up (snail-paced). Initially (Day 1) it showed 15,760 pdf files present, or recognised (C:/ drive plus E:/ drive). Several and up to ten days later that count had risen to 15,810. Today it is 15,813.

    This leads to another (obvious?) suggestion. . .

    SUGGESTION 5: Perhaps in addition to the SUGGESTION 1 (inclusion of a processing 0-100% progress bar and/or counting-down clock,) the programme could indicate to users precisely how many of all files present on the computer have been processed (X of Y). That would allow users to decide whether to keep waiting for 100% processing to finish, or, risk possible miss-outs of matches from a minor proportion of unprocessed vs processed  files.

     

    By the way:

    SUGGESTION 6: It may be useful, and for all sorts of potential reasons, to have a file-name selection option facilitated, such as simple click-tick or leave-blank (which would be the default) ‘Select’ion boxes as a fifth column/field alongside Name, Modified, Path and Type. For example, visually skim-scanning through some of the text of respective match items via the right-side portion of the Anytxt window, may lead users who wish to, to select file names on the left-side as ‘keepers’, i.e. useful or worthy of inclusion if not further study, or not; and/or to be exported to Excel; or deleted, etc.

    Question: Do the current ‘Delete’ and ‘Delete All’ options delete entire files, or only remove them from the current list of file matches?

     

    SUGGESTION 7: It would be useful, and presumably avoid lots of similar or repeat questions from different users, if the ‘Help‘ menu explained: ALL features and options – i.e. ALL toolbar icons and ALL menu drop-down items – available in the programme (such as what exactly will be exported with the ‘Export’ options?), its behaviors (such as how long a full-index may be expected to take), its limitations (such as: Does NOT have the ability to search with a space either side of search-text terms), things to be wary of (such as risk of accidentally deleting master files), etc.

     

    #16123 Reply
    A Guy
    Guest

    Day 12 update:

    Twelve days after initial download and launch, the Anytxt programme is STILL processing files on my PC drives.

    <b>STILL </b>no GREEN TICK but instead displays pulsating ‘doing something’ bars in top right corner of programme window – per screenshot below:

    (a) Despite (as on Day 1, and still now as at Day 12), the ‘File Index Manager’ showing status as ‘Finish’ [sic, (should be ‘Finished’)] and for ALL file types (see screenshot).

    (b) AND despite seemingly plateaued data file sizes for respective drives (per my June 20 post).

    <b>(c) BUT logical in at least one aspect</b> – the PDF file count shown via ‘Index Manager Rules’ has still been creeping up:

    Day    1         7/8/9?     10         12

    15,760   15,810  15,813   15,819

    AnytxtIndex_ManagerRules_Status_Day12

     

    If the programme is STILL indexing/processing files, its ‘Index Manager Rules’ window should NOT show ‘Finish’ [sic] for unprocessed/unindexed (i.e. unFINISHED) files/file-types.

    Hence my SUGGESTION 5 (per previous post). The essence of that suggestion is to show users actual ‘Indexing Status’ in terms of X files done of Y files present, or X files ‘Processed’ or ‘Indexed’ of Y files ‘Total’. Such functionality seems of fundamental importance now. Such status could be reflected – for respective file types – possibly with an additional column in the ‘Index Manager Rules’. Its current layout gives an ambiguous ‘Count’ column. It may be deemed ‘ambiguous’ because a ‘Count’ could be of files processed, or of files present/total, or possibly even something else. Given that, in my case, the Count for pdf files is creeping up, one would have to assume that the Count is of ‘processed files’ and not of ‘total files’. But perhaps that assumption is also wrong, as was my initial one that the ‘Count’ appeared to be of ‘total files’ given that the Status column had/has been showing ‘Finish’ [sic], and for each file type (including PDFs – for which the Count continues to rise), for days.

     

    Only if     X = Y      for a given file type should the ‘Index Manager Rules’ show that file type’s status as ‘Finished’.

    The status for any file type whose status is     X < Y      should be shown as ‘Pending’, or as appropriate*.

     

    The actual status with my situation/PC as at Day 12 after programme launch, which includes about 9 days straight now with both the PC and the programme running 24/7, is ‘Pending’ despite the ‘Index Manager Rules’ window stating ‘Finish’ [sic].

    I have no idea, from within the Anytxt programme itself, (as would be the case for all other users of the current version where the Indexing Status is in question):

    (1) how many file types are still yet to be processed,

    (2) how many files of a given file type, are still yet to be processed,

    (3) which specific files have not yet been processed,

    (4) nor, therefore, whether their absence from the results fields may matter for certain text-searches,

    (5) *nor whether certain (e.g. textless) files may simply be unable to be indexed, i.e. fall under a third ‘Count’ possibility – unprocessable/unindexable or ‘require the OCR version of the programme’, etc.

    #16126 Reply
    A Guy
    Guest

    Deja Vu:

    The initial (Day 1) problem I posted about in this thread was that text-searches yielded ZERO PDF-file matches from the main drive, (C:/ drive).

    At some point later, and for 12 days after that post, the programme ‘came good’, doing as originally expected – yielding matches for ALL file types and from ALL drives.

    Just now, however, I attempted to run a new search, and the initial problem repeated – ZERO matches for ANY file type from the C:/ drive.

    Further, while attempting merely to grab a screenshot to attach to this update, the programme froze. It was completely unresponsive to any mouse click on any part of its window, including the Close (X) at top right corner. It did not indicate “not responding”, and nor did Windows Task Manager. After waiting minutes with no change, I assumed it had frozen permanently and elected to force the programme to shut, via the Task Manager, then relaunch it.

    Behaviour was back to normal, (successful searches, all drives) after the re-launch.

    I have posted this update on the basis that any feedback on the programme’s behaviours/misbehaviours may benefit its developer(s) toward bug identification and overall improvement, and then, by extension, the user community.

    I wonder if issues such as ZERO matches and freezing may have been because it had been running continuously for 12 days, or because I had removed the third (F:/ drive), or from a lack of PC system resources such as ordinary memory for the data files, or RAM for processing, or other, (and if so, what the minimum system resource requirements for the programme are)?

    SUGGESTION  8: As is standard practice with many other programmes, if system resources may be performance limiting factors or result in fatal-errors such as freezes, minimum resource requirements for the programme should be specified up front to users, ideally at some point in or during the download and/or install processes, but also through on-screen prompts over time if and as the programme recognises resources (such as PC memory) dwindling and potentially becoming limiting.

    #16142 Reply
    A Guy
    Guest

    Alphanumeric and Numeric and Double-Quote-Enclosed Text Searches.

    The screenshot below shows the contiguous characters   d3   entered into the text-to-search bar (which prompts users with the words: ‘Input ANY text you like, then press enter‘.

    (The search-term ‘d3’ is as might be applied by some when looking for document instances/matches of ‘vitamin D’, for which vitamin D3 is one molecular form, and ‘D3’ is one shorthand synonym.)

    The sub-windows in the right side of the Anytext window screenshot below show typical results for one document using ‘Advanced Search’ mode.

     

    They appear to make little sense on several counts:

    (a) The lower right window states ” d3 (7 hits) “, yet only one actual match  “D3”  is among the seven.

    (b) Both right-side windows show other, inexplicable and incorrect/undesired ‘matches‘:

    “d3” (entered) = (returns, somehow, and incorrectly) “31” and “D#” (D followed by ANY integer).

    (c) The listings in the TOP right window are in source-document line number order, but those in the BOTTOM right window are, confusingly, NOT.

    (d) If instead of   d3   the search term “d3” is entered, (the same term but within double-quote symbols), match results are much better, minus the likes of “31” and “D#”, inclusive of “D3” itself, but also inclusive of “D  3” (the letter D followed by the number 3 but separated by one or more spaces), (again, despite the characters entered into the search bar being contiguous).

    (e) The outcomes explained in (d) are mirrored if instead of the Advanced Search mode the Whole Match mode is selected.

    (e) Similar results occur if search terms are Numeric, e.g. with ‘4,000’ entered: many inexplicable/incorrect/undesired ‘matches’, whereas with 4000 (no comma) or “4000” or “4,000” (enclosed in double quotes in Advanced Search mode), seemingly accurate results.

     

    SUGGESTION 9:

    Characters that are entered into a (any) search bar contiguously, (adjacent to each other/no spaces included or between, e.g. MS or D3) should return only (and all) corresponding actual matches, and (in the specific case of the Anytxt programme) irrespective of the Search mode selected (Whole Match or Advanced Search or Regular Match or – per the above SUGGESTIONS 2 and 3 previously – Exact Match or Proximity Match).

    In other words, whether while in ‘Whole Match mode‘, or, ‘Advanced Search mode and using enclosing double-quotes‘, ANY string of characters that is:

    (a) entered into a search bar contiguously, (e.g. in Whole Match mode, adjacent to each other without spaces, such as the two characters: MS or D3 ), or,

    (b) entered into a search bar contiguously, (e.g. in Advanced Search mode, adjacent to each other and with bracketing spaces and double quote symbols, such as the four characters inside respective double-quotes: ” MS ” or ” D3 “)

    should return only (and all) corresponding actual matches:

    (i) with or without spaces (or any other character) either side in Case (a), but,

    (ii) exclusively only with a space either side, (and with no other character type(s) either side) in Case (b).

    Anytext - D3 - search results

Viewing 10 posts - 1 through 10 (of 10 total)
Reply To: >15,000 pdfs indexed but zero c:/ drive matches
Your information: