![]() Recent studies suggest that the language of the preceding words is insufficient to restrict lexical access to words of the target language, even when reading in the native language. A newly developed research line asks whether language-non-selective access generalizes to word recognition in sentence contexts, providing a language cue and/or semantic constraint information for upcoming words. Many studies investigating the processing of words out-of-context have shown that lexical representations from both languages are activated when reading in one language (language-non-selective lexical access). This article provides an overview of bilingualism research on visual word recognition in isolation and in sentence context. Department of Experimental Psychology, Ghent University, Ghent, Belgium.You can check progress in each subrun.Eva Van Assche*, Wouter Duyck and Robert J. The global run will have a funny name and will only report data at the end of everything. There will be one sub-run per process (see num_cpu above) per lang + a global run for the root script. Go to wanddb and make sure to enable grouping. ![]() The run will log to wandb monolingual dashboard. This is useful for testing, but if you want to output somewhere else (like a central clean monolingual repo), override the output_dir=/somethingstable/ when starting the run. ![]() this means that the outputs will go to lang dirs in the working directory and will go to different places depending on the day/time you start the run. By default, the data output is set in monolingual.yaml to be output_dir. The run will be started with a custom working directory that follows the pattern: outputs/, all the logs will go there (including executor_logs from slurm jobs). See monolingual.yaml for more possible configurations. input_file_glob_template replace this if the files in your data_dir do not follow the expected template.corpus_filter=yourcorpus filter the lang files you'll process to only work on a specific corpus.Higher means it will go faster, but you'll have a harder time to get a machine from the queue preproces_requirements.cpus_per_task=40 this is the number of CPUs used to process each lang file in a slurm job.launcher.cluster=local local_tmp_dir=/tmp/monolingual if you want to run this locally instead of on the slurm.langs an array of langs to process in this run.data_dir is where the raw data is, should have subfolders per lang and files named with the pattern corpus_.Python monolingual_pipeline.py data_dir=yourdatahere langs='' The core filtering is in monolingual_line_processor.py and utils/text_filter.py Run it deduplicate sentences (this is done by sorting sentences). ![]() run lid detection at the sentence level, if this doesn't match the expected lang, throw the sentence out.run script detection at the sentence level, if this doesn't match the expected lang, throw the sentence out.filter the sentences that do not match some criteria (length, character ratios, etc.).run some moses normalization+cleaning on the sentences.This is the monolingual "cleaning" pipeline, it does a few things:
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |