Hi, I am quite new to using RStudio and I need some help getting language data into a processable format. My general interest relates to Natural Language Processing.
My data consists of different sets of texts, produced by different people. I want to compare these sets, using e.g. the tokenizer and the Stylo() package. So I would like to see Text 1; 2; 3; 4 all by Person 1; and then Texts 1;2;3;4 by Person 2 etc.
I currently have each passage in a separate .txt file. I know how to import them; I know how to specify a working directory.
I would like to know:
1) how to get my data into a frame in RStudio so that I can identify and specify lines or texts for processing. When using Stylo(), my output is not organised in a way that I could, for example, identify which line belongs to which text and person.
Also,
2) When I simply import the data files and try to use tm(), for example, I get an error message saying that there are more rows than data points in line 1. Is this a major issue, if that is how the original data is structured?
Note that I cannot use CSV files as the data contains commas that are meaningful.
I'd appreciate any advice or directions to useful tutorials in this regard.
Thanks in advance.